What is Computer Vision? How AI "sees", inspects and detects defects
Author: Michael Jan Rogocki (AI Engineer & Data Scientist) · Last updated:
A scratch on a surface, a crack in a weld, a missing part — defects that quality control should catch. And it does — but after a few hours and hundreds of inspected units, fatigue and routine set in, and concentration drops with them. And then a defective unit goes on through — and behind it a complaint, a return and costs that could have been avoided.
Computer Vision — a technology from the field of AI — lets you move that inspection to a machine that doesn't get tired and checks the thousandth unit as carefully as the first. But quality control is just one of the applications. Computer Vision supports medical diagnostics, detects vehicle damage, monitors safety on construction sites, lets autonomous cars "see" the road — in short, it proves itself wherever something has to be assessed on the basis of an image.
Below we explain what Computer Vision is, how it differs from OCR, where companies use it — and where to start if you're considering a deployment.
1. What is Computer Vision and how does it work?
⚡ In one sentence
Computer Vision is a field of AI in which algorithms analyze images or video — they recognize objects, detect defects and classify them.
💡 In plain terms
A quality inspector looks at a product and compares it with a model they hold in their head: they know what a good unit should look like and they look for deviations. Computer Vision works on a similar principle — except the model isn't experience, but thousands of labeled images. A camera captures the image of the product, and the algorithm compares it with those images and classifies it: a correct or a defective product.
The effect: every unit checked in a fraction of a second — the first the same as the thousandth.
But note — Computer Vision doesn't "think" like a human. It doesn't "understand" what a scratch is. It "learned" that a certain pattern of pixels in an image means a defect, because someone showed it that earlier across hundreds of examples. If a defect appears that wasn't in the training data — the system may fail to catch it. That's why the effectiveness of Computer Vision depends on the quality and diversity of that data (more in section 4).
🔧 Deep dive
Computer Vision as a research field has existed since the 1960s — the first experiments involved attempts to recognize simple geometric shapes. For decades, progress was slow, because hand-designed algorithms (so-called feature engineering — e.g. Sobel edge filters, SIFT and HOG descriptors) couldn't cope with the natural variability of images: different lighting, angle, background — and the algorithm would stop working.
The breakthrough came in 2012, when the AlexNet neural network won the ImageNet Large Scale Visual Recognition Challenge with a result significantly better than classic methods. From that moment on, Deep Learning (cf. What is Artificial Intelligence?) dominated Computer Vision.
The key architecture is the CNN — Convolutional Neural Network. It works on the principle of layers of filters that slide across the image and extract features — from the simplest (edges, color gradients) in the first layers, to complex ones (shapes, textures, objects) in the deeper ones. The network doesn't require features to be defined by hand — it's trained on data and extracts them on its own.
The main types of tasks in Computer Vision:
- Image classification — "what is in this image?" (e.g. an OK product vs a defective product)
- Object detection — "where in the image are the objects and what are they?" (e.g. finding all the scratches on a surface and marking them with a box)
- Segmentation — "which pixel belongs to which object?" (e.g. precisely outlining the contour of a defect or measuring its area)
In industrial applications (visual inspection), object detection and classification are most often used: a camera captures the image, a CNN model classifies it as "OK" or "defect" and — in the case of a defect — marks its location in the image.
2. How does Computer Vision differ from OCR?
⚡ In one sentence
OCR recognizes text in an image and converts it into editable characters. Computer Vision recognizes objects, defects, scenes — everything that isn't text.
💡 In plain terms
OCR and Computer Vision are two different answers to the same question: "what is in this image?".
OCR answers: "in this image there is text — here's what's written". It extracts letters, digits, words. As a result, a scan of an invoice becomes a file in which you can search and copy text (cf. What is OCR, NLP and how does AI read documents?).
Computer Vision answers differently: "in this image there is a product — and it has a scratch on the left edge". Or: "in this photo of a parking lot there are 47 cars". Or: "this part is mounted crooked by 3 degrees".
A simple rule: if you want to extract text from a document — you need OCR. If you want the system to analyze an image and classify something on that basis — you need Computer Vision.
In practice the two technologies can complement each other — e.g. on a packing line, Computer Vision checks whether the label is in the right place, and OCR reads what's written on it (expiry date, batch number). They're two tools, not competitors.
🔧 Deep dive
Technically, OCR can be considered a narrow application of Computer Vision — after all, recognizing letters in an image is also "computer vision". But in industry practice these concepts live separately, because they solve different problems and require different training data.
OCR operates on alphanumeric characters — the model is trained on millions of examples of letters and digits in various fonts, sizes and qualities. The output is structured text.
Computer Vision in industrial applications operates on visual features specific to a given product or process. The model is trained on images of specific products — e.g. ceramic tiles, metal parts, packaging — with marked defects. The output is a classification (OK/defect), a location (a box around the defect) or a measurement (deviation from the norm in millimeters).
Modern multimodal models (e.g. of the Vision-Language Models class) blur this line — they can simultaneously "read" text and "understand" visual context. But in applications requiring repeatability and certifiability (such as quality inspection), specialized CV models dominate, trained on data from a specific process.
3. Where do companies use Computer Vision?
⚡ In one sentence
Companies use Computer Vision in quality control, logistics, safety and construction — everywhere a decision depends on what can be seen in an image.
💡 In plain terms
Computer Vision in business isn't an abstract technology from a lab. It's a tool that companies deploy because it solves real, measurable problems.
Quality control in production — this is the most mature application. BMW uses the AIQX system (Artificial Intelligence Quality Next) in its factories in Germany. Cameras placed along the assembly line capture every vehicle and check whether the parts are correctly mounted — from the wiper cover to the warning triangle in the trunk. Earlier camera systems falsely signaled a problem when there was, for example, dust or an oil mark on a part. The deep-learning-based system distinguishes a genuine defect from contamination. The effect: faster inspection and fewer unnecessary line stoppages.
Volvo Cars has used the Atlas system (developed by UVeye) since 2020 in its plant in Torslanda, Sweden. Cameras perform a 360-degree scan of every assembled vehicle at the end of the assembly line and detect cosmetic defects — scratches, dents — as small as 0.5 millimeters across. The result appears on the operator's screen immediately, with the exact location indicated.
Logistics and warehousing — Computer Vision identifies parcels, checks stock levels from photos of shelves and verifies the completeness of shipments. Where you need to simultaneously check the position of a label and read its content, CV works together with OCR (cf. What is OCR, NLP and how does AI read documents?).
Safety and OHS — CV systems monitor whether employees are wearing the required protective equipment (helmet, vest, goggles). BMW uses such a system in its plant in Dingolfing.
Construction and infrastructure inspection — drones equipped with cameras and CV analyze the progress of construction work, comparing photos with the plan. Komatsu, the Japanese maker of construction machinery, in cooperation with NVIDIA deployed a CV system to monitor the movement of workers and machines on a construction site. Drones with CV are also increasingly replacing the manual inspection of hard-to-reach structures — wind turbines, power lines, bridges, building facades — cutting inspection time from days to hours.
🔧 Deep dive
The BMW AIQX system is based on neural networks that, for each inspected feature, have access to around 100 reference images — images of a correct part, images with dust, with oil, with an actual defect. This lets the network distinguish so-called pseudo-defects (contamination that looks like a defect) from real problems. It's a solution that earlier camera systems — based on rigid rules, not on learning — handled far worse. BMW has used these systems since 2018 and is rolling them out to further factories.
It's worth noting the difference between CV in controlled conditions (a factory: constant lighting, constant camera angle, a repeatable product) and CV in an open environment (a construction site, a road, a field). In controlled conditions, CV models achieve very high accuracy — provided the training data covers the actual variability of production (different batches, different materials, different sources of defects). In an open environment the variability is far greater and the models require more frequent retraining.
A CV deployment, however, delivers more than just inspection automation. Every image processed by the system is information: the type of defect, its frequency, its location on the product, the shift (which shift has more defects?), the line (which line generates more deviations?). This data feeds dashboards and analytics and allows the whole process to be improved — not just reacting to individual defects (cf. What is process optimization? — the section on KPIs).
4. How Computer Vision works — data, training, accuracy
⚡ In one sentence
Computer Vision works on the principle of comparing a new image with patterns extracted from thousands of labeled training images.
💡 In plain terms
For a CV system to classify a scratch as a defect, it has to be trained beforehand on appropriate examples. It's not enough to define a rule "look for scratches". You have to provide labeled images of products — some with defects, some without — and mark each one: "there's a scratch here", "this is fine here". In the case of BMW AIQX, around 100 images per inspected feature were enough. The exact number depends on the complexity of the problem and on whether the model is trained from scratch or fine-tuned on top of an existing one (transfer learning — more in the 🔧 section).
This process is called labeling. A person reviews the images and marks the features of interest on them — outlining defects with a box, marking regions, assigning categories. It's tedious work, but the quality of the whole system depends on it.
When the model has enough labeled examples, it's trained — the algorithm extracts the patterns that distinguish a good product from a defective one. After training, it receives a new image (one that wasn't in the training set) and, on that basis, classifies it: OK or defect.
What affects the quality of the model:
- Representativeness of the data — the rarer the defect, the harder it is to collect enough examples. If a defect occurs once in a thousand units, gathering the training material can take weeks.
- Image quality — resolution, lighting, camera angle. A system trained on images with ideal lighting can give inaccurate results when conditions on the line change.
- Diversity — the images have to cover the natural variability: different product batches, different colors, different types of defects. A model trained only on scratches won't detect a crack.
- Label quality — if the person labeling the images mistakes a defect for an acceptable production mark, the model will reproduce that error in its results.
The conclusion is simple: you have to know what you're looking for and have the right images for training. Without a clearly defined visual problem and without training data — no camera and no algorithm will help.
🔧 Deep dive
In practice, models are rarely trained from scratch. The standard is transfer learning — using a model that was previously trained on a large dataset (e.g. ImageNet — millions of images, thousands of categories), and then fine-tuning it on a smaller dataset specific to the given application.
Thanks to transfer learning, the model already has basic visual features trained (edges, textures, shapes) and needs far less data to be fine-tuned for detecting a specific type of defect. This means a pilot CV deployment can start with as few as a few dozen labeled images per feature, rather than thousands.
Popular architectures in industrial inspection are families of models: ResNet, EfficientNet (classification), YOLO, Faster R-CNN (object detection), U-Net, Mask R-CNN (segmentation). The choice of architecture depends on the task: whether an "OK/defect" answer is enough (classification), whether the defect needs to be located in the image (detection), or whether a precise outline is needed (segmentation).
A significant limitation: class imbalance. If, in the training set, 99% of the images are correct products and only 1% defective, a model optimized for overall accuracy will classify everything as "OK" — achieving 99% correct answers, but missing all the defects. Standard techniques for dealing with this problem are data augmentation (artificially enlarging the set of defective images through rotations, reflections, brightness changes), oversampling the minority class and choosing an appropriate loss function that penalizes the model more heavily for missing a defect than for a false alarm.
5. Where to start with a Computer Vision deployment in a company?
⚡ In one sentence
Start with a single, clearly defined visual problem — one defect type, one product, one line — and gather images before you start looking for technology.
💡 In plain terms
Companies considering Computer Vision most often start with the question: "how much does it cost?". But the right first question is: "what exactly do you want to detect?" Because Computer Vision isn't a ready-made product you buy and plug in — it's a solution built for a clearly defined problem (cf. What is automation?).
How to approach the deployment step by step:
- Define the visual problem. Precision is key: "detecting scratches on the surface of part X after grinding" or "verifying the completeness of a parcel before shipping" — not generally "quality control". The narrower, the better.
- Gather images. Before you spend a penny on an algorithm, check what data you already have. Images of positive and negative examples, of appropriate quality and covering different variants — that's the foundation on which the model's effectiveness depends.
- A pilot, not a deployment. A Proof of Concept: does a model trained on the gathered images give sufficiently accurate results? A pilot lets you assess feasibility before you invest in infrastructure and integration with the process.
- Assess the results honestly. 95% accuracy sounds good — but ask not about the percentage, but about the consequences: how many cases the system will miss and how many correct ones it will mark in error.
- Scale. When the pilot confirms feasibility — deploy in the target environment: the right hardware, integration with the existing process, monitoring of the model's operation over time. The model requires periodic updates, because conditions change.
Considering a Computer Vision deployment? We help companies in Poland, Germany and across the EU assess whether their visual problem is suitable for automation — from analyzing the available data, through choosing the model architecture, to integration with the existing process.
— The cm-opti perspective
🔧 Deep dive
A CV deployment in a company is an investment in data infrastructure, not just in technology. Costs depend on three factors:
- Training data — gathering and labeling images is often the biggest cost, especially when it requires industry expertise (e.g. only an experienced specialist knows what in an image is a defect and what is an acceptable norm).
- Infrastructure — cameras, lighting (in the case of stationary systems), drones (in the case of infrastructure inspection), an inference server or an edge computing solution.
- Integration with the process — the CV system must have a configured trigger (when to take a photo) and a channel for communicating the result (e.g. an alert for the operator, a signal for automatic rejection or a report in a management dashboard).
The return on investment (ROI) depends on the cost the company bears today because of the problem CV is meant to solve — complaints, returns, downtime, manual inspection. If these costs are quantifiable, the ROI can be estimated even before the pilot.
A few technical aspects are also worth considering:
Edge vs Cloud. Inference (the processing of the image by the model) can take place on a server in the cloud or directly on a device in the field (edge computing). Edge gives lower latency (milliseconds instead of seconds) and works without an internet connection — which in an industrial environment or on a construction site is sometimes required. The cloud gives greater flexibility in scaling and updating models.
MLOps and monitoring. A CV model isn't "deploy and forget". Conditions change — new products, new suppliers, different lighting, seasonality. The model's accuracy can drop over time (so-called data drift). A process is needed for monitoring the results and retraining the model on new data. This is part of MLOps — the operationalization of Machine Learning models (cf. Glossary).
Regulations and documentation. In regulated industries (automotive, medicine, aviation) a CV system may require certification and full documentation: what training data, what model architecture, what accuracy on the test set, what the procedure is in the case of a non-conformity. The EU AI Act classifies some CV applications (e.g. in medicine) as high-risk systems — which means additional requirements regarding transparency and documentation. With a large amount of regulatory documentation, RAG systems can be helpful, letting you quickly find the relevant provision.
Integration with the process (cf. What is systems integration?). A CV system isn't a standalone island — it has to communicate with the rest of the infrastructure: the production control system (PLC), the warehouse system (WMS) or the quality management system. The connectors are APIs and communication protocols — in an industrial environment most often OPC UA or MQTT, in the rest REST API.
Frequently asked questions (FAQ)
What is Computer Vision in simple terms?
Computer Vision is a technology that lets a computer analyze images — detect objects, identify defects, count items in a photo. It works on the principle of training on examples: the system receives hundreds of images with marked elements and, on that basis, classifies new images.
Will Computer Vision replace the quality inspector?
It depends on the process. For repetitive, well-defined checks — yes, Computer Vision can take over that work entirely. For assessments requiring experience and judgment — e.g. borderline cases, new defect types, approval decisions — a human is needed. In practice, many companies reduce the number of inspectors on the line, leaving experts to supervise and improve the process.
How many images are needed to deploy Computer Vision?
It depends on the complexity of the problem. Thanks to transfer learning (using a model pre-trained on large datasets), a pilot can start with as few as a few dozen labeled images per feature. A full deployment requires more data, covering different defect types and varying conditions.
What are the business applications of Computer Vision?
Quality control on the production line (defect detection), logistics (parcel identification, stock checks), safety and OHS (monitoring protective equipment), construction and infrastructure inspection (drone analysis). Deployment starts with a single, well-defined visual problem.
Can a small company afford Computer Vision?
A pilot Computer Vision deployment doesn't require a corporate budget. Costs depend on the complexity of the problem, the quality of existing images and the required integration with the process. The first step is to assess whether the problem is suitable for Computer Vision — that can be done without investing in infrastructure.
Considering a Computer Vision deployment in your company? Let's talk — we'll help you assess whether your visual problem is suitable for automation and where it's worth starting.
Related articles in the cm-opti Knowledge base
- What is Artificial Intelligence?
- What is process optimization?
- What is automation?
- What is OCR, NLP and how does AI read documents?
- What is RAG and an AI agent?
- What is systems integration?
- What is data analysis and BI?
Concepts explained in this article → Glossary
Computer Vision, CNN (Convolutional Neural Network), transfer learning, labeling, image classification, object detection, segmentation, edge computing, data drift, class imbalance, data augmentation, feature engineering
Sources and references
- AIQX system (BMW) — official BMW Group press release, 2019.
- Atlas system / UVeye (Volvo Cars) — Assembly Magazine, 2021.
- AlexNet — Krizhevsky, Sutskever, Hinton, NeurIPS 2012.