Computer vision is a branch of artificial intelligence that enables computers and machines to see, process, and interpret visual information from the world. By combining cameras, sensors, and machine learning models, computer vision systems can detect objects, recognize faces, read text, track motion, and make sense of images and video at a scale and speed far beyond human capability.
How Computer Vision works
Modern computer vision relies on deep learning, most often convolutional neural networks (CNNs) and, more recently, transformer-based architectures. A model is trained on large labeled datasets, such as millions of photos tagged with the objects they contain. During training, the network learns to recognize recurring patterns: edges, textures, shapes, and eventually full objects.
At inference time, the system captures an image or video frame, runs it through the trained model, and outputs predictions. For example, given a photo of a street, the model might label each pixel by class, marking roads, pedestrians, traffic signs, and other cars. The same pipeline powers simpler tasks like optical character recognition (OCR), where the model converts handwritten or printed text in an image into machine-readable characters.
Why it matters
Computer vision is the perceptual layer of AI for the physical world. It underpins medical imaging tools that help radiologists spot tumors, autonomous vehicles that navigate busy streets, manufacturing systems that detect defects on assembly lines, and retail applications that automate checkout. It also powers everyday features such as face unlock on phones, image search, AR filters, and security surveillance. By turning pixels into structured data, computer vision lets machines act on what they see, opening up automation in domains where the physical and digital worlds meet.
Key tasks and types
- Image classification: assigning a single label to an entire image, such as "cat" or "dog."
- Object detection: drawing bounding boxes around each object of interest and identifying it.
- Image segmentation: labeling every pixel by class for fine-grained scene understanding.
- Facial recognition: identifying or verifying a person from facial features.
- Optical character recognition (OCR): extracting printed or handwritten text from images.
- Pose estimation and tracking: detecting the position and movement of people or objects over time.
Computer vision has become one of the most commercially deployed branches of AI because visual data is abundant, the underlying hardware (GPUs and specialized sensors) is mature, and standardized benchmarks like ImageNet have driven rapid model improvement since the early 2010s.