Many everyday scenarios require the simultaneous detection and localization of various objects. As humans, we have an inherent ability to recognize and categorize multiple objects within our field of view. For example, in a supermarket, we recognize shopping carts, other people, and various groceries. This enables us to find specific products, approach them, and place them in the cart.
In the economic context, these abilities are particularly important in areas like process control, automation, and autonomy. A machine-based replication of the steps of recognizing and localizing objects is therefore highly relevant and an intensely studied field of research and application. The corresponding methods and technologies are grouped under the topic of object detection, a specialized area of computer vision.
Fundamentally, a distinction can be made between classical methods and Machine Learning approaches. In this blog post, we focus solely on object detection methods based on Machine Learning, as they far surpass classical approaches in terms of accuracy (Zou et al., 2023).
The framework of object detection
The challenge of object detection lies in simultaneously classifying and localizing multiple objects in an image. Nowadays, neural networks trained through supervised learning are commonly used for this purpose. Alongside images, annotations must also be provided. These annotations contain metadata about an image, which a neural network needs to learn how and where to classify and localize specific objects within the image. In object detection, these annotations typically take the form of rectangular frames, known as bounding boxes, indicating the positions of objects and their identifiers.
Object detection methods rely on extracting features from input images. Visual features are properties within an image that provide information about its content, such as differences in brightness or color. How feature extraction works and how the gathered information leads to classification has been previously discussed in a blog post on computer vision. Suitable features for a dataset can be learned using Convolutional Neural Networks (CNNs). There are general features that, when combined with others, point to a specific class. To detect these foundational features, CNNs are often pre-trained on general datasets like ImageNet or COCO. These trained CNNs form the backbone of object detection networks and are thus referred to as backbone architectures. The quality of the visual features provided in this way is crucial to the performance of object detection algorithms.
Key approaches in object detection
There are two main approaches to implementing object detection based on the backbone architecture: two-stage detectors and one-stage detectors. Two-stage detectors first identify areas of interest in an image and then classify and localize objects within these areas. One-stage detectors tackle both tasks simultaneously by reformulating the problem to consider both steps holistically and optimize them together.
Let’s examine these two types of methods in more detail:
Two-stage methods
In two-stage methods, areas of interest in an image that contain features relevant to object detection are identified first (region proposal step) (Zou et al., 2023). These areas, referred to as candidate boxes, represent potential object locations and are used for further processing. Various techniques are available for finding these boxes. For instance, the Faster R-CNN, introduced in 2015, utilized a specially designed Region Proposal Network (RPN) for this task (Ren et al., 2015). Subsequently, box-specific features are derived from the identified boxes using another neural network. These features are evaluated in two ways: a classifier determines the class of each box, and a bounding box regressor adjusts the boxes to better fit the contained objects.
In subsequent methods, the features used by the RPN or the way candidate boxes are processed were refined to improve classification quality or runtime.
One-stage methods
Since 2016, networks have been developed that combine localization and classification into a single step. These methods belong to the group of one-stage detectors. The motivation for developing these approaches was the high complexity and long runtime of two-stage detectors (Zou et al., 2023). Pioneers in this area include You-Only-Look-Once (YOLO) and the Single-Shot Detector (SSD). Instead of considering arbitrary points in an image, both methods examine fixed positions in the input image for objects of varying sizes and orientations. This eliminates the need for a separate module to pre-filter regions of interest and allows classification and localization to be optimized and applied jointly.
These methods were significantly faster than two-stage approaches at the time of their development. Additionally, SSD achieved accuracy comparable to two-stage object detection methods. Since then, numerous one-stage architectures have been developed, including successors to YOLO, which offer improved runtime and accuracy.
By viewing the tasks of classification and localization holistically in single-stage detector networks, concepts from other application areas can also be employed, such as transformer networks from the field of natural language processing. Researchers see potential in this for a paradigm shift in tackling these problems (Zaidi et al., 2022).
Which object detection method is suitable for which use case?
In general, object detection architectures are developed in various versions to make them versatile. These variants are tailored to specific use cases and devices. The trade-offs in the development and application of object detection algorithms involve speed, accuracy, and memory usage (Huang et al., 2017). As a result, there is a wide range of models, each offering specific advantages for particular requirements. For example, use in smartphones or cameras demands small models with fewer parameters. While these provide quick results, they often do so at the expense of bounding box or classification accuracy.
The more complex the architecture, the more demanding the training process becomes. Depending on the use case, it is also essential to consider whether sufficient data is available to train a suitable model to the desired level of accuracy.
In conclusion, there is no universal model for object detection challenges. Instead, for each use case, it is necessary to determine which model delivers the best results given the specific requirements. The distinction between one-stage and two-stage architectures can serve as a guideline for categorizing models in advance. Beyond that, the specific architecture of the chosen approach and the selected backbone network must be considered. For an overview of various object detection models and their pros and cons, the works of Huang et al. and Zaidi et al. can be consulted. They provide a starting point for selecting a suitable architecture for a given application. Additionally, the PapersWithCode website offers an overview of state-of-the-art Machine Learning methods in various research fields, as well as datasets and code from scientific publications.
References
Speed/accuracy trade-offs for modern convolutional object detectors
J. Huang et al, Proceedings of the IEEE conference on computer vision and pattern recognition, Pages 3296-3297, 2017, pdf
A Survey of Deep Learning-Based Object Detection
L. Jiao et al., IEEE Access, vol. 7, Pages 128837-128868, 2019, pdf
A survey of modern deep learning based object detection models.
S.S.A. Zaidi, et al., Digital Signal Processing, vol. 126, 2022 pdf
Object detection in 20 years: A survey
Z. Zou, et al., Proceedings of the IEEE, 2023, pdf