
With the increasing importance of automation and digitalization, the field of AI-based computer vision offers enormous potential for industrial applications. Real-time tracking, which aims to continuously track moving objects in video streams, is particularly noteworthy here. Numerous industries are already benefiting from this technology, as it enables precise analyses, automated decisions and immediate reactions to changes in dynamic environments. For example, real-time tracking in the field of video analysis is already being used to analyze traffic flows at airports or to improve training methods and competition analysis in professional sports.
What is multi-object tracking (MOT)?
Object tracking is a process in which the movement of one or more objects is continuously tracked. With multi-object tracking (MOT), the position and identity of several objects are determined simultaneously across a video sequence. If several cameras are used for this, it is referred to as multi-camera multi-object tracking (MCMOT). In computer vision-based MOT, objects must first be detected in individual video frames with the help of object recognition (Object Detection) and then linked together across several frames to form so-called tracks. The central challenge here is the correct assignment of the objects over time and space.
Technical challenges in MCMOT
IIn contrast to tracking on individual cameras, MCMOT poses special challenges. Usually, the cameras’ fields of view overlap partially or not at all. In the event of a partial overlap, cameras must be precisely calibrated and synchronized to clearly assign objects from different angles. If the fields of view do not overlap, a so-called re-identification (re-id) is necessary, in which objects are recognized across different camera streams based on visual characteristics. Technically, this re-identification is typically based on deep learning models that extract specific features (feature embeddings) from the visual data and compare them with each other.
Tracking is made even more difficult by similar objects, such as several trucks of the same type from the same carrier. Occlusions or visual similarities can quickly lead to incorrect assignments. Comprehensive camera placement and powerful algorithms are therefore crucial for robust implementation.
Multi-object tracking methods
To address the challenges of multi-object tracking, especially in demanding environments such as yard logistics, a range of algorithmic approaches have been developed. These differ in complexity, accuracy, and computational requirements – yet they all aim to ensure robust object association across frames and camera perspectives. One method in particular has emerged as the current gold standard in computer vision-based tracking: tracking-by-detection.
Tracking-by-detection
Advances in the field of deep learning have established “tracking-by-detection” as the dominant method in MOT. This approach comprises three main steps:
- Object detection: Objects are identified and localized in each video frame using neural detection models such as YOLO or RT-DETR. These detections typically contain the position and dimension of the objects in the form of bounding boxes or segmentations as well as a confidence that indicates the reliability of the detection.
- Data association (matching): The recognized objects are then assigned (matched) across successive frames based on similarity measures. The Hungarian method is particularly popular for matching, in which the globally optimal assignment between all objects of two consecutive frames is calculated.
- Track management: In the last step, so-called tracks are created, updated or terminated, depending on whether the assignment between previous and current detections is successful.
Established procedures are used to implement the tracking-by-detection method:
The Kalman filter
A Kalman filter is often used to improve the reliability of matching. In this iterative process, the future object status (e.g. position and speed) is first predicted (prediction) and then corrected using new measurement data (update). In this way, tracking can function reliably even in the event of partial occlusions or noise.
Concrete tracking algorithms (trackers)
- SORT (Simple Online and Real-time Tracking) combines detections with a Kalman filter and the Hungarian matching method.
- DeepSORT extends SORT with visual re-id features from neural networks and thus improves the matching of temporarily hidden objects
- The ByteTrack approach is suitable for detecting and tracking objects that are difficult to recognize in complex scenarios. In this approach, high-confidence detections are associated first, and then uncertain detections are integrated to find lost objects.
In addition, there are methods that are specially designed for non-real-time-critical analyses and optimize the data association globally across the entire video sequence.
Transformer-based approaches: TrackFormer
Transformer-based methods differ from classic tracking-by-detection approaches in that they explicitly incorporate global context information from the entire scene and enable end-to-end training.
One example of this approach is TrackFormer, which integrates detection and tracking in a single, continuous neural architecture (“Joint Detection and Tracking”). Instead of a separate matching optimization, TrackFormer uses so-called queries to detect new objects as well as to consistently track already known objects:

Transformer-based methods promise improved robustness and accuracy, especially in complex MOT scenarios, and are characterized by their ability to work without explicit matching procedures. However, they require explicit MOT datasets for training and usually have increased data and hardware resource requirements.
Computer vision-based tracking in yard logistics
Real-time tracking unfolds its full potential especially when combined with a digital twin. A digital twin creates a virtual image of the physical environment that is continuously fed with current data. This allows the movements and positions of objects to be tracked in real-time and key performance indicators (KPIs) to be recorded and analyzed. In logistics, for example, this relates to warehouse capacity utilization, material flows or transport times. These KPIs provide valuable insights into process flows and enable data-driven optimization, allowing companies to react faster, plan more efficiently and continuously improve their processes.
Challenges in yard logistics
Specific logistical challenges regularly arise in large depots:
- Trucks are not parked at the designated loading ramps and may therefore be loaded incorrectly, which can lead to high consequential costs.
- Pure GPS-based systems are often not sufficient to ensure a complete and precise overview, especially for mixed fleets of own and third-party vehicles.
- Bottlenecks and congestion on the company premises often go unrecognized until they have a significant impact on operations.
- Manual processes, such as time-consuming searches and time-consuming stocktaking, cause additional costs.
Solutions through computer vision and edge computing
To address these challenges early and automatically, it is possible to rely on camera-based monitoring systems. In this context, the Yard Lense on Edge research project was developed specifically for use in yard logistics (see also: “Logistics Today”).
So-called “edge devices” are used here: camera modules with integrated Nvidia Jetson devices that carry out tracking by-detection directly on site using a YOLO detector and the ByteTrack method. By subsequently combining the results of these local analyses into a central overview, this not only reduces the amount of data transmitted, thereby improving data protection and latency, but also facilitates the scalability of the system.
A global Kalman filter is also used to improve positioning accuracy and to optimally merge the individual camera perspectives. This improves the reliability of predicting vehicle positions across different cameras by merging their data and thus functioning robustly even in the event of short-term occlusions or misidentifications. Trucks are thus reliably detected, localized and tracked across multiple cameras in real-time.
MCMOT in yard logistics: insights and potentials
Together with the logistics service provider Dachser as a practical partner, specific requirements from the industry were collected in the research project and an initial proof of concept was carried out using real data. The project partners succeeded in gaining important insights into the ideal camera positioning for the deployment location, the weather resistance of the modules and the synchronization of the edge devices
The project results include an open-source reference implementation for multi-camera tracking in yard logistics and a Blender pipeline for the automatic generation of MCMOT data sets.
Future research work will deepen the knowledge gained from this project and integrate specific challenges into synthetic data sets. This includes
- the scaling and coverage of entire yard areas (15+ cameras),
- dealing with difficult visibility conditions (e.g. rain, fog),
- the optimal calibration and synchronization of cameras on extensive premises,
- the evaluation of the most accurate and cost-effective MCMOT methods for logistical use cases (e.g. trackformer vs. multi-stage approaches)
- as well as the robust re-identification and association of visually similar or identical vehicles.
The development of such data sets makes it possible to make future algorithms even more efficient and practical and to ensure broad accessibility for further research purposes.
Object Tracking: Developments and Future Directions
Multi-object tracking has made significant progress in recent years, both methodologically and in terms of industrial application. The continuous development of classic tracking-by-detection methods through to end-to-end approaches based on transformers opens new potential in terms of accuracy, robustness and efficiency.
At the same time, key challenges remain: precise association in complex scenes, synchronization and calibration across multiple cameras and reliable processing in real-time under real conditions. The use of additional sensor technology such as LiDAR or GPS, the further development of powerful deep learning models and the creation of practical simulation and benchmark data sets are crucial to overcoming these hurdles.
For the industry, this means MOT technologies are increasingly becoming a viable component of modern automation and digitalization strategies. Concrete applications, such as in yard logistics, show that real-time tracking not only demonstrates technological feasibility, but also delivers clear economic benefits. With targeted research and close cooperation between science and practice, this trend can be further expanded.
Further information:
There are popular data set challenges for benchmarking tracking algorithms, such as the Multiple Object Tracking Benchmark (MOT Challenge) or the AI City Challenge.
Special metrics are used for the evaluation itself, which look at different aspects of tracking: Understanding Object Tracking Metrics.