top of page


Brief Introduction

The solution provides the following:

  • In case of tracking a single face the processing speed of the system (on CPU) reaches 500 FPS

  • Ability to track any number of faces at once and assign a unique ID to each of them. 

  • Ability to measure the tracking error, so the system can tell if it loses tracking. 

  • Ability to resize the tracking rect if the distance of the object from the camera changes. 

  • Ability to track objects stably over a long period of time. 


The solution is not based on a pre-existing tracking algorithm, but I wrote a completely new tracking algorithm. This is a special purpose tracking solution. It is not suitable for tracking arbitrary objects, it is only suitable for tracking things that are well structured and that I have taught the model on which the solution is based. In this case, I taught the model for faces, but it could have been the structure of a bee or something similar. 

Business case

Face detection is a slow - expensive process. Using this solution, face detection needs to be performed only every few seconds. This frees up processor time and allows multiple faces to be tracked in real time even on low performance hardware like single board embedded computers (ie. raspberry pi) or other embedded systems like cameras. 

Technical detailes

Input (video or image to process, capable of processing): ​

  • mjpeg stream

  • rtsp stream

  • USB camera devices

  • video files (avi, mp4, mkv formats supported)

  • standalone image files (.png, .jpg formats supported)


  • Processed video frame

  • The faces in the frame (boinding boxes)

  • For each face:

    • Unique Tracking ID (when processing a video file, the same ID on each frame belongs to the same person)

    • 5 facial landmark points

  • The system is able to to write the processed video to a video file. 

The points returned to the faces are the following:

The demo video was recorded on a HP Laptop 15-DA0042NH (Processor: Intel(R) Core(TM) i7-8550U CPU, RAM: 8 Gb). 

With visualization on:

It used 450 Mb RAM and the CPU usage was 30% during the recording. 

Without visualization:

It used 380 Mb RAM and the CPU usage was 25% during the recording.


The input video was captured using a Xiaomi CMSXJ22A web camera. The input resolution was 1080p.

During recording, the system processing speed was about 500 FPS. When processing a single face, the system can maintain this speed on this hardware. When processing multiple faces, the system may be slower. The visualization was added to the video afterwards. The visualization in the video can be done live, but may slow down processing. 

The system is written entirely in C++ and uses the following libraries/technologies:


0. the tip of the nose

1 . corner of right eye

2. corner of left eye

3. right corner of the mouth

4. left corner of the mouth

Avarage sample error: 7.99014 pixel

bottom of page