One of the most interesting questions in computer vision is how to create high-quality training data for tasks where collecting, annotating, or legally handling real-world data is difficult.
Driver Monitoring Systems, or DMS, are a perfect example. In this domain, we often need video data of a driver inside a vehicle cabin so that machine vision models can learn to detect fatigue, drowsiness, phone usage, smoking, hand position, seatbelt usage, gaze direction, or other driver states.
The challenge is that real videos of drivers are not easy to collect at scale. They are difficult to annotate precisely, and they can also be sensitive from a privacy and personality-rights perspective.
This is why I started experimenting with whether fully synthetic, yet practically useful, DMS training data can be generated with modern generative video models.
What was the goal?
The goal was not simply to generate a visually impressive video of a driver. The goal was to create a data structure that could later be used for training computer vision models:
realistic video
+ semantic segmentation mask
+ instance segmentation mask
= automatically processable training data
In the DMS example, the realistic video shows a driver sitting behind the steering wheel and gradually becoming drowsy. For that video, I wanted to generate:
1. RGB video: the realistic scene
2. Semantic mask: which semantic class each pixel belongs to
3. Instance mask: which individual object or object part each pixel belongs to
This is interesting because, once both the semantic mask and instance mask are available, object detector training data can be extracted automatically frame by frame.
The final pipeline
The experiment eventually converged into a simple three-step workflow:
In other words, the RGB video is created first, and then the same video is used as the source for the two annotation videos.
This order is practical because the realistic video, the semantic mask, and the instance mask all refer to the same scene. Later, they can be aligned frame by frame:
frame_000001_rgb.png
frame_000001_semantic.png
frame_000001_instance.png
Original realistic DMS source video
A synthetic driver monitoring scene where the driver gradually becomes drowsy while driving.
Step 1: generating the realistic DMS video
In the first step, I generated a completely new synthetic video.
The scene takes place inside a vehicle cabin from a driver monitoring camera perspective. The camera is fixed, and the frame includes the driver’s upper body, head, hands, steering wheel, dashboard, seat, headrest, seatbelt, and windows.
The driver starts awake, then gradually becomes drowsy:
- heavy eyelids
- slower blinking
- wandering gaze
- head nodding forward
- relaxed posture
- hands remaining near the steering wheel
It was important that the scene should not look like a dramatic accident sequence. I wanted a realistic driver state that is useful from a DMS perspective, not a cinematic crash scenario.
Original prompt for Step 1 — Realistic DMS video generation
Generate a completely new photorealistic synthetic video from scratch.
Do not use any input video.
Do not use any real person, celebrity, public figure, or recognizable identity.
Create a fictional adult driver.
SCENE:
A realistic driver monitoring camera view inside a modern passenger car.
The camera is fixed near the dashboard or A-pillar, looking at the driver from the front-left side.
The frame clearly shows the driver’s face, head, neck, upper torso, both arms, both hands, steering wheel, dashboard, driver seat, headrest, seatbelt, side window, windshield area, and part of the cabin interior.
ACTION:
The driver starts awake and driving normally.
Over the video, the driver gradually becomes drowsy and starts falling asleep at the wheel.
The drowsiness should be clear and realistic:
- heavy eyelids
- slower blinking
- reduced alertness
- gaze drifting away from the road
- head slowly nodding downward
- shoulders becoming slightly slouched
- head tilting forward or sideways
- both hands remain near or on the steering wheel
- sleepy posture develops gradually
The scene must remain non-violent.
No crash.
No sudden impact.
No injury.
No panic.
No emergency event.
CAMERA AND MOTION:
Fixed driver monitoring camera.
No camera movement.
No zoom.
No cuts.
No scene changes.
Stable lighting.
Stable vehicle interior.
Realistic human motion.
Natural in-cabin lighting.
The driver, steering wheel, seatbelt, dashboard, seat, and windows must remain spatially consistent throughout the video.
VISUAL STYLE:
Photorealistic.
Live-action driver monitoring footage.
Realistic car interior.
Realistic skin, hair, clothing, hands, and facial motion.
No cartoon style.
No CGI look.
No text.
No subtitles.
No labels.
No watermark.
No split-screen.
OUTPUT:
A single photorealistic driver monitoring video of a fictional drowsy driver gradually falling asleep at the wheel.
Step 2: semantic segmentation mask
In the second step, I generated a semantic segmentation mask video from the same realistic source video.
Semantic segmentation means that every pixel belongs to a semantic class. In this case, colors represent classes.
Useful DMS classes can include:
- face, head, and neck skin
- hands and fingers
- hair
- clothing
- steering wheel
- seatbelt
- dashboard
- center console
- seat
- headrest
- window or windshield glass
- background or unknown regions
This mask is useful because it tells us what type of object or region each pixel belongs to.
From a DMS perspective, some especially important questions are:
- where the driver’s head is located
- where the hands are located
- where the steering wheel is located
- where the seatbelt crosses the body
- which regions belong to the vehicle interior
Semantic segmentation mask video
The semantic mask generated from the realistic DMS video. This works well as a side-by-side comparison with the original RGB video.
Original prompt for Step 2 — Semantic segmentation mask generation
Use the uploaded video as the only input video.
The uploaded video is a photorealistic driver monitoring video generated on-platform.
Create a frame-by-frame SEMANTIC SEGMENTATION MASK video from the uploaded video.
IMPORTANT:
This is semantic segmentation.
Colors represent semantic classes.
This is not instance segmentation.
OUTPUT FORMAT:
Output only the semantic segmentation mask video.
Do not output the original photorealistic video.
Do not output split-screen.
Do not add text, labels, captions, legends, borders, arrows, watermarks, or overlays.
TEMPORAL AND GEOMETRIC CONSTRAINTS:
- Preserve the exact duration of the uploaded video.
- Preserve the same frame rate as closely as possible.
- Preserve the same frame count as closely as possible.
- Preserve the same resolution and aspect ratio.
- Preserve the same camera viewpoint.
- The segmentation mask must be pixel-aligned with the uploaded video.
- Follow the visible object boundaries as accurately as possible.
- Preserve object shapes, positions, motion, and occlusions.
- The same semantic class must keep the same color across all frames.
- Minimize flicker.
SEMANTIC COLOR PALETTE:
Use only these exact flat solid RGB colors:
BACKGROUND_UNKNOWN = RGB(0, 0, 0)
Exterior scenery outside the vehicle, road, sky, buildings, outside vehicles, ambiguous regions, unknown regions, and non-annotated background.
FACE_HEAD_NECK_SKIN = RGB(0, 255, 0)
The driver’s visible face, ears, head skin, and neck skin.
HANDS_FINGERS_SKIN = RGB(255, 255, 0)
The driver’s visible left hand, right hand, fingers, and exposed hand skin.
HAIR = RGB(255, 128, 0)
The driver’s visible hair.
CLOTHING = RGB(255, 0, 255)
The driver’s shirt, jacket, sleeves, and visible clothed torso or arms.
STEERING_WHEEL = RGB(255, 0, 0)
The steering wheel and visible steering wheel spokes.
SEATBELT = RGB(128, 255, 0)
The seatbelt crossing the driver and visible seatbelt anchor regions.
DASHBOARD = RGB(0, 0, 255)
Dashboard and fixed front cabin surface.
CENTER_CONSOLE = RGB(0, 255, 255)
Center console, gear selector area, and middle cabin controls.
SEAT = RGB(128, 0, 255)
Driver seat, visible passenger seat parts, and seat upholstery.
HEADREST = RGB(255, 128, 128)
Driver headrest and visible headrest surfaces.
WINDOW_GLASS = RGB(64, 0, 128)
Windshield and side window glass regions.
MIRROR_OR_PILLAR = RGB(192, 192, 192)
Rearview mirror, A-pillar, visible cabin pillars, and mirror structures.
MASK RULES:
- Every visible pixel must belong to exactly one semantic class.
- Use flat solid colors only.
- No gradients.
- No shading.
- No texture.
- No transparency.
- No anti-aliasing.
- No soft edges.
- No blended colors.
- No realistic appearance.
- No compression-like color variation.
- Boundaries must be hard and clean.
OCCLUSION RULES:
- If a hand overlaps the steering wheel, visible hand pixels are HANDS_FINGERS_SKIN and visible steering wheel pixels are STEERING_WHEEL.
- If the seatbelt crosses the driver’s clothing, visible seatbelt pixels are SEATBELT.
- If the steering wheel occludes the driver, only visible driver pixels receive driver-related classes.
- If dashboard or interior objects occlude the driver, visible interior pixels keep their own semantic class.
OUTPUT:
A clean, temporally stable semantic segmentation mask video aligned frame-by-frame with the uploaded photorealistic driver monitoring video.
Step 3: instance segmentation mask
In the third step, I generated an instance segmentation mask from the same RGB video.
Here the main question is not just what class a pixel belongs to, but which specific object or object part it belongs to.
This is important because, for example, both hands may belong to the same semantic class, but at instance level I want to treat them as separate regions.
Useful DMS instances can include:
- face, head, and neck region
- left hand
- right hand
- hair
- upper clothing
- steering wheel
- seatbelt
- dashboard
- seat
- headrest
- center console
The instance mask makes it possible to handle individual objects or object parts separately.
Instance segmentation mask video
The instance mask generated from the same realistic video. The left hand, right hand, head, steering wheel, seatbelt, and other relevant objects should appear as separate regions.
Original prompt for Step 3 — Instance segmentation mask generation
Use the uploaded video as the only input video.
The uploaded video is a photorealistic driver monitoring video generated on-platform.
Create a frame-by-frame INSTANCE SEGMENTATION MASK video from the uploaded video.
IMPORTANT:
This is instance segmentation.
This is not semantic segmentation.
This is not class-color segmentation.
In the output, each unique color represents one visible instance ID.
Colors must not represent semantic classes.
Do not use a predefined fixed color palette.
Do not assign one color per class.
Do not make all skin regions the same color.
Do not make all clothing regions the same color.
Do not make all vehicle interior parts the same color.
Do not merge separated visible objects or separated visible body parts just because they belong to the same semantic class.
OUTPUT FORMAT:
Output only the instance segmentation mask video.
Do not output the original photorealistic video.
Do not output split-screen.
Do not add text, labels, legends, arrows, borders, watermarks, or overlays.
INSTANCE DEFINITION:
Every visually separate foreground object or visible object-part instance must receive its own unique solid color.
This is a visible-instance mask:
- separated visible regions should be separated into different instance colors
- different physical objects must have different colors
- different visible body parts must have different colors
- touching objects must remain separate if they are different objects or body parts
CORE EXAMPLES:
- The driver’s face/head/neck region must have its own unique color.
- The left hand must have a different unique color.
- The right hand must have another different unique color.
- The left hand and right hand must never share the same color.
- Hair must have its own unique color.
- Torso clothing must have its own unique color.
- Left sleeve and right sleeve must have different colors if visually separated.
- The steering wheel must have its own unique color.
- The seatbelt must have its own unique color.
- The dashboard must have its own unique color.
- The center console must have its own unique color if visible.
- The driver seat must have its own unique color.
- The headrest must have its own unique color if visually separable.
- The side window must have its own unique color if annotated.
- The windshield must have its own unique color if annotated.
- The mirror, A-pillar, and door panel must each receive separate unique colors when visible and separable.
DO NOT GROUP THESE TOGETHER:
- do not group both hands together
- do not group face and hands together
- do not group all skin regions together
- do not group all driver parts together
- do not group clothing and skin together
- do not group steering wheel and dashboard together
- do not group dashboard and center console together
- do not group seat and headrest together if separable
- do not group all cabin interior parts together
COLOR RULES:
- Use arbitrary unique colors chosen by the model.
- Each visible instance must have one clearly distinct solid color.
- Never reuse the same color for two different visible instances in the same video.
- The same visible instance should keep the same color across frames.
- Colors should be saturated and easy to separate.
- Black RGB(0, 0, 0) is reserved only for background, outside scenery, unknown regions, and non-annotated areas.
- No foreground instance may use black.
TEMPORAL CONSISTENCY:
The output must be temporally stable.
The same visible object or visible body part must keep the same color throughout the video.
The left hand must keep its color.
The right hand must keep its color.
The face/head region must keep its color.
The steering wheel must keep its color.
The seatbelt must keep its color.
The dashboard must keep its color.
Do not allow flicker.
Do not swap colors between similar objects.
If an instance is briefly occluded and then reappears, reuse its original color when it is clearly the same visible instance.
GEOMETRY AND ALIGNMENT:
- Preserve the exact duration of the uploaded video.
- Preserve the same frame rate as closely as possible.
- Preserve the same frame count as closely as possible.
- Preserve the same resolution and aspect ratio.
- Preserve the same camera viewpoint.
- The instance mask must be pixel-aligned with the uploaded video.
- Follow visible object boundaries accurately.
- Preserve object shapes, positions, motion, and occlusions.
OCCLUSION RULES:
- If the left hand overlaps the steering wheel, visible left-hand pixels keep the left-hand instance color, and visible steering-wheel pixels keep the steering-wheel instance color.
- If the right hand overlaps the steering wheel, visible right-hand pixels keep the right-hand instance color, and visible steering-wheel pixels keep the steering-wheel instance color.
- If the seatbelt crosses the torso, visible seatbelt pixels keep the seatbelt instance color and must not merge with clothing.
- If the steering wheel occludes part of the driver, only the visible driver body-part pixels are colored.
- If two objects touch, they must remain separate if they are different visible instances.
BACKGROUND RULE:
Use pure black RGB(0, 0, 0) only for:
- exterior scenery outside the vehicle
- road, sky, buildings, other vehicles outside
- reflections that are not clearly part of a foreground object
- unknown or ambiguous regions
- non-annotated background
MASK STYLE:
Flat solid colors only.
Hard edges only.
No gradients.
No shading.
No texture.
No transparency.
No anti-aliasing.
No soft edges.
No outlines.
No glow.
No realistic appearance.
No compression-like color variation.
No blended colors.
OUTPUT:
A clean, temporally stable, frame-by-frame instance ID mask video aligned with the uploaded photorealistic driver monitoring video.
Each visible object instance or separated visible object-part instance must have its own unique color, even when multiple instances belong to the same semantic class.
Why semantic and instance masks are useful together
The combination of semantic and instance masks is very powerful.
The instance mask tells us where the individual objects or object parts are. The semantic mask tells us what class those regions belong to.
This makes it possible to automatically generate object detector training data:
1. Extract RGB frames
2. Find unique regions in the instance mask
3. Compute a bounding box for each region
4. Assign a class using the semantic mask
5. Export YOLO or COCO annotations
For example, if most pixels of an instance region overlap with the “hand” class in the semantic mask, that instance can be labeled as hand. If another instance region overlaps with the steering wheel class, it becomes a steering wheel annotation.
This logic is not limited to DMS. It can be generalized to many other computer vision use cases.
Three-panel view: RGB + semantic + instance
The strongest visual element for the post: original RGB video on the left, semantic mask in the center, instance mask on the right.
What needs to be checked?
It is important to emphasize that generated annotations are not automatically perfect ground truth.
Generative models can make mistakes, so the pipeline needs a quality assurance step.
Some important QA checks:
- the RGB video, semantic mask, and instance mask should be temporally aligned
- resolution and aspect ratio should match
- there should be no frame drift
- the semantic mask should use class colors consistently
- the instance mask should not merge separate objects incorrectly
- the left and right hands should remain separated when visible
- the steering wheel, seatbelt, head, and hands should not be incorrectly merged
- there should be no flicker or color swapping between frames
- there should be no unwanted text, watermark, border, or overlay
- the scene should remain physically plausible
This method is therefore not about getting a perfect dataset without human review. It is about automating a large part of the annotation workflow and creating prototype data quickly.
Why is this interesting for DMS?
Driver monitoring systems need to cover a wide variety of driver states and visual conditions.
Examples include:
- normal driving
- fatigue
- micro-sleep
- head nodding forward
- looking sideways
- phone usage
- smoking
- eating or drinking
- hands away from the steering wheel
- seatbelt visible or not visible
- daytime and nighttime conditions
- sunlight and glare
- urban and highway environments
- different vehicle interiors
Collecting and annotating all of these cases with real data is time-consuming. With synthetic video, rare or hard-to-record situations can be generated in a targeted way.
For example, it becomes possible to generate scenes where the driver gradually becomes tired, pays less attention to the road, or places their hands in unsafe positions.
Why am I so interested in this?
As a computer vision specialist, these experiments are both professional work and a hobby for me.
I enjoy testing what new generative tools can do, where their limits are, and how they can be integrated into real machine vision workflows.
For me, this DMS experiment was not only about generating a nice-looking video. The more interesting question was:
Can generative video models be used for structured training data generation?
Based on this experiment, my answer is: yes, but carefully.
This does not fully replace real data. It does not replace quality assurance. It does not guarantee flawless annotations. But for prototyping, simulating rare cases, validating ideas, and training initial models, it looks very promising.
What comes next?
The next step for me is to generalize this pipeline to other use cases.
The same idea can be applied not only to DMS, but also to:
- industrial safety camera data
- retail shelf monitoring
- traffic-camera object detection
- agricultural computer vision
- robotic manipulation scenes
- sports analytics
- animal behavior analysis
The shared logic remains the same:
1. generate a realistic video
2. generate a semantic mask
3. generate an instance mask
4. convert them into training data
5. validate and filter bad samples
More AI and computer vision experiments
I collect my AI and computer vision solutions, experiments, and products in my webshop: shop.antal.ai.
My professional website is available here: antal.ai.
If you are interested in generative AI, synthetic training data, computer vision, and practical machine vision workflows, this is a space worth watching closely.