Synthetic DMS Training Data Generation with Video Models

One of the most interesting questions in computer vision is how to create high-quality training data for tasks where collecting, annotating, or legally handling real-world data is difficult.

Driver Monitoring Systems, or DMS, are a perfect example. In this domain, we often need video data of a driver inside a vehicle cabin so that machine vision models can learn to detect fatigue, drowsiness, phone usage, smoking, hand position, seatbelt usage, gaze direction, or other driver states.

The challenge is that real videos of drivers are not easy to collect at scale. They are difficult to annotate precisely, and they can also be sensitive from a privacy and personality-rights perspective.

This is why I started experimenting with whether fully synthetic, yet practically useful, DMS training data can be generated with modern generative video models.

What was the goal?

The goal was not simply to generate a visually impressive video of a driver. The goal was to create a data structure that could later be used for training computer vision models:

realistic video
+ semantic segmentation mask
+ instance segmentation mask
= automatically processable training data

In the DMS example, the realistic video shows a driver sitting behind the steering wheel and gradually becoming drowsy. For that video, I wanted to generate:

1. RGB video: the realistic scene
2. Semantic mask: which semantic class each pixel belongs to
3. Instance mask: which individual object or object part each pixel belongs to

This is interesting because, once both the semantic mask and instance mask are available, object detector training data can be extracted automatically frame by frame.

The final pipeline

The experiment eventually converged into a simple three-step workflow:

Step 1 — Realistic source video Generate a synthetic DMS video from scratch.

Step 2 — Semantic segmentation mask Generate a semantic mask from the same realistic video.

Step 3 — Instance segmentation mask Generate an instance mask from the same realistic video.

In other words, the RGB video is created first, and then the same video is used as the source for the two annotation videos.

This order is practical because the realistic video, the semantic mask, and the instance mask all refer to the same scene. Later, they can be aligned frame by frame:

frame_000001_rgb.png
frame_000001_semantic.png
frame_000001_instance.png

Original

Original realistic DMS source video

A synthetic driver monitoring scene where the driver gradually becomes drowsy while driving.

Step 1: generating the realistic DMS video

In the first step, I generated a completely new synthetic video.

The scene takes place inside a vehicle cabin from a driver monitoring camera perspective. The camera is fixed, and the frame includes the driver’s upper body, head, hands, steering wheel, dashboard, seat, headrest, seatbelt, and windows.

The driver starts awake, then gradually becomes drowsy:

heavy eyelids
slower blinking
wandering gaze
head nodding forward
relaxed posture
hands remaining near the steering wheel

It was important that the scene should not look like a dramatic accident sequence. I wanted a realistic driver state that is useful from a DMS perspective, not a cinematic crash scenario.

Original prompt for Step 1 — Realistic DMS video generation

Generate a completely new photorealistic synthetic video from scratch.

Do not use any input video.
Do not use any real person, celebrity, public figure, or recognizable identity.
Create a fictional adult driver.

SCENE:
A realistic driver monitoring camera view inside a modern passenger car.
The camera is fixed near the dashboard or A-pillar, looking at the driver from the front-left side.
The frame clearly shows the driver’s face, head, neck, upper torso, both arms, both hands, steering wheel, dashboard, driver seat, headrest, seatbelt, side window, windshield area, and part of the cabin interior.

ACTION:
The driver starts awake and driving normally.
Over the video, the driver gradually becomes drowsy and starts falling asleep at the wheel.

The drowsiness should be clear and realistic:
- heavy eyelids
- slower blinking
- reduced alertness
- gaze drifting away from the road
- head slowly nodding downward
- shoulders becoming slightly slouched
- head tilting forward or sideways
- both hands remain near or on the steering wheel
- sleepy posture develops gradually

The scene must remain non-violent.
No crash.
No sudden impact.
No injury.
No panic.
No emergency event.

CAMERA AND MOTION:
Fixed driver monitoring camera.
No camera movement.
No zoom.
No cuts.
No scene changes.
Stable lighting.
Stable vehicle interior.
Realistic human motion.
Natural in-cabin lighting.
The driver, steering wheel, seatbelt, dashboard, seat, and windows must remain spatially consistent throughout the video.

VISUAL STYLE:
Photorealistic.
Live-action driver monitoring footage.
Realistic car interior.
Realistic skin, hair, clothing, hands, and facial motion.
No cartoon style.
No CGI look.
No text.
No subtitles.
No labels.
No watermark.
No split-screen.

OUTPUT:
A single photorealistic driver monitoring video of a fictional drowsy driver gradually falling asleep at the wheel.

Step 2: semantic segmentation mask

In the second step, I generated a semantic segmentation mask video from the same realistic source video.

Semantic segmentation means that every pixel belongs to a semantic class. In this case, colors represent classes.

Useful DMS classes can include:

face, head, and neck skin
hands and fingers
hair
clothing
steering wheel
seatbelt
dashboard
center console
seat
headrest
window or windshield glass
background or unknown regions

This mask is useful because it tells us what type of object or region each pixel belongs to.

From a DMS perspective, some especially important questions are:

where the driver’s head is located
where the hands are located
where the steering wheel is located
where the seatbelt crosses the body
which regions belong to the vehicle interior

Semantic

Semantic segmentation mask video

The semantic mask generated from the realistic DMS video. This works well as a side-by-side comparison with the original RGB video.

Original prompt for Step 2 — Semantic segmentation mask generation

Use the uploaded video as the only input video.

The uploaded video is a photorealistic driver monitoring video generated on-platform.

Create a frame-by-frame SEMANTIC SEGMENTATION MASK video from the uploaded video.

IMPORTANT:
This is semantic segmentation.
Colors represent semantic classes.
This is not instance segmentation.

OUTPUT FORMAT:
Output only the semantic segmentation mask video.
Do not output the original photorealistic video.
Do not output split-screen.
Do not add text, labels, captions, legends, borders, arrows, watermarks, or overlays.

TEMPORAL AND GEOMETRIC CONSTRAINTS:
- Preserve the exact duration of the uploaded video.
- Preserve the same frame rate as closely as possible.
- Preserve the same frame count as closely as possible.
- Preserve the same resolution and aspect ratio.
- Preserve the same camera viewpoint.
- The segmentation mask must be pixel-aligned with the uploaded video.
- Follow the visible object boundaries as accurately as possible.
- Preserve object shapes, positions, motion, and occlusions.
- The same semantic class must keep the same color across all frames.
- Minimize flicker.

SEMANTIC COLOR PALETTE:
Use only these exact flat solid RGB colors:

BACKGROUND_UNKNOWN = RGB(0, 0, 0)
Exterior scenery outside the vehicle, road, sky, buildings, outside vehicles, ambiguous regions, unknown regions, and non-annotated background.

FACE_HEAD_NECK_SKIN = RGB(0, 255, 0)
The driver’s visible face, ears, head skin, and neck skin.

HANDS_FINGERS_SKIN = RGB(255, 255, 0)
The driver’s visible left hand, right hand, fingers, and exposed hand skin.

HAIR = RGB(255, 128, 0)
The driver’s visible hair.

CLOTHING = RGB(255, 0, 255)
The driver’s shirt, jacket, sleeves, and visible clothed torso or arms.

STEERING_WHEEL = RGB(255, 0, 0)
The steering wheel and visible steering wheel spokes.

SEATBELT = RGB(128, 255, 0)
The seatbelt crossing the driver and visible seatbelt anchor regions.

DASHBOARD = RGB(0, 0, 255)
Dashboard and fixed front cabin surface.

CENTER_CONSOLE = RGB(0, 255, 255)
Center console, gear selector area, and middle cabin controls.

SEAT = RGB(128, 0, 255)
Driver seat, visible passenger seat parts, and seat upholstery.

HEADREST = RGB(255, 128, 128)
Driver headrest and visible headrest surfaces.

WINDOW_GLASS = RGB(64, 0, 128)
Windshield and side window glass regions.

MIRROR_OR_PILLAR = RGB(192, 192, 192)
Rearview mirror, A-pillar, visible cabin pillars, and mirror structures.

MASK RULES:
- Every visible pixel must belong to exactly one semantic class.
- Use flat solid colors only.
- No gradients.
- No shading.
- No texture.
- No transparency.
- No anti-aliasing.
- No soft edges.
- No blended colors.
- No realistic appearance.
- No compression-like color variation.
- Boundaries must be hard and clean.

OCCLUSION RULES:
- If a hand overlaps the steering wheel, visible hand pixels are HANDS_FINGERS_SKIN and visible steering wheel pixels are STEERING_WHEEL.
- If the seatbelt crosses the driver’s clothing, visible seatbelt pixels are SEATBELT.
- If the steering wheel occludes the driver, only visible driver pixels receive driver-related classes.
- If dashboard or interior objects occlude the driver, visible interior pixels keep their own semantic class.

OUTPUT:
A clean, temporally stable semantic segmentation mask video aligned frame-by-frame with the uploaded photorealistic driver monitoring video.

Step 3: instance segmentation mask

In the third step, I generated an instance segmentation mask from the same RGB video.

Here the main question is not just what class a pixel belongs to, but which specific object or object part it belongs to.

This is important because, for example, both hands may belong to the same semantic class, but at instance level I want to treat them as separate regions.

Useful DMS instances can include:

face, head, and neck region
left hand
right hand
hair
upper clothing
steering wheel
seatbelt
dashboard
seat
headrest
center console

The instance mask makes it possible to handle individual objects or object parts separately.

Instance

Instance segmentation mask video

The instance mask generated from the same realistic video. The left hand, right hand, head, steering wheel, seatbelt, and other relevant objects should appear as separate regions.

Original prompt for Step 3 — Instance segmentation mask generation

Use the uploaded video as the only input video.

The uploaded video is a photorealistic driver monitoring video generated on-platform.

Create a frame-by-frame INSTANCE SEGMENTATION MASK video from the uploaded video.

IMPORTANT:
This is instance segmentation.
This is not semantic segmentation.
This is not class-color segmentation.

In the output, each unique color represents one visible instance ID.
Colors must not represent semantic classes.

Do not use a predefined fixed color palette.
Do not assign one color per class.
Do not make all skin regions the same color.
Do not make all clothing regions the same color.
Do not make all vehicle interior parts the same color.
Do not merge separated visible objects or separated visible body parts just because they belong to the same semantic class.

OUTPUT FORMAT:
Output only the instance segmentation mask video.
Do not output the original photorealistic video.
Do not output split-screen.
Do not add text, labels, legends, arrows, borders, watermarks, or overlays.

INSTANCE DEFINITION:
Every visually separate foreground object or visible object-part instance must receive its own unique solid color.

This is a visible-instance mask:
- separated visible regions should be separated into different instance colors
- different physical objects must have different colors
- different visible body parts must have different colors
- touching objects must remain separate if they are different objects or body parts

CORE EXAMPLES:
- The driver’s face/head/neck region must have its own unique color.
- The left hand must have a different unique color.
- The right hand must have another different unique color.
- The left hand and right hand must never share the same color.
- Hair must have its own unique color.
- Torso clothing must have its own unique color.
- Left sleeve and right sleeve must have different colors if visually separated.
- The steering wheel must have its own unique color.
- The seatbelt must have its own unique color.
- The dashboard must have its own unique color.
- The center console must have its own unique color if visible.
- The driver seat must have its own unique color.
- The headrest must have its own unique color if visually separable.
- The side window must have its own unique color if annotated.
- The windshield must have its own unique color if annotated.
- The mirror, A-pillar, and door panel must each receive separate unique colors when visible and separable.

DO NOT GROUP THESE TOGETHER:
- do not group both hands together
- do not group face and hands together
- do not group all skin regions together
- do not group all driver parts together
- do not group clothing and skin together
- do not group steering wheel and dashboard together
- do not group dashboard and center console together
- do not group seat and headrest together if separable
- do not group all cabin interior parts together

COLOR RULES:
- Use arbitrary unique colors chosen by the model.
- Each visible instance must have one clearly distinct solid color.
- Never reuse the same color for two different visible instances in the same video.
- The same visible instance should keep the same color across frames.
- Colors should be saturated and easy to separate.
- Black RGB(0, 0, 0) is reserved only for background, outside scenery, unknown regions, and non-annotated areas.
- No foreground instance may use black.

TEMPORAL CONSISTENCY:
The output must be temporally stable.
The same visible object or visible body part must keep the same color throughout the video.
The left hand must keep its color.
The right hand must keep its color.
The face/head region must keep its color.
The steering wheel must keep its color.
The seatbelt must keep its color.
The dashboard must keep its color.
Do not allow flicker.
Do not swap colors between similar objects.
If an instance is briefly occluded and then reappears, reuse its original color when it is clearly the same visible instance.

GEOMETRY AND ALIGNMENT:
- Preserve the exact duration of the uploaded video.
- Preserve the same frame rate as closely as possible.
- Preserve the same frame count as closely as possible.
- Preserve the same resolution and aspect ratio.
- Preserve the same camera viewpoint.
- The instance mask must be pixel-aligned with the uploaded video.
- Follow visible object boundaries accurately.
- Preserve object shapes, positions, motion, and occlusions.

OCCLUSION RULES:
- If the left hand overlaps the steering wheel, visible left-hand pixels keep the left-hand instance color, and visible steering-wheel pixels keep the steering-wheel instance color.
- If the right hand overlaps the steering wheel, visible right-hand pixels keep the right-hand instance color, and visible steering-wheel pixels keep the steering-wheel instance color.
- If the seatbelt crosses the torso, visible seatbelt pixels keep the seatbelt instance color and must not merge with clothing.
- If the steering wheel occludes part of the driver, only the visible driver body-part pixels are colored.
- If two objects touch, they must remain separate if they are different visible instances.

BACKGROUND RULE:
Use pure black RGB(0, 0, 0) only for:
- exterior scenery outside the vehicle
- road, sky, buildings, other vehicles outside
- reflections that are not clearly part of a foreground object
- unknown or ambiguous regions
- non-annotated background

MASK STYLE:
Flat solid colors only.
Hard edges only.
No gradients.
No shading.
No texture.
No transparency.
No anti-aliasing.
No soft edges.
No outlines.
No glow.
No realistic appearance.
No compression-like color variation.
No blended colors.

OUTPUT:
A clean, temporally stable, frame-by-frame instance ID mask video aligned with the uploaded photorealistic driver monitoring video.

Each visible object instance or separated visible object-part instance must have its own unique color, even when multiple instances belong to the same semantic class.

Why semantic and instance masks are useful together

The combination of semantic and instance masks is very powerful.

The instance mask tells us where the individual objects or object parts are. The semantic mask tells us what class those regions belong to.

This makes it possible to automatically generate object detector training data:

1. Extract RGB frames
2. Find unique regions in the instance mask
3. Compute a bounding box for each region
4. Assign a class using the semantic mask
5. Export YOLO or COCO annotations

For example, if most pixels of an instance region overlap with the “hand” class in the semantic mask, that instance can be labeled as hand. If another instance region overlaps with the steering wheel class, it becomes a steering wheel annotation.

This logic is not limited to DMS. It can be generalized to many other computer vision use cases.

Combined

Three-panel view: RGB + semantic + instance

The strongest visual element for the post: original RGB video on the left, semantic mask in the center, instance mask on the right.

What needs to be checked?

It is important to emphasize that generated annotations are not automatically perfect ground truth.

Generative models can make mistakes, so the pipeline needs a quality assurance step.

Some important QA checks:

the RGB video, semantic mask, and instance mask should be temporally aligned
resolution and aspect ratio should match
there should be no frame drift
the semantic mask should use class colors consistently
the instance mask should not merge separate objects incorrectly
the left and right hands should remain separated when visible
the steering wheel, seatbelt, head, and hands should not be incorrectly merged
there should be no flicker or color swapping between frames
there should be no unwanted text, watermark, border, or overlay
the scene should remain physically plausible

This method is therefore not about getting a perfect dataset without human review. It is about automating a large part of the annotation workflow and creating prototype data quickly.

Why is this interesting for DMS?

Driver monitoring systems need to cover a wide variety of driver states and visual conditions.

Examples include:

normal driving
fatigue
micro-sleep
head nodding forward
looking sideways
phone usage
smoking
eating or drinking
hands away from the steering wheel
seatbelt visible or not visible
daytime and nighttime conditions
sunlight and glare
urban and highway environments
different vehicle interiors

Collecting and annotating all of these cases with real data is time-consuming. With synthetic video, rare or hard-to-record situations can be generated in a targeted way.

For example, it becomes possible to generate scenes where the driver gradually becomes tired, pays less attention to the road, or places their hands in unsafe positions.

Why am I so interested in this?

As a computer vision specialist, these experiments are both professional work and a hobby for me.

I enjoy testing what new generative tools can do, where their limits are, and how they can be integrated into real machine vision workflows.

For me, this DMS experiment was not only about generating a nice-looking video. The more interesting question was:

Can generative video models be used for structured training data generation?

Based on this experiment, my answer is: yes, but carefully.

This does not fully replace real data. It does not replace quality assurance. It does not guarantee flawless annotations. But for prototyping, simulating rare cases, validating ideas, and training initial models, it looks very promising.

What comes next?

The next step for me is to generalize this pipeline to other use cases.

The same idea can be applied not only to DMS, but also to:

industrial safety camera data
retail shelf monitoring
traffic-camera object detection
agricultural computer vision
robotic manipulation scenes
sports analytics
animal behavior analysis

The shared logic remains the same:

1. generate a realistic video
2. generate a semantic mask
3. generate an instance mask
4. convert them into training data
5. validate and filter bad samples

More AI and computer vision experiments

I collect my AI and computer vision solutions, experiments, and products in my webshop: shop.antal.ai.

My professional website is available here: antal.ai.

If you are interested in generative AI, synthetic training data, computer vision, and practical machine vision workflows, this is a space worth watching closely.