Task-solving video world models without paired demonstrations

World Model Self-Distillation

WMSD turns caption-guided video generators into instruction-conditioned Executors, then improves them with VLM feedback while keeping the pretrained Demonstrator as a stabilizing anchor.

Video Comparisons

WMSD vs. Base Model Generations

Here we showcase the task-solving ability of our WMSD-trained models and compare them against the original finetuned LTX-2 and HunyuanVideo-1.5 base models.

Robotics Transfer

Competitive Transfer Without Task-Specific Video Supervision

On DreamGen, WMSD trained on WorldTasks transfers to robotic tasks and reaches competitive performance against SFT-trained video generators.

DreamGen Transfer Scores

Grouped native bars compare WMSD against zero-shot and supervised fine-tuned baselines.

Dataset

WorldTasks Pairs Scenes with Compact Task Instructions

Dataset Release Preview

WorldTasks is constructed from pre-extracted images, filtered for visual quality, aesthetic score, and VLM semantic suitability. WMSD uses each short task plus its generated solution to train the Demonstrator-to-Executor pipeline.

20k training images 8 task-solution pairs per image 200 benchmark tasks
Appendix dataset sample 100_4 showing a man in a blue shirt on a street.
Sample 100_4

Initial Frame with a Person Centered on a Road Scene

Task 1 Task 2 Task 3 Task 4 ... Task 8
Task 1 [Man in blue shirt]: Step onto the yellow lane marking and stop exactly at the white arrow's tip.

Description: The man in the blue shirt begins walking forward along the center of the road, his feet deliberately stepping onto the double yellow lane marking, and continues moving straight ahead until he reaches the tip of the white directional arrow painted on the asphalt, where he halts and stands still.

Task 2 [Person in blue shirt]: Move forward to the nearest building.

Description: The person in the blue shirt begins walking forward along the center of the road, maintaining a steady pace toward the building on the left side of the street, their body oriented directly ahead as they cross the yellow double lines; after a few steps, they continue moving forward until they reach the sidewalk adjacent to the building, then they halt beside the American flag mounted on the building's facade, coming to a complete stop with their feet planted on the pavement.

Appendix dataset sample 145_4 showing a character with a horned helmet in a forest scene.
Sample 145_4

Fantasy Scene with a Controllable Character and Nearby Objects

Task 1 Task 2 Task 3 Task 4 ... Task 8
Task 1 [Character with horned helmet]: Use the bow to aim at the tree trunk directly ahead.

Description: The character with the horned helmet slowly turns their upper body toward the tree trunk directly ahead, simultaneously drawing the bowstring back with their right hand while keeping their left hand steady on the bow's grip, their gaze fixed on the target as the bowstring tenses and the arrow nocks align with the trunk.

Task 2 [Character with horned helmet]: Move to the largest boulder and stop beside its left edge.

Description: The character with the horned helmet begins walking forward along the stone path, their body oriented toward the largest boulder visible to the left, and after a few steps, they decelerate, shifting their weight slightly as they turn their head to the left to align their gaze with the boulder's edge, then halt precisely beside its left side, their right hand resting on their hip while their left hand remains near the hilt of their weapon.

Appendix dataset sample 6888_1 showing a first-person racing cockpit.
Sample 6888_1

First-Person Driving Frame with Steering Controls and Track Context

Task 1 Task 2 Task 3 Task 4 ... Task 8
Task 1 [Driver in racing suit]: Press the red button on the steering wheel's right side.

Description: The driver's right hand, clad in a black racing glove, moves slightly forward and inward, pressing the red button located on the right side of the steering wheel, while the left hand remains steady on the left side of the wheel, and the vehicle continues forward along the track with the dashboard displaying 186 MPH and an overtaking indicator active.

Task 2 [First-person view]: Align the car's front bumper with the white track curb ahead.

Description: The driver's hands grip the steering wheel firmly, thumbs pressing the paddle shifters while the left hand subtly adjusts its position to maintain control; simultaneously, the right hand makes a slight inward rotation of the wheel to initiate a gentle steering correction toward the white track curb ahead, and the car's front bumper begins to approach the curb as the vehicle decelerates slightly, aligning its front edge with the curb's edge while the dashboard display updates to reflect the new position and speed.

Appendix dataset sample 7637_6 showing a city crosswalk with a person holding an umbrella and a black minivan.
Sample 7637_6

Crosswalk Scene with Multiple Possible Agents and Actions

Task 1 Task 2 Task 3 Task 4 ... Task 8
Task 1 [Man holding black umbrella]: Step off the crosswalk and hand the umbrella to the sidewalk curb.

Description: The man holding the black umbrella continues walking forward, stepping off the crosswalk onto the sidewalk, then lowers his arm and extends his hand toward the curb, releasing the umbrella to rest against the sidewalk edge.

Task 2 [Black minivan]: Align its front bumper with the white pedestrian lane marking.

Description: The black minivan advances forward while maintaining its current trajectory, its front bumper gradually moving closer to the white pedestrian lane marking on the asphalt, adjusting its position as it proceeds along the crosswalk.

Results

WMSD Improves Both LTX-2 and HunyuanVideo-1.5

On WorldTasksBench, WMSD raises task completion, correct-agent behavior, and physical consistency while preserving the inference cost of the underlying model.

WorldTasksBench Score Breakdown

Switch metrics to compare task completion, correct-agent behavior, physical consistency, and average score. Inference time is shown at the end of each bar.

On-Policy Self-Distillation Keeps Improving

Both charts are rendered from the paper's ablation traces: WorldTasks score rises while PickScore remains stable.

Average WorldTasks Score

PickScore

Off-policy distillation saturates earlier, while on-policy variants continue improving across training steps.

The Demonstrator Anchor Needs the Right Strength

The beta_d sweep shows the best WorldTasks score at a moderate anchor setting.

Too little anchor weakens teacher guidance; too much anchor constrains task improvement.

RL Gains Are Largest When Paired with On-Policy Distillation

Native line charts show task, agent, and realism scores for the training settings from the paper.

Task Score

Agent Score

Realism Score

On-policy + RL climbs above the Demonstrator reference across the task-solving metrics.

Citation

Reference the Paper

Use the citation below for the current arXiv preprint version.

BibTeX
arXiv preprint

World Model Self-Distillation: Training World Models to Solve General Tasks

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, and Paolo Favaro

Department of Computer Science, University of Bern

{sebastian.stapf, pablo.acuavivahuertos, aram.davtyan, paolo.favaro}@unibe.ch

@misc{stapf2026worldmodelselfdistillation,
  title = {World Model Self-Distillation: Training World Models to Solve General Tasks},
  author = {Stapf, Sebastian and Acuaviva Huertos, Pablo and Davtyan, Aram and Favaro, Paolo},
  institution = {Department of Computer Science, University of Bern},
  year = {2026},
  note = {arXiv preprint}
}