ControlVideo: Training-free Controllable Text-to-Video Generation

Outline

Video visualizations

ControlVideo on depth maps

"A charming flamingo gracefully wanders in the calm and serene water, its delicate neck curving into an elegant shape." "A striking mallard floats effortlessly on the sparkling pond." "A gigantic yellow jeep slowly turns on a wide, smooth road in the city."
"A sleek boat glides effortlessly through the shimmering river, van gogh style." "A majestic sailing boat cruises along the vast, azure sea." "A contented cow ambles across the dewy, verdant pasture."

ControlVideo on canny edges

"A young man riding a sleek, black motorbike through the winding mountain roads." "A white swan moving on the lake, cartoon style." "A dusty old jeep was making its way down the winding forest road, creaking and groaning with each bump and turn."
"A shiny red jeep smoothly turns on a narrow, winding road in the mountains." "A majestic camel gracefully strides across the scorching desert sands." "A fit man is leisurely hiking through a lush and verdant forest."

ControlVideo on human poses

"James bond moonwalk on the beach, animation style." "Hulk is jumping on the street, cartoon style." "Goku in a mountain range, surreal style."
"A man, wearing pink clothes, moonwalk at sunset." "The Simpsons in the city, Hockney style." "Wonder Woman in a desert, Pop Art style."

Long video generation

"A steamship on the ocean, at sunset, sketch style." "Hulk is dancing on the beach, cartoon style."
"An airplane flying on the grasslands." "Towers on grasslands, cartoon style." "A beautiful bird flying in the clear sky."

Novel view generation

Depth Maps A_turtle, high defination, 4k. Depth Maps A plush teddy bear, high defination, 4k.

Limitations

Source Video Structure Sequence "Iron man runs in the road."

Qualitative comparisons

Depth map

Text Prompt: A daring man is scaling a treacherous and jagged peak in the alpine wilderness.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A daring man performing gravity-defying stunts on a high-speed, blue motorbike in an empty parking lot.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A dusty old jeep was making its way down the winding forest road, creaking and groaning with each bump and turn.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A gigantic yellow jeep slowly turns on a wide, smooth road in the city.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A contented cow ambles across the dewy, verdant pasture.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Canny edge

Text Prompt: A curious golden dog curiously wanders on the rocky mountain trail.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A mighty elephant marches steadily through the rugged terrain.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A shiny silver vehicle gracefully maneuvers towards a modern glass building.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A yellow duck moving on the river, anime style.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A lone camel strolls leisurely through the vast, arid expanse of the desert.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Vid2Vid-Zero FateZero ControlVideo (Ours)

Human Pose

Text Prompt: Iron man does the moonwalk in the road.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Follow-Your-Pose Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: A robot dances on a road, animation style.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Follow-Your-Pose Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: The astronaut dances in futuristic city, cyberpunk style.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Follow-Your-Pose Vid2Vid-Zero FateZero ControlVideo (Ours)

Text Prompt: James bond moonwalk on the beach, animation style.

Source Video Structure Sequence Tune-A-Video Text2Video-Zero Follow-Your-Pose Vid2Vid-Zero FateZero ControlVideo (Ours)

Ablation studies

Non-deterministric DDPM sampler

Text Prompt: A striking mallard floats effortlessly on the sparkling pond.

Structure Sequence lambda=0.0 lambda=0.5 lambda=1.0

Trade-off between text prompt and motion

Text Prompt: A rabbit walks in the grasslands.

Text Prompt: A mallard swims in the river.

Input Video Structure Sequence control_scale=1.0 (by default) control_scale=0.3

Effect of fully cross-frame interaction and interleaved-frame smoother

(Different number of key frames)

Text Prompt: A mighty elephant marches steadily through the rugged terrain.

Source video Individual (k=0) First-only (k=1) Sparse-Causal (k=2)
Frame_ids={0,4,8,12} (k=4) Frame_ids={0,2,4,6,8,10,12,14} (k=8) Fully Cross-frame (k=15) Fully + Smoother (k=15)

Which timesteps does interleaved-frame smoother perform?

Text Prompt: A dusty old jeep was making its way down the winding forest road, creaking and groaning with each bump and turn.

Structure Sequence w/o smoother Timesteps {0,1} Timesteps {30,31} Timesteps {48,49}

How many timesteps are used in interleaved-frame smoother?

Text Prompt: A sleek black jeep was speeding along the narrow forest road, dodging trees and rocks.

Structure Sequence 0 step 2 steps 4 steps 6 steps 8 steps