Multimodal control
Combine text, images, video, and audio in one generation. Use each file for a specific role: character, environment, camera rhythm, motion, audio, or style.
Seedance 2.0 Guide
AI video workflow
Seedance 2.0 is a multimodal audio-video model built for structured directing. Use text, image, video, and audio references together, then guide the model with clear roles, physical action, and shot-based timing.
Treat Seedance like a controlled directing system, not a poetic toy. Every reference needs a job, and every shot needs one readable action.
It shines when you stop writing vague visual descriptions and start directing the model with scene logic, reference roles, physical motion, and clean constraints.
Combine text, images, video, and audio in one generation. Use each file for a specific role: character, environment, camera rhythm, motion, audio, or style.
Build scenes with a beginning, development, and payoff. A short beat structure gives the model temporal logic and keeps motion readable.
Describe what bodies, objects, light, dust, fabric, vehicles, and environments actually do on screen. Tangible behavior beats abstract mood.
These are the rules that keep Seedance generations controlled, readable, and cinematic.
Focus each frame on a single clear narrative or action point. Visual clutter lowers impact and makes the model guess.
The camera path should serve the subject, story, or emotion. Do not add camera moves just because they sound cinematic.
Prioritize visible behavior: falls, collisions, dust, fabric, fire, hands, eye movement, product interaction, and body mechanics.
The location should shape the action. Treat furniture, light, weather, crowds, streets, glass, walls, and props as part of the scene.
Define exactly how external information is used. A reference can control identity, wardrobe, palette, composition, motion, rhythm, or ambience.
A shorter prompt with hierarchy usually beats a long prompt full of competing ideas. Organize first, expand only where needed.
Use this repeatable process before every generation, no matter which platform gives you access to Seedance 2.0.
Decide whether you need a single-shot scene, multi-shot story, reference-based generation, continuation, POV, edit, extension, or detail shot.
Never upload files as decoration. Label what each reference should control: identity, wardrobe, palette, composition, camera motion, blocking, speech rhythm, or ambience.
Each shot should focus on one clear event. Avoid stacking too many actions, camera moves, style words, and character changes into the same moment.
For advanced results, write timecoded beats. Use a structure like establish, develop, payoff so the model understands progression across time.
If the result is chaotic, simplify the prompt. If style is ignored, adjust style weight. If identity drifts, fix the reference role. If camera resets, lock the camera path.
Choose the mode before writing. The mode determines how much control you need and how you should structure references.
One continuous readable event. Best for live performances, interviews, simple actions, and unedited footage.
Linked beats across time. Best for transformations, mini narratives, product launches, explainers, and event highlights.
Use a source image, video, or audio file to guide character, scene, motion, camera, rhythm, or style.
Extend a story or scene from a previous output while preserving continuity and avoiding repeated action.
Modify specific elements of an existing image or clip while keeping the rest of the scene stable.
Expand the canvas or scene from an image while preserving the original visual logic.
Change the visual perspective or camera angle so the viewer experiences the scene through the subject.
Add a detailed close-up or element shot to make the scene feel more complete and cinematic.
Seedance 2.0 can read multiple reference types, but every reference should have one job.
Use images for identity, wardrobe, palette, composition, environment, product design, pose, or first/last frame anchors.
Use video for motion rhythm, camera behavior, blocking, pacing, transition feel, choreography, or action timing.
Use audio for timing, speech rhythm, ambience, music energy, impact cues, or the emotional rhythm of the scene.
Use this as the fast version when you do not need a full shot script.
Who or what appears. Be specific about visual features, wardrobe, identity, and role.
What happens physically. Use precise verbs and visible motion instead of abstract emotion.
Where it happens. Include light, atmosphere, space, and objects that can interact with the action.
Use one primary movement: push-in, tracking, orbit, handheld, aerial, fixed, pan, or pull-out.
Anchor the visual feel without creating adjective soup. Keep it focused and consistent.
Tell the model what to avoid: jitter, bent limbs, identity drift, flicker, chaotic composition.
For stronger narrative clips, use a simple three-beat structure instead of a loose paragraph.
Show the subject, environment, relationship, or problem. The viewer should understand the scene immediately.
Introduce motion, conflict, transformation, product use, camera progression, or a visible change in the scene.
End with the result: impact, reveal, product moment, final pose, explosion, joke, reaction, or clean cinematic finish.
When extending a clip, the model needs anchors. Tell it what to preserve and what must happen next.
The continuation should start immediately after the final frame, not replay the previous action.
Preserve character, outfit, environment, emotional state, and camera direction unless you intentionally change them.
The previous video should guide continuity, not trap the model into repeating the same beat.
Add direct constraints like “do not repeat the previous action” or “continue from the final pose.”
Most bad generations come from too much ambiguity, not from a weak model.
Too many actions, references, camera ideas, and style words compete with each other. Remove everything that does not serve the main shot.
Do not just upload files and hope the model understands. Say exactly what each reference controls.
Camera movement must serve the story. Use one primary camera idea and keep subject movement separate from camera movement.
Make the prompt easy to parse: style, duration, shot beats, consistency rules, and negative constraints.
Too many style words dilute the result. Use one strong style anchor, then describe physical details that support it.
If character, outfit, age, face, or role references conflict, Seedance may drift. Give one clear identity source.
When refining, change one thing at a time: camera, style, reference, seed, duration, or prompt structure.
“She feels powerful” is weaker than “she stands still, lifts her eyes, tightens her grip, and steps forward.”
Use this when a generation looks close but not quite usable.