Kling O1 (Omni One): A “Unified” Multimodal Video Model That Wants to Replace Your Whole Video Stack

Kling O1 (Omni One) aims to unify video generation and editing with MVL—text, images, and video references working in one seamless workflow.

Kling O1 (Omni One): A “Unified” Multimodal Video Model That Wants to Replace Your Whole Video Stack
Date: 2025-11-27

Kling’s O1 (Omni One) is being described (in a beta/internal-style guide) as a unified multimodal video foundation model—a single system meant to handle both video generation and instruction-based video editing through one interaction language. Instead of hopping between “text-to-video,” “reference-to-video,” “video edit,” and “extend shot” modes across different tools, O1’s pitch is: give it text + images + video references + a subject reference, and iterate like a director in one continuous workflow.

That direction also matches how Kuaishou has described Kling’s evolution around MVL (Multi-modal Visual Language): expressing identity, style, scene, action, and camera intent by combining text with visual references.


What is Kling O1 (Omni One), in plain English?

Think of O1 as trying to be one model that does “create + revise” end-to-end:

  • Generate a fresh shot from text
  • Generate from references (image/video)
  • Create a shot from first/last frames
  • Add/remove objects or people in a video
  • Apply transformations (look changes, modifications)
  • Repaint style (restyle the clip)
  • Extend the shot to continue motion and pacing

All of those are listed together as supported tasks inside the guide, under the umbrella of a single unified model rather than separate specialized pipelines.


The big idea underneath: MVL (Multi-modal Visual Language)

O1’s guide frames the interaction like this: your inputs aren’t “assets” you drop into a workflow—they’re instructions. Text is the high-level plan; the reference image/video provides visual constraints; the subject reference anchors identity.

Kuaishou’s MVL framing is similar: it’s meant to help users convey multi-dimensional creative intent—identity, appearance, style, scenes, actions, expressions, and camera motion—by integrating multimodal information like image references and video clips.

In practice, O1 is aiming for an experience closer to:

“Keep my hero’s face consistent, keep the same jacket, now remove the bystander, shift to golden-hour lighting, and extend the shot as the camera pushes in.”

…instead of exporting to an editor, masking, keyframing, and re-rendering.


What O1 claims to combine (the “all-in-one capability stack”)

The guide is explicit about the scope it’s trying to unify:

  • Text-to-video
  • Reference-to-video
  • First/last-frame to video
  • Video add/remove
  • Video modification & transformation
  • Style repaint
  • Shot extension

Bundling this list matters because it implies O1 isn’t only competing on “how pretty is the first render,” but on whether you can finish a clip through iterative edits without leaving the model.


One-sentence editing: turning post-production into chat

One of the most creator-relevant promises in the guide is the idea of single-sentence edits—natural language requests like removing passersby, changing time of day, swapping outfit/style—applied directly to an existing video.

If it holds up, it changes the economics of content creation:

  • Less time learning editor-specific techniques
  • More time iterating on story, pacing, and framing
  • Faster A/B testing for ads (multiple variants from the same base clip)

The hardest problem O1 is aiming at: consistency

Most AI video systems still struggle with the thing audiences notice instantly: continuity.

  • The face subtly changes
  • Logos warp
  • Outfit colors drift
  • Props teleport
  • Background architecture melts

O1’s guide directly emphasizes stronger understanding of inputs and multi-view subject creation (building a subject identity from multiple angles) to improve consistency across shots.

This is also why “unified” matters: if generation and editing share the same internal representation of your subject, you have a better chance of modifying a clip without re-rolling your character’s identity each time.


“Skill combos”: stacking tasks in a single pass

A subtle but important point: the guide highlights that you can combine tasks, like doing a subject add and a background change together, or generating from an image reference while restyling.

That sounds minor until you’re producing at scale. Stacked operations can mean:

  • Fewer “generate → export → edit → reimport” hops
  • Fewer generations wasted on intermediate steps
  • More usable variants per iteration cycle

What to watch next (product direction)

The guide references a newer “omni/new” creation workflow path, suggesting an “omni” hub where generation and editing live together instead of being split into separate modes.

And the MVL framing is consistent with Kling’s broader trajectory toward “directing with multimodal constraints,” not just typing prompts.


Where Kling O1 could matter most: practical use cases

1) Short narrative content (multi-shot continuity)

Recurring characters and coherent sequences benefit the most from subject anchoring + shot extension.

2) Product and brand ads (variant generation)

If you can do: “same product, new environment, different lighting, remove reflections, add a hand holding it,” you can produce multiple ad angles from one base.

3) Social volume workflows

One “hero clip,” then 10 variants: different styles, times of day, backgrounds, text removed, camera pacing extended—all without a full editor pipeline.

4) Previs / storyboarding

Explore camera moves, mood, blocking, and pacing before committing to a final sequence.


Quick-start prompt patterns (MVL-friendly)

A) Baseline shot (lock identity first)

Use subject reference + text:

  • Scene, time/lighting, camera framing + motion, action, mood
  • Negative constraints: face drift, outfit color shifting, logo deformation

B) Edit pass (one-sentence post)

“Remove X, change Y, keep identity unchanged.”

C) Extend shot (continue motion)

“Extend 2–4 seconds, continue action, keep tone, smooth motion.”

These map to exactly the sort of “instruction + reference” behavior O1 is aiming to unify.


Try Kling models now on Flux AI (recommended)

If you want to start generating today while the O1 (Omni One) direction evolves, try these current options:

Android & iOS Mobile Application for Flux AI

Download Flux AI mobile Application now to tap into Flux AI's robust tools—boost your creativity with a spark of inspiration that transforms words into stunning visuals!

Start on Web App
flux-ai-app-download

Advanced Image & Video AI Tools in Flux AI

Create stunning images and captivating videos with Flux AI's powerful tools. Unleash your creativity with our advanced AI technology.

Flux Image AI Tools

Create stunning images instantly with Flux AI's text-to-image and image-to-image generation technology.

Flux AI Image Generator

Flux Video AI Tools

Create magic animation videos with Flux AI's text-to-video and image-to-video technology.

Flux AI Video Generator

Flux Kontext

Create stunning images and captivating videos with Flux AI's powerful tools. Unleash your creativity with our advanced AI technology.

Flux AI Image Generator

Android & iOS Mobile Application for Flux AI

Download Flux AI mobile Application now to tap into Flux AI's robust tools—boost your creativity with a spark of inspiration that transforms words into stunning visuals!

Start on Web App
flux-ai-app-download

Start Creating with Flux AI Now

Try Flux AI for free now.