Sora’s video quality seems impossible so I dug into how it works under the hood
it uses both diffusion (starting with noise, refining towards a desired video) and transformer architectures (handling sequential video frames)
read on 🧵 (View Tweet)
here’s an example:
prompt: “a stop-motion animation of a flower growing out of the windowsill of a suburban house.”
Sora doesn’t directly translate text to video frames. instead, it works on spacetime patches. (View Tweet)
these patches capture a snapshot of both space (what’s happening) and time (when it happens), like mini video puzzles on an atomic level.
(View Tweet)
so imagine the video as a giant cuboid (space and time), and Sora cuts it into smaller cuboids, each representing a snippet of space and time.
(View Tweet)
for our example:
first, it dissects the description and identifies the core elements:
• objects (the blooming flower, the sunlit windowsill)
• actions (growth unfolding over time)
• location (the suburban setting)
• and even artistic stylings (the stop-motion aesthetic) (View Tweet)
next are the spacetime patches.
the flower yearning to unfurl becomes a patch, the sun-kissed windowsill another, and the slow growth across time another
all these things act as a sequence of patches evolving throughout the video scene. (View Tweet)
these patches, however, aren’t just random fragments.
to assemble them coherently, Sora has its internal knowledge graphs.
these databases contain information about the physical world, how objects interact, and even artistic styles. (View Tweet)
this allows Sora to understand
• how the flower realistically grows (petal by petal)
• how it interacts with sunlight (lighting changes over time),
• and adheres to the stop-motion aesthetic (frame-by-frame transitions).
these individual patches create a noisy canvas
(View Tweet)
after this, the diffusion models take each noisy, abstract patch and gradually refine it towards its final appearance.
each petal of the flower takes shape, the sunlight becomes more defined, and the stop-motion style emerges frame by frame
like this (View Tweet)
while diffusion models handle individual patches, the transformers analyze the relationships between patches across time.
so the flower grows smoothly, the sunlight shifts naturally, and the stop-motion aesthetic remains consistent throughout the video sequence.
pixel by pixel
(View Tweet)
it can do all sorts of video related tasks:
(View Tweet)
but still, there’s a long way to go
it does not accurately replicate the physics of many basic interactions
see the weird hand gestures of the woman waving here someone described as “hyperdimensional aliens trying to figure out how to appear human in 3D space” (View Tweet)
this is my first time digging into how some AI tech works but seems like I should do it again!
give me a follow @thatguybg & lmk what I should cover next (View Tweet)