DigitalbyDefault.ai
Back to Blog
AI News7 min

HappyHorse Just Took #1 in AI Video — And Shipped the One Feature Sora 2 Skipped

The Artificial Analysis Video Arena leaderboard rarely moves to a new #1 with a genuinely new capability attached. In April 2026, HappyHorse 1.0 from Alibaba did both — and introduced a feature every Western frontier video model, including Sora 2, conspicuously skipped: joint video-and-audio generation in a single pass.

Erhan Timur28 April 2026Founder, Digital by Default
Share:XLinkedIn

The Artificial Analysis Video Arena leaderboard moves fast, but it rarely moves to a new #1 with a genuinely new capability attached. In April 2026, HappyHorse 1.0 from Alibaba's ATH AI Innovation Unit did both — took the top slot on pure video quality and introduced a feature every Western frontier video model, including Sora 2, had conspicuously skipped: joint video-and-audio generation in a single inference pass.

That's a technical distinction dressed up as a feature, but it matters. Every video tool you've used up to this point has produced silent output. Audio — ambient sound, spoken dialogue, music — came from a second model, bolted on afterwards, with all the sync and style mismatch that implies. HappyHorse ships audio as part of the same generation.

What just shipped

HappyHorse 1.0 is a text-to-video model with native audio generation. Give it a prompt — "a marketplace at dusk in Lisbon, tourists walking past a flamenco player, camera tracking slowly" — and it returns a cinematic video with synchronised ambient sound, crowd murmur, and the flamenco music, all generated in one pass.

Currently free to use at happyhorse.app with open model weights planned and a public API coming soon. Backed by Alibaba's $293M investment in parent company ShengShu. Already at 1,100 reviews with a 4.8/5 rating.

The #1 leaderboard position matters less than how it got there. HappyHorse isn't winning on marginal quality gains over Sora 2; it's winning because the integrated audio changes what "finished video" means. A model that produces silent clips is a draft tool. A model that produces clips with coherent sound is, depending on length, shippable output.

Why joint video-audio matters

Three reasons it's a bigger deal than it sounds.

Sync is the problem. Adding audio to video post-hoc requires the audio model to either ignore timing (producing generic ambient that could be anything) or scene-read the video (hard, error-prone, computationally expensive). Joint generation sidesteps both problems — the model knows what it's showing because it's producing both modalities simultaneously.

Style coherence. A silent video with added stock music feels assembled. The added track always has a slightly different emotional register than the visuals. HappyHorse's output feels authored because the audio was generated to match the visual aesthetic and pacing.

Narrative density. Dialogue-driven scenes require the audio to drive the video as much as the other way round. Previous models handle this badly. HappyHorse, by virtue of generating both together, can produce a ten-second clip where lip movements match the words and the ambient sound responds to what's on screen.

For any professional use case — short-form ads, social video, film previz, game cinematics — integrated audio is the difference between "impressive demo" and "usable asset."

How it compares

Against Sora 2. Sora 2 has better consistency across longer clips and a more mature editing workflow. HappyHorse has audio and is currently free. For a 60-second branded clip, Sora 2 is still the pick; for a ten-second social video with ambient audio, HappyHorse.

Against Kling 2.6. Kling is stronger on photorealistic human motion. HappyHorse has the audio advantage. Our informal test: Kling for influencer-style content, HappyHorse for scene-based narrative work.

Against Runway. Runway has the most mature editing and export pipeline, plus better shot-level control. HappyHorse is a pure generation play without Runway's production tooling. Different parts of the workflow.

Against Google Veo 3. Veo has Google Cloud distribution and enterprise packaging. HappyHorse is free and open-weights-coming. For serious enterprise deployment, Veo; for creators experimenting at no cost, HappyHorse.

Where the caveats live

It's Alibaba. For some UK and EU buyers, running production creative through Alibaba Cloud is off the table on data-residency grounds. For everyone else, it's a fine provider — with the usual "don't upload anything genuinely confidential" caveats.

API not yet live. "Coming soon" is doing a lot of work. For teams that need to plan a production pipeline against a dependable API, HappyHorse is a 2026-H2 bet at best.

Weights not yet open. The promise is open weights. The reality is currently closed. If self-hosting is the thing that matters, wait for the release.

Audio is great, not perfect. Native audio generation is a step-change, but lip sync on dialogue-heavy scenes is still imperfect. For narration-over-ambient, HappyHorse is fine; for scenes where dialogue timing is critical, you'll still need to fix things in post.

Who should actually care

Short-form video marketers and social media teams. Ten-to-thirty-second clips with integrated audio is the exact sweet spot. This is the audience where HappyHorse will displace the current "generate in Sora + add ElevenLabs audio" workflow fastest.

Indie filmmakers and previz artists. Scene composition with ambient audio is enormously useful at the previz stage. Being able to prompt a scene and hear it as well as see it changes the director's workflow.

AI video researchers. Open weights (when they land) plus the architectural innovation of joint generation is going to kick off a wave of derivative work. If your job is to track where video models are going, HappyHorse is a file you want to have read.

Not ideal for: long-form video, regulated-industry content (data residency), anyone who needs a stable API today.

The signal

Joint video-audio generation is the new baseline. Sora 3, Veo 4, whichever Western lab gets there next will ship this or look instantly outdated. Expect a 2026 release window where Western frontier labs match HappyHorse's architectural approach; by end of year, silent-only video generation will feel dated in the same way text-only image captioning looked dated by mid-2024.

The other signal worth noting: China shipped this first. This is no longer a one-off — DeepSeek, Kimi, Qwen, and now HappyHorse form a pattern. Leadership on specific model capabilities is distributed, and Western assumptions about which lab has the frontier on any particular axis need constant updating.


If you want to try joint video-audio generation today: HappyHorse on our marketplace has the access specifics, and the Design & Creative category groups the frontier video tools — Sora, Kling, Runway, Veo — worth running the same prompt through if you're picking a primary video model for 2026.

HappyHorseAlibabaAI VideoGenerative AISoraVideo Generation2026
Share:XLinkedIn

Enjoyed this article?

Subscribe to our Weekly AI Digest for more insights, trending tools, and expert picks delivered to your inbox.