#1 on Artificial Analysis Video Arena

Happy Horse 1.0AI Video with Native Audio

The first open-weight model to natively generate synchronized dialogue, ambient sound, and Foley effects jointly with video — no post-production dubbing required. Built on a 15B-parameter unified Transformer.

Native Audio-Video
1080p Output
7-Language Lip-Sync
Open Source
Elo 1392
Image-to-Video
15B
Parameters
1080p
Native Output
7
Languages
#1 Artificial Analysis Arena
Open Source with Commercial License
Surpassed Seedance 2.0

Why Happy Horse Tops the Leaderboard

A 15B-parameter unified Transformer that jointly generates video and audio in a single forward pass — eliminating the multi-stage pipeline of silent video, dubbing, then lip-syncing.

Joint Audio-Video

The first open-weight model to generate dialogue, ambient sound, and Foley effects natively alongside video frames — no separate dubbing step needed.

Single forward pass
Synchronized audio-visual output
Multilingual Lip-Sync

Native lip-sync support for 7 languages with industry-leading accuracy: English, Mandarin, Cantonese, Japanese, Korean, German, and French.

Low word error rate
Native CJK language support
Lightning Fast

DMD-2 distillation reduces denoising to just 8 steps without classifier-free guidance. A 5-second 1080p clip generates in ~38 seconds on H100.

8-step generation
No CFG overhead
1080p HD Output

Native 1080p high-definition output with super-resolution module available. Supports multiple aspect ratios: 16:9, 9:16, 4:3, 21:9, and 1:1.

5-10 second clips
Flexible aspect ratios
Open Source

Fully open source with commercial-use rights. Base model, distilled model, super-resolution module, and inference code — ready to self-host and fine-tune.

Commercial license
Self-hostable
Dual Input Modes

Supports both text-to-video and image-to-video generation through a unified pipeline. Ranked #1 in both categories on Artificial Analysis.

Elo 1333 (text-to-video)
Elo 1392 (image-to-video)

How Happy Horse Works

A unified pipeline processes text or image input and generates high-definition video with synchronized audio in a single pass.

1
Provide Input

Enter a text prompt describing the scene you want, or upload a reference image to animate. The model handles both input modes through a unified architecture.

Text-to-video generation
Image-to-video animation
Specify language for dialogue
2
Unified Generation

The 40-layer Transformer processes all tokens — text, image, video, and audio — in a single sequence. The middle 32 shared layers handle cross-modal reasoning.

Single-stream processing
8-step DMD-2 distillation
~38s per 5s clip on H100
3
HD Output with Audio

Get 1080p video with synchronized dialogue, ambient sound, and Foley effects. Optional super-resolution for even higher quality output.

1080p native resolution
Lip-synced dialogue
Ambient + Foley audio

Technical Architecture

A "sandwich" Transformer architecture that unifies text, image, video, and audio in a single sequence — no cross-attention branches.

Sandwich Layer Design

40-layer self-attention Transformer with a unique "sandwich" structure. The first and last 4 layers handle modality-specific projections; the middle 32 layers share parameters across all modalities.

4 input projection layers
32 shared reasoning layers
4 output projection layers

15 Billion Parameters

The unified architecture processes text, video, and audio tokens in a single sequence. This is the first open-source implementation of real audio-video joint pre-training from scratch.

Pure self-attention (no cross-attention)
Joint multimodal pre-training
All tokens in one sequence

DMD-2 Distillation

Distribution Matching Distillation v2 compresses the denoising process to just 8 steps, eliminating the need for classifier-free guidance and dramatically reducing inference time.

8-step denoising
No CFG required
Distilled model included

Native Audio Pipeline

Unlike traditional pipelines that generate silent video then add audio, Happy Horse produces both simultaneously. This ensures perfect temporal alignment between visual and audio content.

Dialogue generation
Ambient sound synthesis
Foley effect generation

Verified Benchmark Performance

Rankings from Artificial Analysis Video Arena — based on blind user voting with chess-style Elo rating, not self-reported benchmarks.

#1
Text-to-Video
Elo 1333 (no audio)
#1
Image-to-Video
Elo 1392 (no audio)
#2
Text-to-Video + Audio
Elo 1205
#2
Image-to-Video + Audio
Elo 1161

Source: Artificial Analysis Video Arena, April 7, 2026. Rankings determined by blind user preference voting.

The Story Behind Happy Horse

A mysterious model that appeared overnight and dominated the rankings — here's what credible sources have confirmed.

What We Know

HappyHorse-1.0 was submitted to the Artificial Analysis Video Arena without a publicly identified team or organization. It quickly rose to #1 in both text-to-video and image-to-video categories.

According to analysis by 36Kr and other tech media, HappyHorse is widely believed to be an optimized iteration of the daVinci-MagiHuman open-source model, developed through collaboration between Shanghai's GAIR research lab and Beijing-based Sand.ai.

The daVinci-MagiHuman model was open-sourced on GitHub in March 2026 and represents the first open-source implementation of audio-video joint pre-training from scratch.

Important Context

As of April 2026, many "Happy Horse" websites appearing in search results are SEO affiliate sites that are not affiliated with the actual model developers. The official platform is happyhorses.io, and model weights have not yet been publicly released.

The Artificial Analysis rankings are based on blind user preference voting — meaning the quality signal is real, even though the team behind it remains pseudonymous.

7 Languages, Native Lip-Sync

Generate videos with accurate lip-synced dialogue in any of these supported languages, with industry-leading word error rates.

EN
English
ZH
Mandarin
YUE
Cantonese
JA
Japanese
KO
Korean
DE
German
FR
French

Frequently Asked Questions

Common questions about Happy Horse 1.0 and how to get started.

Try AI Video Generation Today

Ready to CreateAI-Powered Videos?

Happy Horse 1.0 represents the cutting edge of AI video generation — native audio, multilingual lip-sync, and #1 ranked quality. Start creating professional AI videos with state-of-the-art tools now.

#1 ranked quality
Native audio generation
7-language support
1080p HD output