Tencent HY-Motion 1.0 Brings Billion-Parameter Diffusion Transformers to Text-to-3D Human Motion

Generating believable 3D human motion from plain English prompts has long been constrained by limited training data, inconsistent motion formats, and models that struggle to follow multi-step instructions over longer clips. Tencent Hunyuan’s latest release aims to address those gaps with a large-scale, open-weight text-to-motion system designed for practical animation pipelines.

What HY-Motion 1.0 is

HY-Motion 1.0 is an open weight family of text-to-3D human motion generation models from Tencent Hunyuan’s 3D Digital Human team. The system takes a natural-language prompt plus an expected duration and outputs a 3D human motion clip on a unified SMPL-H skeleton. The package is intended to plug into common 3D character workflows—supporting use cases such as digital humans, cinematics, and interactive game characters.

The release includes two model variants:

HY-Motion-1.0: the standard model with 1.0B parameters
HY-Motion-1.0-Lite: a lighter option with 0.46B parameters

For developers, Tencent provides code, checkpoints, and a local Gradio interface, along with inference scripts and a batch-oriented CLI. The tooling is described as supporting macOS, Windows, and Linux. The official resources are available via the project’s GitHub repository, and the technical details are documented in the paper.

Why text-to-3D motion is hard (and what this release targets)

Text-to-motion generation sits at the intersection of language understanding, temporal sequence modeling, and human-body kinematics. Even when a model produces smooth movement, it can fail on instruction alignment—misinterpreting the requested action, missing the sequence of steps, or drifting over time in ways that look physically implausible (for example, foot sliding or root drift).

HY-Motion 1.0 explicitly targets:

Instruction following across diverse action categories and combinations
Longer temporal coherence without destabilizing training
Pipeline compatibility by standardizing outputs on a unified SMPL-H skeleton format

Data engine and taxonomy: how the motion corpus is built

One of the biggest differentiators in modern generative systems is data scale and curation strategy. HY-Motion 1.0 is trained using motion data drawn from three sources: in-the-wild human motion videos, motion capture (mocap) data, and 3D animation assets used in game production.

According to the project description, the pipeline begins with 12M high quality video clips from HunyuanVideo. The team applies shot boundary detection to split scenes and runs a human detector to keep only clips featuring people. From there, the GVHMR algorithm is used to reconstruct SMPL X motion tracks.

In addition to reconstructed video-based motion, mocap sessions and 3D animation libraries add about 500 hours of motion sequences.

Unifying and cleaning the dataset

All motion data is retargeted onto a single SMPL-H skeleton using mesh fitting and retargeting tools. The team then applies a multi-stage filtering process intended to remove:

Duplicate clips
Abnormal poses
Outliers in joint velocity
Anomalous displacements
Long static segments
Artifacts such as foot sliding

The resulting motions are canonicalized and resampled to 30 fps, then segmented into clips shorter than 12 seconds. Clips are set into a fixed world frame: Y axis up and the character facing the positive Z axis.

The final training corpus contains over 3,000 hours of motion. Within that, 400 hours are labeled as high quality 3D motion paired with verified captions.

A 3-level motion taxonomy with 6 top classes

To organize the motion space, the team defines a three-level taxonomy. At the top level are 6 classes:

Locomotion
Sports and Athletics
Fitness and Outdoor Activities
Daily Activities
Social Interactions and Leisure
Game Character Actions

These expand into more than 200 fine-grained motion categories at the leaves, covering atomic actions as well as concurrent and sequential combinations.

Motion representation: SMPL-H, 22 joints, and a 201D frame vector

HY-Motion 1.0 generates motion on an SMPL-H skeleton using 22 body joints without hands. Each frame is represented as a 201 dimensional vector that concatenates:

Global root translation in 3D space
Global body orientation using a continuous 6D rotation representation
21 local joint rotations in 6D form
22 local joint positions in 3D coordinates

The team notes that velocities and foot contact labels were removed because they slowed training and did not improve final quality. The representation is positioned as compatible with animation workflows and close to the DART model representation.

The HY-Motion DiT architecture: hybrid multimodal fusion

At the model core is a hybrid HY Motion DiT (Diffusion Transformer). The architecture is described as combining two stages of multimodal processing:

Dual-stream blocks

Early layers process motion latents and text tokens in separate streams. Each modality has its own QKV projections and MLP, while a joint attention module enables motion tokens to query semantic information from text tokens. This is designed to preserve modality-specific structure while still grounding motion in language.

Single-stream blocks

Later layers concatenate motion and text tokens into one combined sequence, using parallel spatial and channel attention modules to deepen multimodal fusion.

Text conditioning with two encoders

Text conditioning uses a dual-encoder strategy:

Qwen3 8B for token-level embeddings
A CLIP-L model for global text features

A Bidirectional Token Refiner is included to counteract the causal attention bias typical of LLMs, which can be a mismatch for non-autoregressive generation.

Notably, attention is asymmetric: motion tokens can attend to all text tokens, but text tokens do not attend back to motion. The intent is to prevent noisy motion states from degrading the language representation.

Temporal attention designed for longer clips

Within the motion branch, temporal attention uses a narrow sliding window of 121 frames, focusing model capacity on local kinematics while controlling compute for longer sequences. After text and motion tokens are concatenated, Full Rotary Position Embedding is applied to encode relative positions across the combined sequence.

Flow Matching instead of denoising diffusion

Rather than using standard denoising diffusion training, HY-Motion 1.0 uses Flow Matching. In this setup, the model learns a velocity field along a continuous path interpolating between Gaussian noise and real motion data. Training uses a mean squared error objective between predicted and ground-truth velocities along the path.

During inference, the system integrates the learned ordinary differential equation from noise to a clean motion trajectory. The approach is presented as offering more stable training for long sequences while fitting well with a DiT-based design.

Prompt rewriting and duration prediction: a separate alignment module

To improve instruction adherence, HY-Motion 1.0 adds a dedicated Duration Prediction and Prompt Rewrite module. This component is built on Qwen3 30B A3B and is trained using synthetic user-style prompts derived from motion captions via a VLM and LLM pipeline (an example cited is Gemini 2.5 Pro).

The module has two jobs:

Predict an appropriate motion duration
Rewrite informal prompts into normalized text that is easier for the DiT to follow

Training begins with supervised fine tuning and then applies Group Relative Policy Optimization, using Qwen3 235B A22B as a reward model scoring semantic consistency and duration plausibility.

Training curriculum: pretraining, fine-tuning, then RL alignment

The overall training is described as a three-stage curriculum:

Stage 1: large-scale pretraining on the full 3,000 hour dataset to learn a broad motion prior and initial text-motion alignment
Stage 2: fine-tuning on the 400 hour high-quality subset to improve realism and semantic correctness at a lower learning rate
Stage 3: reinforcement learning alignment, including Direct Preference Optimization and Flow GRPO

In Stage 3, Direct Preference Optimization uses 9,228 curated human preference pairs sampled from about 40,000 generated pairs. Flow GRPO then introduces a composite reward combining a semantic score from a Text Motion Retrieval model and a physics score that penalizes artifacts such as foot sliding and root drift, with a KL regularization term to remain close to the supervised model.

Benchmarks and scaling behavior

For evaluation, the team constructs a test set of over 2,000 prompts spanning the 6 taxonomy categories, including simple, concurrent, and sequential actions. Human raters score instruction following and motion quality on a 1 to 5 scale.

Reported results include:

HY-Motion 1.0 instruction following: 3.24 average
HY-Motion 1.0 SSAE: 78.6 percent
Baselines (DART, LoM, GoToZero, MoMask) instruction following: 2.17 to 2.31
Baselines SSAE: 42.7 percent to 58.0 percent
HY-Motion 1.0 motion quality: 3.43 average
Best baseline motion quality: 3.11

What scaling experiments suggest

Scaling experiments compare DiT models with 0.05B, 0.46B, a 0.46B model trained only on 400 hours, and a 1B model. Instruction following improves with model size, with the 1B model reaching 3.34 on average. Motion quality, however, appears to saturate around the 0.46B scale, where 0.46B and 1B land in a similar 3.26 to 3.34 range.

The comparison between the two 0.46B models—one trained on 3,000 hours and one on only 400 hours—suggests that larger data volume is critical for instruction alignment, while high-quality curation mainly boosts realism.

Conclusion

HY-Motion 1.0 positions Tencent Hunyuan’s text-to-3D motion generation as a billion-parameter, Flow Matching-based DiT system built around large-scale motion data, a unified SMPL-H output format, and explicit alignment mechanisms for prompt understanding and physical plausibility. With open weights, tooling for local inference, and a documented training and evaluation setup, it offers a substantial new option for developers building language-driven animation workflows.

<<>>

Based on reporting originally published by www.marktechpost.com. See the sources section below.