Generating believable 3D human motion from plain English prompts has long been constrained by limited training data, inconsistent motion formats, and models that struggle to follow multi-step instructions over longer clips. Tencent Hunyuan’s latest release aims to address those gaps with a large-scale, open-weight text-to-motion system designed for practical animation pipelines.
What HY-Motion 1.0 is
HY-Motion 1.0 is an open weight family of text-to-3D human motion generation models from Tencent Hunyuan’s 3D Digital Human team. The system takes a natural-language prompt plus an expected duration and outputs a 3D human motion clip on a unified SMPL-H skeleton. The package is intended to plug into common 3D character workflows—supporting use cases such as digital humans, cinematics, and interactive game characters.
The release includes two model variants:
- HY-Motion-1.0: the standard model with 1.0B parameters
- HY-Motion-1.0-Lite: a lighter option with 0.46B parameters
For developers, Tencent provides code, checkpoints, and a local Gradio interface, along with inference scripts and a batch-oriented CLI. The tooling is described as supporting macOS, Windows, and Linux. The official resources are available via the project’s GitHub repository, and the technical details are documented in the paper.
Why text-to-3D motion is hard (and what this release targets)
Text-to-motion generation sits at the intersection of language understanding, temporal sequence modeling, and human-body kinematics. Even when a model produces smooth movement, it can fail on instruction alignment—misinterpreting the requested action, missing the sequence of steps, or drifting over time in ways that look physically implausible (for example, foot sliding or root drift).
HY-Motion 1.0 explicitly targets:
- Instruction following across diverse action categories and combinations
- Longer temporal coherence without destabilizing training
- Pipeline compatibility by standardizing outputs on a unified SMPL-H skeleton format
Data engine and taxonomy: how the motion corpus is built
One of the biggest differentiators in modern generative systems is data scale and curation strategy. HY-Motion 1.0 is trained using motion data drawn from three sources: in-the-wild human motion videos, motion capture (mocap) data, and 3D animation assets used in game production.
According to the project description, the pipeline begins with 12M high quality video clips from HunyuanVideo. The team applies shot boundary detection to split scenes and runs a human detector to keep only clips featuring people. From there, the GVHMR algorithm is used to reconstruct SMPL X motion tracks.
In addition to reconstructed video-based motion, mocap sessions and 3D animation libraries add about 500 hours of motion sequences.
Unifying and cleaning the dataset
All motion data is retargeted onto a single SMPL-H skeleton using mesh fitting and retargeting tools. The team then applies a multi-stage filtering process intended to remove:
- Duplicate clips
- Abnormal poses
- Outliers in joint velocity
- Anomalous displacements
- Long static segments
- Artifacts such as foot sliding
The resulting motions are canonicalized and resampled to 30 fps, then segmented into clips shorter than 12 seconds. Clips are set into a fixed world frame: Y axis up and the character facing the positive Z axis.
The final training corpus contains over 3,000 hours of motion. Within that, 400 hours are labeled as high quality 3D motion paired with verified captions.
A 3-level motion taxonomy with 6 top classes
To organize the motion space, the team defines a three-level taxonomy. At the top level are 6 classes:
- Locomotion
- Sports and Athletics
- Fitness and Outdoor Activities
- Daily Activities
- Social Interactions and Leisure
- Game Character Actions
These expand into more than 200 fine-grained motion categories at the leaves, covering atomic actions as well as concurrent and sequential combinations.
Motion representation: SMPL-H, 22 joints, and a 201D frame vector
HY-Motion 1.0 generates motion on an SMPL-H skeleton using 22 body joints without hands. Each frame is represented as a 201 dimensional vector that concatenates:
- Global root translation in 3D space
- Global body orientation using a continuous 6D rotation representation
- 21 local joint rotations in 6D form
- 22 local joint positions in 3D coordinates
The team notes that velocities and foot contact labels were removed because they slowed training and did not improve final quality. The representation is positioned as compatible with animation workflows and close to the DART model representation.
The HY-Motion DiT architecture: hybrid multimodal fusion
At the model core is a hybrid HY Motion DiT (Diffusion Transformer). The architecture is described as combining two stages of multimodal processing:
Dual-stream blocks
Early layers process motion latents and text tokens in separate streams. Each modality has its own QKV projections and MLP, while a joint attention module enables motion tokens to query semantic information from text tokens. This is designed to preserve modality-specific structure while still grounding motion in language.
Single-stream blocks
Later layers concatenate motion and text tokens into one combined sequence, using parallel spatial and channel attention modules to deepen multimodal fusion.
Text conditioning with two encoders
Text conditioning uses a dual-encoder strategy:
- Qwen3 8B for token-level embeddings
- A CLIP-L model for global text features
A Bidirectional Token Refiner is included to counteract the causal attention bias typical of LLMs, which can be a mismatch for non-autoregressive generation.
Notably, attention is asymmetric: motion tokens can attend to all text tokens, but text tokens do not attend back to motion. The intent is to prevent noisy motion states from degrading the language representation.
Temporal attention designed for longer clips
Within the motion branch, temporal attention uses a narrow sliding window of 121 frames, focusing model capacity on local kinematics while controlling compute for longer sequences. After text and motion tokens are concatenated, Full Rotary Position Embedding is applied to encode relative positions across the combined sequence.
Flow Matching instead of denoising diffusion
Rather than using standard denoising diffusion training, HY-Motion 1.0 uses Flow Matching. In this setup, the model learns a velocity field along a continuous path interpolating between Gaussian noise and real motion data. Training uses a mean squared error objective between predicted and ground-truth velocities along the path.
During inference, the system integrates the learned ordinary differential equation from noise to a clean motion trajectory. The approach is presented as offering more stable training for long sequences while fitting well with a DiT-based design.
Prompt rewriting and duration prediction: a separate alignment module
To improve instruction adherence, HY-Motion 1.0 adds a dedicated Duration Prediction and Prompt Rewrite module. This component is built on Qwen3 30B A3B and is trained using synthetic user-style prompts derived from motion captions via a VLM and LLM pipeline (an example cited is Gemini 2.5 Pro).
The module has two jobs:
- Predict an appropriate motion duration
- Rewrite informal prompts into normalized text that is easier for the DiT to follow
Training begins with supervised fine tuning and then applies Group Relative Policy Optimization, using Qwen3 235B A22B as a reward model scoring semantic consistency and duration plausibility.
Training curriculum: pretraining, fine-tuning, then RL alignment
The overall training is described as a three-stage curriculum:
- Stage 1: large-scale pretraining on the full 3,000 hour dataset to learn a broad motion prior and initial text-motion alignment
- Stage 2: fine-tuning on the 400 hour high-quality subset to improve realism and semantic correctness at a lower learning rate
- Stage 3: reinforcement learning alignment, including Direct Preference Optimization and Flow GRPO
In Stage 3, Direct Preference Optimization uses 9,228 curated human preference pairs sampled from about 40,000 generated pairs. Flow GRPO then introduces a composite reward combining a semantic score from a Text Motion Retrieval model and a physics score that penalizes artifacts such as foot sliding and root drift, with a KL regularization term to remain close to the supervised model.
Benchmarks and scaling behavior
For evaluation, the team constructs a test set of over 2,000 prompts spanning the 6 taxonomy categories, including simple, concurrent, and sequential actions. Human raters score instruction following and motion quality on a 1 to 5 scale.
Reported results include:
- HY-Motion 1.0 instruction following: 3.24 average
- HY-Motion 1.0 SSAE: 78.6 percent
- Baselines (DART, LoM, GoToZero, MoMask) instruction following: 2.17 to 2.31
- Baselines SSAE: 42.7 percent to 58.0 percent
- HY-Motion 1.0 motion quality: 3.43 average
- Best baseline motion quality: 3.11
What scaling experiments suggest
Scaling experiments compare DiT models with 0.05B, 0.46B, a 0.46B model trained only on 400 hours, and a 1B model. Instruction following improves with model size, with the 1B model reaching 3.34 on average. Motion quality, however, appears to saturate around the 0.46B scale, where 0.46B and 1B land in a similar 3.26 to 3.34 range.
The comparison between the two 0.46B models—one trained on 3,000 hours and one on only 400 hours—suggests that larger data volume is critical for instruction alignment, while high-quality curation mainly boosts realism.
Conclusion
HY-Motion 1.0 positions Tencent Hunyuan’s text-to-3D motion generation as a billion-parameter, Flow Matching-based DiT system built around large-scale motion data, a unified SMPL-H output format, and explicit alignment mechanisms for prompt understanding and physical plausibility. With open weights, tooling for local inference, and a documented training and evaluation setup, it offers a substantial new option for developers building language-driven animation workflows.
<<>>
Related Articles
- Best AI-Powered Dictation Apps of 2025: Top Voice-to-Text Tools for Faster Writing
- The Cybersecurity Stories That Defined 2025: Hacks, Surveillance, and High-Stakes Reporting
- 9 Cybersecurity Startups to Watch From TechCrunch Disrupt Startup Battlefield 200
Based on reporting originally published by www.marktechpost.com. See the sources section below.
Sources
- www.marktechpost.com
- https://arxiv.org/pdf/2512.23464
- https://github.com/Tencent-Hunyuan/HY-Motion-1.0
- https://x.com/intent/follow?screen_name=marktechpost
- https://www.reddit.com/r/machinelearningnews/
- https://www.aidevsignals.com/
- https://t.me/machinelearningresearchnews