Modern interactive applications increasingly demand dynamic 3D content, yet the transformation of static 3D models into animated assets constitutes a significant bottleneck in content creation pipelines. While recent advances in generative AI have revolutionized static 3D model creation, rigging and animation continue to depend heavily on expert intervention.
We present Puppeteer, a comprehensive framework that addresses both automatic rigging and animation for diverse 3D objects. Our system first predicts plausible skeletal structures via an auto-regressive transformer that introduces a joint-based tokenization strategy for compact representation and a hierarchical ordering methodology with stochastic perturbation that enhances bidirectional learning capabilities. It then infers skinning weights via an attention-based architecture incorporating topology-aware joint attention that explicitly encodes inter-joint relationships based on skeletal graph distances. Finally, we complement these rigging advances with a differentiable optimization-based animation pipeline that generates stable, high-fidelity animations while being computationally more efficient than existing approaches. Extensive evaluations across multiple benchmarks demonstrate that our method significantly outperforms state-of-the-art techniques in both skeletal prediction accuracy and skinning quality. The system robustly processes diverse 3D content, ranging from professionally designed game assets to AI-generated shapes, producing temporally coherent animations that eliminate the jittering issues common in existing methods.
Overview of our automatic rigging pipeline. Given a 3D mesh, we first sample point clouds with normals, then generate a skeleton using an auto-regressive transformer. The point clouds and skeleton are processed through an attention-based network with four key operations: (1) bone feature enhancement via topology-aware joint attention, (2) global context integration through cross-attention with shape latents, (3) bone-point interaction via cross-attention, and (4) point feature refinement. Finally, cosine similarity and softmax normalization produce the skinning weights.
To address data scarcity limitations of previous approaches, we introduce Articulation-XL2.0, which is an expanded version of Articulation-XL comprising 59.4k 3D models with high-quality rigging. This dataset includes a carefully curated subset of 11.4k diverse pose examples that enhance generalization to novel articulations.
We compare our skeleton generation results with MagicArticulate on AI-generated meshes from Tripo2.0 and Hunyuan3D 2.0. Our approach consistently generates valid, robust skeletons across all categories.
We compare our skinning weight prediction results with MagicArticulate and RigNet. Each pair shows the predicted weight visualization alongside its L1 error map. Our predictions more closely match the artist-painted references.
We compare our animation results with L4GM and MotionDreamer. Despite well-aligned reference views, L4GM consistently produces geometric distortions. MotionDreamer generates subtle animations and introduces unintended deformations in rigid parts. In contrast, our approach produces stable and accurate animations with the generated rigging.