Back to Yue Wang's homepage
Stony Brook University FutureWei Technologies WACV 2026
WACV 2026 LENS Workshop · Tucson, Arizona · Mar 6–10, 2026

Learning Domain Agnostic Latent Embeddings of 3D Faces for Zero-shot Animal Expression Transfer

Yue Wang1, Lawrence Amadi2, Xiang Gao1, Yazheng Chen1, Yuanpeng Liu1, Ning Lu2, and Xianfeng David Gu1
1Stony Brook University  ·  2FutureWei Technologies
Paper (PDF) arXiv Results Cite

Abstract

We propose a framework for zero-shot cross-species expression transfer on 3D face meshes. By learning domain-agnostic latent embeddings that disentangle identity and expression using intrinsic geometric descriptors (HKS and WKS), the model generalizes at inference time to animal meshes — including felines, dogs, and hippos — without any animal supervision during training. The approach is invariant to global translation, rotation, and scale, eliminating the need for pre-alignment or canonical template normalization.

Results

The model trained solely on human face data successfully transfers diverse expressions to unseen animal meshes at inference time — zero-shot, with no animal supervision.

Zero-shot dog and hippo expression transfer

Dog smile transfer
sliceddog → smile
Dog lip-up-eye-squint transfer
sliceddog → 1lipup1eyesquint
Hippo mouthslant transfer
slicedhippo → mouthslant1eyesquint
Hippo mouthsideway transfer
slicedhippo → mouthsideway

Cross-species expression transfer grid

Human and feline expression transfer grid
Top row: 5 different identities including humans and felines. Left column: 4 human expression inputs. Grid: predicted expression for each identity–expression pair.

Failure cases

Limitation: Some expressions, such as eye or mouth closing, are challenging to transfer to animal faces due to large morphological differences between human and animal facial anatomy in those regions. The model struggles most when the target species lacks the anatomical structures needed to reproduce a given expression.
Failure cases in expression transfer
Examples where expression transfer fails to capture the intended expression

See also: Method and Latent space analysis below for more details on how these results are achieved.

Motivation

Cross-species gap

  • Human facial animation is well-developed, but extending it to non-human characters remains challenging due to morphological differences
  • Collecting paired animal expression data is costly or infeasible for many species

Limitations of prior work

  • Typical human expression transfer methods rely on shared topology or extensive labels
  • They don't generalize to animals out of the box
  • Our method requires zero animal supervision at training time

Method

The framework has a single training stage on human face data, after which the learned model is directly applied to animal meshes at inference time without any additional training or fine-tuning.

Training: Learning domain-agnostic latent embeddings

Training: Learning Domain Agnostic Latent Embedding of 3D Faces
The model is trained on human face triplets. Identity and expression encoders produce disentangled latents zID and zExp, fused via cross-attention into a geometry decoder supervised by Jacobian and vertex losses.

Inference: Zero-shot human-to-animal expression transfer

Inference: Human-To-Animal Expression Transfer
At inference time, the trained model is directly applied to unseen animal identity meshes. No additional training, fine-tuning, or animal data is used — this is purely zero-shot evaluation.

Key components

Geometry

  • Intrinsic descriptors: HKS and WKS
  • Invariant to rigid transformations
  • Neural Jacobian Field decoder
  • Poisson solver for mesh reconstruction

Learning

  • DiffusionNet encoder for each branch
  • Cross-attention latent fusion
  • Mesh-agnostic inputs
  • Vertex loss + Jacobian loss

Dataset

Training — human faces

  • 1,000 synthesized mesh triplets (Mid, Mexp, Mgt) from the ICT Face Model
  • Identity mesh may have a non-neutral expression

Inference — animal meshes

  • Felines from CAFM (WACV Workshops 2020)
  • Dogs and hippos from SMAL (CVPR 2017)
  • Zero animal supervision used at any point

Latent Space Analysis

t-SNE visualization shows the ID encoder produces well-separated clusters per identity, while the expression encoder captures meaningful structure across 50 expression categories.

Latent Space of ID Encoder (t-SNE)
ID encoder latent space — each color is a different identity
Expression Latent Space (t-SNE)
Expression encoder latent space — each color is a different expression

Closest neighbors in latent space

Closest 3 faces in ID latent space
Closest 3 faces in the ID latent space
Closest 3 faces in expression latent space
Closest 3 faces in the expression latent space

Furthest neighbors in latent space

Furthest 3 faces in ID latent space
Furthest 3 faces in the ID latent space
Furthest 3 faces in expression latent space
Furthest 3 faces in the expression latent space

Future Directions

Richer training data

Train on larger, more diverse datasets (e.g., MultiFace) for better generalization across identities and expressions.

Richer supervision

Add semantic or muscle-based supervision to improve realism and anatomical plausibility.

Dynamic sequences

Extend to temporal sequences and real-time applications in VR, AR, and animation pipelines.

More species

Expand zero-shot transfer to a wider variety of animal species with larger morphological differences.

Citation

If you find this work useful, please cite:

@inproceedings{wang2026domain, title = {Learning Domain Agnostic Latent Embeddings of 3D Faces for Zero-shot Animal Expression Transfer}, author = {Wang, Yue and Amadi, Lawrence and Gao, Xiang and Chen, Yazheng and Liu, Yuanpeng and Lu, Ning and Gu, Xianfeng David}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, year = {2026} }