Skip to content

face_pose_mlp

feat.utils.face_pose_mlp

Landmark-only pose MLP inference.

The sole 6DoF-pose backend for non-img2pose face_model paths. The loaded MLP takes 68 face landmarks (normalized to the face bbox) and emits 6DoF head pose calibrated to img2pose's coordinate frame.

Training: distillation from img2pose on CelebV-HQ. v2 (default) was trained on 2.78M frames / 35K clips with a 512→256→128 hidden + LayerNorm + GELU + Dropout stack; v1 was the smaller 256→128→64 ReLU baseline on ~570k frames. v2 validation MAE on held-out CelebV-HQ: pitch 2.66°, roll 2.34°, yaw 1.58° — comparable to img2pose's reported ~4° avg MAE on BIWI (different dataset; smaller is better).

Weights: models/pose_mlp_v2.safetensors locally, HuggingFace py-feat/pose_mlp_v2.

PoseMLP

Bases: Module

Mirror of the architecture in scripts/train_pose_mlp.py.

v2 architecture: Linear → LayerNorm → GELU → Dropout per hidden block, with wider hidden layers (default 512/256/128). v1 used a bare Linear→ReLU→Dropout stack (256/128/64); we keep backward compatibility by inferring the architecture from the checkpoint when loading.

Source code in feat/utils/face_pose_mlp.py
class PoseMLP(nn.Module):
    """Mirror of the architecture in ``scripts/train_pose_mlp.py``.

    v2 architecture: Linear → LayerNorm → GELU → Dropout per hidden
    block, with wider hidden layers (default 512/256/128). v1 used a
    bare Linear→ReLU→Dropout stack (256/128/64); we keep backward
    compatibility by inferring the architecture from the checkpoint
    when loading.
    """

    def __init__(self, hidden=(512, 256, 128), dropout: float = 0.15,
                 use_layernorm: bool = True):
        super().__init__()
        in_dim = 136
        layers: list[nn.Module] = []
        for h in hidden:
            layers.append(nn.Linear(in_dim, h))
            if use_layernorm:
                layers.extend([nn.LayerNorm(h), nn.GELU(), nn.Dropout(dropout)])
            else:
                layers.extend([nn.ReLU(), nn.Dropout(dropout)])
            in_dim = h
        layers.append(nn.Linear(in_dim, 6))
        self.net = nn.Sequential(*layers)

    def forward(self, x: torch.Tensor) -> torch.Tensor:  # noqa: D401
        return self.net(x)

pose_from_landmarks_mlp(landmarks_2d, bboxes=None)

Estimate 6DoF pose from 68 2D landmarks.

Bbox-free: normalizes landmarks by their own centroid + inter-eye distance, so the MLP is decoupled from upstream face-detector bbox conventions (img2pose loose vs retinaface tight).

Parameters:

Name Type Description Default
landmarks_2d Tensor

[B, 68, 2] landmark coordinates.

required
bboxes Tensor | None

ignored (kept in signature for backward compatibility).

None

Returns:

Type Description
Tensor | None

[B, 6] pose tensor with columns

Tensor | None

(Pitch, Roll, Yaw, X, Y, Z) matching FEAT_FACEPOSE_COLUMNS_6D.

Tensor | None

Returns None if the pose-MLP weights are not available.

Source code in feat/utils/face_pose_mlp.py
def pose_from_landmarks_mlp(
    landmarks_2d: torch.Tensor,
    bboxes: torch.Tensor | None = None,
) -> torch.Tensor | None:
    """Estimate 6DoF pose from 68 2D landmarks.

    Bbox-free: normalizes landmarks by their own centroid + inter-eye
    distance, so the MLP is decoupled from upstream face-detector bbox
    conventions (img2pose loose vs retinaface tight).

    Args:
        landmarks_2d: ``[B, 68, 2]`` landmark coordinates.
        bboxes: ignored (kept in signature for backward compatibility).

    Returns:
        ``[B, 6]`` pose tensor with columns
        ``(Pitch, Roll, Yaw, X, Y, Z)`` matching ``FEAT_FACEPOSE_COLUMNS_6D``.
        Returns ``None`` if the pose-MLP weights are not available.
    """
    device = landmarks_2d.device
    loaded = _load_pose_mlp(device=str(device))
    if loaded is None:
        return None
    model, y_mean, y_std = loaded
    y_mean_t = torch.as_tensor(y_mean, device=device, dtype=torch.float32)
    y_std_t = torch.as_tensor(y_std, device=device, dtype=torch.float32)

    if landmarks_2d.dim() != 3 or landmarks_2d.shape[-1] != 2:
        raise ValueError(
            f"landmarks_2d must be [B, 68, 2], got {tuple(landmarks_2d.shape)}"
        )
    if landmarks_2d.shape[1] != 68:
        raise ValueError(
            f"pose-MLP requires 68 landmarks, got {landmarks_2d.shape[1]}"
        )

    x = landmarks_2d[..., 0]  # [B, 68]
    y = landmarks_2d[..., 1]
    cx = x.mean(dim=1, keepdim=True)
    cy = y.mean(dim=1, keepdim=True)
    # Inter-eye distance (dlib-68: 36 = left-eye outer corner, 45 = right-eye outer corner).
    dx = x[:, 36] - x[:, 45]
    dy = y[:, 36] - y[:, 45]
    iod = torch.sqrt(dx * dx + dy * dy).clamp_min(1e-6).unsqueeze(1)
    x_norm = (x - cx) / iod
    y_norm = (y - cy) / iod
    feat = torch.cat([x_norm, y_norm], dim=1).to(torch.float32)

    with torch.inference_mode():
        z = model(feat)
        pose = z * y_std_t + y_mean_t

    return pose