# py-feat detector benchmark — 2026-05-03 23:09:29

## Run metadata

- **Date:** 2026-05-03 23:09:29
- **py-feat version:** 0.7.0
- **Git commit:** d71c0d7
- **Host:** vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)
- **Python:** 3.13.12
- **PyTorch:** 2.11.0
- **GPU:** MPS available
- **OMP_NUM_THREADS:** `1`
- **Devices swept:** ['mps']
- **Batch sizes:** [1, 4, 16]
- **DataLoader workers:** [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

## Video: short (72 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 8.17 | 113.5 | 8.8 |
| mps | 4 | 5.79 | 80.5 | 12.4 |
| mps | 16 | 5.21 | 72.3 | 13.8 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 3.88 | 54.0 | 18.5 |
| mps | 4 | 1.58 | 21.9 | 45.6 |
| mps | 16 | 0.98 | 13.7 | 73.2 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 10.21 | 141.8 | 7.1 |
| mps | 4 | 3.24 | 45.0 | 22.2 |
| mps | 16 | 1.63 | 22.6 | 44.2 |


## Video: long (472 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 82.45 | 174.7 | 5.7 |
| mps | 4 | 57.08 | 120.9 | 8.3 |
| mps | 16 | 53.62 | 113.6 | 8.8 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 29.33 | 62.1 | 16.1 |
| mps | 4 | 12.10 | 25.6 | 39.0 |
| mps | 16 | 8.86 | 18.8 | 53.3 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| mps | 1 | 77.89 | 165.0 | 6.1 |
| mps | 4 | 23.16 | 49.1 | 20.4 |
| mps | 16 | 11.08 | 23.5 | 42.6 |


## Images: 16 x multi_face.jpg = 80 faces

### img2pose

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| mps | 1 | 3.71 | 232.2 | 80 |
| mps | 4 | 2.28 | 142.6 | 80 |
| mps | 16 | 2.69 | 168.2 | 80 |


### retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| mps | 1 | 1.43 | 89.5 | 80 |
| mps | 4 | 0.89 | 55.8 | 80 |
| mps | 16 | 0.90 | 56.4 | 80 |


### MPDetector retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| mps | 1 | 3.97 | 248.3 | 80 |
| mps | 4 | 1.74 | 108.8 | 80 |
| mps | 16 | 1.22 | 76.5 | 80 |

---

## Notes (hand-curated)

**Scope:** full-pipeline bench on M5 MBP MPS. All three configs run with `au_model='xgb'` (or `mp_blendshapes` for MPDetector), `emotion_model='resmasknet'`, and `identity_model='arcface'` — matching the default `Detector()` config most users run in production. Replaces the partial `svm`-AU bench at `2026-05-03-864962c-mps.md`.

**Headline numbers (best ms/frame, MPS):**

| Config | Long video (472 frames) | Image batch (80 faces) |
|---|---|---|
| `Detector(face_model='img2pose', au_model='xgb')` | 113.6 ms/frame (8.8 fps), batch=16 | 142.6 ms/img, batch=4 |
| `Detector(face_model='retinaface', au_model='xgb')` | 18.8 ms/frame (53.3 fps), batch=16 | 55.8 ms/img, batch=4 |
| `MPDetector(au_model='mp_blendshapes')` | 23.5 ms/frame (42.6 fps), batch=16 | 76.5 ms/img, batch=16 |

All three include emotion (resmasknet) + identity (arcface) on top of face + landmark + AU.

**Recommendation for users:** `Detector(face_model='retinaface')` is the fastest full-pipeline config on MPS at batch=16 — 53 fps on 472-frame video. Img2pose pays for its head-pose regression in compute (~6× slower at batch=16). MPDetector falls in between and gives the 478-point mediapipe mesh at the cost of slightly slower wall time.

**HOG batching speedup (PR #292, isolated bench):**

The detector-level numbers above include face detection + landmark + emotion + identity, which dilutes the HOG speedup. Isolated `extract_hog_features_batched` vs the legacy `extract_hog_features` per-face loop on MPS:

| n_faces | Legacy | Batched (PR #292) | Speedup |
|---|---|---|---|
| 5 | 15.2 ms | 7.8 ms | 1.96x |
| 20 | 68.1 ms | 16.3 ms | 4.19x |
| 50 | 145.9 ms | 32.0 ms | 4.56x |

**Comparison to prior baselines:**
- `2026-05-03-864962c.md` (Linux + RTX 3090 + CUDA, hand-curated note) was run with `au_model='svm'` — the bench script's pre-PR-NNN configs. Direct cell comparison drifts ~5-10% from the present configs (svm vs xgb), so cross-platform speed gaps are illustrative not exact. Re-running that bench with the new full-pipeline configs would close the comparison gap.
- `2026-05-03-437b651.md` is hand-curated from a partial v0.7 svm sweep; the workers-axis cells are still informative for `num_workers > 0` regression tracking.
- Pre-v0.7 community benchmarks live in py-feat issue #184 (Google Sheet, per-stage timings); not directly cell-comparable to detect()-wall-time format here.

