# py-feat detector benchmark — 2026-05-04 02:03:51

## Run metadata

- **Date:** 2026-05-04 02:03:51
- **py-feat version:** 0.7.0
- **Git commit:** f44ccb1
- **Host:** vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)
- **Python:** 3.13.12
- **PyTorch:** 2.11.0
- **GPU:** MPS available
- **OMP_NUM_THREADS:** `1`
- **Devices swept:** ['cpu']
- **Batch sizes:** [1, 4, 16]
- **DataLoader workers:** [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

## Video: short (72 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 28.37 | 394.1 | 2.5 |
| cpu | 4 | 28.10 | 390.3 | 2.6 |
| cpu | 16 | 95.05 | 1320.1 | 0.8 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 12.38 | 172.0 | 5.8 |
| cpu | 4 | 12.20 | 169.4 | 5.9 |
| cpu | 16 | 59.43 | 825.4 | 1.2 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 13.95 | 193.8 | 5.2 |
| cpu | 4 | 14.00 | 194.5 | 5.1 |
| cpu | 16 | 63.00 | 875.0 | 1.1 |


## Video: long (472 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 184.91 | 391.8 | 2.6 |
| cpu | 4 | 182.17 | 386.0 | 2.6 |
| cpu | 16 | 672.85 | 1425.5 | 0.7 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 82.55 | 174.9 | 5.7 |
| cpu | 4 | 79.91 | 169.3 | 5.9 |
| cpu | 16 | 427.30 | 905.3 | 1.1 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cpu | 1 | 93.58 | 198.3 | 5.0 |
| cpu | 4 | 92.36 | 195.7 | 5.1 |
| cpu | 16 | 446.98 | 947.0 | 1.1 |


## Images: 16 x multi_face.jpg = 80 faces

### img2pose

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cpu | 1 | 12.17 | 760.6 | 80 |
| cpu | 4 | 50.27 | 3141.7 | 80 |
| cpu | 16 | 48.63 | 3039.3 | 80 |


### retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cpu | 1 | 10.99 | 686.9 | 80 |
| cpu | 4 | 49.59 | 3099.2 | 80 |
| cpu | 16 | 50.62 | 3163.9 | 80 |


### MPDetector retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cpu | 1 | 13.29 | 830.3 | 80 |
| cpu | 4 | 52.06 | 3253.8 | 80 |
| cpu | 16 | 52.99 | 3312.0 | 80 |

---

## Notes (hand-curated)

**Scope:** full-pipeline (xgb AU + resmasknet emotion + arcface identity) on M5 MBP CPU. Companion to `2026-05-03-d71c0d7.md` (M5 MPS) and `2026-05-03-864962c.md` (Linux + RTX 3090 + svm-only AU).

**Headline finding: M5 CPU does not benefit from batching.**

The clearest pattern in this file: every config gets *worse* per-frame as batch size increases beyond 1 on M5 CPU. Examples (short video, retinaface):

| Batch | ms/frame |
|---|---|
| 1 | 172.0 |
| 4 | 169.4 |
| 16 | **825.4** (5× slower) |

That's the opposite of the Linux CPU pattern in `2026-05-03-864962c.md`, where batching speeds things up (retinaface short video: 314 → 115 → 98 ms/frame at batch 1/4/16).

**M5 CPU vs Linux CPU (best-batch ms/frame, short video, retinaface):**

| | M5 CPU (this file) | Linux CPU (svm-only) | Ratio |
|---|---|---|---|
| batch=1 | 172 ms/frame | 314 ms/frame | M5 1.8× faster |
| batch=4 | 169 ms/frame | 115 ms/frame | Linux 1.5× faster |
| batch=16 | 825 ms/frame | 98 ms/frame | Linux 8.4× faster |

At batch=1 (the package default and our current recommendation for CPU users), M5 is ~1.8× faster than the Linux server CPU per frame. That's expected given M5's per-core throughput on ARM SIMD. At larger batches the Linux CPU's MKL/OpenBLAS leverages bigger L3 cache + AVX much better than Apple's Accelerate does for single-threaded inference. **The Linux file's `svm`-AU configs differ slightly from this file's `xgb` + emotion + identity stack, so direct cell-comparison drifts ~5-10%; the qualitative pattern (Linux scales with batch, M5 doesn't) is robust.**

**Implication for users:**

- **Apple Silicon (M-series) on CPU:** `batch_size=1` is the right default. Larger batches actively regress per-frame throughput. Already what the package defaults to.
- **Apple Silicon on MPS:** `batch_size=16` wins (see `2026-05-03-d71c0d7.md` — 18.8 ms/frame at b=16 vs 62.1 at b=1).
- **Linux CPU:** `batch_size=16` wins by a wide margin.
- **Linux + CUDA:** `batch_size=4` is the sweet spot on small inputs (see `2026-05-03-864962c.md` — image bench shows non-monotonic regression at b=16).

**MPS vs CPU on M5 (full pipeline, retinaface long video, batch=16, this file vs `d71c0d7.md`):**

- M5 MPS: 18.8 ms/frame (53.3 fps)
- M5 CPU: 905 ms/frame (1.1 fps) — **48× slower**

If a user has MPS available, they shouldn't be running on CPU. The `Detector(device='auto')` default should handle this correctly, but worth flagging in the docs.

