# py-feat detector benchmark — 2026-05-03

## Run metadata

- **Date:** 2026-05-03
- **py-feat version:** 0.7.0
- **Git commit:** 437b651
- **Host:** Mac (arm64, 18 CPUs)
- **Python:** 3.13.12
- **PyTorch:** 2.11.0
- **MPS available:** True
- **OMP_NUM_THREADS:** `1`
- **Devices swept:** `['cpu', 'mps']`
- **Batch sizes:** `[1, 4, 16]`
- **DataLoader workers:** `[0, 2]`

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

> Note: this baseline file was hand-curated from the raw stdout of an earlier run, so the structured tables match the script's output format and the narrative ("Headline" + "Findings" sections) is appended at the end. Future runs generated by `python scripts/bench_detectors.py --markdown` produce the structured part automatically; narrative is optional and added by hand if it adds value.

## Video: long (472 frames)

### retinaface

| device | batch | workers | sec | ms/frame | fps |
|---|---|---|---|---|---|
| cpu | 1 | 0 | 38.30 | 81.2 | 12.3 |
| cpu | 1 | 2 | 71.35 | 151.2 | 6.6 |
| cpu | 4 | 0 | 37.97 | 80.4 | 12.4 |
| cpu | 4 | 2 | 69.77 | 147.8 | 6.8 |
| cpu | 16 | 0 | 158.81 | 336.5 | 3.0 |
| cpu | 16 | 2 | 188.09 | 398.5 | 2.5 |
| mps | 1 | 0 | 17.12 | 36.3 | 27.6 |
| mps | 1 | 2 | 49.97 | 105.9 | 9.4 |
| mps | 4 | 0 | 7.86 | 16.7 | 60.0 |
| mps | 4 | 2 | 40.28 | 85.3 | 11.7 |
| mps | 16 | 0 | 5.16 | 10.9 | 91.4 |
| mps | 16 | 2 | 32.88 | 69.7 | 14.4 |

## Images: 16 x multi_face.jpg = 80 faces

### retinaface

| device | batch | workers | sec | ms/img | rows |
|---|---|---|---|---|---|
| cpu | 1 | 0 | 3.72 | 232.3 | 80 |
| cpu | 1 | 2 | 36.44 | 2277.4 | 80 |
| cpu | 4 | 0 | 5.31 | 332.1 | 80 |
| cpu | 4 | 2 | 33.10 | 2069.1 | 80 |
| cpu | 16 | 0 | 14.80 | 924.9 | 80 |
| cpu | 16 | 2 | 37.99 | 2374.5 | 80 |
| mps | 1 | 0 | 0.97 | 60.8 | 80 |
| mps | 1 | 2 | 33.49 | 2093.3 | 80 |
| mps | 4 | 0 | 0.65 | 40.6 | 80 |
| mps | 4 | 2 | 28.36 | 1772.6 | 80 |
| mps | 16 | 0 | 0.65 | 40.4 | 80 |
| mps | 16 | 2 | 23.51 | 1469.5 | 80 |

---

## Headline numbers (hand-curated)

| | best ms/frame | config |
|---|---|---|
| Long video (472 frames) | **10.9 ms** (91.4 fps) | mps, batch=16, workers=0 |
| Image batch (16 imgs) | **40.4 ms/img** | mps, batch=16, workers=0 |
| CPU long video | 80.4 ms/frame (12.4 fps) | cpu, batch=4, workers=0 |
| CPU images | 232.3 ms/img | cpu, batch=1, workers=0 |

**MPS vs CPU at batch=16:** 31× faster on long video, 23× faster on image batch.

## Findings & recommendations (hand-curated)

1. **MPS is dramatically faster than CPU at batch ≥ 4.** 31× faster on video at batch=16 (10.9 ms vs 336.5 ms), 23× on images.

2. **`num_workers=0` beats `num_workers=2` in every cell.** Worst case: 16 images at batch=1 → 33× slower with workers=2. Cause: with `OMP_NUM_THREADS=1` (the xgb fix from #288), each forked DataLoader worker is single-threaded, so we pay fork overhead with no parallelism win.

3. **Batch size effect inverts on CPU.**
   - MPS video: batch=1→16 gives **3.3× speedup** (36.3 → 10.9 ms/frame)
   - CPU video: batch=1→16 is **4× *slowdown*** (81 → 337 ms/frame) — single-threaded matmul + larger memory traffic = worse

4. **MPS image throughput saturates at batch=4** — going to 16 doesn't help (40 ms/img both). The dataset is too small (80 faces) to keep MPS busy beyond 4-frame batches.

### Recommended user defaults

| User config | Best settings |
|---|---|
| Apple Silicon + long video | `device='mps', batch_size=16, num_workers=0` |
| Apple Silicon + image batch | `device='mps', batch_size=4, num_workers=0` |
| CPU-only + video | `device='cpu', batch_size=1 or 4, num_workers=0` |
| CPU-only + images | `device='cpu', batch_size=1, num_workers=0` |
| **All configs** | **`num_workers=0`** (the current default — leave it!) |

## Notes

- Only the `retinaface` Detector configuration was swept on the workers axis here. The other configurations (img2pose, MPDetector retinaface) were measured separately in earlier sessions without the workers axis. To regenerate everything: `python scripts/bench_detectors.py --workers 0 2 --markdown` (~25 min on M5 MBP).
- xgb AU not benchmarked here; SVM is the apples-to-apples constant.