py-feat detector benchmark — 2026-05-03#
Run metadata#
Date: 2026-05-03
py-feat version: 0.7.0
Git commit: 437b651
Host: Mac (arm64, 18 CPUs)
Python: 3.13.12
PyTorch: 2.11.0
MPS available: True
OMP_NUM_THREADS:
1Devices swept:
['cpu', 'mps']Batch sizes:
[1, 4, 16]DataLoader workers:
[0, 2]
Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.
Note: this baseline file was hand-curated from the raw stdout of an earlier run, so the structured tables match the script’s output format and the narrative (“Headline” + “Findings” sections) is appended at the end. Future runs generated by
python scripts/bench_detectors.py --markdownproduce the structured part automatically; narrative is optional and added by hand if it adds value.
Video: long (472 frames)#
retinaface#
device |
batch |
workers |
sec |
ms/frame |
fps |
|---|---|---|---|---|---|
cpu |
1 |
0 |
38.30 |
81.2 |
12.3 |
cpu |
1 |
2 |
71.35 |
151.2 |
6.6 |
cpu |
4 |
0 |
37.97 |
80.4 |
12.4 |
cpu |
4 |
2 |
69.77 |
147.8 |
6.8 |
cpu |
16 |
0 |
158.81 |
336.5 |
3.0 |
cpu |
16 |
2 |
188.09 |
398.5 |
2.5 |
mps |
1 |
0 |
17.12 |
36.3 |
27.6 |
mps |
1 |
2 |
49.97 |
105.9 |
9.4 |
mps |
4 |
0 |
7.86 |
16.7 |
60.0 |
mps |
4 |
2 |
40.28 |
85.3 |
11.7 |
mps |
16 |
0 |
5.16 |
10.9 |
91.4 |
mps |
16 |
2 |
32.88 |
69.7 |
14.4 |
Images: 16 x multi_face.jpg = 80 faces#
retinaface#
device |
batch |
workers |
sec |
ms/img |
rows |
|---|---|---|---|---|---|
cpu |
1 |
0 |
3.72 |
232.3 |
80 |
cpu |
1 |
2 |
36.44 |
2277.4 |
80 |
cpu |
4 |
0 |
5.31 |
332.1 |
80 |
cpu |
4 |
2 |
33.10 |
2069.1 |
80 |
cpu |
16 |
0 |
14.80 |
924.9 |
80 |
cpu |
16 |
2 |
37.99 |
2374.5 |
80 |
mps |
1 |
0 |
0.97 |
60.8 |
80 |
mps |
1 |
2 |
33.49 |
2093.3 |
80 |
mps |
4 |
0 |
0.65 |
40.6 |
80 |
mps |
4 |
2 |
28.36 |
1772.6 |
80 |
mps |
16 |
0 |
0.65 |
40.4 |
80 |
mps |
16 |
2 |
23.51 |
1469.5 |
80 |
Headline numbers (hand-curated)#
best ms/frame |
config |
|
|---|---|---|
Long video (472 frames) |
10.9 ms (91.4 fps) |
mps, batch=16, workers=0 |
Image batch (16 imgs) |
40.4 ms/img |
mps, batch=16, workers=0 |
CPU long video |
80.4 ms/frame (12.4 fps) |
cpu, batch=4, workers=0 |
CPU images |
232.3 ms/img |
cpu, batch=1, workers=0 |
MPS vs CPU at batch=16: 31× faster on long video, 23× faster on image batch.
Findings & recommendations (hand-curated)#
MPS is dramatically faster than CPU at batch ≥ 4. 31× faster on video at batch=16 (10.9 ms vs 336.5 ms), 23× on images.
num_workers=0beatsnum_workers=2in every cell. Worst case: 16 images at batch=1 → 33× slower with workers=2. Cause: withOMP_NUM_THREADS=1(the xgb fix from #288), each forked DataLoader worker is single-threaded, so we pay fork overhead with no parallelism win.Batch size effect inverts on CPU.
MPS video: batch=1→16 gives 3.3× speedup (36.3 → 10.9 ms/frame)
CPU video: batch=1→16 is 4× slowdown (81 → 337 ms/frame) — single-threaded matmul + larger memory traffic = worse
MPS image throughput saturates at batch=4 — going to 16 doesn’t help (40 ms/img both). The dataset is too small (80 faces) to keep MPS busy beyond 4-frame batches.
Recommended user defaults#
User config |
Best settings |
|---|---|
Apple Silicon + long video |
|
Apple Silicon + image batch |
|
CPU-only + video |
|
CPU-only + images |
|
All configs |
|
Notes#
Only the
retinafaceDetector configuration was swept on the workers axis here. The other configurations (img2pose, MPDetector retinaface) were measured separately in earlier sessions without the workers axis. To regenerate everything:python scripts/bench_detectors.py --workers 0 2 --markdown(~25 min on M5 MBP).xgb AU not benchmarked here; SVM is the apples-to-apples constant.