py-feat detector benchmark — 2026-05-03

py-feat detector benchmark — 2026-05-03#

Run metadata#

Date: 2026-05-03
py-feat version: 0.7.0
Git commit: 437b651
Host: Mac (arm64, 18 CPUs)
Python: 3.13.12
PyTorch: 2.11.0
MPS available: True
OMP_NUM_THREADS: 1
Devices swept: ['cpu', 'mps']
Batch sizes: [1, 4, 16]
DataLoader workers: [0, 2]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Note: this baseline file was hand-curated from the raw stdout of an earlier run, so the structured tables match the script’s output format and the narrative (“Headline” + “Findings” sections) is appended at the end. Future runs generated by python scripts/bench_detectors.py --markdown produce the structured part automatically; narrative is optional and added by hand if it adds value.

Video: long (472 frames)#

retinaface#

device	batch	workers	sec	ms/frame	fps
cpu	1	0	38.30	81.2	12.3
cpu	1	2	71.35	151.2	6.6
cpu	4	0	37.97	80.4	12.4
cpu	4	2	69.77	147.8	6.8
cpu	16	0	158.81	336.5	3.0
cpu	16	2	188.09	398.5	2.5
mps	1	0	17.12	36.3	27.6
mps	1	2	49.97	105.9	9.4
mps	4	0	7.86	16.7	60.0
mps	4	2	40.28	85.3	11.7
mps	16	0	5.16	10.9	91.4
mps	16	2	32.88	69.7	14.4

Images: 16 x multi_face.jpg = 80 faces#

retinaface#

device	batch	workers	sec	ms/img	rows
cpu	1	0	3.72	232.3	80
cpu	1	2	36.44	2277.4	80
cpu	4	0	5.31	332.1	80
cpu	4	2	33.10	2069.1	80
cpu	16	0	14.80	924.9	80
cpu	16	2	37.99	2374.5	80
mps	1	0	0.97	60.8	80
mps	1	2	33.49	2093.3	80
mps	4	0	0.65	40.6	80
mps	4	2	28.36	1772.6	80
mps	16	0	0.65	40.4	80
mps	16	2	23.51	1469.5	80

Headline numbers (hand-curated)#

	best ms/frame	config
Long video (472 frames)	10.9 ms (91.4 fps)	mps, batch=16, workers=0
Image batch (16 imgs)	40.4 ms/img	mps, batch=16, workers=0
CPU long video	80.4 ms/frame (12.4 fps)	cpu, batch=4, workers=0
CPU images	232.3 ms/img	cpu, batch=1, workers=0

MPS vs CPU at batch=16: 31× faster on long video, 23× faster on image batch.

Findings & recommendations (hand-curated)#

MPS is dramatically faster than CPU at batch ≥ 4. 31× faster on video at batch=16 (10.9 ms vs 336.5 ms), 23× on images.
num_workers=0 beats num_workers=2 in every cell. Worst case: 16 images at batch=1 → 33× slower with workers=2. Cause: with OMP_NUM_THREADS=1 (the xgb fix from #288), each forked DataLoader worker is single-threaded, so we pay fork overhead with no parallelism win.
Batch size effect inverts on CPU.
- MPS video: batch=1→16 gives 3.3× speedup (36.3 → 10.9 ms/frame)
- CPU video: batch=1→16 is 4× slowdown (81 → 337 ms/frame) — single-threaded matmul + larger memory traffic = worse
MPS image throughput saturates at batch=4 — going to 16 doesn’t help (40 ms/img both). The dataset is too small (80 faces) to keep MPS busy beyond 4-frame batches.

Recommended user defaults#

User config	Best settings
Apple Silicon + long video	`device='mps', batch_size=16, num_workers=0`
Apple Silicon + image batch	`device='mps', batch_size=4, num_workers=0`
CPU-only + video	`device='cpu', batch_size=1 or 4, num_workers=0`
CPU-only + images	`device='cpu', batch_size=1, num_workers=0`
All configs	`num_workers=0` (the current default — leave it!)

Notes#

Only the retinaface Detector configuration was swept on the workers axis here. The other configurations (img2pose, MPDetector retinaface) were measured separately in earlier sessions without the workers axis. To regenerate everything: python scripts/bench_detectors.py --workers 0 2 --markdown (~25 min on M5 MBP).
xgb AU not benchmarked here; SVM is the apples-to-apples constant.

py-feat detector benchmark — 2026-05-03

Contents

py-feat detector benchmark — 2026-05-03#

Run metadata#

Video: long (472 frames)#

retinaface#

Images: 16 x multi_face.jpg = 80 faces#

retinaface#

Headline numbers (hand-curated)#

Findings & recommendations (hand-curated)#

Recommended user defaults#

Notes#