py-feat detector benchmark — 2026-05-04 02:03:51

py-feat detector benchmark — 2026-05-04 02:03:51#

Run metadata#

Date: 2026-05-04 02:03:51
py-feat version: 0.7.0
Git commit: f44ccb1
Host: vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)
Python: 3.13.12
PyTorch: 2.11.0
GPU: MPS available
OMP_NUM_THREADS: 1
Devices swept: [‘cpu’]
Batch sizes: [1, 4, 16]
DataLoader workers: [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Video: short (72 frames)#

img2pose#

device	batch	sec	ms/frame	fps
cpu	1	28.37	394.1	2.5
cpu	4	28.10	390.3	2.6
cpu	16	95.05	1320.1	0.8

retinaface#

device	batch	sec	ms/frame	fps
cpu	1	12.38	172.0	5.8
cpu	4	12.20	169.4	5.9
cpu	16	59.43	825.4	1.2

MPDetector retinaface#

device	batch	sec	ms/frame	fps
cpu	1	13.95	193.8	5.2
cpu	4	14.00	194.5	5.1
cpu	16	63.00	875.0	1.1

Video: long (472 frames)#

img2pose#

device	batch	sec	ms/frame	fps
cpu	1	184.91	391.8	2.6
cpu	4	182.17	386.0	2.6
cpu	16	672.85	1425.5	0.7

retinaface#

device	batch	sec	ms/frame	fps
cpu	1	82.55	174.9	5.7
cpu	4	79.91	169.3	5.9
cpu	16	427.30	905.3	1.1

MPDetector retinaface#

device	batch	sec	ms/frame	fps
cpu	1	93.58	198.3	5.0
cpu	4	92.36	195.7	5.1
cpu	16	446.98	947.0	1.1

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

device	batch	sec	ms/img	rows
cpu	1	12.17	760.6	80
cpu	4	50.27	3141.7	80
cpu	16	48.63	3039.3	80

retinaface#

device	batch	sec	ms/img	rows
cpu	1	10.99	686.9	80
cpu	4	49.59	3099.2	80
cpu	16	50.62	3163.9	80

MPDetector retinaface#

device	batch	sec	ms/img	rows
cpu	1	13.29	830.3	80
cpu	4	52.06	3253.8	80
cpu	16	52.99	3312.0	80

Notes (hand-curated)#

Scope: full-pipeline (xgb AU + resmasknet emotion + arcface identity) on M5 MBP CPU. Companion to 2026-05-03-d71c0d7.md (M5 MPS) and 2026-05-03-864962c.md (Linux + RTX 3090 + svm-only AU).

Headline finding: M5 CPU does not benefit from batching.

The clearest pattern in this file: every config gets worse per-frame as batch size increases beyond 1 on M5 CPU. Examples (short video, retinaface):

Batch	ms/frame
1	172.0
4	169.4
16	825.4 (5× slower)

That’s the opposite of the Linux CPU pattern in 2026-05-03-864962c.md, where batching speeds things up (retinaface short video: 314 → 115 → 98 ms/frame at batch 1/4/16).

M5 CPU vs Linux CPU (best-batch ms/frame, short video, retinaface):

	M5 CPU (this file)	Linux CPU (svm-only)	Ratio
batch=1	172 ms/frame	314 ms/frame	M5 1.8× faster
batch=4	169 ms/frame	115 ms/frame	Linux 1.5× faster
batch=16	825 ms/frame	98 ms/frame	Linux 8.4× faster

At batch=1 (the package default and our current recommendation for CPU users), M5 is ~1.8× faster than the Linux server CPU per frame. That’s expected given M5’s per-core throughput on ARM SIMD. At larger batches the Linux CPU’s MKL/OpenBLAS leverages bigger L3 cache + AVX much better than Apple’s Accelerate does for single-threaded inference. The Linux file’s svm-AU configs differ slightly from this file’s xgb + emotion + identity stack, so direct cell-comparison drifts ~5-10%; the qualitative pattern (Linux scales with batch, M5 doesn’t) is robust.

Implication for users:

Apple Silicon (M-series) on CPU: batch_size=1 is the right default. Larger batches actively regress per-frame throughput. Already what the package defaults to.
Apple Silicon on MPS: batch_size=16 wins (see 2026-05-03-d71c0d7.md — 18.8 ms/frame at b=16 vs 62.1 at b=1).
Linux CPU: batch_size=16 wins by a wide margin.
Linux + CUDA: batch_size=4 is the sweet spot on small inputs (see 2026-05-03-864962c.md — image bench shows non-monotonic regression at b=16).

MPS vs CPU on M5 (full pipeline, retinaface long video, batch=16, this file vs d71c0d7.md):

M5 MPS: 18.8 ms/frame (53.3 fps)
M5 CPU: 905 ms/frame (1.1 fps) — 48× slower

If a user has MPS available, they shouldn’t be running on CPU. The Detector(device='auto') default should handle this correctly, but worth flagging in the docs.

py-feat detector benchmark — 2026-05-04 02:03:51

Contents

py-feat detector benchmark — 2026-05-04 02:03:51#

Run metadata#

Video: short (72 frames)#

img2pose#

retinaface#

MPDetector retinaface#

Video: long (472 frames)#

img2pose#

retinaface#

MPDetector retinaface#

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

retinaface#

MPDetector retinaface#

Notes (hand-curated)#