py-feat detector benchmark — 2026-05-04 02:03:51#
Run metadata#
Date: 2026-05-04 02:03:51
py-feat version: 0.7.0
Git commit: f44ccb1
Host: vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)
Python: 3.13.12
PyTorch: 2.11.0
GPU: MPS available
OMP_NUM_THREADS:
1Devices swept: [‘cpu’]
Batch sizes: [1, 4, 16]
DataLoader workers: [0]
Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.
Video: short (72 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
28.37 |
394.1 |
2.5 |
cpu |
4 |
28.10 |
390.3 |
2.6 |
cpu |
16 |
95.05 |
1320.1 |
0.8 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
12.38 |
172.0 |
5.8 |
cpu |
4 |
12.20 |
169.4 |
5.9 |
cpu |
16 |
59.43 |
825.4 |
1.2 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
13.95 |
193.8 |
5.2 |
cpu |
4 |
14.00 |
194.5 |
5.1 |
cpu |
16 |
63.00 |
875.0 |
1.1 |
Video: long (472 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
184.91 |
391.8 |
2.6 |
cpu |
4 |
182.17 |
386.0 |
2.6 |
cpu |
16 |
672.85 |
1425.5 |
0.7 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
82.55 |
174.9 |
5.7 |
cpu |
4 |
79.91 |
169.3 |
5.9 |
cpu |
16 |
427.30 |
905.3 |
1.1 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cpu |
1 |
93.58 |
198.3 |
5.0 |
cpu |
4 |
92.36 |
195.7 |
5.1 |
cpu |
16 |
446.98 |
947.0 |
1.1 |
Images: 16 x multi_face.jpg = 80 faces#
img2pose#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cpu |
1 |
12.17 |
760.6 |
80 |
cpu |
4 |
50.27 |
3141.7 |
80 |
cpu |
16 |
48.63 |
3039.3 |
80 |
retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cpu |
1 |
10.99 |
686.9 |
80 |
cpu |
4 |
49.59 |
3099.2 |
80 |
cpu |
16 |
50.62 |
3163.9 |
80 |
MPDetector retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cpu |
1 |
13.29 |
830.3 |
80 |
cpu |
4 |
52.06 |
3253.8 |
80 |
cpu |
16 |
52.99 |
3312.0 |
80 |
Notes (hand-curated)#
Scope: full-pipeline (xgb AU + resmasknet emotion + arcface identity) on M5 MBP CPU. Companion to 2026-05-03-d71c0d7.md (M5 MPS) and 2026-05-03-864962c.md (Linux + RTX 3090 + svm-only AU).
Headline finding: M5 CPU does not benefit from batching.
The clearest pattern in this file: every config gets worse per-frame as batch size increases beyond 1 on M5 CPU. Examples (short video, retinaface):
Batch |
ms/frame |
|---|---|
1 |
172.0 |
4 |
169.4 |
16 |
825.4 (5× slower) |
That’s the opposite of the Linux CPU pattern in 2026-05-03-864962c.md, where batching speeds things up (retinaface short video: 314 → 115 → 98 ms/frame at batch 1/4/16).
M5 CPU vs Linux CPU (best-batch ms/frame, short video, retinaface):
M5 CPU (this file) |
Linux CPU (svm-only) |
Ratio |
|
|---|---|---|---|
batch=1 |
172 ms/frame |
314 ms/frame |
M5 1.8× faster |
batch=4 |
169 ms/frame |
115 ms/frame |
Linux 1.5× faster |
batch=16 |
825 ms/frame |
98 ms/frame |
Linux 8.4× faster |
At batch=1 (the package default and our current recommendation for CPU users), M5 is ~1.8× faster than the Linux server CPU per frame. That’s expected given M5’s per-core throughput on ARM SIMD. At larger batches the Linux CPU’s MKL/OpenBLAS leverages bigger L3 cache + AVX much better than Apple’s Accelerate does for single-threaded inference. The Linux file’s svm-AU configs differ slightly from this file’s xgb + emotion + identity stack, so direct cell-comparison drifts ~5-10%; the qualitative pattern (Linux scales with batch, M5 doesn’t) is robust.
Implication for users:
Apple Silicon (M-series) on CPU:
batch_size=1is the right default. Larger batches actively regress per-frame throughput. Already what the package defaults to.Apple Silicon on MPS:
batch_size=16wins (see2026-05-03-d71c0d7.md— 18.8 ms/frame at b=16 vs 62.1 at b=1).Linux CPU:
batch_size=16wins by a wide margin.Linux + CUDA:
batch_size=4is the sweet spot on small inputs (see2026-05-03-864962c.md— image bench shows non-monotonic regression at b=16).
MPS vs CPU on M5 (full pipeline, retinaface long video, batch=16, this file vs d71c0d7.md):
M5 MPS: 18.8 ms/frame (53.3 fps)
M5 CPU: 905 ms/frame (1.1 fps) — 48× slower
If a user has MPS available, they shouldn’t be running on CPU. The Detector(device='auto') default should handle this correctly, but worth flagging in the docs.