py-feat detector benchmark — 2026-05-03 23:09:29#
Run metadata#
Date: 2026-05-03 23:09:29
py-feat version: 0.7.0
Git commit: d71c0d7
Host: vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)
Python: 3.13.12
PyTorch: 2.11.0
GPU: MPS available
OMP_NUM_THREADS:
1Devices swept: [‘mps’]
Batch sizes: [1, 4, 16]
DataLoader workers: [0]
Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.
Video: short (72 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
8.17 |
113.5 |
8.8 |
mps |
4 |
5.79 |
80.5 |
12.4 |
mps |
16 |
5.21 |
72.3 |
13.8 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
3.88 |
54.0 |
18.5 |
mps |
4 |
1.58 |
21.9 |
45.6 |
mps |
16 |
0.98 |
13.7 |
73.2 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
10.21 |
141.8 |
7.1 |
mps |
4 |
3.24 |
45.0 |
22.2 |
mps |
16 |
1.63 |
22.6 |
44.2 |
Video: long (472 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
82.45 |
174.7 |
5.7 |
mps |
4 |
57.08 |
120.9 |
8.3 |
mps |
16 |
53.62 |
113.6 |
8.8 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
29.33 |
62.1 |
16.1 |
mps |
4 |
12.10 |
25.6 |
39.0 |
mps |
16 |
8.86 |
18.8 |
53.3 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
mps |
1 |
77.89 |
165.0 |
6.1 |
mps |
4 |
23.16 |
49.1 |
20.4 |
mps |
16 |
11.08 |
23.5 |
42.6 |
Images: 16 x multi_face.jpg = 80 faces#
img2pose#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
mps |
1 |
3.71 |
232.2 |
80 |
mps |
4 |
2.28 |
142.6 |
80 |
mps |
16 |
2.69 |
168.2 |
80 |
retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
mps |
1 |
1.43 |
89.5 |
80 |
mps |
4 |
0.89 |
55.8 |
80 |
mps |
16 |
0.90 |
56.4 |
80 |
MPDetector retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
mps |
1 |
3.97 |
248.3 |
80 |
mps |
4 |
1.74 |
108.8 |
80 |
mps |
16 |
1.22 |
76.5 |
80 |
Notes (hand-curated)#
Scope: full-pipeline bench on M5 MBP MPS. All three configs run with au_model='xgb' (or mp_blendshapes for MPDetector), emotion_model='resmasknet', and identity_model='arcface' — matching the default Detector() config most users run in production. Replaces the partial svm-AU bench at 2026-05-03-864962c-mps.md.
Headline numbers (best ms/frame, MPS):
Config |
Long video (472 frames) |
Image batch (80 faces) |
|---|---|---|
|
113.6 ms/frame (8.8 fps), batch=16 |
142.6 ms/img, batch=4 |
|
18.8 ms/frame (53.3 fps), batch=16 |
55.8 ms/img, batch=4 |
|
23.5 ms/frame (42.6 fps), batch=16 |
76.5 ms/img, batch=16 |
All three include emotion (resmasknet) + identity (arcface) on top of face + landmark + AU.
Recommendation for users: Detector(face_model='retinaface') is the fastest full-pipeline config on MPS at batch=16 — 53 fps on 472-frame video. Img2pose pays for its head-pose regression in compute (~6× slower at batch=16). MPDetector falls in between and gives the 478-point mediapipe mesh at the cost of slightly slower wall time.
HOG batching speedup (PR #292, isolated bench):
The detector-level numbers above include face detection + landmark + emotion + identity, which dilutes the HOG speedup. Isolated extract_hog_features_batched vs the legacy extract_hog_features per-face loop on MPS:
n_faces |
Legacy |
Batched (PR #292) |
Speedup |
|---|---|---|---|
5 |
15.2 ms |
7.8 ms |
1.96x |
20 |
68.1 ms |
16.3 ms |
4.19x |
50 |
145.9 ms |
32.0 ms |
4.56x |
Comparison to prior baselines:
2026-05-03-864962c.md(Linux + RTX 3090 + CUDA, hand-curated note) was run withau_model='svm'— the bench script’s pre-PR-NNN configs. Direct cell comparison drifts ~5-10% from the present configs (svm vs xgb), so cross-platform speed gaps are illustrative not exact. Re-running that bench with the new full-pipeline configs would close the comparison gap.2026-05-03-437b651.mdis hand-curated from a partial v0.7 svm sweep; the workers-axis cells are still informative fornum_workers > 0regression tracking.Pre-v0.7 community benchmarks live in py-feat issue #184 (Google Sheet, per-stage timings); not directly cell-comparable to detect()-wall-time format here.