py-feat detector benchmark — 2026-05-04 02:03:51#

Run metadata#

  • Date: 2026-05-04 02:03:51

  • py-feat version: 0.7.0

  • Git commit: f44ccb1

  • Host: vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)

  • Python: 3.13.12

  • PyTorch: 2.11.0

  • GPU: MPS available

  • OMP_NUM_THREADS: 1

  • Devices swept: [‘cpu’]

  • Batch sizes: [1, 4, 16]

  • DataLoader workers: [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Video: short (72 frames)#

img2pose#

device

batch

sec

ms/frame

fps

cpu

1

28.37

394.1

2.5

cpu

4

28.10

390.3

2.6

cpu

16

95.05

1320.1

0.8

retinaface#

device

batch

sec

ms/frame

fps

cpu

1

12.38

172.0

5.8

cpu

4

12.20

169.4

5.9

cpu

16

59.43

825.4

1.2

MPDetector retinaface#

device

batch

sec

ms/frame

fps

cpu

1

13.95

193.8

5.2

cpu

4

14.00

194.5

5.1

cpu

16

63.00

875.0

1.1

Video: long (472 frames)#

img2pose#

device

batch

sec

ms/frame

fps

cpu

1

184.91

391.8

2.6

cpu

4

182.17

386.0

2.6

cpu

16

672.85

1425.5

0.7

retinaface#

device

batch

sec

ms/frame

fps

cpu

1

82.55

174.9

5.7

cpu

4

79.91

169.3

5.9

cpu

16

427.30

905.3

1.1

MPDetector retinaface#

device

batch

sec

ms/frame

fps

cpu

1

93.58

198.3

5.0

cpu

4

92.36

195.7

5.1

cpu

16

446.98

947.0

1.1

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

device

batch

sec

ms/img

rows

cpu

1

12.17

760.6

80

cpu

4

50.27

3141.7

80

cpu

16

48.63

3039.3

80

retinaface#

device

batch

sec

ms/img

rows

cpu

1

10.99

686.9

80

cpu

4

49.59

3099.2

80

cpu

16

50.62

3163.9

80

MPDetector retinaface#

device

batch

sec

ms/img

rows

cpu

1

13.29

830.3

80

cpu

4

52.06

3253.8

80

cpu

16

52.99

3312.0

80


Notes (hand-curated)#

Scope: full-pipeline (xgb AU + resmasknet emotion + arcface identity) on M5 MBP CPU. Companion to 2026-05-03-d71c0d7.md (M5 MPS) and 2026-05-03-864962c.md (Linux + RTX 3090 + svm-only AU).

Headline finding: M5 CPU does not benefit from batching.

The clearest pattern in this file: every config gets worse per-frame as batch size increases beyond 1 on M5 CPU. Examples (short video, retinaface):

Batch

ms/frame

1

172.0

4

169.4

16

825.4 (5× slower)

That’s the opposite of the Linux CPU pattern in 2026-05-03-864962c.md, where batching speeds things up (retinaface short video: 314 → 115 → 98 ms/frame at batch 1/4/16).

M5 CPU vs Linux CPU (best-batch ms/frame, short video, retinaface):

M5 CPU (this file)

Linux CPU (svm-only)

Ratio

batch=1

172 ms/frame

314 ms/frame

M5 1.8× faster

batch=4

169 ms/frame

115 ms/frame

Linux 1.5× faster

batch=16

825 ms/frame

98 ms/frame

Linux 8.4× faster

At batch=1 (the package default and our current recommendation for CPU users), M5 is ~1.8× faster than the Linux server CPU per frame. That’s expected given M5’s per-core throughput on ARM SIMD. At larger batches the Linux CPU’s MKL/OpenBLAS leverages bigger L3 cache + AVX much better than Apple’s Accelerate does for single-threaded inference. The Linux file’s svm-AU configs differ slightly from this file’s xgb + emotion + identity stack, so direct cell-comparison drifts ~5-10%; the qualitative pattern (Linux scales with batch, M5 doesn’t) is robust.

Implication for users:

  • Apple Silicon (M-series) on CPU: batch_size=1 is the right default. Larger batches actively regress per-frame throughput. Already what the package defaults to.

  • Apple Silicon on MPS: batch_size=16 wins (see 2026-05-03-d71c0d7.md — 18.8 ms/frame at b=16 vs 62.1 at b=1).

  • Linux CPU: batch_size=16 wins by a wide margin.

  • Linux + CUDA: batch_size=4 is the sweet spot on small inputs (see 2026-05-03-864962c.md — image bench shows non-monotonic regression at b=16).

MPS vs CPU on M5 (full pipeline, retinaface long video, batch=16, this file vs d71c0d7.md):

  • M5 MPS: 18.8 ms/frame (53.3 fps)

  • M5 CPU: 905 ms/frame (1.1 fps) — 48× slower

If a user has MPS available, they shouldn’t be running on CPU. The Detector(device='auto') default should handle this correctly, but worth flagging in the docs.