py-feat detector benchmark — 2026-05-03 23:09:29#

Run metadata#

  • Date: 2026-05-03 23:09:29

  • py-feat version: 0.7.0

  • Git commit: d71c0d7

  • Host: vpn-two-factor-general-228-129-185.dartmouth.edu (arm64, 18 CPUs)

  • Python: 3.13.12

  • PyTorch: 2.11.0

  • GPU: MPS available

  • OMP_NUM_THREADS: 1

  • Devices swept: [‘mps’]

  • Batch sizes: [1, 4, 16]

  • DataLoader workers: [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Video: short (72 frames)#

img2pose#

device

batch

sec

ms/frame

fps

mps

1

8.17

113.5

8.8

mps

4

5.79

80.5

12.4

mps

16

5.21

72.3

13.8

retinaface#

device

batch

sec

ms/frame

fps

mps

1

3.88

54.0

18.5

mps

4

1.58

21.9

45.6

mps

16

0.98

13.7

73.2

MPDetector retinaface#

device

batch

sec

ms/frame

fps

mps

1

10.21

141.8

7.1

mps

4

3.24

45.0

22.2

mps

16

1.63

22.6

44.2

Video: long (472 frames)#

img2pose#

device

batch

sec

ms/frame

fps

mps

1

82.45

174.7

5.7

mps

4

57.08

120.9

8.3

mps

16

53.62

113.6

8.8

retinaface#

device

batch

sec

ms/frame

fps

mps

1

29.33

62.1

16.1

mps

4

12.10

25.6

39.0

mps

16

8.86

18.8

53.3

MPDetector retinaface#

device

batch

sec

ms/frame

fps

mps

1

77.89

165.0

6.1

mps

4

23.16

49.1

20.4

mps

16

11.08

23.5

42.6

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

device

batch

sec

ms/img

rows

mps

1

3.71

232.2

80

mps

4

2.28

142.6

80

mps

16

2.69

168.2

80

retinaface#

device

batch

sec

ms/img

rows

mps

1

1.43

89.5

80

mps

4

0.89

55.8

80

mps

16

0.90

56.4

80

MPDetector retinaface#

device

batch

sec

ms/img

rows

mps

1

3.97

248.3

80

mps

4

1.74

108.8

80

mps

16

1.22

76.5

80


Notes (hand-curated)#

Scope: full-pipeline bench on M5 MBP MPS. All three configs run with au_model='xgb' (or mp_blendshapes for MPDetector), emotion_model='resmasknet', and identity_model='arcface' — matching the default Detector() config most users run in production. Replaces the partial svm-AU bench at 2026-05-03-864962c-mps.md.

Headline numbers (best ms/frame, MPS):

Config

Long video (472 frames)

Image batch (80 faces)

Detector(face_model='img2pose', au_model='xgb')

113.6 ms/frame (8.8 fps), batch=16

142.6 ms/img, batch=4

Detector(face_model='retinaface', au_model='xgb')

18.8 ms/frame (53.3 fps), batch=16

55.8 ms/img, batch=4

MPDetector(au_model='mp_blendshapes')

23.5 ms/frame (42.6 fps), batch=16

76.5 ms/img, batch=16

All three include emotion (resmasknet) + identity (arcface) on top of face + landmark + AU.

Recommendation for users: Detector(face_model='retinaface') is the fastest full-pipeline config on MPS at batch=16 — 53 fps on 472-frame video. Img2pose pays for its head-pose regression in compute (~6× slower at batch=16). MPDetector falls in between and gives the 478-point mediapipe mesh at the cost of slightly slower wall time.

HOG batching speedup (PR #292, isolated bench):

The detector-level numbers above include face detection + landmark + emotion + identity, which dilutes the HOG speedup. Isolated extract_hog_features_batched vs the legacy extract_hog_features per-face loop on MPS:

n_faces

Legacy

Batched (PR #292)

Speedup

5

15.2 ms

7.8 ms

1.96x

20

68.1 ms

16.3 ms

4.19x

50

145.9 ms

32.0 ms

4.56x

Comparison to prior baselines:

  • 2026-05-03-864962c.md (Linux + RTX 3090 + CUDA, hand-curated note) was run with au_model='svm' — the bench script’s pre-PR-NNN configs. Direct cell comparison drifts ~5-10% from the present configs (svm vs xgb), so cross-platform speed gaps are illustrative not exact. Re-running that bench with the new full-pipeline configs would close the comparison gap.

  • 2026-05-03-437b651.md is hand-curated from a partial v0.7 svm sweep; the workers-axis cells are still informative for num_workers > 0 regression tracking.

  • Pre-v0.7 community benchmarks live in py-feat issue #184 (Google Sheet, per-stage timings); not directly cell-comparable to detect()-wall-time format here.