py-feat detector benchmark — 2026-05-03#

Run metadata#

  • Date: 2026-05-03

  • py-feat version: 0.7.0

  • Git commit: 437b651

  • Host: Mac (arm64, 18 CPUs)

  • Python: 3.13.12

  • PyTorch: 2.11.0

  • MPS available: True

  • OMP_NUM_THREADS: 1

  • Devices swept: ['cpu', 'mps']

  • Batch sizes: [1, 4, 16]

  • DataLoader workers: [0, 2]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Note: this baseline file was hand-curated from the raw stdout of an earlier run, so the structured tables match the script’s output format and the narrative (“Headline” + “Findings” sections) is appended at the end. Future runs generated by python scripts/bench_detectors.py --markdown produce the structured part automatically; narrative is optional and added by hand if it adds value.

Video: long (472 frames)#

retinaface#

device

batch

workers

sec

ms/frame

fps

cpu

1

0

38.30

81.2

12.3

cpu

1

2

71.35

151.2

6.6

cpu

4

0

37.97

80.4

12.4

cpu

4

2

69.77

147.8

6.8

cpu

16

0

158.81

336.5

3.0

cpu

16

2

188.09

398.5

2.5

mps

1

0

17.12

36.3

27.6

mps

1

2

49.97

105.9

9.4

mps

4

0

7.86

16.7

60.0

mps

4

2

40.28

85.3

11.7

mps

16

0

5.16

10.9

91.4

mps

16

2

32.88

69.7

14.4

Images: 16 x multi_face.jpg = 80 faces#

retinaface#

device

batch

workers

sec

ms/img

rows

cpu

1

0

3.72

232.3

80

cpu

1

2

36.44

2277.4

80

cpu

4

0

5.31

332.1

80

cpu

4

2

33.10

2069.1

80

cpu

16

0

14.80

924.9

80

cpu

16

2

37.99

2374.5

80

mps

1

0

0.97

60.8

80

mps

1

2

33.49

2093.3

80

mps

4

0

0.65

40.6

80

mps

4

2

28.36

1772.6

80

mps

16

0

0.65

40.4

80

mps

16

2

23.51

1469.5

80


Headline numbers (hand-curated)#

best ms/frame

config

Long video (472 frames)

10.9 ms (91.4 fps)

mps, batch=16, workers=0

Image batch (16 imgs)

40.4 ms/img

mps, batch=16, workers=0

CPU long video

80.4 ms/frame (12.4 fps)

cpu, batch=4, workers=0

CPU images

232.3 ms/img

cpu, batch=1, workers=0

MPS vs CPU at batch=16: 31× faster on long video, 23× faster on image batch.

Findings & recommendations (hand-curated)#

  1. MPS is dramatically faster than CPU at batch ≥ 4. 31× faster on video at batch=16 (10.9 ms vs 336.5 ms), 23× on images.

  2. num_workers=0 beats num_workers=2 in every cell. Worst case: 16 images at batch=1 → 33× slower with workers=2. Cause: with OMP_NUM_THREADS=1 (the xgb fix from #288), each forked DataLoader worker is single-threaded, so we pay fork overhead with no parallelism win.

  3. Batch size effect inverts on CPU.

    • MPS video: batch=1→16 gives 3.3× speedup (36.3 → 10.9 ms/frame)

    • CPU video: batch=1→16 is slowdown (81 → 337 ms/frame) — single-threaded matmul + larger memory traffic = worse

  4. MPS image throughput saturates at batch=4 — going to 16 doesn’t help (40 ms/img both). The dataset is too small (80 faces) to keep MPS busy beyond 4-frame batches.

Notes#

  • Only the retinaface Detector configuration was swept on the workers axis here. The other configurations (img2pose, MPDetector retinaface) were measured separately in earlier sessions without the workers axis. To regenerate everything: python scripts/bench_detectors.py --workers 0 2 --markdown (~25 min on M5 MBP).

  • xgb AU not benchmarked here; SVM is the apples-to-apples constant.