py-feat detector benchmark — 2026-05-14 23:15:32#

Run metadata#

  • Date: 2026-05-14 23:15:32

  • py-feat version: 0.7.0

  • Git commit: c716340

  • Host: liquidswords2 (x86_64, 128 CPUs)

  • Python: 3.12.13

  • PyTorch: 2.11.0+cu128

  • GPU: CUDA 12.8, NVIDIA RTX PRO 6000 Blackwell Workstation Edition

  • OMP_NUM_THREADS: 1

  • Devices swept: [‘cuda’]

  • Batch sizes: [1, 4, 16]

  • DataLoader workers: [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Video: short (72 frames)#

img2pose#

device

batch

sec

ms/frame

fps

cuda

1

8.72

121.2

8.3

cuda

4

3.83

53.2

18.8

cuda

16

2.21

30.7

32.6

retinaface#

device

batch

sec

ms/frame

fps

cuda

1

7.67

106.6

9.4

cuda

4

2.43

33.7

29.6

cuda

16

0.54

7.5

134.0

MPDetector retinaface#

device

batch

sec

ms/frame

fps

cuda

1

2.76

38.4

26.0

cuda

4

0.78

10.9

91.7

cuda

16

0.65

9.1

110.2

Video: long (472 frames)#

img2pose#

device

batch

sec

ms/frame

fps

cuda

1

54.89

116.3

8.6

cuda

4

26.19

55.5

18.0

cuda

16

16.30

34.5

29.0

retinaface#

device

batch

sec

ms/frame

fps

cuda

1

39.76

84.2

11.9

cuda

4

12.27

26.0

38.5

cuda

16

5.42

11.5

87.1

MPDetector retinaface#

device

batch

sec

ms/frame

fps

cuda

1

19.32

40.9

24.4

cuda

4

5.66

12.0

83.4

cuda

16

2.71

5.7

174.4

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

device

batch

sec

ms/img

rows

cuda

1

2.49

155.8

80

cuda

4

1.23

77.2

80

cuda

16

1.07

67.0

80

retinaface#

device

batch

sec

ms/img

rows

cuda

1

1.86

116.6

80

cuda

4

0.62

38.5

80

cuda

16

0.56

35.3

80

MPDetector retinaface#

device

batch

sec

ms/img

rows

cuda

1

0.74

46.0

80

cuda

4

0.31

19.4

80

cuda

16

1.50

94.0

80


Notes (hand-curated)#

Scope: first run on the new RTX PRO 6000 Blackwell Workstation Edition (96 GB) on liquidswords2, replacing the prior RTX 3090. Companion to 2026-05-03-864962c.md (same host, same py-feat 0.7.0, same OMP_NUM_THREADS=1, RTX 3090 + torch 2.5.1+cu124). Two things changed at once: GPU (3090 → Blackwell) and torch (2.5.1+cu124 → 2.11.0+cu128, required for sm_120 kernels). Cannot isolate the two — but the GPU is the dominant factor for the configurations swept here.

Headline: Blackwell is 1.3–1.75× the 3090 across the board, best on MPDetector long video at 174 fps.

Best-batch comparison vs 2026-05-03-864962c.md (3090 file did not include long-video sweep):

Config

3090 best

Blackwell best

Speedup

img2pose, short video

47.9 ms/frame (b=16, 20.9 fps)

30.7 ms/frame (b=16, 32.6 fps)

1.56×

retinaface, short video

10.6 ms/frame (b=16, 94.5 fps)

7.5 ms/frame (b=16, 134 fps)

1.42×

MPDetector, short video

11.5 ms/frame (b=16, 86.7 fps)

9.1 ms/frame (b=16, 110 fps)

1.27×

img2pose, image

87.1 ms/img (b=4)

67.0 ms/img (b=16)

1.30×

retinaface, image

44.4 ms/img (b=4)

35.3 ms/img (b=16)

1.26×

MPDetector, image

33.9 ms/img (b=4)

19.4 ms/img (b=4)

1.75×

Long video (472 frames) — new ground:

Config

Best Blackwell

img2pose b=16

34.5 ms/frame, 29.0 fps

retinaface b=16

11.5 ms/frame, 87.1 fps

MPDetector retinaface b=16

5.7 ms/frame, 174.4 fps (~7× realtime at 24 fps)

Other observations:

  • MPDetector image bench regresses at b=16 (94 ms/img vs 19 ms at b=4). Same non-monotonic pattern the 3090 file showed — this is a CPU-side bottleneck (likely resmasknet or DataLoader collation), not a GPU limit. The video paths don’t show the regression.

  • Retinaface ≫ img2pose on this hardware: at b=16 short video, retinaface is ~4× img2pose (134 vs 32.6 fps). MPDetector is competitive with retinaface alone (110 vs 134 fps short video) and faster on long video (174 vs 87 fps), benefiting from per-frame mesh + blendshape efficiency.

  • VRAM headroom unused. Sweep capped at b=16; with 96 GB available this is well below saturation for the video paths. A --batches 1 16 32 64 follow-up sweep would identify Blackwell’s actual ceiling — expect further gains on retinaface and img2pose long-video paths.

  • Driver stack: NVIDIA driver 580.159.04 (open modules) with the nvidia-driver-580-open package; required for sm_120. Both py-feat .venv and the face_jepa .venv were upgraded to torch 2.11.0+cu128 / torchcodec 0.11.1+cu128 same-day. No torchcodec regression observed.