py-feat detector benchmark — 2026-05-14 23:15:32#
Run metadata#
Date: 2026-05-14 23:15:32
py-feat version: 0.7.0
Git commit: c716340
Host: liquidswords2 (x86_64, 128 CPUs)
Python: 3.12.13
PyTorch: 2.11.0+cu128
GPU: CUDA 12.8, NVIDIA RTX PRO 6000 Blackwell Workstation Edition
OMP_NUM_THREADS:
1Devices swept: [‘cuda’]
Batch sizes: [1, 4, 16]
DataLoader workers: [0]
Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.
Video: short (72 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
8.72 |
121.2 |
8.3 |
cuda |
4 |
3.83 |
53.2 |
18.8 |
cuda |
16 |
2.21 |
30.7 |
32.6 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
7.67 |
106.6 |
9.4 |
cuda |
4 |
2.43 |
33.7 |
29.6 |
cuda |
16 |
0.54 |
7.5 |
134.0 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
2.76 |
38.4 |
26.0 |
cuda |
4 |
0.78 |
10.9 |
91.7 |
cuda |
16 |
0.65 |
9.1 |
110.2 |
Video: long (472 frames)#
img2pose#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
54.89 |
116.3 |
8.6 |
cuda |
4 |
26.19 |
55.5 |
18.0 |
cuda |
16 |
16.30 |
34.5 |
29.0 |
retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
39.76 |
84.2 |
11.9 |
cuda |
4 |
12.27 |
26.0 |
38.5 |
cuda |
16 |
5.42 |
11.5 |
87.1 |
MPDetector retinaface#
device |
batch |
sec |
ms/frame |
fps |
|---|---|---|---|---|
cuda |
1 |
19.32 |
40.9 |
24.4 |
cuda |
4 |
5.66 |
12.0 |
83.4 |
cuda |
16 |
2.71 |
5.7 |
174.4 |
Images: 16 x multi_face.jpg = 80 faces#
img2pose#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cuda |
1 |
2.49 |
155.8 |
80 |
cuda |
4 |
1.23 |
77.2 |
80 |
cuda |
16 |
1.07 |
67.0 |
80 |
retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cuda |
1 |
1.86 |
116.6 |
80 |
cuda |
4 |
0.62 |
38.5 |
80 |
cuda |
16 |
0.56 |
35.3 |
80 |
MPDetector retinaface#
device |
batch |
sec |
ms/img |
rows |
|---|---|---|---|---|
cuda |
1 |
0.74 |
46.0 |
80 |
cuda |
4 |
0.31 |
19.4 |
80 |
cuda |
16 |
1.50 |
94.0 |
80 |
Notes (hand-curated)#
Scope: first run on the new RTX PRO 6000 Blackwell Workstation Edition (96 GB) on liquidswords2, replacing the prior RTX 3090. Companion to 2026-05-03-864962c.md (same host, same py-feat 0.7.0, same OMP_NUM_THREADS=1, RTX 3090 + torch 2.5.1+cu124). Two things changed at once: GPU (3090 → Blackwell) and torch (2.5.1+cu124 → 2.11.0+cu128, required for sm_120 kernels). Cannot isolate the two — but the GPU is the dominant factor for the configurations swept here.
Headline: Blackwell is 1.3–1.75× the 3090 across the board, best on MPDetector long video at 174 fps.
Best-batch comparison vs 2026-05-03-864962c.md (3090 file did not include long-video sweep):
Config |
3090 best |
Blackwell best |
Speedup |
|---|---|---|---|
img2pose, short video |
47.9 ms/frame (b=16, 20.9 fps) |
30.7 ms/frame (b=16, 32.6 fps) |
1.56× |
retinaface, short video |
10.6 ms/frame (b=16, 94.5 fps) |
7.5 ms/frame (b=16, 134 fps) |
1.42× |
MPDetector, short video |
11.5 ms/frame (b=16, 86.7 fps) |
9.1 ms/frame (b=16, 110 fps) |
1.27× |
img2pose, image |
87.1 ms/img (b=4) |
67.0 ms/img (b=16) |
1.30× |
retinaface, image |
44.4 ms/img (b=4) |
35.3 ms/img (b=16) |
1.26× |
MPDetector, image |
33.9 ms/img (b=4) |
19.4 ms/img (b=4) |
1.75× |
Long video (472 frames) — new ground:
Config |
Best Blackwell |
|---|---|
img2pose b=16 |
34.5 ms/frame, 29.0 fps |
retinaface b=16 |
11.5 ms/frame, 87.1 fps |
MPDetector retinaface b=16 |
5.7 ms/frame, 174.4 fps (~7× realtime at 24 fps) |
Other observations:
MPDetector image bench regresses at b=16 (94 ms/img vs 19 ms at b=4). Same non-monotonic pattern the 3090 file showed — this is a CPU-side bottleneck (likely resmasknet or DataLoader collation), not a GPU limit. The video paths don’t show the regression.
Retinaface ≫ img2pose on this hardware: at b=16 short video, retinaface is ~4× img2pose (134 vs 32.6 fps). MPDetector is competitive with retinaface alone (110 vs 134 fps short video) and faster on long video (174 vs 87 fps), benefiting from per-frame mesh + blendshape efficiency.
VRAM headroom unused. Sweep capped at b=16; with 96 GB available this is well below saturation for the video paths. A
--batches 1 16 32 64follow-up sweep would identify Blackwell’s actual ceiling — expect further gains on retinaface and img2pose long-video paths.Driver stack: NVIDIA driver 580.159.04 (open modules) with the
nvidia-driver-580-openpackage; required for sm_120. Both py-feat.venvand the face_jepa.venvwere upgraded to torch 2.11.0+cu128 / torchcodec 0.11.1+cu128 same-day. No torchcodec regression observed.