py-feat detector benchmark — 2026-05-14 23:15:32

py-feat detector benchmark — 2026-05-14 23:15:32#

Run metadata#

Date: 2026-05-14 23:15:32
py-feat version: 0.7.0
Git commit: c716340
Host: liquidswords2 (x86_64, 128 CPUs)
Python: 3.12.13
PyTorch: 2.11.0+cu128
GPU: CUDA 12.8, NVIDIA RTX PRO 6000 Blackwell Workstation Edition
OMP_NUM_THREADS: 1
Devices swept: [‘cuda’]
Batch sizes: [1, 4, 16]
DataLoader workers: [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

Video: short (72 frames)#

img2pose#

device	batch	sec	ms/frame	fps
cuda	1	8.72	121.2	8.3
cuda	4	3.83	53.2	18.8
cuda	16	2.21	30.7	32.6

retinaface#

device	batch	sec	ms/frame	fps
cuda	1	7.67	106.6	9.4
cuda	4	2.43	33.7	29.6
cuda	16	0.54	7.5	134.0

MPDetector retinaface#

device	batch	sec	ms/frame	fps
cuda	1	2.76	38.4	26.0
cuda	4	0.78	10.9	91.7
cuda	16	0.65	9.1	110.2

Video: long (472 frames)#

img2pose#

device	batch	sec	ms/frame	fps
cuda	1	54.89	116.3	8.6
cuda	4	26.19	55.5	18.0
cuda	16	16.30	34.5	29.0

retinaface#

device	batch	sec	ms/frame	fps
cuda	1	39.76	84.2	11.9
cuda	4	12.27	26.0	38.5
cuda	16	5.42	11.5	87.1

MPDetector retinaface#

device	batch	sec	ms/frame	fps
cuda	1	19.32	40.9	24.4
cuda	4	5.66	12.0	83.4
cuda	16	2.71	5.7	174.4

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

device	batch	sec	ms/img	rows
cuda	1	2.49	155.8	80
cuda	4	1.23	77.2	80
cuda	16	1.07	67.0	80

retinaface#

device	batch	sec	ms/img	rows
cuda	1	1.86	116.6	80
cuda	4	0.62	38.5	80
cuda	16	0.56	35.3	80

MPDetector retinaface#

device	batch	sec	ms/img	rows
cuda	1	0.74	46.0	80
cuda	4	0.31	19.4	80
cuda	16	1.50	94.0	80

Notes (hand-curated)#

Scope: first run on the new RTX PRO 6000 Blackwell Workstation Edition (96 GB) on liquidswords2, replacing the prior RTX 3090. Companion to 2026-05-03-864962c.md (same host, same py-feat 0.7.0, same OMP_NUM_THREADS=1, RTX 3090 + torch 2.5.1+cu124). Two things changed at once: GPU (3090 → Blackwell) and torch (2.5.1+cu124 → 2.11.0+cu128, required for sm_120 kernels). Cannot isolate the two — but the GPU is the dominant factor for the configurations swept here.

Headline: Blackwell is 1.3–1.75× the 3090 across the board, best on MPDetector long video at 174 fps.

Best-batch comparison vs 2026-05-03-864962c.md (3090 file did not include long-video sweep):

Config	3090 best	Blackwell best	Speedup
img2pose, short video	47.9 ms/frame (b=16, 20.9 fps)	30.7 ms/frame (b=16, 32.6 fps)	1.56×
retinaface, short video	10.6 ms/frame (b=16, 94.5 fps)	7.5 ms/frame (b=16, 134 fps)	1.42×
MPDetector, short video	11.5 ms/frame (b=16, 86.7 fps)	9.1 ms/frame (b=16, 110 fps)	1.27×
img2pose, image	87.1 ms/img (b=4)	67.0 ms/img (b=16)	1.30×
retinaface, image	44.4 ms/img (b=4)	35.3 ms/img (b=16)	1.26×
MPDetector, image	33.9 ms/img (b=4)	19.4 ms/img (b=4)	1.75×

Long video (472 frames) — new ground:

Config	Best Blackwell
img2pose b=16	34.5 ms/frame, 29.0 fps
retinaface b=16	11.5 ms/frame, 87.1 fps
MPDetector retinaface b=16	5.7 ms/frame, 174.4 fps (~7× realtime at 24 fps)

Other observations:

MPDetector image bench regresses at b=16 (94 ms/img vs 19 ms at b=4). Same non-monotonic pattern the 3090 file showed — this is a CPU-side bottleneck (likely resmasknet or DataLoader collation), not a GPU limit. The video paths don’t show the regression.
Retinaface ≫ img2pose on this hardware: at b=16 short video, retinaface is ~4× img2pose (134 vs 32.6 fps). MPDetector is competitive with retinaface alone (110 vs 134 fps short video) and faster on long video (174 vs 87 fps), benefiting from per-frame mesh + blendshape efficiency.
VRAM headroom unused. Sweep capped at b=16; with 96 GB available this is well below saturation for the video paths. A --batches 1 16 32 64 follow-up sweep would identify Blackwell’s actual ceiling — expect further gains on retinaface and img2pose long-video paths.
Driver stack: NVIDIA driver 580.159.04 (open modules) with the nvidia-driver-580-open package; required for sm_120. Both py-feat .venv and the face_jepa .venv were upgraded to torch 2.11.0+cu128 / torchcodec 0.11.1+cu128 same-day. No torchcodec regression observed.

py-feat detector benchmark — 2026-05-14 23:15:32

Contents

py-feat detector benchmark — 2026-05-14 23:15:32#

Run metadata#

Video: short (72 frames)#

img2pose#

retinaface#

MPDetector retinaface#

Video: long (472 frames)#

img2pose#

retinaface#

MPDetector retinaface#

Images: 16 x multi_face.jpg = 80 faces#

img2pose#

retinaface#

MPDetector retinaface#

Notes (hand-curated)#