# py-feat detector benchmark — 2026-05-14 23:15:32

## Run metadata

- **Date:** 2026-05-14 23:15:32
- **py-feat version:** 0.7.0
- **Git commit:** c716340
- **Host:** liquidswords2 (x86_64, 128 CPUs)
- **Python:** 3.12.13
- **PyTorch:** 2.11.0+cu128
- **GPU:** CUDA 12.8, NVIDIA RTX PRO 6000 Blackwell Workstation Edition
- **OMP_NUM_THREADS:** `1`
- **Devices swept:** ['cuda']
- **Batch sizes:** [1, 4, 16]
- **DataLoader workers:** [0]

Each timed call is preceded by one untimed warmup; the timed-call wall time is reported.

## Video: short (72 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 8.72 | 121.2 | 8.3 |
| cuda | 4 | 3.83 | 53.2 | 18.8 |
| cuda | 16 | 2.21 | 30.7 | 32.6 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 7.67 | 106.6 | 9.4 |
| cuda | 4 | 2.43 | 33.7 | 29.6 |
| cuda | 16 | 0.54 | 7.5 | 134.0 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 2.76 | 38.4 | 26.0 |
| cuda | 4 | 0.78 | 10.9 | 91.7 |
| cuda | 16 | 0.65 | 9.1 | 110.2 |


## Video: long (472 frames)

### img2pose

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 54.89 | 116.3 | 8.6 |
| cuda | 4 | 26.19 | 55.5 | 18.0 |
| cuda | 16 | 16.30 | 34.5 | 29.0 |


### retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 39.76 | 84.2 | 11.9 |
| cuda | 4 | 12.27 | 26.0 | 38.5 |
| cuda | 16 | 5.42 | 11.5 | 87.1 |


### MPDetector retinaface

| device | batch | sec | ms/frame | fps |
|---|---|---|---|---|
| cuda | 1 | 19.32 | 40.9 | 24.4 |
| cuda | 4 | 5.66 | 12.0 | 83.4 |
| cuda | 16 | 2.71 | 5.7 | 174.4 |


## Images: 16 x multi_face.jpg = 80 faces

### img2pose

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cuda | 1 | 2.49 | 155.8 | 80 |
| cuda | 4 | 1.23 | 77.2 | 80 |
| cuda | 16 | 1.07 | 67.0 | 80 |


### retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cuda | 1 | 1.86 | 116.6 | 80 |
| cuda | 4 | 0.62 | 38.5 | 80 |
| cuda | 16 | 0.56 | 35.3 | 80 |


### MPDetector retinaface

| device | batch | sec | ms/img | rows |
|---|---|---|---|---|
| cuda | 1 | 0.74 | 46.0 | 80 |
| cuda | 4 | 0.31 | 19.4 | 80 |
| cuda | 16 | 1.50 | 94.0 | 80 |

---

## Notes (hand-curated)

**Scope:** first run on the new RTX PRO 6000 Blackwell Workstation Edition (96 GB) on `liquidswords2`, replacing the prior RTX 3090. Companion to `2026-05-03-864962c.md` (same host, same py-feat 0.7.0, same `OMP_NUM_THREADS=1`, RTX 3090 + torch 2.5.1+cu124). Two things changed at once: GPU (3090 → Blackwell) and torch (2.5.1+cu124 → 2.11.0+cu128, required for sm_120 kernels). Cannot isolate the two — but the GPU is the dominant factor for the configurations swept here.

**Headline: Blackwell is 1.3–1.75× the 3090 across the board, best on MPDetector long video at 174 fps.**

Best-batch comparison vs `2026-05-03-864962c.md` (3090 file did not include long-video sweep):

| Config | 3090 best | Blackwell best | Speedup |
|---|---|---|---|
| img2pose, short video | 47.9 ms/frame (b=16, 20.9 fps) | 30.7 ms/frame (b=16, 32.6 fps) | 1.56× |
| retinaface, short video | 10.6 ms/frame (b=16, 94.5 fps) | 7.5 ms/frame (b=16, 134 fps) | 1.42× |
| MPDetector, short video | 11.5 ms/frame (b=16, 86.7 fps) | 9.1 ms/frame (b=16, 110 fps) | 1.27× |
| img2pose, image | 87.1 ms/img (b=4) | 67.0 ms/img (b=16) | 1.30× |
| retinaface, image | 44.4 ms/img (b=4) | 35.3 ms/img (b=16) | 1.26× |
| MPDetector, image | 33.9 ms/img (b=4) | 19.4 ms/img (b=4) | 1.75× |

**Long video (472 frames) — new ground:**

| Config | Best Blackwell |
|---|---|
| img2pose b=16 | 34.5 ms/frame, 29.0 fps |
| retinaface b=16 | 11.5 ms/frame, 87.1 fps |
| MPDetector retinaface b=16 | **5.7 ms/frame, 174.4 fps** (~7× realtime at 24 fps) |

**Other observations:**

- **MPDetector image bench regresses at b=16** (94 ms/img vs 19 ms at b=4). Same non-monotonic pattern the 3090 file showed — this is a CPU-side bottleneck (likely resmasknet or DataLoader collation), not a GPU limit. The video paths don't show the regression.
- **Retinaface ≫ img2pose** on this hardware: at b=16 short video, retinaface is ~4× img2pose (134 vs 32.6 fps). MPDetector is competitive with retinaface alone (110 vs 134 fps short video) and *faster* on long video (174 vs 87 fps), benefiting from per-frame mesh + blendshape efficiency.
- **VRAM headroom unused.** Sweep capped at b=16; with 96 GB available this is well below saturation for the video paths. A `--batches 1 16 32 64` follow-up sweep would identify Blackwell's actual ceiling — expect further gains on retinaface and img2pose long-video paths.
- **Driver stack:** NVIDIA driver 580.159.04 (open modules) with the `nvidia-driver-580-open` package; required for sm_120. Both py-feat `.venv` and the face_jepa `.venv` were upgraded to torch 2.11.0+cu128 / torchcodec 0.11.1+cu128 same-day. No torchcodec regression observed.
