2. Detecting facial expressions from videos¶
In this tutorial we'll use Detectorv2 — Py-Feat's single multi-task model — to process a video file, first one frame at a time and then in batches to speed things up on a GPU.
2.1 Setting up the detector¶
We create a Detectorv2 instance just like in the previous tutorial. One network predicts Action Units, emotions, valence/arousal, gaze, head pose, a 478-point 3D FaceMesh, and blendshapes in a single forward pass.
from feat import Detectorv2
detector = Detectorv2(device=device) # device selected above (cuda/mps/cpu)
2.2 Processing a video¶
Detecting facial expressions in a video uses the same .detect() method with data_type="video". This sample video included in Py-Feat is by Wolfgang Langer from Pexels.
from feat.utils.io import get_test_data_path
import os
test_data_dir = get_test_data_path()
test_video_path = os.path.join(test_data_dir, "WolfgangLanger_Pexels.mp4")
# (The input video is processed below; an inline preview is omitted in the
# static docs. Download the notebook or open it in molab to view it.)
test_video_path
'/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4'
We pass skip_frames=24 to process only every 24th frame for speed, and face_detection_threshold=0.95 to be conservative about what counts as a face — we know this clip is a continuous front-on shot of one person, so raising it from the default 0.5 avoids spurious extra detections.
By default .detect() processes one frame at a time (batch_size=1):
# Without batching: one frame at a time (batch_size=1, the default).
video_prediction = detector.detect(
test_video_path,
data_type="video",
skip_frames=24,
face_detection_threshold=0.95,
)
video_prediction.head()
0%| | 0/20 [00:00<?, ?it/s] 5%|▌ | 1/20 [00:02<00:49, 2.63s/it] 20%|██ | 4/20 [00:02<00:08, 1.90it/s] 35%|███▌ | 7/20 [00:02<00:03, 3.80it/s] 50%|█████ | 10/20 [00:02<00:01, 6.11it/s] 65%|██████▌ | 13/20 [00:03<00:00, 8.74it/s] 80%|████████ | 16/20 [00:03<00:00, 11.56it/s] 95%|█████████▌| 19/20 [00:03<00:00, 14.44it/s] 100%|██████████| 20/20 [00:03<00:00, 6.06it/s]
| FaceRectX | FaceRectY | FaceRectWidth | FaceRectHeight | FaceScore | x_0 | x_1 | x_2 | x_3 | x_4 | ... | mouthUpperUpLeft | mouthUpperUpRight | noseSneerLeft | noseSneerRight | FrameHeight | FrameWidth | input | frame | approx_time | Identity | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 249.928070 | 12.430298 | 366.052765 | 366.052734 | 0.999921 | 333.219360 | 337.151581 | 341.083801 | 346.445892 | 353.595367 | ... | 0.000458 | 0.000572 | 0.000002 | 0.000001 | 360.0 | 640.0 | /home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4 | 0 | 00:00 | Person_0 |
| 1 | 273.298859 | 21.706299 | 353.184052 | 353.184082 | 0.999950 | 351.247681 | 354.006927 | 357.111084 | 360.905060 | 366.423553 | ... | 0.014526 | 0.018311 | 0.000002 | 0.000002 | 360.0 | 640.0 | /home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4 | 24 | 00:01 | Person_1 |
| 2 | 264.965637 | 1.584579 | 356.801331 | 356.801270 | 0.999943 | 343.364380 | 342.667480 | 342.667480 | 344.758118 | 348.242493 | ... | 0.021973 | 0.028442 | 0.000004 | 0.000001 | 360.0 | 640.0 | /home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4 | 48 | 00:02 | Person_2 |
| 3 | 243.580078 | 27.420776 | 355.178040 | 355.178040 | 0.999921 | 316.419312 | 316.766174 | 317.459900 | 319.887878 | 324.743805 | ... | 0.072754 | 0.088867 | 0.000005 | 0.000003 | 360.0 | 640.0 | /home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4 | 72 | 00:03 | Person_3 |
| 4 | 271.361267 | 61.490356 | 328.899841 | 328.899841 | 0.999908 | 332.066406 | 331.745239 | 331.102844 | 333.351166 | 338.811432 | ... | 0.007568 | 0.004456 | 0.000007 | 0.000001 | 360.0 | 640.0 | /home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4 | 96 | 00:04 | Person_4 |
5 rows × 2183 columns
Our 20-second clip recorded at 24 fps yields 20 predictions because of skip_frames=24:
2.3 Speeding things up with batching¶
Passing batch_size > 1 runs several frames through the network in a single forward pass instead of one at a time. This is much faster on a GPU (CUDA or MPS) and is the recommended way to process video. On CUDA you can squeeze out a bit more by also passing pin_memory=True, which page-locks host memory for faster CPU→GPU transfers. The predictions are identical — only throughput changes:
# With batching: 8 frames per forward pass — much faster on a GPU.
# On CUDA, pin_memory=True further speeds host->device transfers.
video_prediction_batched = detector.detect(
test_video_path,
data_type="video",
batch_size=8,
skip_frames=24,
face_detection_threshold=0.95,
)
video_prediction_batched.shape
0%| | 0/3 [00:00<?, ?it/s] 33%|███▎ | 1/3 [00:03<00:07, 3.82s/it] 67%|██████▋ | 2/3 [00:03<00:01, 1.63s/it] 100%|██████████| 3/3 [00:05<00:00, 1.61s/it] 100%|██████████| 3/3 [00:05<00:00, 1.83s/it]
2.4 Visualizing predictions¶
You can plot detection results from a video. The frames aren't extracted from the video (that would produce thousands of images), so the visualization shows the detected face geometry without the underlying image.
The clip runs at 24 fps; the actress shows sadness around 0:02 and happiness around 0:14.
# Frame 48 ~ 0:02 (sadness), Frame 408 ~ 0:14 (happiness)
_figs = video_prediction.query("frame in [48, 408]").plot_detections(
faceboxes=False, add_titles=False
)
mo.vstack(_figs)
We can also use pandas plotting to show how emotions unfold over time — the shift from sadness to happiness is clearly visible: