Open in molab View on GitHub Download .py

2. Detecting facial expressions from videos¶

In this tutorial we'll use Detectorv2 — Py-Feat's single multi-task model — to process a video file, first one frame at a time and then in batches to speed things up on a GPU.

2.1 Setting up the detector¶

We create a Detectorv2 instance just like in the previous tutorial. One network predicts Action Units, emotions, valence/arousal, gaze, head pose, a 478-point 3D FaceMesh, and blendshapes in a single forward pass.

from feat import Detectorv2

detector = Detectorv2(device=device)  # device selected above (cuda/mps/cpu)

2.2 Processing a video¶

Detecting facial expressions in a video uses the same .detect() method with data_type="video". This sample video included in Py-Feat is by Wolfgang Langer from Pexels.

from feat.utils.io import get_test_data_path
import os

test_data_dir = get_test_data_path()
test_video_path = os.path.join(test_data_dir, "WolfgangLanger_Pexels.mp4")

# (The input video is processed below; an inline preview is omitted in the
# static docs. Download the notebook or open it in molab to view it.)
test_video_path

'/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4'

We pass skip_frames=24 to process only every 24th frame for speed, and face_detection_threshold=0.95 to be conservative about what counts as a face — we know this clip is a continuous front-on shot of one person, so raising it from the default 0.5 avoids spurious extra detections.

By default .detect() processes one frame at a time (batch_size=1):

# Without batching: one frame at a time (batch_size=1, the default).
video_prediction = detector.detect(
    test_video_path,
    data_type="video",
    skip_frames=24,
    face_detection_threshold=0.95,
)
video_prediction.head()

  0%|          | 0/20 [00:00<?, ?it/s]
  5%|▌         | 1/20 [00:02<00:49,  2.63s/it]
 20%|██        | 4/20 [00:02<00:08,  1.90it/s]
 35%|███▌      | 7/20 [00:02<00:03,  3.80it/s]
 50%|█████     | 10/20 [00:02<00:01,  6.11it/s]
 65%|██████▌   | 13/20 [00:03<00:00,  8.74it/s]
 80%|████████  | 16/20 [00:03<00:00, 11.56it/s]
 95%|█████████▌| 19/20 [00:03<00:00, 14.44it/s]
100%|██████████| 20/20 [00:03<00:00,  6.06it/s]

	FaceRectX	FaceRectY	FaceRectWidth	FaceRectHeight	FaceScore	x_0	x_1	x_2	x_3	x_4	...	mouthUpperUpLeft	mouthUpperUpRight	noseSneerLeft	noseSneerRight	FrameHeight	FrameWidth	input	frame	approx_time	Identity
0	249.928070	12.430298	366.052765	366.052734	0.999921	333.219360	337.151581	341.083801	346.445892	353.595367	...	0.000458	0.000572	0.000002	0.000001	360.0	640.0	/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4	0	00:00	Person_0
1	273.298859	21.706299	353.184052	353.184082	0.999950	351.247681	354.006927	357.111084	360.905060	366.423553	...	0.014526	0.018311	0.000002	0.000002	360.0	640.0	/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4	24	00:01	Person_1
2	264.965637	1.584579	356.801331	356.801270	0.999943	343.364380	342.667480	342.667480	344.758118	348.242493	...	0.021973	0.028442	0.000004	0.000001	360.0	640.0	/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4	48	00:02	Person_2
3	243.580078	27.420776	355.178040	355.178040	0.999921	316.419312	316.766174	317.459900	319.887878	324.743805	...	0.072754	0.088867	0.000005	0.000003	360.0	640.0	/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4	72	00:03	Person_3
4	271.361267	61.490356	328.899841	328.899841	0.999908	332.066406	331.745239	331.102844	333.351166	338.811432	...	0.007568	0.004456	0.000007	0.000001	360.0	640.0	/home/ljchang/Github/py-feat/feat/tests/data/WolfgangLanger_Pexels.mp4	96	00:04	Person_4

5 rows × 2183 columns

Our 20-second clip recorded at 24 fps yields 20 predictions because of skip_frames=24:

video_prediction.shape

2.3 Speeding things up with batching¶

Passing batch_size > 1 runs several frames through the network in a single forward pass instead of one at a time. This is much faster on a GPU (CUDA or MPS) and is the recommended way to process video. On CUDA you can squeeze out a bit more by also passing pin_memory=True, which page-locks host memory for faster CPU→GPU transfers. The predictions are identical — only throughput changes:

# With batching: 8 frames per forward pass — much faster on a GPU.
# On CUDA, pin_memory=True further speeds host->device transfers.
video_prediction_batched = detector.detect(
    test_video_path,
    data_type="video",
    batch_size=8,
    skip_frames=24,
    face_detection_threshold=0.95,
)
video_prediction_batched.shape

  0%|          | 0/3 [00:00<?, ?it/s]
 33%|███▎      | 1/3 [00:03<00:07,  3.82s/it]
 67%|██████▋   | 2/3 [00:03<00:01,  1.63s/it]
100%|██████████| 3/3 [00:05<00:00,  1.61s/it]
100%|██████████| 3/3 [00:05<00:00,  1.83s/it]

2.4 Visualizing predictions¶

You can plot detection results from a video. The frames aren't extracted from the video (that would produce thousands of images), so the visualization shows the detected face geometry without the underlying image.

The clip runs at 24 fps; the actress shows sadness around 0:02 and happiness around 0:14.

# Frame 48 ~ 0:02 (sadness), Frame 408 ~ 0:14 (happiness)
_figs = video_prediction.query("frame in [48, 408]").plot_detections(
    faceboxes=False, add_titles=False
)
mo.vstack(_figs)

We can also use pandas plotting to show how emotions unfold over time — the shift from sadness to happiness is clearly visible:

_ax = video_prediction.emotions.plot()
_ax.figure