Chaplin: Local visual speech recognition (VSR) in real-time

Published at

Feb 3, 2025

Main Article

https://github.com/amanvirparhar/chaplin

Chaplin

A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.

Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.

Watch a demo of Chaplin here.

Setup

Clone the repository, and cd into it:

git clone https://github.com/amanvirparhar/chaplin
cd chaplin

Download the required model components: LRS3_V_WER19.1 and lm_en_subword.

Unzip both folders, and place them in their respective directories:

chaplin/
├── benchmarks/
    ├── LRS3/
        ├── language_models/
            ├── lm_en_subword/
        ├── models/
            ├── LRS3_V_WER19.1/
├── ...

Install and run ollama, and pull the llama3.2 model.
Install uv.

Usage

Run the following command:

sudo uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe

Once the camera feed is displayed, you can start "recording" by pressing the option key (Mac) or the alt key (Windows/Linux), and start mouthing words.
To stop recording, press the option key (Mac) or the alt key (Windows/Linux) again. You should see some text being typed out wherever your cursor is.
To exit gracefully, focus on the window displaying the camera feed and press q.

gittech. site

Chaplin: Local visual speech recognition (VSR) in real-time

Chaplin

Setup

Usage