gittech. site

for different kinds of informations and explorations.

Chaplin: Local visual speech recognition (VSR) in real-time

Published at
Feb 3, 2025

Chaplin

Chaplin Thumbnail

A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.

Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.

Watch a demo of Chaplin here.

Setup

  1. Clone the repository, and cd into it:
    git clone https://github.com/amanvirparhar/chaplin
    cd chaplin
    
  2. Download the required model components: LRS3_V_WER19.1 and lm_en_subword.
  3. Unzip both folders, and place them in their respective directories:
    chaplin/
    β”œβ”€β”€ benchmarks/
        β”œβ”€β”€ LRS3/
            β”œβ”€β”€ language_models/
                β”œβ”€β”€ lm_en_subword/
            β”œβ”€β”€ models/
                β”œβ”€β”€ LRS3_V_WER19.1/
    β”œβ”€β”€ ...
    
  4. Install and run ollama, and pull the llama3.2 model.
  5. Install uv.

Usage

  1. Run the following command:
    sudo uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe
    
  2. Once the camera feed is displayed, you can start "recording" by pressing the option key (Mac) or the alt key (Windows/Linux), and start mouthing words.
  3. To stop recording, press the option key (Mac) or the alt key (Windows/Linux) again. You should see some text being typed out wherever your cursor is.
  4. To exit gracefully, focus on the window displaying the camera feed and press q.