
gittech. site
for different kinds of informations and explorations.
Chaplin: Local visual speech recognition (VSR) in real-time
Published at
Feb 3, 2025
Main Article
Chaplin
A visual speech recognition (VSR) tool that reads your lips in real-time and types whatever you silently mouth. Runs fully locally.
Relies on a model trained on the Lip Reading Sentences 3 dataset as part of the Auto-AVSR project.
Watch a demo of Chaplin here.
Setup
- Clone the repository, and
cd
into it:git clone https://github.com/amanvirparhar/chaplin cd chaplin
- Download the required model components: LRS3_V_WER19.1 and lm_en_subword.
- Unzip both folders, and place them in their respective directories:
chaplin/ βββ benchmarks/ βββ LRS3/ βββ language_models/ βββ lm_en_subword/ βββ models/ βββ LRS3_V_WER19.1/ βββ ...
- Install and run
ollama
, and pull thellama3.2
model. - Install
uv
.
Usage
- Run the following command:
sudo uv run --with-requirements requirements.txt --python 3.12 main.py config_filename=./configs/LRS3_V_WER19.1.ini detector=mediapipe
- Once the camera feed is displayed, you can start "recording" by pressing the
option
key (Mac) or thealt
key (Windows/Linux), and start mouthing words. - To stop recording, press the
option
key (Mac) or thealt
key (Windows/Linux) again. You should see some text being typed out wherever your cursor is. - To exit gracefully, focus on the window displaying the camera feed and press
q
.