3D Avatar Voice Bot
A voice bot with a 3D avatar with realtime lip sync and animation.
Make sure you have gone over QuickStart before trying this example.
In this example, we will build a realtime 3D chatbot.
Demo
Creating a 3D Chatbot in 3 steps
This app will process input from your microphone, send it to an LLM and convert the reponse back to voice. It will also generate the necessary lip sync and animation to make the avatar move.
- The full github repository is available here.
This application will initially support only English as the source language.
Start by cloning the repository
git clone https://github.com/outspeed-ai/outspeed.git
cd examples/cooking_assistant/
Install Dependencies
Run Backend
Ensure you’re in the same directory as voice_bot.py!
You will need the following environment variables:
DEEPGRAM_API_KEY
- You can get this by going to https://console.deepgram.com/ and clicking on the API key tab.GROQ_API_KEY
- You can get this by going to https://console.groq.com/keys and clicking on the API key tab.CARTESIA_API_KEY
- You can get this by going to https://play.cartesia.ai/keys and clicking on the API key tab.
All of these providers have a free tier. Once you have your keys, run the following command:
export DEEPGRAM_API_KEY=<your_deepgram_api_key>
export GROQ_API_KEY=<your_groq_api_key>
export CARTESIA_API_KEY=<your_cartesia_api_key>
Finally, run the following command to start the server:
python cooking_assistant.py
The console will output the URL you can use to connect to the server (default is http://localhost:8080).
Run Demo
We have already created a simple frontend using our React SDK. You can then browse to the following page with your browser:
https://playground.outspeed.com/webrtc
- Paste the link your received from the previous step into the URL field.
- Select Audio device. Leave Video device blank. Click Run to begin.
Understanding the Process
Review the code in chatbot.py
.
Setup
The setup
method initializes the necessary services and configurations required for the chatbot to function:
- DeepgramSTT: This service converts spoken audio into text using Deepgram’s speech-to-text API. It is configured with a sample rate of 8000 Hz to match the expected audio input quality.
- FireworksLLM: Utilizes a large language model from Fireworks to generate responses based on the audio input converted to text. The model is set with a specific system prompt that defines the behavior and output format (JSON) of the chatbot, including facial expressions and animations.
- LipSync: This service synchronizes the generated audio with appropriate mouth movements on the 3D avatar using the Rhubarb Lip Sync tool.
- ElevenLabsTTS: Converts the generated text responses into spoken audio using Eleven Labs’ text-to-speech API. This service is optimized for streaming latency and uses a specific voice model.
- AudioConverter: Ensures that the audio output is in the correct format for playback, converting the audio stream as necessary.
Streaming Endpoint
The run
method acts as the streaming endpoint for audio input:
- Audio to Text: The audio input stream is first converted to text using the
DeepgramSTT
service. - Text to Response: The text is then processed by the
FireworksLLM
to generate a JSON-formatted response containing the text, facial expressions, and animations. - Processing JSON: The JSON response is unpacked, and the text component is extracted and sent to the
ElevenLabsTTS
service to be converted into audio. - Generating Lip Sync Data: Simultaneously, the text-to-speech audio is used to generate lip-sync data to match the avatar’s mouth movements to the spoken audio.
- Combining Outputs: The audio stream and JSON data (now including lip-sync information) are combined and synchronized.
- Audio Conversion: Finally, the audio stream is converted to the appropriate format for output through the
AudioConverter
.
Support
For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!
Was this page helpful?