Make sure you have gone over QuickStart before trying this example.

In this example, we will build a realtime 3D chatbot.

Demo

Creating a 3D Chatbot in 3 steps

This app will process input from your microphone, send it to an LLM and convert the reponse back to voice. It will also generate the necessary lip sync and animation to make the avatar move.

  • The full github repository is available here.

This application will initially support only English as the source language.

1

Start by cloning the repository

git clone https://github.com/outspeed-ai/outspeed.git
cd examples/cooking_assistant/
2

Install Dependencies

3

Run Backend

Ensure you’re in the same directory as voice_bot.py!

You will need the following environment variables:

  1. DEEPGRAM_API_KEY - You can get this by going to https://console.deepgram.com/ and clicking on the API key tab.
  2. GROQ_API_KEY - You can get this by going to https://console.groq.com/keys and clicking on the API key tab.
  3. CARTESIA_API_KEY - You can get this by going to https://play.cartesia.ai/keys and clicking on the API key tab.

All of these providers have a free tier. Once you have your keys, run the following command:

export DEEPGRAM_API_KEY=<your_deepgram_api_key>
export GROQ_API_KEY=<your_groq_api_key>
export CARTESIA_API_KEY=<your_cartesia_api_key>

Finally, run the following command to start the server:

python cooking_assistant.py

The console will output the URL you can use to connect to the server (default is http://localhost:8080).

4

Run Demo

We have already created a simple frontend using our React SDK. You can then browse to the following page with your browser:

https://playground.outspeed.com/webrtc

  1. Paste the link your received from the previous step into the URL field.
  2. Select Audio device. Leave Video device blank. Click Run to begin.

Understanding the Process

Review the code in chatbot.py.

Setup

The setup method initializes the necessary services and configurations required for the chatbot to function:

  • DeepgramSTT: This service converts spoken audio into text using Deepgram’s speech-to-text API. It is configured with a sample rate of 8000 Hz to match the expected audio input quality.
  • FireworksLLM: Utilizes a large language model from Fireworks to generate responses based on the audio input converted to text. The model is set with a specific system prompt that defines the behavior and output format (JSON) of the chatbot, including facial expressions and animations.
  • LipSync: This service synchronizes the generated audio with appropriate mouth movements on the 3D avatar using the Rhubarb Lip Sync tool.
  • ElevenLabsTTS: Converts the generated text responses into spoken audio using Eleven Labs’ text-to-speech API. This service is optimized for streaming latency and uses a specific voice model.
  • AudioConverter: Ensures that the audio output is in the correct format for playback, converting the audio stream as necessary.

Streaming Endpoint

The run method acts as the streaming endpoint for audio input:

  1. Audio to Text: The audio input stream is first converted to text using the DeepgramSTT service.
  2. Text to Response: The text is then processed by the FireworksLLM to generate a JSON-formatted response containing the text, facial expressions, and animations.
  3. Processing JSON: The JSON response is unpacked, and the text component is extracted and sent to the ElevenLabsTTS service to be converted into audio.
  4. Generating Lip Sync Data: Simultaneously, the text-to-speech audio is used to generate lip-sync data to match the avatar’s mouth movements to the spoken audio.
  5. Combining Outputs: The audio stream and JSON data (now including lip-sync information) are combined and synchronized.
  6. Audio Conversion: Finally, the audio stream is converted to the appropriate format for output through the AudioConverter.

Support

For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!