Overview
A multilingual voice bot built using Outspeed in Python.
Make sure you have reviewed QuickStart before proceeding with this example.
In this example, we will build a multilingual voice bot that leverages Outspeed’s AI services. The bot will process audio input, generate responses using a Large Language Model (LLM), and convert text responses back to speech.
Setup Application with Multilingual Voice Bot in 4 Steps
This application will process input from your microphone, send it to AI services, and convert the response back to voice.
- The full code is available here.
Install Dependencies
Create VoiceBot Application
Refer to QuickStart to understand the structure of an Outspeed application.
Create a file named voice_bot.py
and add the following code:
Setup API Keys and Run
You will need the following environment variable:
AZURE_SPEECH_KEY
- Obtain this by visiting the Azure Portal and generating a new key.AZURE_SPEECH_REGION
- Obtain this by visiting the Azure Portal and copying the region from the Azure Speech resource.OPENAI_API_KEY
- Obtain this by visiting the OpenAI Realtime API documentation and generating a new key.ELEVEN_LABS_API_KEY
- Obtain this by visiting the ElevenLabs API documentation and generating a new key.
Once you have your key, create a .env
file in the same directory as voice_bot.py
and add the following:
Finally, run the following command to start the server:
The console will output the URL you can use to connect to the server (default is http://localhost:8080).
Run Demo
You can use our playground to interact with the voice bot.
- Navigate to playground and select “Voice Bot”.
- Paste the link you received from the previous step into the URL field.
- Select your Audio device. Leave the Video device blank. Click Run to begin.
The playground is built using our React SDK. You can use it to build your own frontends or integrate with an existing one!
Understanding the Process
Review the code in voice_bot.py
.
Setup
The VoiceBot
class initializes with the setup
method, which sets up the necessary AI services and nodes. Here’s a breakdown of the services initialized:
- AzureTranscriber: Transcribes audio input into text for supported languages (
en-US
,hi-IN
). - OpenAILLM: Generates responses using OpenAI’s LLM with a specified system prompt and model.
- TokenAggregator: Aggregates tokens from the LLM for processing.
- ElevenLabsTTS: Converts text responses back to speech using ElevenLabs’ TTS service.
- SileroVAD: Voice Activity Detection to handle audio interruptions.
Streaming Endpoint
The run
method in the VoiceBot
class is marked as a streaming endpoint, handling real-time audio and text streams.
The method processes the audio input through the following pipeline:
- Transcription: Converts audio to text using
AzureTranscriber
. - Voice Activity Detection: Monitors audio input for interruptions.
- LLM Processing: Merges transcribed text and user input, then generates responses using
OpenAILLM
. - Aggregation: Aggregates tokens from the LLM response.
- Text-to-Speech: Converts the aggregated text back to audio using
ElevenLabsTTS
.
Support
For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!
Was this page helpful?