Make sure you have reviewed QuickStart before proceeding with this example.

In this example, we will build a multilingual voice bot that leverages Outspeed’s AI services. The bot will process audio input, generate responses using a Large Language Model (LLM), and convert text responses back to speech.

Setup Application with Multilingual Voice Bot in 4 Steps

This application will process input from your microphone, send it to AI services, and convert the response back to voice.

  • The full code is available here.
1

Install Dependencies

2

Create VoiceBot Application

Refer to QuickStart to understand the structure of an Outspeed application.

Create a file named voice_bot.py and add the following code:

import json
import outspeed as sp

@sp.App()
class VoiceBot:
    """
    VoiceBot class represents a voice-based AI assistant.

    This class handles the setup, running, and teardown of various AI services
    used to process audio input, generate responses, and convert text to speech.
    """

    async def setup(self) -> None:
        """
        Initialize the VoiceBot.

        This method is called when the app starts. It should be used to set up
        services, load models, and perform any necessary initialization.
        """
        # Initialize the AI services
        self.transcriber_node = sp.AzureTranscriber(languages=["en-US", "fr-FR"])
        self.llm_node = sp.OpenAILLM(
            system_prompt="You are a helpful assistant. Keep your answers very short. No special characters in responses. You can speak English and French. Reply in the same language as the user's response.",
            model="gpt-4o-mini",
        )
        self.token_aggregator_node = sp.TokenAggregator()
        self.tts_node = sp.ElevenLabsTTS(
            voice_id="1qZOLVpd1TVic43MSkFY",
            model="eleven_turbo_v2_5",
            volume=0.7,
        )
        self.vad_node = sp.SileroVAD(min_volume=0)

    @sp.streaming_endpoint()
    async def run(self, audio_input_queue: sp.AudioStream, text_input_queue: sp.TextStream) -> sp.AudioStream:
        """
        Handle the main processing loop for the VoiceBot.

        This method is called for each new connection request. It sets up and
        runs the various AI services in a pipeline to process audio input and
        generate audio output.

        Args:
            audio_input_queue (sp.AudioStream): The input stream of audio data.

        Returns:
            sp.AudioStream: The output stream of generated audio responses.
        """
        # Set up the AI service pipeline
        transcriber_stream: sp.TextStream = self.transcriber_node.run(audio_input_queue)

        vad_stream: sp.VADStream = self.vad_node.run(audio_input_queue.clone())

        text_input_queue = sp.map(text_input_queue, lambda x: json.loads(x).get("content"))

        llm_input_queue = sp.merge(
            [transcriber_stream, text_input_queue],
        )

        llm_token_stream: sp.TextStream
        chat_history_stream: sp.TextStream
        llm_token_stream, chat_history_stream = self.llm_node.run(llm_input_queue)

        token_aggregator_stream: sp.TextStream = self.token_aggregator_node.run(llm_token_stream)
        tts_stream: sp.AudioStream = self.tts_node.run(token_aggregator_stream)

        self.llm_node.set_interrupt_stream(vad_stream)
        self.token_aggregator_node.set_interrupt_stream(vad_stream.clone())
        self.tts_node.set_interrupt_stream(vad_stream.clone())

        return tts_stream

    async def teardown(self) -> None:
        """
        Clean up resources when the VoiceBot is shutting down.

        This method is called when the app stops or is shut down unexpectedly.
        It should be used to release resources and perform any necessary cleanup.
        """
        await self.transcriber_node.close()
        await self.llm_node.close()
        await self.token_aggregator_node.close()
        await self.tts_node.close()

if __name__ == "__main__":
    # Start the VoiceBot when the script is run directly
    VoiceBot().start()
3

Setup API Keys and Run

You will need the following environment variable:

  1. AZURE_SPEECH_KEY - Obtain this by visiting the Azure Portal and generating a new key.
  2. AZURE_SPEECH_REGION - Obtain this by visiting the Azure Portal and copying the region from the Azure Speech resource.
  3. OPENAI_API_KEY - Obtain this by visiting the OpenAI Realtime API documentation and generating a new key.
  4. ELEVEN_LABS_API_KEY - Obtain this by visiting the ElevenLabs API documentation and generating a new key.

Once you have your key, create a .env file in the same directory as voice_bot.py and add the following:

OPENAI_API_KEY=<your_openai_api_key>
ELEVEN_LABS_API_KEY=<your_elevenlabs_api_key>
AZURE_SPEECH_KEY=<your_azure_speech_key>
AZURE_SPEECH_REGION=<your_azure_speech_region>

Finally, run the following command to start the server:

python voice_bot.py

The console will output the URL you can use to connect to the server (default is http://localhost:8080).

4

Run Demo

You can use our playground to interact with the voice bot.

  1. Navigate to playground and select “Voice Bot”.
  2. Paste the link you received from the previous step into the URL field.
  3. Select your Audio device. Leave the Video device blank. Click Run to begin.

The playground is built using our React SDK. You can use it to build your own frontends or integrate with an existing one!

Understanding the Process

Review the code in voice_bot.py.

Setup

The VoiceBot class initializes with the setup method, which sets up the necessary AI services and nodes. Here’s a breakdown of the services initialized:

  • AzureTranscriber: Transcribes audio input into text for supported languages (en-US, hi-IN).
  • OpenAILLM: Generates responses using OpenAI’s LLM with a specified system prompt and model.
  • TokenAggregator: Aggregates tokens from the LLM for processing.
  • ElevenLabsTTS: Converts text responses back to speech using ElevenLabs’ TTS service.
  • SileroVAD: Voice Activity Detection to handle audio interruptions.

Streaming Endpoint

The run method in the VoiceBot class is marked as a streaming endpoint, handling real-time audio and text streams.

The method processes the audio input through the following pipeline:

  1. Transcription: Converts audio to text using AzureTranscriber.
  2. Voice Activity Detection: Monitors audio input for interruptions.
  3. LLM Processing: Merges transcribed text and user input, then generates responses using OpenAILLM.
  4. Aggregation: Aggregates tokens from the LLM response.
  5. Text-to-Speech: Converts the aggregated text back to audio using ElevenLabsTTS.

Support

For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!