In this guide, we’ll develop an advanced voice over 7 tutorials. The voice bot will have the following capabilities:

  1. Introduction: Start here (this page 👇🏽). Learn about the essential elements of a voice bot built using OutspeedSDK.
  2. Recording: Add an endpoint where the call recording is stored.
  3. Interruption Handling: The bot will use a voice activity detector to handle interruptions
  4. Backchanneling: The bot will be able to respond with filler words to make conversation sound more human.
  5. Conversational Pathways: Add conversational pathways to reduce hallucination, and implement voice-based workflows
  6. Database: Add a database to add to the knowledge that the bot has access to
  7. Integrations: Add an endpoint for alerting, or integration with a webhook to your bot
  8. Meetings: Make your voice bot available in a Zoom meeting!

Understanding the Process

Refer to Quickstart guide to clone the examples directory before proceeding.

Review the code in voice_bot.py.

Setup

The VoiceBot class initializes with the setup method, which is automatically called when the application starts. This method is responsible for setting up the necessary services and loading models. Here’s a breakdown of the services initialized:

  • DeepgramSTT: Converts spoken audio to text using Deepgram’s speech-to-text API.
  • GroqLLM: Processes text inputs to generate responses, simulating a conversational agent.
  • TokenAggregator: Aggregates tokens from the language model to form coherent responses.
  • ElevenLabsTTS: Converts text responses back into spoken audio using Eleven Labs’ text-to-speech API.
  • AudioConverter: Converts audio formats to ensure compatibility across different services.

Streaming Endpoint

The run method is decorated with @outspeed.streaming_endpoint(), indicating it handles real-time streaming data. This method is triggered whenever a new connection request is received. It processes incoming audio streams and coordinates the flow of data through various services:

  1. Audio Cloning: Clones the incoming audio stream to allow simultaneous processing by different services.
  2. Speech Recognition: Converts the cloned audio stream to text using DeepgramSTT.
  3. Language Processing: Processes the recognized text to generate tokens and manage conversation history using FireworksLLM.
  4. Token Aggregation: Aggregates tokens to form complete responses.
  5. Text-to-Speech: Converts the aggregated text back into audio using ElevenLabsTTS.
  6. Audio Conversion: Ensures the audio format is suitable for output.

Teardown

The teardown method is called when the application is stopped or shutdown unexpectedly. It ensures that all services are properly closed and resources are cleaned up, preventing resource leaks and ensuring the application can restart cleanly if necessary.

Deploying with outspeed deploy hosts it on our serverless backend and allows visual tracking of inputs and outputs via the Outspeed dashboard!

Next Steps

In the next tutorial, learn how to record the conversation with the AI.