What is Outspeed?

Outspeed is a platform for realtime voice and video AI applications. The Outspeed SDK provides a simple and intuitive interface for building real-time voice and video AI applications. It abstracts away the complexities of handling multimedia inputs, allowing developers to focus on creating powerful multimodal AI experiences.

Key features of the Outspeed SDK include:

  • Easy-to-use API inspired by PyTorch
  • Built-in support for processing voice and video inputs
  • Seamless integration with various AI models and services
  • Real-time processing capabilities for responsive applications

Now, let’s get started!

Demo

In this quick start guide, we’ll demonstrate how to build a voice bot using Outspeed. This bot can engage in real-time conversations and respond based on the LLM prompt.

Below, you’ll find a brief video demonstration showcasing the capabilities of the voice bot: (PS: Adapt was our previous name 😅)

Prerequisites

Make sure to have the following installed:

  1. Python: version 3.9 or higher, but less than 3.13
  2. pip: latest version (comes with Python)
  3. Git: for cloning the repository

Creating a Voice Bot in 4 steps!

This app will process input from your microphone or chatbox, send it to an LLM and convert the reponse back to voice.

This application only supports English as the source language. For other languages, switch out the models/configs used in this example.

1

Install Dependencies

2

Create VoiceBot application

To create a simple voice bot, let’s setup a file voice_bot.py.

# voice_bot.py
import outspeed as sp

Next, we’ll create an application class with a streaming endpoint. This endpoint will:

  • Accept sp.AudioStream and sp.TextStream as inputs
  • Respond with an sp.AudioStream

To create a real-time application in Outspeed:

  1. Define a Python class
  2. Annotate it with @sp.App()
  3. Implement these three key methods:
MethodPurpose
setup()Initialize AI services and resources
run()Define the main processing pipeline (use @sp.streaming_endpoint())
teardown()Clean up resources when the application stops
@sp.App()
class VoiceBot:
  async def setup(self) -> None:
    # Initialize the AI services
    # (Code for setup will be shown in the next snippet)

  @sp.streaming_endpoint()
  async def run(self, audio_input_queue: sp.AudioStream) -> sp.AudioStream:
    # Handle the main processing loop for the VoiceBot
    # (Code for run will be shown in a separate snippet)

  async def teardown(self) -> None:
    # Clean up resources when the VoiceBot is shutting down
    # (Code for teardown will be shown in the final snippet)

Now let’s break down each method:

  1. Setup method:
async def setup(self) -> None:
  # Initialize the AI services
  self.deepgram_node = sp.DeepgramSTT()
  self.llm_node = sp.GroqLLM(
      system_prompt="You are a helpful assistant. Keep your answers very short. No special characters in responses.",
  )
  self.token_aggregator_node = sp.TokenAggregator()
  self.tts_node = sp.CartesiaTTS(
      voice_id="95856005-0332-41b0-935f-352e296aa0df",
  )

The setup method initializes all the necessary AI services:

  • DeepgramSTT for speech-to-text conversion
  • GroqLLM for language model processing
  • TokenAggregator for aggregating tokens
  • CartesiaTTS for text-to-speech conversion
  1. Run method:
@sp.streaming_endpoint()
async def run(self, audio_input_queue: sp.AudioStream) -> sp.AudioStream:
  deepgram_stream: sp.TextStream = self.deepgram_node.run(audio_input_queue)

  llm_token_stream, chat_history_stream = self.llm_node.run(deepgram_stream)
  token_aggregator_stream: sp.TextStream = self.token_aggregator_node.run(llm_token_stream)

  tts_stream: sp.AudioStream = self.tts_node.run(token_aggregator_stream)

  return tts_stream

The run method sets up the AI service pipeline:

  • Converts audio input to text using Deepgram
  • Processes any text input
  • Merges speech-to-text and direct text inputs
  • Runs the merged input through the LLM
  • Aggregates LLM output tokens for improved TTS synthesis
  • Converts the aggregated text to speech
  • Returns the final audio stream
  1. Teardown method:
async def teardown(self) -> None:
  await self.deepgram_node.close()
  await self.llm_node.close()
  await self.token_aggregator_node.close()
  await self.tts_node.close()

if __name__ == "__main__":
  VoiceBot().start()

The teardown method ensures proper cleanup of resources:

  • Closes all initialized AI services (Deepgram, LLM, TokenAggregator, TTS)
  • This method is called when the app stops or shuts down unexpectedly

To view the full voice_bot.py code, navigate to the following link: View the complete voice_bot.py code

3

Setup API Keys and Run

To run this example locally, you’ll need API keys setup in the environment variables for the following services:

  1. Deepgram - For transcription. Sign up and navigate to https://console.deepgram.com/ to get the API key.
  2. Groq - For LLM. Sign up and navigate to https://console.groq.com/keys and to get the API key.
  3. Cartesia - For text-to-speech. Sign up and navigate to https://play.cartesia.ai/keys to get the API key.

All of these providers have a free tier. Once you have your keys, create a .env file in the same directory as voice_bot.py and add the following:

DEEPGRAM_API_KEY=<your_deepgram_api_key>
GROQ_API_KEY=<your_groq_api_key>
CARTESIA_API_KEY=<your_cartesia_api_key>

Finally, run the following command to start the server:

# in examples/voice_bot
python voice_bot.py

The console will output the URL you can use to connect to the (default is http://localhost:8080).

4

Try it Out

You can use our playground to interact with the voice bot.

  1. Navigate to playground and select “Voice Bot”
  2. Paste the link your received from the previous step into the URL field.
  3. Select Audio device. Leave Video device blank. Click Run to begin.

The playground is built using our our React SDK. You can use it to build your own frontends, or integrate with an existing one!

Support

For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!