Make sure you have gone over QuickStart before trying this example.

In this example, we will build a realtime commentator that provides commentary on a live poker tournament. We accomplish by using screenshare to capture the live game on Youtube and microphone audio to interact with the commentator.

Demo

Creating a Sports Commentator in 4 steps

This app will process input from your microphone, video from screenshare, send it to an LLM and convert the reponse back to voice.

  • The full github repository is available here.

This application will initially support only English as the source language.

1

Start by cloning the repository

git clone https://github.com/outspeed-ai/outspeed.git
cd examples/sports_commentator/
2

Install Dependencies

3

Run Backend

Ensure you’re in the same directory as voice_bot.py!

You will need the following environment variables:

  1. DEEPGRAM_API_KEY - You can get this by going to https://console.deepgram.com/ and clicking on the API key tab.
  2. GROQ_API_KEY - You can get this by going to https://console.groq.com/keys and clicking on the API key tab.
  3. CARTESIA_API_KEY - You can get this by going to https://play.cartesia.ai/keys and clicking on the API key tab.

All of these providers have a free tier. Once you have your keys, run the following command:

export DEEPGRAM_API_KEY=<your_deepgram_api_key>
export GROQ_API_KEY=<your_groq_api_key>
export CARTESIA_API_KEY=<your_cartesia_api_key>

Finally, run the following command to start the server:

python poker_commentator.py

The console will output the URL you can use to connect to the server (default is http://localhost:8080).

4

Run Demo

We have already created a simple frontend using our React SDK. You can then browse to the following page with your browser:

https://playground.outspeed.com/webrtc

  1. Paste the link your received from the previous step into the URL field.
  2. Select Audio device. Leave Video device blank. Click Run to begin.

Understanding the Process

Review the code in poker_commentator.py.

Setup

The PokerCommentator class initializes with the setup method, which is automatically called when the application starts. This method is responsible for setting up the necessary services and loading models. Here’s a breakdown of the services initialized:

  • DeepgramSTT: Converts spoken audio to text using Deepgram’s speech-to-text API, configured with a sample rate of 8000 Hz.
  • KeyFrameDetector: Analyzes video streams to detect significant moments, using a sensitivity threshold of 0.2 and a maximum time interval of 15 seconds between key frames.
  • GeminiVision: Processes audio and video inputs to generate insightful poker commentary, guided by a detailed system prompt. It operates with a response temperature of 0.9 and maintains a chat history for context.
  • TokenAggregator: Aggregates tokens from GeminiVision to form coherent responses.
  • ElevenLabsTTS: Converts text responses back into spoken audio using Eleven Labs’ text-to-speech API, optimized for low latency and using a specific voice model.
  • AudioConverter: Converts audio formats to ensure compatibility across different services.

Streaming Endpoint

The run method in the PokerCommentator class is marked as a streaming endpoint, handling real-time audio and video streams. Here’s how it processes these streams:

  1. Audio Processing: The audio input stream is first converted to text using the DeepgramSTT service.
  2. Video Processing: Simultaneously, the video input stream is analyzed by the KeyFrameDetector to identify key moments.
  3. Commentary Generation: The text from Deepgram and the video analysis from KeyFrameDetector are then processed by GeminiVision to generate commentary.
  4. Token Aggregation: The commentary tokens generated are refined by the TokenAggregator for coherence.
  5. Text-to-Speech: The coherent text is then converted into spoken audio by ElevenLabsTTS.
  6. Audio Conversion: Finally, the audio stream is formatted by the AudioConverter for output.

The method outputs three streams: the audio stream of the commentary, the chat history text stream, and the cloned video input stream.

Support

For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!