Make sure you have gone over QuickStart before trying this example.

In this example, we will build a realtime cooking assistant that uses audio from microphone and video from camera to provide help while cooking.

Demo

Creating a Cooking Assistant in 4 steps

This app will process input from your microphone, video from camera, send it to an LLM and convert the reponse back to voice.

  • The full github repository is available here.

This application will initially support only English as the source language.

1

Start by cloning the repository

git clone https://github.com/outspeed-ai/outspeed.git
cd examples/cooking_assistant/
2

Install Dependencies

3

Run Backend

Ensure you’re in the same directory as voice_bot.py!

You will need the following environment variables:

  1. DEEPGRAM_API_KEY - You can get this by going to https://console.deepgram.com/ and clicking on the API key tab.
  2. GROQ_API_KEY - You can get this by going to https://console.groq.com/keys and clicking on the API key tab.
  3. CARTESIA_API_KEY - You can get this by going to https://play.cartesia.ai/keys and clicking on the API key tab.

All of these providers have a free tier. Once you have your keys, run the following command:

export DEEPGRAM_API_KEY=<your_deepgram_api_key>
export GROQ_API_KEY=<your_groq_api_key>
export CARTESIA_API_KEY=<your_cartesia_api_key>

Finally, run the following command to start the server:

python cooking_assistant.py

The console will output the URL you can use to connect to the server (default is http://localhost:8080).

4

Run Demo

We have already created a simple frontend using our React SDK. You can then browse to the following page with your browser:

https://playground.outspeed.com/webrtc

  1. Paste the link your received from the previous step into the URL field.
  2. Select Audio device. Leave Video device blank. Click Run to begin.

Understanding the Process

Review the code in cooking_assistant.py.

Setup

The CookingAssistant class initializes with the setup method, which is automatically called when the application starts. This method is responsible for setting up the necessary services and loading models. Here’s a breakdown of the services initialized:

  • DeepgramSTT: Converts spoken audio to text using Deepgram’s speech-to-text API, configured with a sample rate of 8000 Hz.
  • OpenAIVision: Analyzes video inputs and generates cooking instructions, guided by a detailed system prompt. It operates with an auto-response setting and waits for the user’s first response to initiate interaction.
  • TokenAggregator: Aggregates tokens from OpenAIVision to ensure smooth and coherent data flow.
  • ElevenLabsTTS: Converts text instructions back into spoken audio using Eleven Labs’ text-to-speech API.
  • SileroVAD: Detects voice activity to manage interruptions effectively during the audio stream processing.
  • AudioConverter: Converts audio formats to ensure compatibility across different services.

Streaming Endpoint (run method)

This method handles the real-time processing of audio and video streams:

  1. Audio Stream Handling:

    • Receives and clones the audio input stream for parallel processing.
    • Processes the original audio stream through DeepgramSTT to convert speech to text.
    • Processes the cloned audio stream through SileroVAD for voice activity detection.
  2. Video Stream Handling:

    • The text output from DeepgramSTT and the video input stream are processed by OpenAIVision to generate cooking instructions.
  3. Text and Audio Processing:

    • The text from OpenAIVision is aggregated by TokenAggregator.
    • ElevenLabsTTS converts the aggregated text into spoken audio.
    • AudioConverter transforms this audio back into a stream format suitable for output.
  4. Interruption Management:

    • Sets up interruptions in the audio output based on voice activity detected by SileroVAD to manage the flow of responses.

Support

For any assistance or questions, feel free to join our Discord community. We’re excited to see what you build!