Ensure you have completed the Quick Start - Simple Voice Bot guide before proceeding with this guide.

Integrating Cartesia TTS into Your Voice Bot

In this guide, we will enhance your basic Voice Bot by integrating Cartesia TTS (Text-to-Speech) to convert text responses into natural-sounding speech. This integration provides a more engaging and interactive user experience.

  • The full code is available here.

Prerequisites

Before integrating Cartesia TTS, ensure you have the following:

  1. API Keys:

Setup Environment Variables

Create a .env file in the same directory as voice_bot.py and add your API keys:

DEEPGRAM_API_KEY=<your_deepgram_api_key>
GROQ_API_KEY=<your_groq_api_key>
CARTESIA_API_KEY=<your_cartesia_api_key>

Understanding the CartesiaTTS Plugin

The CartesiaTTS plugin is a powerful component that enables text-to-speech synthesis using the Cartesia API. Let’s explore its key features and functionality:

Initialization

The plugin is initialized with several parameters:

CartesiaTTS(
    api_key: Optional[str] = None,
    voice_id: str = "a0e99841-438c-4a64-b679-ae501e7d6091",
    model: str = "sonic-english",
    output_encoding: str = "pcm_s16le",
    output_sample_rate: int = 16000,
    stream: bool = True,
    base_url: str = "wss://api.cartesia.ai/tts/websocket",
    cartesia_version: str = "2024-06-10",
)
  • api_key: Your Cartesia API key (can be set via environment variable CARTESIA_API_KEY)
  • voice_id: The ID of the voice to use for synthesis
  • model: The TTS model to use (default is “sonic-english”)
  • output_encoding: The audio encoding format (default is “pcm_s16le”)
  • output_sample_rate: The sample rate of the output audio (default is 16000 Hz)
  • stream: Whether to stream the audio output (default is True)

Key Features

  1. WebSocket Communication: The plugin establishes a WebSocket connection with the Cartesia API for real-time text-to-speech synthesis.

  2. Streaming Support: It supports streaming audio output, allowing for low-latency speech synthesis.

  3. Interrupt Handling: The plugin can handle interruptions (e.g., when the user starts speaking) by cancelling ongoing TTS generation and clearing queues.

  4. Asynchronous Operation: The plugin operates asynchronously, efficiently managing text input and audio output streams.

  5. Tracing and Logging: It includes tracing and logging functionality for monitoring performance and debugging.

Usage

To use the CartesiaTTS plugin in your Voice Bot:

  1. Initialize the plugin in your setup method:
self.tts_node = sp.CartesiaTTS(
    voice_id="95856005-0332-41b0-935f-352e296aa0df",
    api_key=os.getenv("CARTESIA_API_KEY")
)
  1. In your run method, connect the text input to the TTS node:
tts_stream: sp.AudioStream = self.tts_node.run(token_aggregator_stream)

This integration allows your Voice Bot to convert text responses into natural-sounding speech, enhancing the interactive experience for users.

Updating voice_bot.py to Use Cartesia TTS

Ensure your voice_bot.py includes the Cartesia TTS integration as shown below:

# voice_bot.py
import outspeed as sp
import json

@sp.App()
class VoiceBot:
    async def setup(self) -> None:
        # Initialize the AI services
        self.deepgram_node = sp.DeepgramSTT(sample_rate=8000)
        self.llm_node = sp.GroqLLM(
            system_prompt="You are a helpful assistant. Keep your answers very short. No special characters in responses.",
        )
        self.token_aggregator_node = sp.TokenAggregator()
        self.tts_node = sp.CartesiaTTS(
            voice_id="95856005-0332-41b0-935f-352e296aa0df",
            api_key=os.getenv("CARTESIA_API_KEY")  # Ensure API key is loaded from environment
        )

    @sp.streaming_endpoint()
    async def run(self, audio_input_queue: sp.AudioStream, text_input_queue: sp.TextStream) -> sp.AudioStream:
        deepgram_stream: sp.TextStream = self.deepgram_node.run(audio_input_queue)

        text_input_queue = sp.map(text_input_queue, lambda x: json.loads(x).get("content"))

        llm_input_queue: sp.TextStream = sp.merge(
            [deepgram_stream, text_input_queue],
        )

        llm_token_stream, chat_history_stream = self.llm_node.run(llm_input_queue)

        token_aggregator_stream: sp.TextStream = self.token_aggregator_node.run(llm_token_stream)
        tts_stream: sp.AudioStream = self.tts_node.run(token_aggregator_stream)

        return tts_stream

    async def teardown(self) -> None:
        await self.deepgram_node.close()
        await self.llm_node.close()
        await self.token_aggregator_node.close()
        await self.tts_node.close()

if __name__ == "__main__":
    VoiceBot().start()