This directory contains a complete implementation of a speech-to-speech conversational AI system using Amazon Nova Sonic. The system enables real-time voice conversations between users and an AI assistant through a web browser interface.
- Amazon Nova Sonic Speech-to-Speech Implementation
The Amazon Nova Sonic Speech-to-Speech implementation is built on a modern, distributed architecture that enables real-time bidirectional voice communication between users and an AI assistant. The system consists of three main components:
- Server: A FastAPI-based backend that manages connections, creates communication rooms, and spawns AI bot instances
- Bot: A sophisticated pipeline that processes audio streams, transcribes speech, generates responses using Amazon Nova Sonic, and synthesizes speech
- Client: A browser-based frontend that handles WebRTC connections, audio/video streaming, and user interactions
The system uses WebRTC (via Daily's transport layer) for real-time communication, enabling low-latency, high-quality audio streaming in both directions. The entire infrastructure is deployed on AWS using containerized services for scalability and reliability.
The server implementation (server.py
) provides the foundation for the speech-to-speech system:
- Built with FastAPI for high-performance API endpoints
- Manages client connections and authentication
- Creates Daily rooms for WebRTC communication
- Spawns and monitors Amazon Nova Sonic bot instances
- Provides health check and status endpoints
Key server endpoints:
/connect
: Creates a Daily room and returns connection credentials/health
: Health check endpoint for monitoring/status/{pid}
: Gets the status of a specific bot process
The bot implementation (bot_bedrock_nova.py
) handles the core AI functionality:
- Establishes bidirectional WebRTC connections using DailyTransport
- Processes audio streams using voice activity detection
- Transcribes user speech to text
- Generates contextual responses using Amazon Nova Sonic LLM
- Synthesizes speech from text responses
- Manages visual animations for the bot interface
The bot uses a sophisticated pipeline architecture for processing:
- Input Stage: Captures audio from WebRTC and processes it through voice activity detection
- Transcription: Converts speech to text
- Context Management: Maintains conversation context using OpenAILLMContext
- Language Processing: Generates responses using AWSNovaSonicLLMService
- Speech Synthesis: Converts text responses to speech
- Output Stage: Streams synthesized speech and visual animations back to the client
The pipeline is designed for real-time processing with minimal latency, enabling natural conversational flow.
The client implementation (app.js
) establishes and manages the WebRTC connection:
- Uses the RTVI client library for standardized communication
- Connects to the server's
/connect
endpoint to obtain room credentials - Establishes a WebRTC connection via DailyTransport
- Handles connection state changes and reconnection logic
The client manages audio and video streams:
- Captures user microphone input
- Processes incoming bot audio for playback
- Displays bot video/animations
- Handles media track events (start/stop)
- Manages media resources cleanup on disconnection
The client provides a simple but effective user interface:
- Connect/disconnect controls
- Connection status display
- Transcript display for both user and bot speech
- Visual feedback through bot animations
- Debug logging for troubleshooting
The entire infrastructure is defined using AWS CDK with TypeScript:
- Enables reproducible deployments
- Manages all AWS resources as code
- Simplifies updates and modifications
- Provides consistent environments
Both server and client components are containerized:
-
Server Container:
- Python FastAPI application
- Deployed on AWS Fargate
- Configured with appropriate memory and CPU resources
- Environment variables and secrets managed securely
-
Client Container:
- Vite-based web application
- Nginx for static file serving
- Deployed on AWS Fargate
- Configured for optimal performance
The architecture includes comprehensive networking:
- Custom VPC with public and private subnets
- Application Load Balancers for both server and client
- HTTPS with SSL/TLS certificates
- Custom domain names via Route 53
- Health checks and monitoring
Security is implemented at multiple levels:
- HTTPS for all communications
- Secrets stored in AWS Parameter Store
- Environment-specific configuration
- Proper IAM roles and permissions
- Network security groups
The diagram above illustrates the AWS architecture of the speech-to-speech implementation:
- Users connect to the client application through a load balancer
- The client application runs in a Fargate container in a private subnet
- The client communicates with the server through WebRTC/HTTPS
- The server runs in a separate Fargate container in a private subnet
- The server communicates with Amazon Nova Sonic for language processing
- Supporting services include Route 53 for DNS, ACM for certificates, S3 for storage, and CloudWatch for logging
To run the application locally:
- Clone the repository
- Set up environment variables (see below)
- Start the server:
cd daily/server pip install -r requirements.txt python server.py
- Start the client:
cd daily/vite-client npm install npm run dev
DAILY_API_KEY
: API key for Daily.coDAILY_API_URL
: URL for Daily API (default: https://api.daily.co/v1)NOVA_AWS_ACCESS_KEY_ID
: AWS access key for Nova SonicNOVA_AWS_SECRET_ACCESS_KEY
: AWS secret key for Nova SonicNOVA_AWS_REGION
: AWS region for Nova Sonic (default: us-east-1)NOVA_VOICE_ID
: Voice ID for Nova Sonic (default: tiffany)
VITE_BASE_URL
: Base URL for the server API (default: http://localhost:8000)