Introduction
Language barriers remain one of the biggest challenges in communication. Whether you are holding an all-hands meeting for a globally distributed team, consulting with non-native speaking patients, or teaching students across continents—seamless, real-time translation makes or breaks effective communication.Traditional translation tools feel impersonal and disconnected. Text captions scroll across screens while speakers continue in their native tongue, creating a disjointed experience. What if your audience could see and hear an AI avatar speaking directly to them in their own language, with natural lip-sync and human-like expressions?
We can use Azure Speech Translation and Avatar to address this: a speaker talks in one language, and listeners watch an AI avatar deliver the translated speech in their chosen language. Imagine a CEO in Tokyo delivering a quarterly update. Employees in Munich, São Paulo, and Mumbai each see an AI avatar speaking to them in German, Portuguese, and Hindi respectively—all in real-time, with synchronized lip movements and natural speech patterns. The speaker focuses on their message; the technology handles the rest.
In this blog, we will discuss a sample implementation that used Azure Speech, Translation and Avatar capabilities.
How It Works
📚 Ready to build your own real-time translation avatar application? Grab the complete source code and documentation from GitHub : github.com/l-sudarsan/avatar-translation
The application uses a session-based Speaker/Listener architecture to separate the presenter's control interface from the audience's viewing experience. The speaker can create and configure a session based on requirements
Speaker Mode
The speaker interface gives presenters full control over the translation session:
- Session Management: Create sessions and generate shareable listener URLs
- Language Configuration: Select source language (what you speak) and target language (what listeners hear)
- Avatar Selection: Choose from prebuilt or custom avatars for the translation output
- Real-time Feedback: View live transcription of your speech and monitor listener count
- No Avatar Display: The interface intentionally hides the avatar video/audio to prevent microphone feedback loops
Listener Mode
The listener interface delivers an immersive, distraction-free viewing experience:
- Easy Access: Join via a simple URL containing the session code (e.g.,
/listener/123456) - Avatar Video: Watch the AI avatar with synchronized lip movements matching the translated speech
- Translated Audio: Hear the avatar speak the translation in the target language
- Caption Display: Read real-time translation text alongside the avatar
- Translation History: Scroll through all translations from the session
Data Flow & Solution Components
The diagram below shows data flow and how the components interact. The Flask server acts as the central hub, coordinating communication between the speaker's browser, Azure Speech Services, and multiple listener clients.
Implementation Deep Dive
You can check the complete source code in the GitHub repository.
Core Components
Five main technical components power the application, each handling a specific part of the translation pipeline.
1. Backend: Flask + Socket.IO
The server uses Flask and Flask-SocketIO with the Eventlet async worker for WebSocket support. This combination delivers:
- HTTP endpoints for session management and avatar connection
- WebSocket rooms for real-time translation broadcasting
- Session storage for managing multiple concurrent translation sessions
# Session structure
sessions = {
"123456": {
"name": "Q1 Townhall",
"source_language": "en-US",
"target_language": "ja-JP",
"avatar": "lisa",
"listeners": set(),
"is_translating": False
}
}
2. Audio Streaming: Browser to Server
Instead of relying on server-side microphone access, the browser captures audio directly using the Web Audio API:
// Speaker captures microphone at 16kHz
const audioContext = new AudioContext({ sampleRate: 16000 });
const mediaStream = await navigator.mediaDevices.getUserMedia({ audio: true });
// Process audio and send via Socket.IO
processor.onaudioprocess = (event) => {
const pcmData = convertToPCM16(event.inputBuffer);
socket.emit('audioData', { sessionId, audioData: pcmData });
};
This approach works seamlessly across different deployment environments without requiring server microphone permissions.
3. Azure Speech Translation
The server receives audio chunks and feeds them to Azure's TranslationRecognizer via a PushAudioInputStream:
# Configure translation
translation_config = speechsdk.translation.SpeechTranslationConfig(
subscription=SPEECH_KEY,
region=SPEECH_REGION
)
translation_config.speech_recognition_language = "en-US"
translation_config.add_target_language("ja")
# Push audio stream
push_stream = speechsdk.audio.PushAudioInputStream()
audio_config = speechsdk.audio.AudioConfig(stream=push_stream)
# Handle recognition results
def on_recognized(evt):
translation = evt.result.translations["ja"]
socketio.emit('translationResult', {
'original': evt.result.text,
'translated': translation
}, room=session_id)
4. Avatar Synthesis with WebRTC
Each listener establishes a WebRTC connection to Azure's Avatar Service:
- ICE Token Exchange: Server provides TURN server credentials
- SDP Negotiation: Browser and Azure exchange session descriptions
- Avatar Connection: Listener sends local SDP offer, receives remote answer
- Video Stream: Avatar video flows directly to listener via WebRTC
// Listener connects to avatar
const peerConnection = new RTCPeerConnection(iceConfig);
const offer = await peerConnection.createOffer();
await peerConnection.setLocalDescription(offer);
// Send to Azure Avatar Service
const response = await fetch('/api/connectListenerAvatar', {
method: 'POST',
headers: { 'session-id': sessionId },
body: JSON.stringify({ sdp: offer.sdp })
});
const { sdp: remoteSdp } = await response.json();
await peerConnection.setRemoteDescription({ type: 'answer', sdp: remoteSdp });
5. Real-Time Broadcasting
When the speaker talks, translations flow to all listeners simultaneously:
Each listener maintains their own WebRTC connection to the Avatar Service, ensuring independent video streams while receiving synchronized translation text.