Latency Results for Dhwani AI - Speech-to-Speech Voice Assistant

Latency Report

This report presents the restructured latency analysis across various GPUs, organized using tables for clarity and comparison. It includes total latency, a breakdown by phase (Non-TTS and TTS), and concludes with key insights and recommendations.

Total Latency Across GPUs

The table below summarizes the total latency for three requests across different GPUs, along with the average latency and notable observations.

GPU	Request 1 (s)	Request 2 (s)	Request 3 (s)	Average (s)	Notes
A100	6.668	6.621	6.515	6.601	Consistent performance around 6.5–6.7 seconds.
L40 S	6.536	4.400	4.479	4.440*	First request slower (6.536s); stabilizes at ~4.4s.
L4	11.687	9.344	9.207	9.276*	Improves to ~9.2s after slow first request (11.687s).
T4 Medium	19.504	17.746	17.898	17.822*	High latency, stabilizing at ~17.8s.
T4	20.830	18.643	18.850	18.747*	Slowest overall, around 18.7s after warmup.

Note: Average calculated after the first request to account for initialization effects.

Latency Breakdown by Phase

The latency is broken down into two phases: Non-TTS Phase (transcription to processed text) and TTS Phase (processed text to request completion). Each phase is presented in a separate table.

Non-TTS Phase (Transcription to Processed Text)

GPU	Request 1 (s)	Average (Requests 2–3) (s)	Notes
A100	1.507	~1.5	Consistent across requests.
L40 S	1.515	~1.3	Slightly faster after first request.
L4	1.630	~1.3	Improves after first request.
T4 Medium	2.078	~1.8	Higher latency compared to others.
T4	2.189	~1.9	Highest latency in this phase.

TTS Phase (Processed Text to Request Completion)

GPU	Request 1 (s)	Average (Requests 2–3) (s)	Notes
A100	5.161	~5.0	Consistent performance.
L40 S	5.021	~3.1	Significant improvement after first request.
L4	10.057	~8.0	Reduces after initial request.
T4 Medium	17.426	~16.0	High latency, even after warmup.
T4	18.641	~17.0	Highest TTS latency.

Key Insights

Total Latency

Fastest: L40 S (~4.4s after warmup).
Most Consistent: A100 (~6.5s across requests).
Moderate: L4 (~9.2s after warmup).
Slowest: T4 (18.7s) and T4 Medium (17.8s) after warmup.

Non-TTS Phase

Relatively quick across all GPUs (1.3–2.2s).
Best Performers: L40 S and L4 (~1.3s after warmup).
Slowest: T4 (1.9s) and T4 Medium (1.8s).

TTS Phase

Primary source of latency variation:
Fastest: L40 S (~3.1s after warmup).
Consistent: A100 (~5s).
Moderate: L4 (~8s after warmup).
Slowest: T4 Medium (16s) and T4 (17s).

Conclusion

The L40 S GPU delivers the lowest total latency (4.4s after warmup, with ~3s in the TTS phase), making it the best choice for real-time applications like Dhwani AI. The A100 GPU offers reliable performance (6.5s total, 5s TTS), serving as a strong alternative. The TTS phase is the primary bottleneck, particularly for the T4 (17s) and T4 Medium (~16s), highlighting it as a critical area for optimization. The Non-TTS phase shows less variation (1.3–2.2s) and is less impactful on overall performance.

--

This document provides the latency results for Dhwani AI, a speech-to-speech voice assistant designed for Kannada and other Indian languages. The pipeline processes spoken Kannada input through transcription, translation to English, response generation, translation back to Kannada, and speech synthesis. We evaluated five GPU configurations—A100, L40 S, L4, T4 Medium, and T4—based on total request times and key processing phases, derived from server logs.

Total Latency Across GPUs

The total request time represents the end-to-end duration from receiving audio input to delivering the spoken response. Below are the results for three requests per GPU, showing consistency and initialization effects:

A100:
Request 1: 6.668 seconds
Request 2: 6.621 seconds
Request 3: 6.515 seconds
Average: 6.601 seconds
Note: Stable performance around 6.5–6.7 seconds.
L40 S:
Request 1: 6.536 seconds
Request 2: 4.400 seconds
Request 3: 4.479 seconds
Average (after first request): 4.440 seconds
Note: First request slower due to initialization; stabilizes at ~4.4 seconds.
L4:
Request 1: 11.687 seconds
Request 2: 9.344 seconds
Request 3: 9.207 seconds
Average (after first request): 9.276 seconds
Note: Improves to ~9.2 seconds after a slow first request.
T4 Medium:
Request 1: 19.504 seconds
Request 2: 17.746 seconds
Request 3: 17.898 seconds
Average (after first request): 17.822 seconds
Note: High latency, stabilizing at ~17.8 seconds.
T4:
Request 1: 20.830 seconds
Request 2: 18.643 seconds
Request 3: 18.850 seconds
Average (after first request): 18.747 seconds
Note: Slowest overall, around 18.7 seconds after warmup.

Summary of Total Latency

Fastest: L40 S (~4.4 seconds after warmup).
Most Consistent: A100 (~6.5 seconds).
Moderate: L4 (~9.2 seconds after warmup).
Slowest: T4 (~18.7 seconds) and T4 Medium (~17.8 seconds).

Latency Breakdown by Phase

The pipeline splits into two main phases: 1. Non-TTS Phase: Transcription, translation to English, response generation, and translation to Kannada. 2. TTS Phase: Text-to-speech synthesis of the Kannada response.

Below is the breakdown based on the first request, with averages for subsequent requests to account for initialization:

Non-TTS Phase

A100:
Request 1: 1.507 seconds
Average: ~1.5 seconds
L40 S:
Request 1: 1.515 seconds
Average (Requests 2–3): ~1.3 seconds
L4:
Request 1: 1.630 seconds
Average (Requests 2–3): ~1.3 seconds
T4 Medium:
Request 1: 2.078 seconds
Average (Requests 2–3): ~1.8 seconds
T4:
Request 1: 2.189 seconds
Average (Requests 2–3): ~1.9 seconds

TTS Phase

A100:
Request 1: 5.161 seconds
Average: ~5 seconds
L40 S:
Request 1: 5.021 seconds
Average (Requests 2–3): ~3.1 seconds
L4:
Request 1: 10.057 seconds
Average (Requests 2–3): ~8 seconds
T4 Medium:
Request 1: 17.426 seconds
Average (Requests 2–3): ~16 seconds
T4:
Request 1: 18.641 seconds
Average (Requests 2–3): ~17 seconds

Phase Insights

Non-TTS: Quick across GPUs (1.3–2.2 seconds), with L40 S and L4 leading (~1.3 seconds after warmup).
TTS: Major contributor to latency differences:
L40 S excels (~3 seconds after warmup).
A100 steady (~5 seconds).
L4 moderate (~8 seconds).
T4 Medium and T4 lag (~16–17 seconds).

Conclusion

The L40 S GPU offers the lowest latency (~4.4 seconds total, ~3 seconds TTS after warmup), making it ideal for real-time use. The A100 follows closely (~6.5 seconds total, ~5 seconds TTS) with reliable performance. The TTS phase drives most latency variations, especially on slower GPUs like T4 and T4 Medium (~17–18 seconds total), highlighting it as a critical area for optimization.