Parler-TTS Latency Measurements
This report summarizes the latency measurements for the Parler-TTS text-to-speech system (ai4bharat/indic-parler-tts
) under different hardware configurations and optimization methods. The latency is reported in seconds and corresponds to the time taken to generate audio for a given number of words. The data is derived from various tests conducted on March 16, 2025.
Latency Table
Parler-TTS Latency Measurements (Formatted)
Hardware | Optimization Method | Word Count | Latency (s) | Notes |
---|---|---|---|---|
T4 | Simple Transformer | 5 | 4.70 | Baseline measurement |
T4 | Simple Transformer | 21 | 23.63 | Baseline measurement |
T4 | Flash Attention | 5 | 8.16 | Slower than baseline |
T4 | Flash Attention | 21 | 38.29 | Significantly slower than baseline |
L4 | Flash Attention | 5 | 3.99 | Fastest for 5 words across tests |
L4 | Flash Attention | 21 | 20.82 | Improved over T4 FA |
L4 | Flash Attention (App) | 7 | 7.92 | App request measurement |
A10G | Flash Attention | 21 | 25.52 | Consistent but slower than L4 |
A10G | Flash Attention | 21 | 24.33 | Slight variation in repeated test |
L4 | Torch Compile (Regular) | 1 | 2.72 | Minimal input size |
L4 | Torch Compile (Regular) | 5 | 2.58 | Fastest for small input |
L4 | Torch Compile (Regular) | 7 | 4.70 | Comparable to baseline T4 |
L4 | Torch Compile (Regular) | 21 | 10.65 | Best regular compile for 21 words |
L4 | Torch Compile (Regular) | 21 | 11.99 | Slight variation |
L4 | Torch Compile (Regular) | 21 | 12.10 | Consistent performance |
L4 | Torch Compile (Regular) | 21 | 13.51 | Higher variation |
L4 | Torch Compile (Reduce OH) | 7 | 3.00 | Estimated from "3s - 7 words" |
L4 | Torch Compile (Reduce OH) | 21 | 10.00 | Estimated from "10 s - 21" |
L4 | Torch Compile (Reduce OH) | 21 | 12.00 | Estimated from "12 s - 21 words" |
Observations
- Hardware Impact:
- The L4 server with Flash Attention showed the best performance for 5 words (3.99s), suggesting better optimization or higher computational power compared to T4.
-
A10G with Flash Attention was slower (24-25s for 21 words) than L4 (20.82s), indicating potential hardware or configuration differences.
-
Optimization Methods:
- Simple Transformer (T4): Served as a baseline with 4.70s for 5 words and 23.63s for 21 words.
- Flash Attention: Surprisingly slower on T4 (8.16s for 5 words, 38.29s for 21 words) compared to the baseline, but improved on L4 (3.99s for 5 words, 20.82s for 21 words). This suggests Flash Attention benefits from specific hardware capabilities.
- Torch Compile (Regular): Consistently faster than Flash Attention, with the best result for 5 words at 2.58s and a range of 10.65-13.51s for 21 words.
-
Torch Compile (Reduce Overhead): Showed promising results with approximately 3s for 7 words and 10-12s for 21 words, indicating potential for lower latency with this mode.
-
Input Size:
- Latency generally increases with word count, but the scaling is not linear. For example, Torch Compile (Regular) took 2.58s for 5 words and 10.65s for 21 words, suggesting optimization benefits for larger inputs.
Notes
- The "reduce-overhead" mode values (3s, 12s, 10s) were approximated from your shorthand notation; actual measurements might vary slightly.
- All measurements were taken on March 16, 2025, using the
ai4bharat/indic-parler-tts
model. - Latency values are in seconds (s), rounded to two decimal places.
- Word counts represent the number of words in the input text.
- "Reduce OH" refers to the "reduce-overhead" mode in Torch Compile.
- The table is sorted by hardware, then optimization method, and finally word count for better readability.
Conclusion
The Torch Compile optimization, particularly with "reduce-overhead" mode, appears to offer the best balance of latency reduction across different input sizes. The L4 server with Flash Attention also performed well, especially for smaller inputs. For optimal performance, consider using Torch Compile with "reduce-overhead" mode on capable hardware, though further testing could refine these findings.