Skip to content

Technical Specification Document: Kannada Voice Model Development Demo

Project Overview

The Kannada Voice Model Development project aims to create a robust voice assistant solution for the Kannada language, leveraging open-source Large Language Models (LLMs) and tools from AI4BHARAT. The demo will showcase three core functionalities: Automatic Speech Recognition (ASR), Text-to-Speech (TTS), and translation services, tailored specifically for Kannada and extensible to other Indian languages. This technical specification outlines the system requirements, architecture, and deliverables for a functional demo to be developed over a three-month period with GPU access.


Objectives

  • Demonstrate real-time ASR for converting spoken Kannada into text.
  • Showcase natural-sounding TTS for converting Kannada text into speech.
  • Highlight translation capabilities between Kannada and at least one other Indian language (e.g., Hindi or Tamil).
  • Prove scalability and performance using GPU-accelerated processing.

Technical Requirements

1. Hardware Requirements

Current Setup

  • Laptop: GTX 1060 with 6GB VRAM
  • Limitations: Insufficient for large-scale training and real-time inference under load.

Demo Requirements

  • GPU Resources:
  • Minimum: 1 GPU with at least 12GB VRAM (e.g., RTX 4090 or equivalent).
  • Recommended: 3 GPUs with 24GB VRAM each (e.g., A100 or RTX A6000) for scalability testing.
  • Runtime:
    • Month 1: 8 GPU hours/day (development).
    • Month 2: 16 GPU hours/day (scalability tests).
    • Month 3: 24 GPU hours/day (large-scale testing).
  • Storage: 80GB minimum for model weights, datasets, and logs.
  • RAM: 18GB minimum for data preprocessing and inference.
  • vCPUs: 2-4 cores for parallel processing.

2. Software Requirements

Open-Source Tools

  • ASR: ASR Indic Server
  • Framework: Likely based on PyTorch or TensorFlow.
  • Model: Pre-trained Indic ASR model fine-tuned for Kannada.
  • TTS: Parler TTS Server
  • Framework: PyTorch.
  • Model: Pre-trained TTS model adapted for Kannada phonetics.
  • Translation: Indic Translate Server
  • Framework: Likely Transformer-based (e.g., Hugging Face models).
  • Model: Fine-tuned for Kannada-to-other Indian language translation.

Dependencies

  • Operating System: Ubuntu 22.04 LTS .
  • Programming Language: Python 3.10+.
  • Libraries:
  • PyTorch (GPU-enabled).
  • NumPy, Pandas (data handling).
  • Hugging Face Transformers (for model fine-tuning).
  • Torchaudio (audio processing).
  • FastAPI (for server deployment).

Dataset

  • Source: AI4BHARAT datasets (e.g., IndicSpeech, IndicTTS).
  • Size: ~10-20GB of Kannada audio/text pairs for training and validation.
  • Preprocessing: Audio normalization, text tokenization.

System Architecture

1. High-Level Architecture

[User Input] --> [ASR Module] --> [Text Processing] --> [TTS Module] --> [Audio Output]
                |                         |
                |--------------------[Translation Module] --> [Translated Text]
- Input: Microphone-captured Kannada speech or text input. - Output: Spoken Kannada audio or translated text/speech.

2. Component Breakdown

ASR Module

  • Function: Convert spoken Kannada to text.
  • Model: Fine-tuned Indic ASR model.
  • Input: WAV audio (16kHz, mono).
  • Output: Kannada text (UTF-8 encoded).
  • Latency Goal: < 500ms for real-time demo.

TTS Module

  • Function: Convert Kannada text to natural-sounding speech.
  • Model: Fine-tuned Parler TTS model.
  • Input: Kannada text (UTF-8 encoded).
  • Output: WAV audio (22kHz, mono).
  • Quality Goal: MOS (Mean Opinion Score) > 4.0.

Translation Module

  • Function: Translate Kannada text to another Indian language (e.g., Hindi).
  • Model: Fine-tuned Indic Translate model.
  • Input: Kannada text (UTF-8 encoded).
  • Output: Translated text (UTF-8 encoded).
  • Accuracy Goal: BLEU score > 0.8.

Server Infrastructure

  • Deployment: Flask/FastAPI server hosted on cloud GPU instance.
  • API Endpoints:
  • /asr: Audio → Text.
  • /tts: Text → Audio.
  • /translate: Text → Translated Text.

Demo Deliverables

  1. Live Demonstration:
  2. User speaks a Kannada phrase → System transcribes it → System responds with spoken Kannada output.
  3. User inputs Kannada text → System translates to Hindi and speaks it back.
  4. Performance Metrics:
  5. ASR latency, TTS quality (MOS), translation accuracy (BLEU).
  6. Source Code: GitHub repository with server and model configurations.
  7. Documentation: README with setup instructions and API usage.

Risks and Mitigation

  • Risk: Insufficient GPU performance for real-time inference.
  • Mitigation: Start with a single high-VRAM GPU (e.g., RTX 4090) and scale as needed.
  • Risk: Dataset quality affects model accuracy.
  • Mitigation: Validate and augment AI4BHARAT datasets with additional Kannada samples if required.
  • Risk: Cost overrun beyond $1,800.
  • Mitigation: Monitor usage daily and adjust GPU hours if approaching budget limits.

Conclusion

This technical specification outlines the requirements and roadmap for a successful demo of the Kannada Voice Model. With GPU access secured for three months, the project will deliver a functional, scalable voice assistant solution comparable to industry standards, tailored for Kannada speakers. The demo will serve as a proof-of-concept for further funding and development.