Project Dhwani: Enhancing Kannada Voice Model Development with GPU Access
Table of Contents
- Summary
- Introduction
- Budget
- Project Scope
- Proposed Plan
- Test Cloud Provider
- Alternate Cloud Providers for GPU Access
- Additional Reading Materials
- Conclusion
- Contact Information
Summary
Dhwani is a self-hosted GenAI platform designed to provide voice mode interaction for Kannada and other Indian languages.
Research Goals
- Measure and improve the Time to First Token Generation (TTFTG) for model architectures in ASR, Translation, and TTS systems.
- Develop and enhance a Kannada voice model that meets industry standards set by OpenAI, Google, ElevenLabs, xAI
- Create robust voice solutions for Indian languages, with a specific emphasis on Kannada.
Introduction
Project Website - https://slabstech.com/dhwani
Report - Doc
Presentation - SLides
Background
Current voice assistants like Alexa, Siri, and Google dominate the consumer market but lack comprehensive support for Indian languages, particularly Kannada. OpenAI's recent entry into the voice assistant market highlights the growing demand for such technologies. By utilizing open-source models and tools, we can develop a voice solution that is accessible and robust, specifically tailored for Kannada speakers.
Objectives
The primary objective is to integrate and enhance the following models and services for Kannada: - Automatic Speech Recognition (ASR): To convert spoken Kannada into text. - Text-to-Speech (TTS): To convert Kannada text into natural-sounding speech. - Translation Services: To enable translation between Kannada and other Indian languages.
Models and Tools
The project utilizes the following open-source tools:
Open-Source Tool | Source Repository | CPU / Available 24/7 - Free | GPU / On-demand |
---|---|---|---|
Automatic Speech Recognition : ASR | ASR Indic Server | API Demo | - |
Text to Speech : TTS | TTS Indic Server | CPU-not suitable | App -Demo |
Translation | Indic Translate Server | API Demo | |
Large Language Model | LLM Indic Server | API Demo | |
Document Parser | Indic Document Server | Not Suitable | - |
All in One Server - ASR + TTS + Translate | indic-all-server | Not Suitable | -- |
Target Solution
Answer Engine | Voice Translation |
---|---|
![]() |
![]() |
Budget
Cloud Providers
- Cost: Estimated $2,880 for three months of cloud-based GPU access.
- Justification: Necessary for initial infra setup, model optimization and performance evaluation.
On-Premise GPU Setup
- Cost: $4,000 for hardware and setup: RTX 4090 - Workstation with 24GB VRAM
- Justification: Long-term investment for sustainable development and scalability.
We will target implementaion with Single GPU
GPU Access Cost Estimation
Cost Breakdown
Month | Activity | Users | Cost per Hour/GPU ($) | Hours per Day | Daily Cost ($) | Monthly Cost ($) |
---|---|---|---|---|---|---|
1 | Development and optimization | 1-5 | 0.5 | 4 | 2.00 | 960 |
2 | Scalability tests and beta users | 10-20 | 0.5 | 24 | 12.00 | 960 |
3 | Large scale testing across timezones | 10-20 | 0.5 | 36 | 18.00 | 960 |
Total Cost - Total Cost: $960 + $960 + $960 = $2,880
Project Scope
Current Setup
The development is currently being executed on a laptop with a GTX 1060 6GB VRAM. However, to ensure robustness and scalability, additional GPU resources are required.
Integrated Demos
- Demo for Testing components for Dhwani for Accuracy and evaluation
Feature | Description | Components | Source Code | Hardware |
---|---|---|---|---|
Kannada Voice AI | Provides answers to voice queries using a LLM | LLM | API // APP | CPU / GPU |
Text Translate | Translates text from one language to another. | Translation | Link | CPU / GPU |
Text Query | Allows querying text data for specific information. | LLM | Link | CPU / GPU |
Voice to Text Translation | Converts spoken language to text and translates it. | ASR, Translation | Link | CPU / GPU |
PDF Translate | Translates content from PDF documents. | Translation | ||
Text to Speech | Generates speech from text. | TTS | Link | GPU |
Voice to Voice Translation | Converts spoken language to text, translates it, and then generates speech. | ASR, Translation, TTS | Link | GPU |
Answer Engine with Translate | Provides answers to queries with translation capabilities. | ASR, LLM, Translation, TTS | Link | GPU |
Proposed Plan
Phase 1: Cloud Provider setup with Single GPU
- Objective: Utilize cloud-based GPU resources to enhance the models.
- Actions:
- Set up and configure cloud-based GPUs.
- Perform initial training and testing of ASR, TTS, and translation models.
- Evaluate the performance and make necessary adjustments.
Phase 2: Alpha user scaling with multi-gpu setup
- Objective: Assess the feasibility of multi-GPU solutions.
- Actions:
- Conduct a cost-benefit analysis of multi-GPU setup.
- Continue model training and optimization using cloud-based GPUs.
Phase 3: Resource Maximization and Scalability to Beta users
- Objective: Release to Beta users with advanced GPU.
- Actions:
- Monitor the performance and resource utilization.
- Adjust the project plan as needed to ensure efficient use of resources.
- Seek additional funding or resources based on project progress and demand.
Test Cloud Provider
- Huggingface Spaces,
- OlaKrutrim Cloud
Provider and Costs
- Huggingface Spaces
Cost from Huggingface Spaces - Ease of Use and model close to server
GPU Type | vCPU | Memory | GPU Model | GPU Memory | Price ($) |
---|---|---|---|---|---|
Nvidia T4 - small | 4 | 15 GB | Nvidia T4 | 16 GB | $0.40 |
1x Nvidia L4 | 8 | 30 GB | Nvidia L4 | 24 GB | $0.80 |
1x Nvidia L40S | 8 | 62 GB | Nvidia L4 | 48 GB | $1.80 |
Nvidia A10G - small | 4 | 15 GB | Nvidia A10G | 24 GB | $1.00 |
- OlaKrutrim Cloud
Instance Type | Price (₹/hour) | GPUs | Availability | vCPUs | GPU Memory | RAM |
---|---|---|---|---|---|---|
A100-NVLINK-Mini | ₹ 45 | 1 | Medium | 16 | 20 GB | |
A100-NVLINK-Standard-1x | ₹ 105 | 1 | Medium | 16 | 40 GB | 60 GB |
H100-NVLINK-Nano | ₹ 83 | 1 | Medium | 16 | 20 GB | |
H100-NVLINK-Mini | ₹ 124 | 1 | Medium | 16 | 40 GB | 60 GB |
Additional Reading Materials
Dhwani - 3 month - Milestone plan
Dhwani Research Milestone document
Technical Specifications
For more detailed technical specifications, please refer to the following documents:
Conclusion
This proposal aims to secure GPU access for three months to develop a robust Kannada/Indic Language voice model. By leveraging open-source tools and models, we can create a solution that meets the needs of Kannada speakers and contributes to the broader field of voice assistant technologies. Your support in providing GPU access will be instrumental in achieving this goal.
Contact Information
For any inquiries or further discussion, please contact:
-
[sachin]
-
To collaborate immediately with code, feedback, issues : Join our Discord Server
- Clear, Small Pull Requests for Milestones - are worth its weight in Gold
We appreciate your consideration and look forward to the possibility of collaborating on this exciting project.