Project Dhwani: Enhancing Kannada Voice Model Development with GPU Access

Summary
Introduction
- Background
- Objectives
Budget
Project Scope
- Models and Tools
- Current Setup
Proposed Plan
Test Cloud Provider
- Overview
- Provider and Costs
Alternate Cloud Providers for GPU Access
Additional Reading Materials
- Dhwani - 3 months Milestone Document
- Technical Specifications
Conclusion
Contact Information

Summary

Dhwani is a self-hosted GenAI platform designed to provide voice mode interaction for Kannada and other Indian languages.

Research Goals

Measure and improve the Time to First Token Generation (TTFTG) for model architectures in ASR, Translation, and TTS systems.
Develop and enhance a Kannada voice model that meets industry standards set by OpenAI, Google, ElevenLabs, xAI
Create robust voice solutions for Indian languages, with a specific emphasis on Kannada.

Introduction

Project Website - https://slabstech.com/dhwani

Report - Doc

Presentation - SLides

Background

Current voice assistants like Alexa, Siri, and Google dominate the consumer market but lack comprehensive support for Indian languages, particularly Kannada. OpenAI's recent entry into the voice assistant market highlights the growing demand for such technologies. By utilizing open-source models and tools, we can develop a voice solution that is accessible and robust, specifically tailored for Kannada speakers.

Objectives

The primary objective is to integrate and enhance the following models and services for Kannada: - Automatic Speech Recognition (ASR): To convert spoken Kannada into text. - Text-to-Speech (TTS): To convert Kannada text into natural-sounding speech. - Translation Services: To enable translation between Kannada and other Indian languages.

Models and Tools

The project utilizes the following open-source tools:

Open-Source Tool	Source Repository	CPU / Available 24/7 - Free	GPU / On-demand
Automatic Speech Recognition : ASR	ASR Indic Server	API Demo	-
Text to Speech : TTS	TTS Indic Server	CPU-not suitable	App -Demo
Translation	Indic Translate Server	API Demo
Large Language Model	LLM Indic Server	API Demo
Document Parser	Indic Document Server	Not Suitable	-
All in One Server - ASR + TTS + Translate	indic-all-server	Not Suitable	--

Target Solution

Answer Engine	Voice Translation

Budget

Cloud Providers

Cost: Estimated $2,880 for three months of cloud-based GPU access.
Justification: Necessary for initial infra setup, model optimization and performance evaluation.

On-Premise GPU Setup

Cost: $4,000 for hardware and setup: RTX 4090 - Workstation with 24GB VRAM
Justification: Long-term investment for sustainable development and scalability.

We will target implementaion with Single GPU

GPU Access Cost Estimation

Cost Breakdown

Month	Activity	Users	Cost per Hour/GPU ($)	Hours per Day	Daily Cost ($)	Monthly Cost ($)
1	Development and optimization	1-5	0.5	4	2.00	960
2	Scalability tests and beta users	10-20	0.5	24	12.00	960
3	Large scale testing across timezones	10-20	0.5	36	18.00	960

Total Cost - Total Cost: $960 + $960 + $960 = $2,880

Project Scope

Current Setup

The development is currently being executed on a laptop with a GTX 1060 6GB VRAM. However, to ensure robustness and scalability, additional GPU resources are required.

Integrated Demos

Demo for Testing components for Dhwani for Accuracy and evaluation

Feature	Description	Components	Source Code	Hardware
Kannada Voice AI	Provides answers to voice queries using a LLM	LLM	API // APP	CPU / GPU
Text Translate	Translates text from one language to another.	Translation	Link	CPU / GPU
Text Query	Allows querying text data for specific information.	LLM	Link	CPU / GPU
Voice to Text Translation	Converts spoken language to text and translates it.	ASR, Translation	Link	CPU / GPU
PDF Translate	Translates content from PDF documents.		Translation
Text to Speech	Generates speech from text.	TTS	Link	GPU
Voice to Voice Translation	Converts spoken language to text, translates it, and then generates speech.	ASR, Translation, TTS	Link	GPU
Answer Engine with Translate	Provides answers to queries with translation capabilities.	ASR, LLM, Translation, TTS	Link	GPU

Proposed Plan

Phase 1: Cloud Provider setup with Single GPU

Objective: Utilize cloud-based GPU resources to enhance the models.
Actions:
Set up and configure cloud-based GPUs.
Perform initial training and testing of ASR, TTS, and translation models.
Evaluate the performance and make necessary adjustments.

Phase 2: Alpha user scaling with multi-gpu setup

Objective: Assess the feasibility of multi-GPU solutions.
Actions:
Conduct a cost-benefit analysis of multi-GPU setup.
Continue model training and optimization using cloud-based GPUs.

Phase 3: Resource Maximization and Scalability to Beta users

Objective: Release to Beta users with advanced GPU.
Actions:
Monitor the performance and resource utilization.
Adjust the project plan as needed to ensure efficient use of resources.
Seek additional funding or resources based on project progress and demand.

Test Cloud Provider

Huggingface Spaces,
OlaKrutrim Cloud

Provider and Costs

Huggingface Spaces

Cost from Huggingface Spaces - Ease of Use and model close to server

GPU Type	vCPU	Memory	GPU Model	GPU Memory	Price ($)
Nvidia T4 - small	4	15 GB	Nvidia T4	16 GB	$0.40
1x Nvidia L4	8	30 GB	Nvidia L4	24 GB	$0.80
1x Nvidia L40S	8	62 GB	Nvidia L4	48 GB	$1.80
Nvidia A10G - small	4	15 GB	Nvidia A10G	24 GB	$1.00

OlaKrutrim Cloud

Instance Type	Price (₹/hour)	GPUs	Availability	vCPUs	GPU Memory	RAM
A100-NVLINK-Mini	₹ 45	1	Medium	16	20 GB
A100-NVLINK-Standard-1x	₹ 105	1	Medium	16	40 GB	60 GB
H100-NVLINK-Nano	₹ 83	1	Medium	16	20 GB
H100-NVLINK-Mini	₹ 124	1	Medium	16	40 GB	60 GB

WIP - Cloud provider benchmark document

Additional Reading Materials

Dhwani - 3 month - Milestone plan

Dhwani Research Milestone document

Technical Specifications

For more detailed technical specifications, please refer to the following documents:

Conclusion

This proposal aims to secure GPU access for three months to develop a robust Kannada/Indic Language voice model. By leveraging open-source tools and models, we can create a solution that meets the needs of Kannada speakers and contributes to the broader field of voice assistant technologies. Your support in providing GPU access will be instrumental in achieving this goal.

Contact Information

For any inquiries or further discussion, please contact:

[sachin]
To collaborate immediately with code, feedback, issues : Join our Discord Server
- Clear, Small Pull Requests for Milestones - are worth its weight in Gold

We appreciate your consideration and look forward to the possibility of collaborating on this exciting project.