Skip to content

Project Dhwani: Enhancing Kannada Voice Model Development with GPU Access

Table of Contents

  1. Summary
  2. Introduction
  3. Budget
  4. Project Scope
  5. Proposed Plan
  6. Test Cloud Provider
  7. Alternate Cloud Providers for GPU Access
  8. Additional Reading Materials
  9. Conclusion
  10. Contact Information

Summary

Dhwani is a self-hosted GenAI platform designed to provide voice mode interaction for Kannada and other Indian languages.

Research Goals

  • Measure and improve the Time to First Token Generation (TTFTG) for model architectures in ASR, Translation, and TTS systems.
  • Develop and enhance a Kannada voice model that meets industry standards set by OpenAI, Google, ElevenLabs, xAI
  • Create robust voice solutions for Indian languages, with a specific emphasis on Kannada.

Introduction

Project Website - https://slabstech.com/dhwani

Report - Doc

Presentation - SLides

Background

Current voice assistants like Alexa, Siri, and Google dominate the consumer market but lack comprehensive support for Indian languages, particularly Kannada. OpenAI's recent entry into the voice assistant market highlights the growing demand for such technologies. By utilizing open-source models and tools, we can develop a voice solution that is accessible and robust, specifically tailored for Kannada speakers.

Objectives

The primary objective is to integrate and enhance the following models and services for Kannada: - Automatic Speech Recognition (ASR): To convert spoken Kannada into text. - Text-to-Speech (TTS): To convert Kannada text into natural-sounding speech. - Translation Services: To enable translation between Kannada and other Indian languages.

Models and Tools

The project utilizes the following open-source tools:

Open-Source Tool Source Repository CPU / Available 24/7 - Free GPU / On-demand
Automatic Speech Recognition : ASR ASR Indic Server API Demo -
Text to Speech : TTS TTS Indic Server CPU-not suitable App -Demo
Translation Indic Translate Server API Demo
Large Language Model LLM Indic Server API Demo
Document Parser Indic Document Server Not Suitable -
All in One Server - ASR + TTS + Translate indic-all-server Not Suitable --

Target Solution

Answer Engine Voice Translation
Answer Engine Voice Translation

Budget

Cloud Providers

  • Cost: Estimated $2,880 for three months of cloud-based GPU access.
  • Justification: Necessary for initial infra setup, model optimization and performance evaluation.

On-Premise GPU Setup

  • Cost: $4,000 for hardware and setup: RTX 4090 - Workstation with 24GB VRAM
  • Justification: Long-term investment for sustainable development and scalability.

We will target implementaion with Single GPU

GPU Access Cost Estimation

Cost Breakdown

Month Activity Users Cost per Hour/GPU ($) Hours per Day Daily Cost ($) Monthly Cost ($)
1 Development and optimization 1-5 0.5 4 2.00 960
2 Scalability tests and beta users 10-20 0.5 24 12.00 960
3 Large scale testing across timezones 10-20 0.5 36 18.00 960

Total Cost - Total Cost: $960 + $960 + $960 = $2,880

Project Scope

Current Setup

The development is currently being executed on a laptop with a GTX 1060 6GB VRAM. However, to ensure robustness and scalability, additional GPU resources are required.

Integrated Demos

  • Demo for Testing components for Dhwani for Accuracy and evaluation
Feature Description Components Source Code Hardware
Kannada Voice AI Provides answers to voice queries using a LLM LLM API // APP CPU / GPU
Text Translate Translates text from one language to another. Translation Link CPU / GPU
Text Query Allows querying text data for specific information. LLM Link CPU / GPU
Voice to Text Translation Converts spoken language to text and translates it. ASR, Translation Link CPU / GPU
PDF Translate Translates content from PDF documents. Translation
Text to Speech Generates speech from text. TTS Link GPU
Voice to Voice Translation Converts spoken language to text, translates it, and then generates speech. ASR, Translation, TTS Link GPU
Answer Engine with Translate Provides answers to queries with translation capabilities. ASR, LLM, Translation, TTS Link GPU

Proposed Plan

Phase 1: Cloud Provider setup with Single GPU

  • Objective: Utilize cloud-based GPU resources to enhance the models.
  • Actions:
  • Set up and configure cloud-based GPUs.
  • Perform initial training and testing of ASR, TTS, and translation models.
  • Evaluate the performance and make necessary adjustments.

Phase 2: Alpha user scaling with multi-gpu setup

  • Objective: Assess the feasibility of multi-GPU solutions.
  • Actions:
  • Conduct a cost-benefit analysis of multi-GPU setup.
  • Continue model training and optimization using cloud-based GPUs.

Phase 3: Resource Maximization and Scalability to Beta users

  • Objective: Release to Beta users with advanced GPU.
  • Actions:
  • Monitor the performance and resource utilization.
  • Adjust the project plan as needed to ensure efficient use of resources.
  • Seek additional funding or resources based on project progress and demand.

Test Cloud Provider

  • Huggingface Spaces,
  • OlaKrutrim Cloud

Provider and Costs

  • Huggingface Spaces

Cost from Huggingface Spaces - Ease of Use and model close to server

GPU Type vCPU Memory GPU Model GPU Memory Price ($)
Nvidia T4 - small 4 15 GB Nvidia T4 16 GB $0.40
1x Nvidia L4 8 30 GB Nvidia L4 24 GB $0.80
1x Nvidia L40S 8 62 GB Nvidia L4 48 GB $1.80
Nvidia A10G - small 4 15 GB Nvidia A10G 24 GB $1.00
  • OlaKrutrim Cloud
Instance Type Price (₹/hour) GPUs Availability vCPUs GPU Memory RAM
A100-NVLINK-Mini ₹ 45 1 Medium 16 20 GB
A100-NVLINK-Standard-1x ₹ 105 1 Medium 16 40 GB 60 GB
H100-NVLINK-Nano ₹ 83 1 Medium 16 20 GB
H100-NVLINK-Mini ₹ 124 1 Medium 16 40 GB 60 GB

Additional Reading Materials

Dhwani - 3 month - Milestone plan

Dhwani Research Milestone document

Technical Specifications

For more detailed technical specifications, please refer to the following documents:

Conclusion

This proposal aims to secure GPU access for three months to develop a robust Kannada/Indic Language voice model. By leveraging open-source tools and models, we can create a solution that meets the needs of Kannada speakers and contributes to the broader field of voice assistant technologies. Your support in providing GPU access will be instrumental in achieving this goal.

Contact Information

For any inquiries or further discussion, please contact:

  • [sachin]

  • To collaborate immediately with code, feedback, issues : Join our Discord Server

    • Clear, Small Pull Requests for Milestones - are worth its weight in Gold

We appreciate your consideration and look forward to the possibility of collaborating on this exciting project.