AI-TUTOR , an Avatar based, localized, light weight Gen-AI chatbot will converse with target audience in local language (audio format). helps in school students and teachers for their day-to-day learning and activities.
·Core Features:
Audio input and output format : Tutor will take an input from user in audio format, process it and will give output in audio format.
Localized : Retrieval augmented generation (RAG) is used to make sure tutor is having local context.
Subject-specific tutoring (e.g., math, science, language).
Interactive Q&A with explanations.
Multilingual support for regional languages.
Technical Apects :
Target Model : Efficient , scalable, understand and generate human like text.
Infrastructure :
Lightweight Model which doesn’t require substantial computation resources but handles complex language tasks.
E.g. DistilBeRT / mBeRT / Sarvam / IndicTrans
Minimum requirement : 16GB RAM, 8 Core CPU, 256GB Memory, GPU(optional) / Raspberry pi4 (8GB RAM) / Jetson (8 core, 4GB RAM)
On-premise deployment which will not require internet connectivity for day-to-day use.
Libraries :
Audio Input: Translator from googletrans, KaldiRecognizer from vosk, queue
Detect language : Translator from googletrans
LLM : llama3.2 from ollamaLLM
Audio Output: gTTS from gtts, pyaudio
Avatar Animation : pygame
The described system integrates several advanced technologies to transform user queries into comprehensive, engaging responses. Here's an overview of the process:
Query Embedding and Retrieval:
Conversion to Embedding Vector: When a user submits a query, it's transformed into an embedding vector—a numerical representation capturing the query's semantic meaning.
Vector Database Retrieval: This vector is used to search a vector database for information closely related to the query, employing similarity metrics like cosine similarity.
Retrieval-Augmented Generation (RAG):
Contextual Information Integration: The retrieved data serves as context for a Large Language Model (LLM), enhancing its ability to generate accurate and contextually relevant responses. This approach, known as Retrieval-Augmented Generation (RAG), combines external information retrieval with generative AI to improve response quality.
time.com
Natural Language Response Generation:
LLM Processing: The LLM processes the query alongside the contextual information to produce a coherent, natural language answer.
Multimodal Output Presentation:
Text-to-Speech Conversion: The generated text is converted into speech using advanced text-to-speech technologies.
Avatar Animation: To enhance user engagement, the system employs AI-generated avatars that deliver the spoken response with synchronized facial expressions and body language. Companies like Synthesia specialize in creating such avatars, enabling the transformation of textual content into dynamic video presentations.
time.com
Example Scenario: If a teacher requests an analysis of a specific student's performance over the past three months, the system would:
Retrieve the relevant performance data from the database.
Use the LLM to generate an insightful analysis based on this data.
Present the analysis through an AI-generated avatar, delivering the information in an engaging, animated format.
This seamless integration of data retrieval, language generation, and multimedia presentation technologies results in an interactive and informative user experience.
Demo video