Soro: A Lightweight Foundation Model and Chatbot for Tajik
Original reporting by arXiv (cs.AI)

The global proliferation of large language models (LLMs) often overlooks the practical challenges of deploying these powerful tools in regions with limited computational resources and internet connectivity, particularly for less-resourced languages. A new initiative introduces Soro, a family of conversational LLMs meticulously crafted for real-world application in Tajikistan, addressing these very constraints.
Developed from open-weight Gemma 3 checkpoints, Soro underwent intensive, Tajik-only continual pretraining. Researchers built a specialized 1.9-billion-token corpus from filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning with 40,000 Tajik teacher-style examples. This bespoke approach aimed to create models deeply rooted in the linguistic and cultural nuances of Tajikistan.
Measuring impact
Recognizing the scarcity of Tajik-specific evaluation tools, the team also developed and open-sourced a comprehensive suite of Tajik benchmarks. These span general knowledge, linguistic competence, and local school and university entrance exam domains. On these new benchmarks, Soro significantly surpasses same-size Gemma 3 baselines, impressively maintaining strong performance on standard English datasets. Furthermore, memory-efficient FP8 and INT4 quantization techniques ensure Soro’s viability for edge deployment, a critical factor for its ongoing education-sector pilot and planned expansion across schools in Tajikistan.
Soro represents a significant stride in democratizing advanced AI capabilities, particularly for historically underserved linguistic communities. By meticulously fine-tuning open-source models with a large, curated Tajik corpus and developing bespoke evaluation benchmarks, the Soro team has demonstrated that high-performing, culturally relevant LLMs can be successfully developed and deployed even under challenging resource constraints. Its superior performance on Tajik-specific tasks, coupled with efficient quantization techniques, not only addresses a critical language gap but also underscores a practical pathway for bringing sophisticated conversational AI to the edge. The ongoing pilot in Tajikistan's education sector is a powerful testament to Soro's immediate, tangible impact.