AI Singapore (AISG) and Google Research have announced the launch of Project SEALD (Southeast Asian Languages in One Network Data), a pioneering research initiative aimed at enhancing language datasets for training and refining large language models (LLMs) specific to the languages of Southeast Asia (SEA). This collaboration marks a significant step towards improving the cultural context and linguistic capabilities of LLMs across the region, promising widespread societal benefits.
“Google is proud to be partnering with AISG to put Singapore and SEA on the map of AI model development. By focusing on languages spoken and used in SEA and cultural understanding, Project SEALD will significantly improve the existing corpus and evaluation benchmarks for these languages. This will open new opportunities and make AI more inclusive, accessible, and helpful for individuals and businesses throughout the region.”
Yolyn Ang, Vice President, Knowledge and Information Partnerships, Google APAC
Enhancing Linguistic Diversity and Inclusivity
Project SEALD targets the development of a diverse and high-quality data corpus beginning with five key languages: Indonesian, Thai, Tamil, Filipino, and Burmese. This effort is part of the SEA-LION (Southeast Asian Languages in One Network) initiative by AISG, which focuses on developing LLMs that are finely tuned to reflect the unique cultural contexts and linguistic nuances of SEA. The collaboration between AISG and Google Research APAC includes the development of translocalization and translation models, the establishment of best practices for dataset instruction tuning, and the creation of scalable translocalization tools. Moreover, the project will see the publication of pre-training recipes specifically designed for SEA languages.
“The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve SEA-LION’s capabilities. We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with Google for the benefit of the entire community.”
Leslie Teo, Senior Director of AI Products, AISG
Open-Source Commitment for Regional Expertise
In a move to foster regional language model expertise and progress, AISG and Google will make the datasets and outputs from Project SEALD available in open source. This initiative particularly aims at enhancing communication with Singapore’s under-represented migrant worker populations, who often possess greater fluency in regional languages than in English. Through improved data collection efforts capturing linguistic nuances, the project seeks to bolster engagement from the Singapore Government and employers towards these communities.
Bridging Gaps Across Domains
By integrating these advancements into generative AI solutions from the AI Trailblazers initiative, the project aspires to facilitate outreach in critical areas such as worker grievance redressal and assistance scheme extensions. Additionally, Project SEALD will involve ecosystem partners from academia, industry, and government to further data collection, curation, quality checks, and the implementation of advanced evaluation and benchmarking techniques.
Expanding Access and Collaboration
Building upon these foundations, AISG is collaborating with Google Cloud to make the SEA-LION LLMs accessible on Google Cloud’s Model Garden via Vertex AI. This will enable organizations to leverage enterprise-grade tools for customizing models to suit relevant use cases seamlessly. The SEA-LION LLMs will also continue to be available on Hugging Face, in partnership with Google Cloud, to aid developers in efficiently training, tuning, and deploying open models.
In addition to Project SEALD, AISG is actively pursuing collaborations across the SEA region, including MOUs and LOIs with entities in Indonesia, Malaysia, and Vietnam. These partnerships aim at the development of datasets and applications for regional LLMs, while also engaging with partners in Thailand, the Philippines, and Indonesia on resources for regional language syntax and semantics.
Parallel to Project SEALD, Google Research is conducting a similar initiative in India named Project Vaani, focusing on the transcription and open-sourcing of speech data across India’s 773 districts. This reflects a growing commitment to language inclusivity and cultural context awareness in AI development across the Asia-Pacific region, promising a future where technology is more representative of and accessible to its diverse user base.



Share your thoughts