Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A large-scale text corpus, likely containing trillions of tokens, built by SKT AI Labs as part of Project Surya. The dataset is intended to serve as a sovereign data foundation for indigenous large language model development in India. The dataset page indicates it is on a building stage with a target size of 8TB.
License is unknown, which restricts clarity on permissible use.