Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Approximately 94 million tokens of professionally formatted screenplay text, pre-tokenized for direct use in GPT-2 training pipelines. The corpus was derived from the Movie-Script-Database by Aveek Saha and is provided as tokenized JSON splits. The dataset was created by kazkiryu and was last updated on June 9, 2026.
License is unknown; users should verify terms before use.