DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Screenplay Corpus Tokenized for GPT-2 Fine-Tuning | DataSalon

Home Media & CommunicationScreenplay Corpus Tokenized for GPT-2 Fine-Tuning

Media & Communication

Screenplay Corpus Tokenized for GPT-2 Fine-Tuning

Name: Screenplay Corpus Tokenized for GPT-2 Fine-Tuning
Creator: kazkiryuu
Published: 2026-06-09T06:59:40
Keywords: Movie Scripts, Gpt 2 Training, Nlp Corpus, Text, Large Scale, Natural Language Processing, Screenplay Text

by kazkiryuu·Updated 4d ago

Available on 1 platform

Description

Approximately 94 million tokens of professionally formatted screenplay text, pre-tokenized for direct use in GPT-2 training pipelines. The corpus was derived from the Movie-Script-Database by Aveek Saha and is provided as tokenized JSON splits. The dataset was created by kazkiryu and was last updated on June 9, 2026.

Use Cases

Fine-tuning GPT-2 models for screenplay generation based on the pre-tokenized corpus.
Evaluating language model performance on structured, professionally formatted text.
Analyzing stylistic patterns in movie scripts using the provided tokenized data.

Strengths

Contains approximately 94 million tokens of screenplay text.
Data is pre-tokenized and ready for direct consumption by a GPT-2 Trainer pipeline, requiring no preprocessing.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file size are unknown, which may limit suitability assessment.
Last updated 2026-06-09 06:59:41; freshness should be verified.

Provenance

Source: Derived from Movie-Script-Database by Aveek Saha.
Collection Method: Pre-tokenized screenplay corpus.
Freshness: Last updated 2026-06-09 06:59:41.

License is unknown; users should verify terms before use.

Text Movie Scripts Gpt 2 Training Nlp Corpus Large Scale Natural Language Processing Screenplay Text

Related Datasets

Quality Score

D36

Description

Source

Reputation

Quality Score

D36

Description

Source

Reputation

Access

Community

1 likes

0 views

Dataset Info

Author: kazkiryuu
Created: Jun 9, 2026
Updated: Jun 9, 2026
Last synced: Jun 13, 2026

Access

Community

1 likes

0 views

Dataset Info

Author: kazkiryuu
Created: Jun 9, 2026
Updated: Jun 9, 2026
Last synced: Jun 13, 2026

Screenplay Corpus Tokenized for GPT-2 Fine-Tuning

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info