Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
This dataset contains Reddit sentences scored for similarity to spoken dialogue and written forum communication. It was created for the EMNLP 2025 paper, though the authors note it was not used in the final results. Early experiments showed no significant gains versus smaller C4 and Subtitle training sets.
The dataset was not used in the final results of the cited paper, so its practical utility for replicating reported outcomes may be limited. License and access details are unknown.