DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Reddit Sentence Similarity Scores Dataset | DataSalon

Home Media & CommunicationReddit Sentence Similarity Scores Dataset

Media & Communication

Reddit Sentence Similarity Scores Dataset

Name: Reddit Sentence Similarity Scores Dataset
Creator: figmtu
Published: 2023-02-21T08:12:11
Keywords: Reddit, Text, Natural Language Processing, Dialogue Analysis

by figmtu·Updated 2mo ago

Available on 1 platform

Description

This dataset contains Reddit sentences scored for similarity to spoken dialogue and written forum communication. It was created for the EMNLP 2025 paper, though the authors note it was not used in the final results. Early experiments showed no significant gains versus smaller C4 and Subtitle training sets.

Use Cases

Training models to distinguish between conversational and formal written styles
Analyzing linguistic features of online communication platforms

Strengths

Provides similarity scores along two stylistic dimensions
Based on a large-scale platform (Reddit)

Limitations

Specific scoring methodology details are not fully described in the input
Dataset scale (row count) and specific source subreddits are unknown
May reflect biases inherent to the Reddit platform and its user base

Provenance

Source: Authors of the EMNLP 2025 paper (figmt mu)
Collection Method: Sentences were scored according to similarity to spoken dialogue and written communication, though exact method details are not fully specified in the input.
Time Range: Unknown
Geography: Unknown

The dataset was not used in the final results of the cited paper, so its practical utility for replicating reported outcomes may be limited. License and access details are unknown.

Text Reddit Natural Language Processing Dialogue Analysis

Related Datasets

Quality Score

D34

Description

Source

Reputation

Quality Score

D34

Description

Source

Reputation

Access

Community

2.7K downloads

1 likes

0 views

Dataset Info

Author: figmtu
Created: Feb 21, 2023
Updated: Apr 29, 2026
Last synced: May 28, 2026

Access

Community

2.7K downloads

1 likes

0 views

Dataset Info

Author: figmtu
Created: Feb 21, 2023
Updated: Apr 29, 2026
Last synced: May 28, 2026

Reddit Sentence Similarity Scores Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info