Loading...
Loading...
News corpora, social media analysis, movie/music metadata, sports data, cultural datasets, misinformation
11,020 datasets
Reddit NSFW writing prompts, likely sourced from ShareGPT conversations. The dataset was uploaded by author 'lipilipic' to the Hugging Face platform and was last updated on 2026-04-04 16:24:02. Its specific content, scale, and structure require verification after download.
Movies is a dataset hosted on the Kaggle platform. The dataset's specific content, size, and provenance are not detailed in the available metadata. Users must download the data to verify its scope, features, and potential applications.
A dataset of news content published on Kaggle. The title suggests it likely contains textual news articles or headlines. The author, organization, and specific temporal coverage are unknown.
A high-fidelity facial expression dataset focused on Asian demographics. The dataset is hosted on Kaggle, but details about its size, creation date, and authorship are not provided. Its description emphasizes demographic focus and high-fidelity imagery for the Asian population.
Delivering public data on cultural capital for selected counties designated as the most renewable in eight U.S. economic regions. It assesses community resources like libraries, religious proclivities, ethnic heritage, language use, festivals, museums, symbolism, and education. The dataset was authored by Michael Petersen and is hosted by Harvard Dataverse.
Crystal Math Preview is a collection of 1,000 to 10,000 mathematical reasoning problems released by ycchen in February 2026 to accompany a research preprint. The dataset focuses on olympiad and competition-level mathematics, featuring specialized configurations derived from high-reasoning budget rollouts. It serves as an early-access version of a larger planned release for training and evaluating mathematical reasoning models.
Finetuned-steam-reviews is a text dataset sourced from Kaggle. The dataset likely contains user reviews from the Steam gaming platform, potentially processed or annotated for machine learning tasks. Its specific size, author, and update history are not provided in the available metadata.
Environmental Information Data Centre provides predicted outcomes for land use change scenarios across 127 sub-catchments in upland Wales. The data project maximum and minimum change for 10 land-cover types based on factors like agricultural land quality and ownership. This work was part of the NERC-funded DURESS project, using underlying mapping data from 1998-2007.
wav2vec2 is a machine learning model for speech recognition. The dataset likely contains audio data and corresponding model weights or training artifacts. It is published on Kaggle under the identifier 'facebook/wav2vec2-base'.
Top Rated Movies data was collected using the TMDB API. The dataset likely contains information on films with high user ratings. The specific number of rows, columns, and last update date are unknown.
reviews_dataset is a text dataset hosted on Kaggle. The dataset likely contains user-generated review content. Its specific size, origin, and detailed contents are not described in the available metadata.
medical_news_vi is a dataset of medical news articles published on Kaggle. The dataset's specific size, source, and time period are not detailed in the available metadata. Its content likely contains text from medical news sources.
Sentimentanalysdata-facebook/nlbb is a dataset published on Kaggle. The title suggests it contains data from Facebook for sentiment analysis. The dataset's specific content, size, and creation details require verification after download.
A dataset of Steam game reviews intended for fine-tuning models. The data was published on Kaggle. The specific volume, time range, and collection methodology are unknown from the provided metadata.
A literature review by Cher Carney for Battelle's guideline development project, analyzing over 200 articles, several books, and more than 100 websites on in-vehicle information system (IVIS) symbols. The report synthesizes findings on icon design, standards, and evaluation methods, concluding with five key points about the state of IVIS icon development. It includes 7 appendices, 88 figures, and 7 tables across 247 pages.
A collection of COVID-19-related headlines and claims shared across the internet, each labeled for veracity. The dataset was published by Sumit Banik in response to research demand for a combined fake news resource. It contains a binary outcome column where 0 indicates a fake headline and 1 indicates a true one.
A review paper discussing text classification techniques for social media data. The paper, authored by Iosr Journals, examines data from platforms like Facebook, Twitter, LinkedIn, and YouTube, which includes user sentiments and opinions. It compares different machine learning classifiers for extracting meaningful information from informal, unstructured text.
MUSDB18-HQ is an uncompressed audio dataset containing 150 full-track songs across different styles, created by Zafar Rafii et al. in 2019. It provides stereo mixtures and isolated sources (vocals, bass, drums, other) for 100 training and 50 test songs, encoded as 44.1kHz WAV files. The dataset serves as a reference for designing and evaluating source separation algorithms and was used in the SiSEC 2018 campaign.
Kaggle hosts a dataset titled 'review-chekpoints--2026-05-07--13246-13246'. The title suggests it likely contains evaluation data or metrics for machine learning model checkpoints. No further metadata on size, source, or specific content is available.
A collection of medical news articles. The dataset is hosted on Kaggle, but its specific source, size, and creation date are unknown. Columns and sample data are not provided in the metadata.