Loading...
Loading...
News corpora, social media analysis, movie/music metadata, sports data, cultural datasets, misinformation
10,967 datasets
A dataset likely containing text data related to the review of damage claims, potentially for insurance or property assessment. It was published on Kaggle, but its specific origin, size, and creation date are unknown. The dataset's content and structure must be verified after download.
Top Rated Movie Dataset is a collection of movie information and ratings published on Kaggle. The dataset's specific size, columns, and creation date are unknown. Its content likely includes titles and user or critic ratings.
A benchmark dataset for cross-cultural negotiation analysis. It contains records of the same B2B deal and the same negotiators, with the country context changed between Great Britain and the United States. The dataset appears to be designed for controlled comparison of negotiation behaviors across these two cultural settings.
Australian Ocean Data Network provides air pressure measurements from the Halftide Rocks AWS weather station. The dataset covers a nine-year period from 26 July 2000 to 19 December 2009, collected by deployed weather sensors.
A curated dataset for training models to distinguish between AI-generated 'slop' and quality human writing. It was created by feeding 200 prompts from ChaoticNeutrals/Reddit-SFW-Writing_Prompts_ShareGPT into various LLMs and comparing responses. The dataset was authored by DrRiceIO7 and last updated on March 24, 2026.
A 10-year dataset builder for NASDAQ market data, created by HaiwenWang. It includes daily and hourly OHLCV data, with optional news and ticker-level fundamental data attachments. The dataset page was last updated in April 2026.
Mashable.com news articles are used to predict their publishing channel based on title text and auxiliary numerical features. The dataset originates from the UCI Machine Learning Repository's Online News Popularity collection and was referenced in a 2021 arXiv preprint benchmarking multimodal AutoML. Authors include Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola.
A challenging tabular dataset for predicting the log-scaled popularity of Mashable.com news articles based on title text and auxiliary numerical features. The dataset, sourced from a 2021 arXiv paper, is intended as a difficult benchmark for AutoML systems. Authors include Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola.
The news_channel dataset predicts which Mashable.com news category an article belongs to based on its title text and auxiliary numerical features. The original data was collected for the Online News Popularity dataset hosted by the UCI Machine Learning Repository. This version was referenced in the paper 'Benchmarking multimodal automl for tabular data with text fields' by Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola.
Google review policy violation audit data by BHMarketer.ai. The dataset likely contains records of reviews flagged for violating Google's policies. The specific scope, size, and collection period are not detailed.
IMDB reviews likely contain user-generated text for movies. The dataset is hosted on Kaggle, a platform for data science competitions and projects. Specific details such as the number of reviews, time range, and collection method are not provided in the available metadata.
Water Corporation sewer pipes with no pumps or pressure systems connected. The dataset includes features such as gravity flow, wastewater type, and asset ownership. It was last updated by the Water Corporation in March 2026.
Metric-AI's Reddit Armenian Dataset is a subset of Reddit content containing titles and bodies translated into Armenian. The dataset was created using the Gemma-2-27B-it model and is intended for training Armenian text embeddings models. It was last updated on March 25, 2026.
Global Company Ratings & Employee Reviews contains employee ratings, sentiment tags, and workplace culture metrics. The dataset appears to be sourced from Kaggle, though the original author and organization are unknown. The last update date and specific data volume are not provided.
2026-03-23 updated collection of sportsfields from the City of Moreton Bay's Data Hub. The dataset, created by moretonbaygis, is available in multiple formats including XLSX, CSV, and GeoJSON.
Facebook-scraper_data likely contains information extracted from Facebook's public pages or groups. The dataset is hosted on Kaggle, but its specific contents, size, and creation details are unknown. Columns, sample data, and authorship information are not provided in the metadata.
fake-reviews-dataset is a text dataset hosted on Kaggle. The dataset likely contains examples of fake reviews, which could be used for training models to detect deceptive or inauthentic text. Its specific size, origin, and creation date are unknown.
Trending movies data sourced from The Movie Database (TMDb) and published on Kaggle. The dataset's specific size, columns, and update frequency are not detailed in the provided metadata. Users should verify the actual content and structure after download.
A dataset concerning compressive strength, likely related to materials such as concrete or composites. It is hosted on Kaggle, but its author, creation date, and specific scope are not detailed in the provided metadata. The actual data content, including the number of records and specific features, requires verification after download.
A cleaned and structured version of the raw Reddit Pushshift dump, transformed into columnar Parquet files. The dataset includes both Reddit submissions and comments, prepared by the author 'anhchanghoangsg'. It was last updated on March 23, 2026.