Loading...
Loading...
News corpora, social media analysis, movie/music metadata, sports data, cultural datasets, misinformation
10,984 datasets
This dataset lists community solar projects identified from various sources as of Spring 2018. It includes project attributes such as State, Service Territory, and System Capacity. The database is maintained by the Department of Energy's National Renewable Energy Laboratory (NREL).
Missouri's Alcohol and Tobacco Control (ATC) dataset of conditionally approved product labels submitted for review. The data includes labels submitted over five business days prior to the current date, which are in a submitted, in-review, or conditionally approved status. It is published by data.mo.gov and was last updated on 2026-02-24.
A collection of New Zealand's publicly owned aerial and satellite imagery, ranging from 5cm resolution in urban areas to lower-resolution full national coverage. The dataset includes historical imagery scanned from film, orthorectified, and georeferenced, provided as Cloud Optimised GeoTIFFs with STAC metadata. It is published by Toitū Te Whenua Land Information New Zealand under a CC-BY-4.0 license.
Global prison facility locations with source provenance and review flags. The dataset is hosted on Kaggle, but the author, organization, and specific creation details are unknown. The last update date and data volume are also unspecified.
600 million news articles from the Common Crawl archive, processed from 2016 to June 2024. The data has been cleaned, deduplicated, and includes language detection for articles in over 100 languages. This dataset was created by kareenamehta and is hosted on Hugging Face.
NMC operationally produced daily gridded analyses for the Northern Hemisphere from August 1963 to December 1972. The dataset includes parameters like upper-level winds, surface temperature, sea-level pressure, tropopause pressure and temperature, and 500mb relative humidity. Data is structured on a 47x51 polar-stereographic grid centered on the North Pole.
Banglanewsmm-dataset is a text corpus hosted on Kaggle. The dataset's title suggests it contains news content in the Bangla language. Specific details regarding its size, collection method, and authorship are unavailable from the provided metadata.
Netflix content data includes movies and TV shows with associated ratings and genres. The dataset likely contains information on popularity and content types for analysis. Its origin and specific size are not detailed in the provided description.
Kaggle hosts a dataset titled 'Fake-News-Detection'. The dataset likely contains text articles or statements labeled for veracity. Its specific size, origin, and creation date are unknown from the provided metadata.
IMDB_Movie.csv is a dataset of movie information, likely sourced from the Internet Movie Database. The dataset's specific contents, such as columns for titles, ratings, or cast, are inferred from its name. It was published on Kaggle, but details on its creation, size, and update history are not provided.
PHEME dataset contains a collection of Twitter rumours and non-rumours posted during five breaking news events. The dataset includes 1,972 rumours and 3,830 non-rumours across events like the Charlie Hebdo attack, Ferguson unrest, and Germanwings Crash. It was created by Arkaitz Zubiaga for the paper 'Learning Reporting Dynamics during Breaking News for Rumour Detection in Social Media'.
California is the likely geographic focus of this dataset. The title suggests it contains text data related to news coverage of political interactions, specifically involving the Hilton entity and Governor Gavin Newsom. The dataset is hosted on Kaggle, but its specific content, size, and origin are not detailed in the provided metadata.
A 2026 systematic review and meta-analysis by Jianbin Guan compares unilateral biportal endoscopy (UBE) and percutaneous transforaminal endoscopic discectomy (PTED) for treating far lateral lumbar disc herniation. The dataset contains aggregated results from multiple clinical studies, focusing on efficacy, safety, and radiation exposure metrics. It was published in the Jianbin Guan Dataverse.
BenchPreS is a benchmark for evaluating persistent-memory large language models. It pairs 10 user profiles with 39 recipient-task contexts across five formal communication domains. The dataset was created by sangyon and last updated on March 20, 2026.
Afghanistan news articles collected from unspecified sources. The dataset is hosted on Kaggle, but the author, organization, and specific collection method are unknown. Its size, format, and exact publication date are also unspecified.
A dataset of reviews for companies based in India. It is hosted on the Kaggle platform. The specific source, collection method, and volume of data are not detailed in the available metadata.
Persian news articles likely organized for classification tasks. The dataset is hosted on Kaggle, but its specific size, creation date, and authorship are not detailed in the provided metadata. Columns and sample data are unknown, making a full assessment impossible without downloading the files.
A dataset sourced from the Reddit platform, published on Kaggle. The specific content, scale, and collection methodology are not detailed in the available metadata. Further verification after download is required to confirm the dataset's exact composition and potential applications.
A dataset related to movies, published on the Kaggle platform. The specific contents, scale, and origin are not detailed in the available metadata. Further details such as the number of records, specific features, and creation date require verification after accessing the data.
Date from a textile analysis of linen archaeological textiles. The dataset is authored by Payton Becker and was last updated in March 2026. It is a small dataset of 17.8 KB with an unknown number of rows and columns.