Loading...
Loading...
News corpora, social media analysis, movie/music metadata, sports data, cultural datasets, misinformation
10,999 datasets
James H. Merrell's account details the lives and work of cultural go-betweens on the Pennsylvania frontier. The text covers the period from the Quaker colony's founding in the 1680s into the 1750s, examining efforts to maintain peace between European colonists and Native Americans. It reflects on wilderness meanings and the eventual failure of diplomacy leading to war after 1750.
Jessala Grijalva developed this replication package in 2026, applying Gaussian Mixture Model clustering to 4,785 records from the 2006 Latino National Survey. The data identifies four distinct acculturation orientations—Culture Affirming, Assimilationist, Demicultural, and Bicultural—using a bootstrap-validated inferential framework. It includes the full R/Quarto analysis pipeline and processed data artifacts for two political science manuscripts.
Small-scale GIS data layers compiled by the National Park Service for a Baseline Water Quality Data Inventory and Analysis Report. The layers depict locations of water quality monitoring stations, industrial discharges, drinking intakes, water gages, and water impoundments within Big Cypress National Preserve. Data was last updated on March 4, 2026.
Small-scale GIS data layers compiled by the National Park Service for a Baseline Water Quality Data Inventory and Analysis Report. The layers were used to map locations of water quality monitoring stations, industrial discharges, drinking intakes, gages, and impoundments based on EPA databases. Data includes features like roads, hydrography, and political boundaries, generally at a 1:100,000 scale.
INFINI-NEWS Corpus is a large-scale multilingual collection of news articles extracted from Common Crawl News archives. The dataset, created by author 'ruggsea', contains articles from 2021 to 2025, with partial statistics showing 242 GB of data for 2021 and 356 GB for 2022. It was last updated on the platform in February 2026.
Seattle Parks and Recreation maintains a dataset of soccer fields, published as a hosted feature layer from the DPR.AthleticsFields feature class. The data is filtered using a definition query (WHERE SOCCER > 0) and is updated on a weekly refresh cycle. The specific number of fields, rows, and columns is not provided in the input.
Kaggle dataset titled 'review-chekpoints--2026-05-29--13268-13268'. The dataset's content likely relates to checkpoints or evaluations for machine learning models, as suggested by its platform tags. Metadata is minimal; the actual data content and structure require verification after download.
Seattle Parks and Recreation maintains this dataset of football fields, filtered from a broader athletics fields feature class. It is updated weekly, though the specific number of field records is not provided. The data includes geographic features and is available in multiple formats including CSV, GeoJSON, and KML.
Seattle Parks and Recreation maintains a list of baseball and softball fields. The data is filtered from a larger athletics feature class using the query 'WHERE BASEBALL > 0' and is updated weekly.
Wine reviews from sommeliers, likely containing text descriptions for tasting notes and structured features like price and country of origin. The dataset was originally collected from WineEnthusiast and compiled by authors Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola for a benchmarking paper on multimodal AutoML.
WineEnthusiast reviews collected for a machine learning benchmark. The dataset likely contains tasting descriptions from sommeliers and features like price and country-of-origin. Authors Xingjian Shi, Jonas Mueller, Nick Erickson, Mu Li, and Alexander J. Smola published the dataset in a 2021 arXiv paper on multimodal AutoML.
FOMC press releases published on Kaggle. The dataset likely contains official statements and announcements from the Federal Open Market Committee. The specific number of documents, time range, and original source are not detailed in the provided metadata.
Independent Medical Review (IMR) decisions from the California Department of Managed Health Care, covering all determinations administered since January 1, 2001. The dataset documents reviews of health plan denials for services deemed not medically necessary, experimental, or non-urgent.
California Department of Managed Health Care data contains all proposed health plan premium rate filings submitted since January 1, 2011. The dataset supports public transparency and accountability in health insurance rate setting. Row and column counts are not specified in the input.
June 2013 review details four submarine geolocation technologies for a 2012 CO2 release experiment offshore Oban, Scotland. The QICS1 experiment involved 200 instrument deployments, collection of 1,300 samples, and placement of 24 seabed indicator cages. The report compares audio (acoustic) and visual (photography, video) techniques for locating CO2 bubble streams and equipment.
British Geological Survey research analyzes residual saturation trapping of CO2 in sandstone reservoirs. Experimental results indicate 13–92% of injected CO2 can be residually trapped, providing evidence for storage security assessments. The data supports modeling of leakage event probabilities and financial mechanisms for carbon capture and storage projects.
Kaggle hosts a dataset titled 'movies'. The dataset's content likely pertains to films, but specific details such as the number of records, included features, and its origin are not provided in the available metadata. The platform tags suggest it is structured as tabular data.
ThaiSafetyBench contains 1,889 malicious Thai-language prompts developed by typhoon-ai in 2026 to evaluate the safety of large language models. The collection combines translated global safety benchmarks with original prompts specifically designed to test culturally specific attack vectors unique to the Thai context.
VietNews-Summarizer is a dataset published on Kaggle. The title suggests it likely contains Vietnamese-language news articles paired with summaries. The dataset's creator, size, and specific contents are not detailed in the available metadata.
Corporate Cyber Threat OSINT: Twitter & LinkedIn is a dataset likely containing open-source intelligence data gathered from social media platforms. The dataset is hosted on Kaggle, but its specific content, size, and creation details are not provided. Its columns, sample data, and update history are unknown.