This synthetic HR dataset provides employee records featuring intentional data quality issues such as missing values, inconsistent formatting, and duplicate entries. It covers standard organizational categories including employee names, department assignments, and hire dates to simulate real-world administrative data challenges.
Use Cases
- Develop a deduplication algorithm to identify redundant records based on employee name and ID fields.
- Create a date normalization script to standardize the hire_date column into a single ISO format.
- Build a text cleaning pipeline to fix casing and whitespace issues in the employee_name and department columns.
- Practice outlier detection and handling on numerical fields like salary or years_of_service.
Strengths
- Includes synthetic employee records with intentional missing values and duplicate entries.
- Features inconsistent string formatting across name and department columns.
- Contains temporal data with mixed date formats to test parsing logic.
- Provides a structured environment for benchmarking automated data cleaning scripts.