Data Prompts
Python Data Cleaning Script
Generate a pandas script to detect and fix common data quality issues in any dataset.
Prompt
Write a Python data cleaning script using pandas for the following dataset. **Dataset description:** [DATASET_DESCRIPTION] (Column names, types, and any known quality issues — e.g., "date column in mixed formats", "customer_id sometimes has leading zeros stripped", "revenue column has '$' and ',' characters") **Known issues to fix:** [KNOWN_ISSUES] (List specific problems — or write "unknown" to trigger a general-purpose audit) The script should: 1. **Audit the raw data** — report: null counts per column, duplicate row count, unique values for low-cardinality columns, value range for numeric columns, sample of distinct formats for date/string columns. 2. **Fix issues** — for each issue in [KNOWN_ISSUES] (or common issues if unknown): - Standardize date formats to ISO 8601 (YYYY-MM-DD) - Strip currency symbols and convert to float - Normalize whitespace and casing in string columns - Remove or flag duplicate rows - Impute or drop nulls (print a decision log explaining which columns were imputed vs. dropped and why) 3. **Re-audit** — after cleaning, run the same checks and print a before/after comparison showing what changed. 4. **Save the cleaned dataset** — output to `cleaned_[original_filename].csv`. Add a comment above each logical block and make the script runnable as `python clean.py <input_file.csv>`.
How to Use
Replace [DATASET_DESCRIPTION] with a description of your data's columns and types. If you know specific problems (e.g., "the 'revenue' column contains strings like '$1,200.50'"), list them in [KNOWN_ISSUES]. If you write "unknown", the script will do a general audit and fix the most common issues automatically.
Variables
| Variable | Description |
|---|---|
| [DATASET_DESCRIPTION] | Column names, types, and any domain context about what the data represents |
| [KNOWN_ISSUES] | Specific data quality problems you've observed — or "unknown" for a general-purpose audit |
Tips
- Run the audit section first without fixing anything by asking: "Generate only the audit section — no cleaning yet." Review the output before committing to a cleaning strategy.
- For large datasets, add "add a --sample flag to run on the first 10,000 rows for testing" to the prompt.
- If you have a data dictionary or schema document, paste the relevant section into [DATASET_DESCRIPTION] for more accurate type handling.