Skip to main content
Data Prompts

Python Data Cleaning Script

Generate a pandas script to detect and fix common data quality issues in any dataset.

intermediateWorks with any modelData
Prompt
Write a Python data cleaning script using pandas for the following dataset.

**Dataset description:**
[DATASET_DESCRIPTION]
(Column names, types, and any known quality issues — e.g., "date column in mixed formats", "customer_id sometimes has leading zeros stripped", "revenue column has '$' and ',' characters")

**Known issues to fix:**
[KNOWN_ISSUES]
(List specific problems — or write "unknown" to trigger a general-purpose audit)

The script should:

1. **Audit the raw data** — report: null counts per column, duplicate row count, unique values for low-cardinality columns, value range for numeric columns, sample of distinct formats for date/string columns.

2. **Fix issues** — for each issue in [KNOWN_ISSUES] (or common issues if unknown):
   - Standardize date formats to ISO 8601 (YYYY-MM-DD)
   - Strip currency symbols and convert to float
   - Normalize whitespace and casing in string columns
   - Remove or flag duplicate rows
   - Impute or drop nulls (print a decision log explaining which columns were imputed vs. dropped and why)

3. **Re-audit** — after cleaning, run the same checks and print a before/after comparison showing what changed.

4. **Save the cleaned dataset** — output to `cleaned_[original_filename].csv`.

Add a comment above each logical block and make the script runnable as `python clean.py <input_file.csv>`.

How to Use

Replace [DATASET_DESCRIPTION] with a description of your data's columns and types. If you know specific problems (e.g., "the 'revenue' column contains strings like '$1,200.50'"), list them in [KNOWN_ISSUES]. If you write "unknown", the script will do a general audit and fix the most common issues automatically.

Variables

VariableDescription
[DATASET_DESCRIPTION]Column names, types, and any domain context about what the data represents
[KNOWN_ISSUES]Specific data quality problems you've observed — or "unknown" for a general-purpose audit

Tips

  • Run the audit section first without fixing anything by asking: "Generate only the audit section — no cleaning yet." Review the output before committing to a cleaning strategy.
  • For large datasets, add "add a --sample flag to run on the first 10,000 rows for testing" to the prompt.
  • If you have a data dictionary or schema document, paste the relevant section into [DATASET_DESCRIPTION] for more accurate type handling.