Data Prompts
Feature Engineering Ideas
Generate candidate features for a machine learning model given a dataset description and prediction target.
Prompt
Generate feature engineering ideas for a machine learning model. **Prediction target:** [TARGET] (What are you trying to predict? e.g., "customer churn in the next 30 days", "house sale price", "loan default probability") **Available raw columns:** [RAW_COLUMNS] (List column names and types — e.g., "signup_date (datetime), last_login (datetime), total_orders (int), avg_order_value (float), country (string), plan_tier (categorical: free/pro/enterprise)") **Model type:** [MODEL_TYPE] (e.g., "gradient boosting (XGBoost/LightGBM)", "logistic regression", "neural network", "random forest") For each suggested feature: 1. **Feature name** — snake_case 2. **Construction** — exact formula or pandas/SQL code snippet to create it 3. **Rationale** — why this feature might predict [TARGET] 4. **Potential issue** — leakage risk, cardinality problem, or distribution concern Organize features into categories: - **Temporal features** (from datetime columns) - **Aggregation features** (counts, means, recency, frequency) - **Interaction features** (products or ratios of existing columns) - **Encoding features** (how to handle categoricals for [MODEL_TYPE]) - **Domain-specific features** (business logic derived features) Flag any features at high risk of data leakage.
How to Use
Describe the prediction target precisely in [TARGET] — the more specific (including the time horizon), the better the feature ideas. List all raw columns with their types in [RAW_COLUMNS]. Include [MODEL_TYPE] because the best encoding strategy (one-hot, target encoding, embeddings) differs by model family.
Variables
| Variable | Description |
|---|---|
| [TARGET] | What you are predicting — be specific about the definition and time horizon |
| [RAW_COLUMNS] | All available columns with their data types |
| [MODEL_TYPE] | The type of model, as this affects encoding and interaction feature strategies |
Tips
- After generating ideas, ask: "Which 5 of these features are most likely to have high predictive power and why?" to prioritize implementation.
- For temporal data, always ask the AI to check for leakage: "Verify that none of these features could contain information from after the prediction cutoff date."
- Use the SQL snippets to validate that the feature can actually be computed from your warehouse before implementing it in your Python pipeline.