Feature Engineering Ideas

Generate candidate features for a machine learning model given a dataset description and prediction target.

advancedWorks with any modelData

Prompt

Generate feature engineering ideas for a machine learning model.

**Prediction target:** [TARGET]
(What are you trying to predict? e.g., "customer churn in the next 30 days", "house sale price", "loan default probability")

**Available raw columns:**
[RAW_COLUMNS]
(List column names and types — e.g., "signup_date (datetime), last_login (datetime), total_orders (int), avg_order_value (float), country (string), plan_tier (categorical: free/pro/enterprise)")

**Model type:** [MODEL_TYPE]
(e.g., "gradient boosting (XGBoost/LightGBM)", "logistic regression", "neural network", "random forest")

For each suggested feature:
1. **Feature name** — snake_case
2. **Construction** — exact formula or pandas/SQL code snippet to create it
3. **Rationale** — why this feature might predict [TARGET]
4. **Potential issue** — leakage risk, cardinality problem, or distribution concern

Organize features into categories:
- **Temporal features** (from datetime columns)
- **Aggregation features** (counts, means, recency, frequency)
- **Interaction features** (products or ratios of existing columns)
- **Encoding features** (how to handle categoricals for [MODEL_TYPE])
- **Domain-specific features** (business logic derived features)

Flag any features at high risk of data leakage.

How to Use

Describe the prediction target precisely in [TARGET] — the more specific (including the time horizon), the better the feature ideas. List all raw columns with their types in [RAW_COLUMNS]. Include [MODEL_TYPE] because the best encoding strategy (one-hot, target encoding, embeddings) differs by model family.

Variables

Variable	Description
[TARGET]	What you are predicting — be specific about the definition and time horizon
[RAW_COLUMNS]	All available columns with their data types
[MODEL_TYPE]	The type of model, as this affects encoding and interaction feature strategies

Tips

After generating ideas, ask: "Which 5 of these features are most likely to have high predictive power and why?" to prioritize implementation.
For temporal data, always ask the AI to check for leakage: "Verify that none of these features could contain information from after the prediction cutoff date."
Use the SQL snippets to validate that the feature can actually be computed from your warehouse before implementing it in your Python pipeline.

Feature Engineering Ideas

How to Use

Variables

Tips

More Data prompts