Transform raw data into meaningful ML features and select the ones that matter
Deep learning hasn't eliminated the need for feature engineering — it's shifted it. Even with neural networks, the quality of your input representation directly determines model performance. For structured/tabular data (which dominates industry ML), thoughtful feature engineering routinely improves accuracy more than switching model architectures. The goal is simple: make the signal in your data more accessible to the model. Raw data often encodes relationships in forms the model can't easily exploit.
- **Temporal decomposition**: Extract hour, day-of-week, is_weekend, days_since_event from timestamps - **Aggregations**: Rolling 7-day average, max in last 30 days, count of events per user - **Interaction features**: price_per_sqft = price / sqft; amount_to_income_ratio - **Binning**: Age → age_group (0-18, 19-35, 36-55, 56+) for non-linear relationships - **Target encoding**: Replace categorical ID with mean target value per category
Too many features slow training, cause overfitting, and make models harder to maintain. Select the features that add predictive value.
from sklearn.feature_selection import SelectFromModel, mutual_info_classif
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
X, y = load_features() # your data
# Method 1: Mutual information — catches non-linear relationships
mi_scores = mutual_info_classif(X, y)
mi_df = pd.DataFrame({"feature": X.columns, "mi_score": mi_scores})
top_features = mi_df.nlargest(20, "mi_score")["feature"].tolist()
print("Top 20 by mutual info:", top_features[:5])
# Method 2: Feature importance from a tree model
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
selector = SelectFromModel(rf, threshold="median")
X_selected = selector.transform(X)
selected_names = X.columns[selector.get_support()].tolist()
print(f"Selected {len(selected_names)} of {X.shape[1]} features")