Y-aware Feature Engineering with High Cardinality Features (Part 2 of 4)
Most real-world data available for use in predictive modeling is not purely numeric data. There are often columns/features of categorical data (e.g. product or customer identifiers, zip codes). Sometimes this categorical data has many unique values. When that happens, it is called a high cardinality feature. There can be a lot of strong signal in high cardinality features, but it can also be very tricky to work with them.
This is the second in a 4-part series where Anders Larson and Shea Parkes discuss predictive analytics with high cardinality features. In this episode they focus on y-aware feature engineering. Y-aware feature engineering is all about carefully bleeding information from your training response back into your engineered features without grossly misrepresenting your ability to generalize to new data.