The Hashing Trick for High Cardinality Features (Part 3 of 4)
Most real-world data available for use in predictive modeling is not purely numeric data. There are often columns/features of categorical data (e.g. product or customer identifiers, zip codes). Sometimes this categorical data has many unique values. When that happens, it is called a high cardinality feature. There can be a lot of strong signal in high cardinality features, but it can also be very tricky to work with them.
This is the third in a 4-part series where Anders Larson and Shea Parkes discuss predictive analytics with high cardinality features. In this episode they focus on feature engineering via the hashing trick. The hashing trick is most applicable for extremely high cardinality, and at first glance can seem almost ridiculous. In a lot of ways, it is the same as bucketing values at random. But there are times that it is more valuable to include randomly engineered buckets than to exclude the original high cardinality feature entirely.