One-hot Encoding Techniques in Data Recipe for Einstein Discovery Dataset
One of the most common feature engineering methods for categorical attributes is transforming each categorical attribute into a numeric representation. This blog will review and discuss the popular categorical-column encoding method “one-hot encoding in Data Recipe by taking the Kaggle datasets as an example.
After finalizing the features and rows to be included in your model, you next want to look for text-based values that can be converted into numbers. Aside from set text-based values such as True/False (that automatically convert to “1” and “0” respectively), most algorithms are not compatible with non-numeric data. and Another problem is that analyzing the “correlation” between a categorical attribute and the target variable is not straightforward. The challenge of categorical attributes is why we often need feature engineering.
one method to convert text-based values into numeric values is one-hot encoding, which transforms values into binary form, represented as “1” or “0” — “True” or “False”.A “0” representing False, means that the value doesn't belong to a given feature. whereas a “1”- True or hot — confirms that the value does belongs to the feature.
Below is the Kaggle Cancer dataset which we can use to observe one-hot encoding. For illustration purposes let's assume that the diagnosis field categorizes the cancer type. We can convert this column into numeric values by applying the one-hot encoding method.
The diagnosis categorical column has 2 unique values.
One-hot encoding approaches in Data Recipe
We can convert categorical values into binary values by two methods.
- Creating individual columns for each categorical value using case statement by replacing “1” for present value and “0” for other value. But this method would be a time-consuming process.
Example: Let’s assume that we are having 10 values in our categorical column, in that case, we have to create 10 columns.
2. Second approach would be pivoting with the group by categorical column in the aggregate node. Which is a less time-consuming as well as efficient method of encoding.
Step 1: After the loading of the input data source or transformation node, let’s add the aggregate node.
step 2: In the aggregate node, select aggregation type as “count of rows”, in the group by rows select the id of the records in order to keep the granularity of records, in the group by column select the categorical column which needs to be binary encoded.
Step 3: After the pivoting, let’s augment/join the encoded column into the higher datasets register node.
Consideration
Aggregation Pivoting option limitation in Data Recipe:
Because pivoting increases the number of columns, keep in mind that the Aggregate node can create up to 5,000 columns. If the node exceeds the maximum, you can reduce the number of columns by changing the aggregates or row and column groupings.
Points to be remembered.
Your values that aren’t picked are automatically included in an Other column when you pivot with the Data Prep aggregate node. For example, you can group columns by opportunity type, select your most important types, and let the rest be grouped as Other.
Using one-hot encoding, the dataset has expanded to two more columns, and we have created two new features from the original feature. this now makes it possible for us to input the data into our model and choose from a Broder spectrum of machine learning algorithms.
The downside is that if we have more dataset features, which may slightly extend processing time in nature. This is usually manageable but can be problematic for datasets where the original features are split into a large number of new features.
Consideration:
In the Einstein Discovery training dataset, the maximum limit for the number of columns is 50. so, while doing one hot encode make sure that these columns should not be exceeded.
One hack to minimize the total number of features is to restrict binary cases to a single column. An above example of a Cancer dataset lists gender in a single column using one-hot encoding. Rather than create discrete columns for both “B” and “M”, they merged these two features into one. According to the dataset’s key, B is denoted as “1” and M as “0”.
Below is the output comparison of Encoded Data recipe output against dataset records.
How will you handle it when you have a high-dimensional category feature column?
Below is another Kaggle MELBOURNE_HOUSE_PRICES Dataset which is a high-dimensional feature.
If we apply one-hot encoding to this column, we must add almost 86 new columns. Such a high-dimensional category feature representation often causes undesired consequences like high training variance (eventually decreasing accuracy), significant computation and memory consumption, etc. Even in some cases, categorical values might be more than 300. In such a scenario how we can leverage the one-hot encoding in order to avoid losing the useful information data rows in the training dataset.
Solution:
instead of creating all the categorical values as one-hot encoding, we can limit the one encoding to the 5 most or 10 most frequent values of the variable/column based on considering the dataset column limitation in Einstein's discovery. This means that they would make one binary variable for each of the 5 most frequent values only. this is equivalent to grouping all the other values under a new category, that in this case will be dropped. thus the 5 or 10 new dummy variables indicate if one of the 5 or 10 most frequent labels is present (1) or not (0) for a particular observation.
the below screenshot illustrates the top 5 most frequent categories for the variables Suburb, Council Area.
How can we do in Data Recipe:
please refer to the below videos for how we can be able to make Top values encoded dynamically in Data Recipe.
Note: if we have more than 1 features that we wanted to encode for the top “n” values then you have to extend the same 2 streams of finding the top n values list and encoding the top n values of the features augment/join back to the higher dataset register nodes.
Below is the output comparison of Top n values Encoded Data recipe output against dataset records.
Conclusion: -
For a feature having a large number of unique feature values or categories, one-hot encoding is not a great choice. There are various other techniques to encode the categorical (ordinal or nominal) features.