Recoding using One-Hot Encoding
What is One-Hot Encoding?
One-hot encoding is a technique to convert categorical data into a numerical format which is the demand of all algorithms in machine learning. It works by creating a new binary (0 or 1) column for each unique category in the original data. For a given row, the column corresponding to its category is marked with a '1', while all other new columns for that category are '0'.
How it works
Identify unique categories:
First, identify all unique values within a categorical column (e.g., "Male," "Female," "Trans").
Create new binary columns:
Create a new column for each of these unique categories. For example, if your original column was "gender," you would create "gender_Male," "gender_Female," and "gender_Trans" columns.
Assign values:
For each row, place a '1' in the new column that matches the original category and a '0' in all other new columns for that category.
Codes:
import pandas as pd
data = pd.DataFrame({
'gender': ['Male','Trans', 'Female', 'Female', 'Male', 'Male','Trans']
})
gender_dummies = pd.get_dummies(data['gender'])
|
Female |
Male |
Trans |
|
|
0 |
False |
True |
False |
|
1 |
False |
False |
True |
|
2 |
True |
False |
False |
|
3 |
True |
False |
False |
|
4 |
False |
True |
False |
|
5 |
False |
True |
False |
|
6 |
False |
False |
True |
data = pd.concat([data, gender_dummies], axis=1)
data
|
gender |
Female |
Male |
Trans |
|
|
0 |
Male |
False |
True |
False |
|
1 |
Trans |
False |
False |
True |
|
2 |
Female |
True |
False |
False |
|
3 |
Female |
True |
False |
False |
|
4 |
Male |
False |
True |
False |
|
5 |
Male |
False |
True |
False |
|
6 |
Trans |
False |
False |
True |
Statlearner
Statlearner