Indicator Variables
- 02:57
How to convert categorical variables into numerical inputs for machine learning algorithms by creating an indicator variable.
Downloads
No associated resources to download.
Glossary
Indicator Variables Pandas PythonTranscript
In machine learning as you're using these open source packages, you're going to need to use numerical inputs for all of your algorithms. And that means when you're using categorical variables like you see here, investment grade or high yield, you're going to need to convert that into a number. We do that by creating, in this case, what's called an indicator variable that tells you whether a condition is met or not. So in this case, we might assign investment grade to one, the positive case and high yield to zero, the negative case. So our indicator variable is I indicating whether this company is investment grade or not. In this case, creating a normal Boolean mask doesn't work because your outputs are the words true and false instead of one and zero. But there's an easy way for us to fix that. So here I'm creating a Boolean mask, which is just a panda series, and I'm setting equal to our dataframe ratings df, and I'm specifically calling out the series IG or HY investment grade or high yield. And notice since that series has spaces in it, it works better for me to use the square brackets method of calling out that series. And I'm checking to see if it's equal to the string investment grade and notice that I'm using the double equal sign to check equality. When I print that Boolean mask, I get values of true and false, which are strings. I can easily convert this into integer zero in one so that my machine learning algorithms can interpret it. And I do that by going back and adding parentheses here around my Boolean mask.
And then I just write dot as type to change the type of object that I'm producing, and in parentheses INT for integer. So now when I create that Boolean mask and print it, I get ones for true and zeros for false. That gives me something that my machine learning algorithms can interpret.
So now I've created this Boolean mask, which I've named mask, and it has numeric values, ones and zeros in integer format so that my machine learning algorithms can interpret that variable.
Now, if I want to add it to my dataframe, I then create a new series in that dataframe. So here I've written ratings underscore df, the name of my dataframe, and then in square brackets I've written the name of the new series. I want to create IG indicator for investment grade indicator, and then I'm setting that equal to the Boolean mask. When I display ratings df, you can see that that IG indicator feature has now been added as a series to my dataframe.