Replace and Sparse Classes
- 03:01
How to identify and correct typos in datasets using the replace function in Pandas.
Downloads
No associated resources to download.
Transcript
Using the unique function and the countplot function, by now you should have found four typos in the stock data dataset that appear to be duplicates that were just misspelled. If you haven't found them already, go back and look very carefully and see if you can find those four misspelled sectors. They should contain a very small number of observations because they're typos. When you find errors like that in your dataset, what do you do? Pandas has a useful replace function, and this is how it works. First, I'm gonna call the name of my dataframe, which in this case is example df. I'm going back to the the football players and the dancers. And then I want to look at the occupation series, and I'm using this method. Obviously, I could also use the method with square brackets, but I'm choosing to use the method with a dot because it's only a single word. Then after that, once I know the series, I'm calling the replace function and the syntax of the replace function is my first argument contains the old name of whatever I'm trying to replace. So in this case, I'm replacing wide receiver, and then in the second argument, I write the whatever I want to replace that name with. So in this case, I'm replacing wide receiver with the new name football player. And then this in place argument just says that I want to replace wide receiver with football player in its place. Totally take it out and stick it in there in its place. In this case, the wide receiver observation is not necessarily incorrect, but it's what we call a sparse class. So it's a class with very few observations. Now this is a small dataset there, so there's only one, but sparse classes are not very helpful for machine learning algorithms. Our machine learning algorithms like to see lots and lots of examples to learn from. So what I'm doing is I'm converting that sparse class, which is wide receiver, and I'm just dumping that bucket into my bigger bucket of football players, which is still accurate, but it gives my machine learning algorithm more to learn from because now there are going to be more observations. Now I'm doing the same thing to combine tango dancer with ballet dancer, and in this case, what I'm doing is I'm just changing both of those names to a broader dancer bucket. And the way that I do that is I'm replacing two values. So my first argument where my old value goes is actually a list. So this list of two values contains tango dancer and ballet dancer, both of the names of the existing series. And then in my second argument, I'm telling Python, replace both of those with the name dancer in place is true. So they'll be replaced. And then underneath that, I'm just going to run this code up here again, I'm actually gonna go up here to the cell above and take that countplot code, and I'm gonna paste it below so that I can see exactly the same thing after those values are replaced. And when I do that, you can see I've created two classes, which each have more observation so I don't have to worry about that sparse class problem. And you can use the same exact method to replace observations with typos.