NaN Object
- 02:10
How to handle missing values in datasets using the NumPy package and Pandas functions, specifically focusing on identifying and filtering NaN (not a number) values to maintain data integrity.
Downloads
No associated resources to download.
Glossary
Machine Learning NaN NaN object NULL Values NumPy PythonTranscript
Okay, I'm assuming that you're all caught up in all of the exercises up to this point. You should see something that looks a little bit funny in the count row of your summary statistics. Something's a little off. Most of the data looks fine, but why are there not 100 observations with price, target price increase, and the return values? The reason why is because the NumPy package includes a special type of object, NAN, which stands for not a number. It allows you to represent null or missing values without altering the data type in your NumPy array or your Panda's dataframe. Since all of the objects must be the same type since price target, for example, only has a count of 93, but you know there are 100 observations in the dataset, then that means seven observations must have a price target value of NAN or a null value. Your first instinct to solve this problem might be to create a Boolean mask for NAN. Just like we created a Boolean mask for our outliers, you'll see that I've done just that in the cell right here, but when I execute that cell, I get a blank result. This approach doesn't work because NAN isn't equal to anything, not even NAN. Instead, pandas has a special function called is null, which checks for NAN values within a given series in a pandas dataframe. To use the pandas is null function. Use the format that you see right here, first point to pandas, then the is null function. Then write the name of your dataframe and you can use the square bracket method of pointing to your series, or you can use the dot method of pointing to your series. If your series name is a continuous string, then to actually filter your data frame you use that is null function as a Boolean mask. So right here I have my dataframe name and just as I would normally filter with a Boolean mask, I have my square brackets and I use that is null function in the series to filter the dataframe based on those NAN values in that series.