One Error Workout
- 01:18
Practice cleaning a dataset by removing observations with negative or zero total fees and verifying the correction through histograms of continuous variables.
Downloads
No associated resources to download.
Glossary
Machine Learning PythonTranscript
Use a Boolean mask to define a new data frame that includes only observations with total fees exceeding 0. Then display histograms for all continuous variables. To verify that all errors have been removed.
We're going to put a new adjusted dataframe inside the dataframe investor data, and that's gonna be equal to our existing investor data dataframe where the total fee series is greater than 0. Then we're going to print histograms for all of the continuous variables using the hist function.
And finally, the pyplot show function to display the histograms.
When you finished, take a look at these histograms. Do you notice anything funny looking about these distributions? There's actually one major error in the dataset that you need to correct. Look at the total fees feature. Does it make sense for a client to pay negative fees? No, but it looks like there's some observations with a total fees value of less than 0. Normally, you'd want to investigate this further to figure out the source of this anomaly, but in this case, this observation was added artificially to the dataset. Once removed, your data will be clean.