Splitting Data
- 02:50
How to prepare data for machine learning by defining data frames for target variables and input features, and using stratified random sampling to ensure training and testing sets are representative of the whole dataset.
Downloads
No associated resources to download.
Glossary
Machine Learning Python splitting dataTranscript
Finally, just like you did in your regression algorithm, you're going to define new dataframes, one for your target variable commit decline, and one for your input features, which is the whole dataframe after dropping commit decline. Copy this code into your own Jupyter notebook and your data will be ready to go.
Only about 20% of the observations contain a positive class of decline. When you split your data into training and testing sets, it's really important for each set to contain approximately equal proportions of commit and decline. Each set, your training and your testing should be representative of the data as a whole. The greater the difference between positive and negative classes, the more likely you are to accidentally create varying proportions in your training and testing data, which can lead to problems with your machine learning algorithm. You can prevent these problems with stratified random sampling.
Stratified random sampling, also called proportional random sampling separates your data into classes or strata, and randomly selects proportionate samples from each class. Scikit-learn's train test split function makes stratified random sampling really easy. In addition to the arguments you use for your regression algorithm, you'll add stratify. The value for this argument should be the feature with which you want to separate your data.
Follow along with the code that you see here to split your data with stratified random sampling. First from Scikit-learn model selection module. Import the train test split function. Then create a new object called split list and make that the output of the train test split function. The first argument is going to be your input features. The second is your target variable. The third is your test size, which we're setting to 20% right now, just like we did with your regression algorithm. And then we're setting the random state equal to one so that your results are the same as you see in this video. All of that is the same as you would see in your regression algorithm in the last project, but this time we're adding the stratify argument where you're telling the train test split function to stratify this data using our target variable commit decline. Since this function stratifies your data based on the values in commit decline before splitting, you can be certain that the proportions of commit and decline in both your training data and testing data are representative of the dataset as a whole. Remember, train test split returns a list containing four objects in order. First, your training inputs, then your testing inputs, then your training target values, and finally your testing target values.