Splitting Your Data
- 01:57
Creating and evaluating several regression models, including Lasso, Ridge, Elastic Net, Random Forest, and Gradient Booster.
Downloads
No associated resources to download.
Transcript
Later in the lesson, you're going to create several competing versions of these five regression models, lasso ridge, elastic net, random forest, and gradient booster. Your algorithm will evaluate the performance of each competing model, and then it will choose the top performer to make a prediction for your client. The top performer will be the model with the highest R squared. However, it would not be valid to judge the competing models based on their performance on the training data. As you saw in the prior lesson. It's possible for a highly overfit model to perform well on the training data, but poorly on new unseen data. If you judge the competing models solely based on their performance on the training data, there is no way for you to know if the model is overfit. To prevent overfitting, you're going to split your known data into a training data set and a testing data set. By holding out a portion of the data where you know the actual target variable values, you're able to evaluate model performance after training.
The Scikit-Learn package contains a function specifically designed for splitting your data train test split in your Jupyter Notebook, execute the code that you see right here to import the train test split function from the model selection module within Scikit-Learn. So go ahead and copy the code that you see here and put it underneath where you're importing your other packages in your code. The train test split function will accept as arguments. First, a series containing your target variable, and second, a data frame containing your input features. Before going any further, let's separate the dataset into those two objects.