Regularization
- 06:53
How regularization works in machine learning.
Downloads
No associated resources to download.
Glossary
Python Regression algorithms RegularizationTranscript
Now let's talk about how you prevent overfitting with a tool called regularization. Because machine learning models tend to overfit, an important part of your design process is preventing overfitting. One cause of overfitting is poor feature selection. To explain, consider this example where weight in pounds is your target variable, and you're predicting based on someone's height. It's clear that there's a positive relationship between height and weight. Based on this data, you can make a reasonable prediction of a person's weight just by knowing their height. There's an obvious strong real relationship. However, height is not the only input feature that could explain weight. What other real relationships could explain weight variation between observations? In addition to knowing a person's height, if you could also know maybe their occupation and their sex, you could further increase the accuracy of your weight prediction because you're identifying other real relationships between other features and your target variable. So obviously, a limited number of additional features can add useful complexity to your model so that the complexity of your model matches the complexity of the real relationships you're trying to represent leading to a good fit. On the other hand, too many features, especially features that don't describe a significant relationship, can create excess complexity that leads to overfitting and bad predictions. Look at the data that you have right here. What does someone's hometown, their hour of birth and the number of hairs on their head have to do with their weight? Nothing. However, if you don't place limits on your machine learning models, they will seek out relationships where relationships don't really exist. They're going to look at these random pieces of data and they're going to try to match it to the random fluctuation in your target variable. For example, a model could take the data above and make predictions on the basis of someone's hour of birth. We did it right here. Look at it. So what happens when this model starts to evaluate new observations? Well, they were born at 8:00 AM then they must weigh 175 pounds, or they were born at 8:00 PM they must weigh 262 pounds. If you apply this model to the training data, you actually have 100% accuracy. But how accurate do you think the new predictions will be? Probably no different than a random guess because the model is making predictions on the basis of a relationship that doesn't actually exist, that is overfitting. So what is regularization and why does it matter? Regularization is a form of automatic feature selection that dampens the effect of insignificant features, thus reducing overfitting. To understand regularization, you first need to understand the idea of a cost function, which is the way that you choose to quantify your model's error. Machine learning algorithms are iterative, meaning that they begin with a poor model and cycle through a process to optimize it. During this process in order to determine if a change to the model is an improvement, the machine learning algorithm requires some measure of success. Since you're using the model to make predictions, the most successful model will make the most accurate predictions. That is to say the best model will make predictions with the least error. The way that you measure error is your cost function, and the objective of any machine learning algorithm is to minimize its cost function. For example, the linear regression that you see right here chooses 11.3 as the coefficient and negative 635.8 as the intercept. Because these values minimized its cost function, which is to say that it minimized its error. As a tip, the cost function can also be called the loss function or the error function. So if you see either of those phrases, it's indicating the same idea. The chart you see here visualizes the error of the height versus weight regression on the prior page. The most commonly used cost function is called the sum of squared errors. To calculate the sum of squared errors, you just measure the error of each prediction like you see in these red lines. Then you square those error values and finally use some all of the squared values. The result of this cost function is that extreme errors are more heavily penalized because their values are squared. So how does regularization come into play and what does it have to do with the cost function? The basic cost function, penalizes error, and the objective of the machine learning model is to minimize that penalty. Regularization adds a new penalty to the cost function that penalizes model coefficients. The larger the coefficient, which is to say the more influence a feature has on the model, then the larger the penalty. Now, in order to minimize the cost function, your machine learning model must balance two competing objectives, minimizing the error on one hand and minimizing coefficients on the other. And this means that the algorithm is incentivized to create a model that maximizes accuracy to minimize your error, while also minimizing complexity by minimizing the coefficients. So let's look back at this example. A non-regular unconstrained algorithm would find a way to incorporate useless variables like hometown, hour of birth, and hairs on somebody's head. A regularized algorithm would either drastically reduce the coefficients of those features, thus reducing their influence on the final model, or it would drive their coefficients to zero and eliminate them from the model altogether. So you can imagine how useful this will be in preventing overfitting. When you're designing your algorithms, you have the ability to tune the strength of the regularization penalty factor. A small penalty factor means that the model will prioritize error minimization, and a large penalty factor means that the model will prioritize complexity minimization. So you're kind of on the seesaw of finding the right balance between accuracy and avoiding overfitting. There are two types of regularization penalties. L1 regularization penalizes the absolute size of model coefficients and L2 regularization penalizes the squared size of model coefficients. As a tip, a regularized algorithm with an L1 penalty will result in a smaller number of larger coefficients, meaning a smaller number of highly influential features. On the other hand, a regularized algorithm with an L2 penalty will result in a larger number of smaller coefficients, meaning a larger number of less influential features, rather than randomly choosing L1 or L2 regularization and then guessing the optimal strength of the chosen penalty factor. Later in the course, you're going to learn how to create alternative models that compete against each other. To find the model with the best performance through trial and error, you're going to use three types of regularized linear regression algorithms in this course. First, the lasso regressor, which is regularized with the L1 penalty factor. Then the ridge regressor, which is regularized with the L2 penalty factor. And finally, the elastic net, which is regularized with a blend of both penalty factors. And we're gonna look at how you find the optimal blend using the elastic net.