Binary Classification
- 03:04
Understand binary classification in Python and its application in distinguishing between two categories. Understand the difference between regression and classification problems.
Downloads
No associated resources to download.
Transcript
We're going to start with binary classification. What is that? Well, as you know by now, in a regression problem, the target variable you're predicting is continuous or numeric, such as the amount of liquidity a company should maintain or the percentage return of an investment. In contrast, in a classification problem, the target variable that you're trying to predict is categorical, which means that you're predicting a class rather than a number one common type of classification called binary classification sorts observations into two categories, a positive class and a negative class. For example, you could use binary classification for labeling credit card transactions as A fraudulent, or B, legitimate distinguishing attractive investment opportunities from less promising alternatives. Flagging investors who are likely to decline to participate in a debt offering or sorting out investment banking clients who are likely to make a large acquisition, or who are likely targets of shareholder activism. You can imagine that being able to make these types of predictions and estimations would be very powerful in practice.
In each case, the positive class is labeled with a 1 like a fraudulent credit card transaction, and the negative case is labeled with a 0 for a legitimate credit card transaction, and the classification algorithm calculates the probability that any given observation belongs in one class or the other. Let's take a look at this simplified example. The graph that you see here illustrates investor responses to initial public offerings based on each IPO's projected return. A 1, which is the positive class, indicates that an investor decided to participate in the IPO and a 0, which is the negative class indicates that an investor declined to participate. If an IPO's projected return is below 10%, most investors decline, and if the projected return exceeds 20%, almost every investor participates between those values. There are some mixed decisions based on each investor's preferences as a tip for simplicity, this example will make predictions based on a single input feature, which is the IPO projected return. In a real world application, you would collect data on many other features for more accurate predictions between that 10 and 20% threshold based solely on the IPO's projected return, you would want your model to predict decline with high confidence below 10%, participation with high confidence above 20%, and lower confidence for predictions between those two values. Take a look at the line cutting through this graph and ask yourself, how well will a linear regression work for this problem? It looks like the answer is not very well. First, the linear regression is predicting negative values when projected return falls below 5%, which doesn't make much sense in the context of classification. Second, confidence should be very high below 10% and above 20%, meaning that the prediction should be very close to either 0 or 1, but the linear regression is not flexible enough to model that relationship accurately.