Exploring Relationships
- 05:03
Exploring relationships and data visualization in Python.
Downloads
No associated resources to download.
Glossary
Machine Learning Python RelationshipsTranscript
In this example, we group data by the categorical variable occupation and then applied the mean function to each group. We could do the same thing with our investor data like you see right here, but if you look at the output, it doesn't appear to highlight any really useful differences between the banks. But here's an interesting question. We're trying to predict if these investors will commit to a transaction or decline the transaction. So what if we asked, do banks decline the same proportion of transactions, or are some banks more selective than others? Could we take the countplot that you see right here and split the bars into commits and declines and see the difference in those proportions by investor? To answer that question, we would need to complete four steps. First, group the data by investor. Second, isolate the commit series, which is our target variable. Third count each investor's commit values and decline values. And then finally, we would want to plot these values in a bar chart. You already know how to use the groupby function to group data by class, and you already know how to isolate a series in a dataframe using the dot format in the square bracket format.
In addition, Pandas provides the value counts function that you see in the bottom cell here, which counts the values in each class of a series with a categorical variable. So in this case, we have the commit series, our target variable, and it has two classes commit and decline. So when we point Python to our investor data dataframe and then the commit series inside that dataframe and then pass the value counts function, it's going to return a count of both classes in that series. So we see that we have 5,700 commits and 1,500 declines.
You can combine these groupby and value count steps into one cell that groups data by investor, and then counts the value in each class of the commit feature. Look at what we're doing right here. We are first calling the investor data dataframe and then segmenting the data in that dataframe by the classes in the investor series.
Third, we're isolating the commit series.
And finally, we're applying the value counts function to tally the number of observations in each class of the commit series. If you need to pause the video and take a look at this code to make sure that it all makes sense. What I wanna show you now is that this piece of code is actually a series this that you see right here, even though it's grouped by the investors. This output is a series and I can show you by copying and pasting That code into the type function and the output that I'm going to get will tell me that this piece of code is a series. What that means is that any functionality that you can perform on a series, you can also perform on this group by value count. So all of your data visualization functions that you would normally apply to a Pandas series or a Pandas dataframe, you can also apply after you group the data by classes and after you count the values of those classes. This is a super useful tool as you're getting to know your data when you want to visualize it to get to know the relationships inside your data. And it's going to help us answer questions like, do some investors commit to a larger proportion of transactions than others? To visualize this, I can copy and paste this code again in another new cell, and then after it add the plot function inside that plot function. I'm gonna pass it the kind argument and give it BARH horizontal bar chart, and you can actually plot several different kinds of graphs with the plot function by changing the kind argument. There's a link below this video if you want to check out the full list of options that you can pass into that plot function, and that'll give you a lot of useful tools in your data visualization toolbox.
I'm gonna add the pyplot show function, and I've actually forgotten to include the dot right there. And that's important to separate the value counts from the plot function using this dot. So now when I execute the cell, I have this really nice horizontal bar chart. Take a look, what relationships do you notice in this bar plot? There seem to be some personality differences between each financial institution and the proportion of transactions they decline. If you were inviting banks to participate in a transaction, do you think this information could influence your decisions?