Removing Duplicates
- 04:20
Using profiling to help identify duplicates or errors.
Transcript
Removing Duplicates It's really important that we check for duplicates in our data set.
Most tables will have one column which contains some sort of unique information.
Each row contains a different value for that column, so transaction ID, ISIN numbers, client references, or company codes.
It doesn't happen all the time, sometimes we don't have a single column like that that we can use but there's still duplication across the entire row.
Let's have a look at some examples. So if we have duplication in a single column, it would be the likes of the ISIN number here where no two rows should have the same value. We would remove the duplicates using that one single column.
Sometimes it's difficult to spot or identify duplicates, so Power BI has a fantastic tool called column distribution what that allows us to do is to check how many distinct values we have against how many unique values so the distinct values in a column are the different values that occur, so 196 different ISIN numbers.
The unique values are how many of those values are actually unique, so I have 196 distinct or different values but only 193 of them are unique. So there must be three values in there which have been used more than once and therefore I have duplicates.
Sometimes it's not always so easy to spot so in this example, we don't have a single column that we can use to identify duplicates with the company name for example can be repeated. It's okay to have Merrill Lynch more than once it's okay to have the same nominal value more than once the same market price or the same currency code. But if they were to all to be duplicated across the entire row, then we would want to remove those rows.
Let's do a workout in Power BI to see how to remove duplicates.
So let's use Get Data then to connect to an Excel workbook, module 3 lesson 6 work out.
I have two worksheets in here. I'm just going to select both of them and go through to transform data and into query editor.
I'm going to make sure I have selected the duplicate single column table first.
And I can see here the ISIN number should contain unique values, now if I want to check first of all, if there are any duplicates, I can use the view tab and just click on that to put on column distribution and right away. I can see there are 196 different values and only 193 of them are unique so I know I have got duplicates there, I'm going to remove duplicates just down here.
And I can see now all of my values are unique.
If I go to the second table so multiple columns this is slightly trickier because if I look at for example row 2 3 and 4, I can see Merrill Lynch has been duplicated, but it's okay in this example to have duplication there. But if I look further across the row I can see that some of the values are duplicated right across and some of the values are different. So row two and three have exactly the same value in every single column.
Row 4 is slightly different, so I would consider a row two and three a duplicate there row four. I would want to leave now to remove duplicates across all of the columns. I need to select all of the columns, so I'm going to start with company name just scroll across to the last column use the shift key and select so that will get them all selected there.
And then I can just right click remove duplicates.
So I just double check I can see I've got Merrill Lynch.
two rows of data 9 both unique