In this notebook, I show you how to combine rows that are near-duplicates and remove true duplicate rows in Pandas. I recently ran into this problem while doing some data cleaning. I decided to simply combine the similar rows into one row. The below notebook details how to accomplish merging the rows into one.
import pandas as pd
Pull in some sample data I generated with ChatGPT, with some true duplicates and some near-duplicates
df1 = pd.read_csv('citizens.csv')
df1.head(15)
Name | Age | Country | |
---|---|---|---|
0 | John Smith | 25 | USA |
1 | Jane Doe | 30 | Canada |
2 | Mark Johnson | 40 | USA |
3 | John Smith | 25 | USA |
4 | Michael Brown | 35 | UK |
5 | Emily Wilson | 28 | Australia |
6 | Jane Doe | 30 | Canada |
7 | Robert Anderson | 45 | USA |
8 | Emily Wilson | 28 | New Zealand |
9 | Michael Brown | 35 | UK |
10 | John Smith | 25 | USA |
11 | Robert Anderson | 45 | USA |
12 | Jessica Lee | 32 | Canada |
13 | Mark Johnson | 40 | USA |
14 | Emily Wilson | 28 | New Zealand |
Drop the true duplicates with the drop_duplicates function. With this done, we have a rather small dataframe. We can notice that Emily Wilson appears twice. We want to combine both her entries into a single row.
df1 = df1.drop_duplicates()
df1
Name | Age | Country | |
---|---|---|---|
0 | John Smith | 25 | USA |
1 | Jane Doe | 30 | Canada |
2 | Mark Johnson | 40 | USA |
4 | Michael Brown | 35 | UK |
5 | Emily Wilson | 28 | Australia |
7 | Robert Anderson | 45 | USA |
8 | Emily Wilson | 28 | New Zealand |
12 | Jessica Lee | 32 | Canada |
15 | David Miller | 31 | USA |
16 | Sarah Thompson | 33 | Australia |
We combine the relevant rows by joining with a comma. We reset index as the groupby
function changes the index columns.
df1.groupby(['Name', 'Age']).agg({'Country': ', '.join}).reset_index()
Name | Age | Country | |
---|---|---|---|
0 | David Miller | 31 | USA |
1 | Emily Wilson | 28 | Australia, New Zealand |
2 | Jane Doe | 30 | Canada |
3 | Jessica Lee | 32 | Canada |
4 | John Smith | 25 | USA |
5 | Mark Johnson | 40 | USA |
6 | Michael Brown | 35 | UK |
7 | Robert Anderson | 45 | USA |
8 | Sarah Thompson | 33 | Australia |
Pull in a second example with rows where both the age and country might be different with the same name.
df2 = pd.read_csv('citizens_improved.csv')
df2.head(15)
Name | Age | Country | |
---|---|---|---|
0 | John Smith | 25 | USA |
1 | Jane Doe | 30 | Canada |
2 | Mark Johnson | 40 | USA |
3 | John Smith | 25 | USA |
4 | Michael Brown | 35 | UK |
5 | Emily Wilson | 28 | Australia |
6 | Jane Doe | 30 | Canada |
7 | Robert Anderson | 45 | USA |
8 | Emily Wilson | 28 | New Zealand |
9 | Michael Brown | 35 | UK |
10 | John Smith | 25 | USA |
11 | Robert Anderson | 45 | USA |
12 | Jessica Lee | 32 | Canada |
13 | Mark Johnson | 40 | USA |
14 | Emily Wilson | 28 | New Zealand |
df2 = df2.drop_duplicates()
df2
Name | Age | Country | |
---|---|---|---|
0 | John Smith | 25 | USA |
1 | Jane Doe | 30 | Canada |
2 | Mark Johnson | 40 | USA |
4 | Michael Brown | 35 | UK |
5 | Emily Wilson | 28 | Australia |
7 | Robert Anderson | 45 | USA |
8 | Emily Wilson | 28 | New Zealand |
12 | Jessica Lee | 32 | Canada |
15 | David Miller | 31 | USA |
16 | Sarah Thompson | 33 | Australia |
85 | John Smith | 30 | USA |
86 | Mark Johnson | 35 | USA |
87 | Emily Wilson | 40 | New Zealand |
88 | Sarah Thompson | 28 | Australia |
89 | Michael Brown | 32 | UK |
90 | David Miller | 45 | USA |
91 | Jessica Lee | 28 | Canada |
92 | Robert Anderson | 33 | USA |
93 | Emily Wilson | 32 | Australia |
Here we make a function to join the unique items with a comma, otherwise we will get duplicate country names. The age is averaged.
def join_unique(items):
return ', '.join(set(items))
df2.groupby(['Name']).agg({'Age': 'mean', 'Country': join_unique}).reset_index()
Name | Age | Country | |
---|---|---|---|
0 | David Miller | 38.0 | USA |
1 | Emily Wilson | 32.0 | Australia, New Zealand |
2 | Jane Doe | 30.0 | Canada |
3 | Jessica Lee | 30.0 | Canada |
4 | John Smith | 27.5 | USA |
5 | Mark Johnson | 37.5 | USA |
6 | Michael Brown | 33.5 | UK |
7 | Robert Anderson | 39.0 | USA |
8 | Sarah Thompson | 30.5 | Australia |