Python ProgrammingPython Programming

Remove duplicate rows from Pandas DataFrame where only some columns have the same value

Remove duplicate rows based on two columns:

import pandas as pd

df = pd.DataFrame({'Age': [30, 40, 30, 40, 30, 30, 20, 25],
                   'Height': [120, 162, 120, 120, 120, 72, 120, 81],
                   'Score': [4.6, 4.6, 9.0, 3.3, 4, 8, 9, 3],
                   'State': ['NY', 'NY', 'FL', 'AL', 'NY', 'TX', 'FL', 'AL']},
                  index=['Jane', 'Jane', 'Aaron', 'Penelope', 'Jaane', 'Nicky',
                         'Armour', 'Ponting'])

print("\n -------- Duplicate Rows ----------- \n")
print(df)

df1 = df.reset_index().drop_duplicates(subset=['Age','Height'],
                                       keep='first').set_index('index')

print("\n ------- Unique Rows ------------ \n")
print(df1)



C:\python\pandas>python example54.py
 
 -------- Duplicate Rows -----------
 
          Age  Height  Score State
Jane       30     120    4.6    NY
Jane       40     162    4.6    NY
Aaron      30     120    9.0    FL
Penelope   40     120    3.3    AL
Jaane      30     120    4.0    NY
Nicky      30      72    8.0    TX
Armour     20     120    9.0    FL
Ponting    25      81    3.0    AL
 
 ------- Unique Rows ------------
 
          Age  Height  Score State
index
Jane       30     120    4.6    NY
Jane       40     162    4.6    NY
Penelope   40     120    3.3    AL
Nicky      30      72    8.0    TX
Armour     20     120    9.0    FL
Ponting    25      81    3.0    AL
 
C:\python\pandas>