Keep duplicated row from a specific data frame after append - python

I have two data frames (df1 & df2), after a merge, I find two duplicate rows (same values for three columns:["ID","City","Year"]). I would like to keep one of the duplicate rows which comes from df2 .
import pandas as pd
data1 = {
'ID':[7,2],
'City': ["Berlin","Paris"],
'Year':[2012,2000],
'Number':[62,43],}
data2 ={
'ID': [7, 5],
'City': ["Berlin", "London"],
'Year':[2012,2019],
'Number': [60, 100], }
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
df_merged= df1.append(df2)
Is there a way to do this ?
Expected output:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100

new = (pd.concat([df1, df2])
.drop_duplicates(subset=["ID", "City", "Year"],
keep="last",
ignore_index=True))
append will be gone in near future, use pd.concat there please. Then drop_duplicates over the said columns while keep="last":
In [376]: df1
Out[376]:
ID City Year Number
0 7 Berlin 2012 62
1 2 Paris 2000 43
In [377]: df2
Out[377]:
ID City Year Number
0 7 Berlin 2012 60
1 5 London 2019 100
In [378]: (pd.concat([df1, df2])
...: .drop_duplicates(subset=["ID", "City", "Year"],
...: keep="last",
...: ignore_index=True))
Out[378]:
ID City Year Number
0 2 Paris 2000 43
1 7 Berlin 2012 60
2 5 London 2019 100
ignore_index makes it again 0, 1, 2 after drop_duplicates disturbs it

Related

Get the value of a data frame column with respect to another data frame column value

I have two data frames
df1:
ID Date Value
0 9560 07/3/2021 25
1 9560 03/03/2021 20
2 9712 12/15/2021 15
3 9712 08/30/2021 10
4 9920 4/11/2021 5
df2:
ID Value
0 9560
1 9712
2 9920
In df2, I want to get the latest value from "Value" column of df1 with respect to ID.
This is my expected output:
ID Value
0 9560 25
1 9712 15
2 9920 5
How could I achieve it?
Based on Daniel Afriyie's approach, I came up with this solution:
import pandas as pd
# Setup for demo
df1 = pd.DataFrame(
columns=['ID', 'Date', 'Value'],
data=[
[9560, '07/3/2021', 25],
[9560, '03/03/2021', 20],
[9712, '12/15/2021', 15],
[9712, '08/30/2021', 10],
[9920, '4/11/2021', 5]
]
)
df2 = pd.DataFrame(
columns=['ID', 'Value'],
data=[[9560, None], [9712, None], [9920, None]]
)
## Actual solution
# Casting 'Date' column to actual dates
df1['Date'] = pd.to_datetime(df1['Date'])
# Sorting by dates
df1 = df1.sort_values(by='Date', ascending=False)
# Dropping duplicates of 'ID' (since it's ordered by date, only the newest of each ID will be kept)
df1 = df1.drop_duplicates(subset=['ID'])
# Merging the values from df1 into the the df2
pf2 = pd.merge(df2[['ID']], df1[['ID', 'Value']]))
output:
ID Value
0 9560 25
1 9712 15
2 9920 5

How to join two pandas dataframes based on a date in df1 being >= date in df2

I have a large data frame with key IDs, states, start dates and other characteristics. I have another data frame with states, a start date and a "1" to signify a flag.
I want to join the two, based on the state and the date in df1 being greater than or equal to the date in df2.
Take the example below. df1 is the table of states, start dates, and a 1 for a flag. df2 is a dataframe that needs those flags if the date in df2 is >= the date in df1. The end result is df3. The only observations get the flag whose states match and dates are >= the original dates.
import pandas as pd
dict1 = {'date':['2020-01-01', '2020-02-15', '2020-02-04','2020-03-17',
'2020-06-15'],
'state':['AL','FL','MD','NC','SC'],
'flag': [1,1,1,1,1]}
df1 = pd.DataFrame(dict1)
df1['date'] = pd.to_datetime(df1['date'])
dict2 = {'state': ['AL','FL','MD','NC','SC'],
'keyid': ['001','002','003','004','005'],
'start_date':['2020-02-01', '2020-01-15', '2020-01-30','2020-05-18',
'2020-05-16']}
df2 = pd.DataFrame(dict2)
df2['start_date'] = pd.to_datetime(df2['start_date'])
df3 = df2
df3['flag'] = [0,1,1,0,1]
How do I get to df3 programmatically? My actual df1 has a row for each state. My actual df2 has over a million observations with different dates.
Use merge_asof for merge by greater or equal datetimes by parameter direction='forward':
A “forward” search selects the first row in the right DataFrame whose ‘on’ key is greater than or equal to the left’s key.
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1,
left_on='start_date',
right_on='date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need date flag
0 FL 002 2020-01-15 1 2020-02-15 1
1 MD 003 2020-01-30 1 2020-02-04 1
2 AL 001 2020-02-01 0 NaT 0
3 SC 005 2020-05-16 1 2020-06-15 1
4 NC 004 2020-05-18 0 NaT 0
You can also rename column for avoid appending in output DataFrame:
df2['need'] = [0,1,1,0,1]
df1 = df1.sort_values('date')
df2 = df2.sort_values('start_date')
df = pd.merge_asof(df2,
df1.rename(columns={'date':'start_date'}),
on='start_date',
by='state',
direction='forward')
df['flag'] = df['flag'].fillna(0).astype(int)
print (df)
state keyid start_date need flag
0 FL 002 2020-01-15 1 1
1 MD 003 2020-01-30 1 1
2 AL 001 2020-02-01 0 0
3 SC 005 2020-05-16 1 1
4 NC 004 2020-05-18 0 0
Use df.merge and numpy.where:
In [29]: import numpy as np
In [30]: df3 = df2.merge(df1)[['state', 'keyid', 'start_date', 'date']]
In [31]: df3['flag'] = np.where(df3['start_date'].ge(df3['date']), 0, 1)
In [33]: df3.drop('date', 1, inplace=True)
In [34]: df3
Out[34]:
state keyid start_date flag
0 AL 001 2020-02-01 0
1 FL 002 2020-01-15 1
2 MD 003 2020-01-30 1
3 NC 004 2020-05-18 0
4 SC 005 2020-05-16 1

Python: Compare two dataframes in Python with different number rows and a Compsite key

I have two different dataframes which i need to compare.
These two dataframes are having different number of rows and doesnt have a Pk its Composite primarykey of (id||ver||name||prd||loc)
df1:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
b 1 alex 1b y
b 2 david 1b z
df2:
id ver name prd loc
a 1 surya 1a x
a 1 surya 1a y
a 2 ram 1a x
b 1 alex 1b z
I tried the below code and this workingif there are same number of rows , but if its like the above case its not working.
df1 = pd.DataFrame(Source)
df1 = df1.astype(str) #converting all elements as objects for easy comparison
df2 = pd.DataFrame(Target)
df2 = df2.astype(str) #converting all elements as objects for easy comparison
header_list = df1.columns.tolist() #creating a list of column names from df1 as the both df has same structure
df3 = pd.DataFrame(data=None, columns=df1.columns, index=df1.index)
for x in range(len(header_list)) :
df3[header_list[x]] = np.where(df1[header_list[x]] == df2[header_list[x]], 'True', 'False')
df3.to_csv('Output', index=False)
Please leet me know how to compare the datasets if there are different number od rows.
You can try this:
~df1.isin(df2)
# df1[~df1.isin(df2)].dropna()
Lets consider a quick example:
df1 = pd.DataFrame({
'Buyer': ['Carl', 'Carl', 'Carl'],
'Quantity': [18, 3, 5, ]})
# Buyer Quantity
# 0 Carl 18
# 1 Carl 3
# 2 Carl 5
df2 = pd.DataFrame({
'Buyer': ['Carl', 'Mark', 'Carl', 'Carl'],
'Quantity': [2, 1, 18, 5]})
# Buyer Quantity
# 0 Carl 2
# 1 Mark 1
# 2 Carl 18
# 3 Carl 5
~df2.isin(df1)
# Buyer Quantity
# 0 False True
# 1 True True
# 2 False True
# 3 True True
df2[~df2.isin(df1)].dropna()
# Buyer Quantity
# 1 Mark 1
# 3 Carl 5
Another idea can be merge on the same column names.
Sure, tweak the code to your needs. Hope this helped :)

Dropping duplicate rows but keeping certain values Pandas

I have 2 similar dataframes that I concatenated that have a lot of repeated values because they are basically the same data set but for different years.
The problem is that one of the sets has some values missing whereas the other sometimes has these values.
For example:
Name Unit Year Level
Nik 1 2000 12
Nik 1 12
John 2 2001 11
John 2 2001 11
Stacy 1 8
Stacy 1 1999 8
.
.
I want to drop duplicates on the subset = ['Name', 'Unit', 'Level'] since some repetitions don't have years.
However, I'm left with the data that has no Year and I'd like to keep the data with these values:
Name Unit Year Level
Nik 1 2000 12
John 2 2001 11
Stacy 1 1999 8
.
.
How do I keep these values rather than the blanks?
Use sort_values with default parameter na_position='last', so should be omit, and then drop_duplicates:
print (df)
Name Unit Year Level
0 Nik 1 NaN 12
1 Nik 1 2000.0 12
2 John 2 2001.0 11
3 John 2 2001.0 11
4 Stacy 1 NaN 8
5 Stacy 1 1999.0 8
subset = ['Name', 'Unit', 'Level']
df = df.sort_values('Year').drop_duplicates(subset)
Or:
df = df.sort_values(subset + ['Year']).drop_duplicates(subset)
print (df)
Name Unit Year Level
5 Stacy 1 1999.0 8
1 Nik 1 2000.0 12
2 John 2 2001.0 11
Another solution with GroupBy.first for return first non missing value of Year per groups:
df = df.groupby(subset, as_index=False, sort=False)['Year'].first()
print (df)
Name Unit Level Year
0 Nik 1 12 2000.0
1 John 2 11 2001.0
2 Stacy 1 8 1999.0
One solution that comes to mind is to first sort the concatenated dataframe by year with the sortvalues function:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
then drop duplicates with keep='first' parameter
df.drop_duplicates(subset=['Name', 'Unit', 'Level'], keep="first")
I would suggest that you look at the creation step of your merged dataset.
When merging the data sets you can do so on multiple indices i.e.
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
With the outer join you collect all data sets and remove duplicates right away. The only thing left is to merge the Year column which you can do like so:
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
This fills the gaps and afterwards you are able to simply drop the 'Year_r' column.
The benefit here is that not only NaN values of missing years are covered but also missing Years which are represented as empty strings.
Following a small working example:
import pandas as pd
import numpy as np
left = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo', 'Peter', 'Adam'],
'Unit': ['2', '4', '6', '2', '4', '12'],
'Year': ['', '2009', '1954', '2025', '2012', '2024'],
'Level': ['L1', 'L1', 'L0', 'L4', 'L3', 'L10']})
right = pd.DataFrame({'Name': ['Adam', 'Beatrice', 'Crissy', 'Dumbo'],
'Unit': ['2', '4', '6', '2'],
'Year': ['2010', '2009', '1954', '2025'],
'Level': ['L1', 'L1', 'L0', 'L4']})
df = pd.merge(left, right, how='outer', on=['Name', 'Unit', 'Level'], suffixes=['', '_r'])
df['Year'] = df[['Year', 'Year_r']].apply(lambda x: x['Year'] if (x['Year'] is not np.nan and x['Year'] != '') else x['Year_r'], axis=1)
df

Pandas data frame sum of column and collecting the results

Given the following dataframe:
import pandas as pd
p1 = {'name': 'willy', 'age': 11, 'interest': "Lego"}
p2 = {'name': 'willy', 'age': 11, 'interest': "games"}
p3 = {'name': 'zoe', 'age': 9, 'interest': "cars"}
df = pd.DataFrame([p1, p2, p3])
df
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
I want to know the sum of interests of each person and let each person only show once in the list. I do the following:
Interests = df[['age', 'name', 'interest']].groupby(['age' , 'name']).count()
Interests.reset_index(inplace=True)
Interests.sort('interest', ascending=False, inplace=True)
Interests
age name interest
1 11 willy 2
0 9 zoe 1
This works but I have the feeling that I'm doing it wrong. Now I'm using the column 'interest' to display my sum values which is okay but like I said I expect there to be a nicer way to do this.
I saw many questions about counting/sum in Pandas but for me the part where I leave out the 'duplicates' is key.
You can use size (the length of each group), rather than count, the non-NaN enties in each column of the group.
In [11]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size()
Out[11]:
age name
9 zoe 1
11 willy 2
dtype: int64
In [12]: df[['age', 'name', 'interest']].groupby(['age' , 'name']).size().reset_index(name='count')
Out[12]:
age name count
0 9 zoe 1
1 11 willy 2
In [2]: df
Out[2]:
age interest name
0 11 Lego willy
1 11 games willy
2 9 cars zoe
In [3]: for name,group in df.groupby('name'):
...: print name
...: print group.interest.count()
...:
willy
2
zoe
1

Categories