How to efficiently do lookup in dataframe based on multiple columns [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have the follow two dataframes, and I need calculate value column in the df2 based on df1
df1
col1 col2 col3 value
Chicago M 26 54
NY M 20 21
...
df2
col1 col2 col3 value
NY M 20 ? (should be 21 based on above dataframe)
I am doing loop like below which is slow
for index, row in df2.iterrows():
df1[(df1['col1'] == row['col1'])
& (df1['col2'] == df1['col2'])
&(df1['col3'] == df1['col3'])]['value'].values[0]
how to do it more efficiently/fast?

You need merge with left join by columns for compare first:
print (df2)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 ?
df = pd.merge(df2, df1, on=['col1','col2','col3'], how='left', suffixes=('','_'))
It create new column value_1 with matched values. Last use fillna for replace NaNs by original values and last remove helper column value_:
print (df)
col1 col2 col3 value value_
0 LA M 20 20 NaN
1 NY M 20 ? 21.0
df['value'] = df['value_'].fillna(df['value'])
df = df.drop('value_', axis=1)
print (df)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 21

Related

Drop duplicate rows in dataframe based on multplie columns with list values [duplicate]

This question already has answers here:
Drop all duplicate rows across multiple columns in Python Pandas
(8 answers)
Closed 2 years ago.
I have DataFrame with multiple columns and few columns contains list values. By considering only columns with list values in it, duplicate rows have to be deleted.
Current Dataframe:
ID col1 col2 col3 col4
1 52 [kjd,pkh,sws] [aqs,zxc,asd] [plm,okn,ijb]
2 47 [qaz,wsx,edc] [aws,rfc,tgb] [rty,wer,dfg]
3 85 [kjd,pkh,sws] [aqs,zxc,asd] [plm,okn,ijb]
4 27 [asw,bxs,mdh] [wka,kdy,kaw] [pqm,lsc,yhb]
Desired output:
ID col1 col2 col3 col4
2 47 [qaz,wsx,edc] [aws,rfc,tgb] [rty,wer,dfg]
4 27 [asw,bxs,mdh] [wka,kdy,kaw] [pqm,lsc,yhb]
I have tried converting it to tuple and apply df.drop_duplicates() but am getting multiple errors
You can convert each of the columns with lists into str and then drop duplicates.
Step 1: Convert each column that has lists into a string type using
astype(str).
Step 2: use drop_duplicates with the columns as strings. Since you
want all duplicates to be removed, set keep=False.
Step 3: drop the temp created astype(str) columns as you no longer
need them.
The full code will be:
c = ['col1','col2','col3','col4']
d =[[52,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[47,['qaz','wsx','edc'],['aws','rfc','tgb'],['rty','wer','dfg']],
[85,['kjd','pkh','sws'],['aqs','zxc','asd'],['plm','okn','ijb']],
[27,['asw','bxs','mdh'],['wka','kdy','kaw'],['pqm','lsc','yhb']]]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print(df)
df['col2s'] = df['col2'].astype(str)
df['col3s'] = df['col3'].astype(str)
df['col4s'] = df['col4'].astype(str)
df.drop_duplicates(subset=['col2s', 'col3s','col4s'],keep=False,inplace=True)
df.drop(['col2s', 'col3s','col4s'],axis=1,inplace=True)
print (df)
The output of this will be:
Original DataFrame:
col1 col2 col3 col4
0 52 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
2 85 [kjd, pkh, sws] [aqs, zxc, asd] [plm, okn, ijb]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]
DataFrame after dropping the duplicates:
col1 col2 col3 col4
1 47 [qaz, wsx, edc] [aws, rfc, tgb] [rty, wer, dfg]
3 27 [asw, bxs, mdh] [wka, kdy, kaw] [pqm, lsc, yhb]

Concatenate the column values with above rows when other columns are empty

I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6

Sort a column based on the sorting of a column from another pandas data frame

I have a dataframe like this:
df1:
col1 col2
P 1
Q 3
M 2
I have another dataframe:
df2:
col1 col2
Q 1
M 3
P 9
I want to sort the col1 of df2 based on the order of col1 of df1. So the final dataframe will look like:
df3:
col1 col2
P 1
Q 3
M 9
How to do it using pandas or any other effective method ?
You could set col1 as index in df2 using set_index and index the dataframe using df1.col11 with .loc:
df2.set_index('col1').loc[df1.col1].reset_index()
col1 col2
0 P 9
1 Q 1
2 M 3
Or as #jpp suggests you can also use .reindex instead of .loc:
df2.set_index('col1').reindex(df1.col1).reset_index()

Join pandas dataframes based on column values

I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position column in position 3:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
If you're merging on all common columns as in the OP, you don't even need to pass on=, simply calling merge() will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on= and right_on= (see #1 in this answer).

How to do group by and take Count of one column divide by count of unique of second column of data frame in python pandas?

I have panda data frame with 4 column say 'col1', 'col2', 'col3' and 'col4' now I want to group by col1 and col2 and want to take aggregate say below.
Count(col3)/(Count(unique col4)) As result_col
How do I do this? I am using MySql with pandas.
I have tried many things from the internet but not getting an exact solution, that's why I am posting here. Give reason of downvote so I can improve my question.
It seems you need aggregate by size and nunique and then div output columns:
df = pd.DataFrame({'col1':[1,1,1],
'col2':[4,4,6],
'col3':[7,7,9],
'col4':[3,3,5]})
print (df)
col1 col2 col3 col4
0 1 4 7 3
1 1 4 7 3
2 1 6 9 5
df1 = df.groupby(['col1','col2']).agg({'col3':'size','col4':'nunique'})
df1['result_col'] = df1['col3'].div(df1['col4'])
print (df1)
col4 col3 result_col
col1 col2
1 4 1 2 2.0
6 1 1 1.0

Categories