I'm quite new to pandas dataframes, and I'm experiencing some troubles joining two tables.
The first df has just 3 columns:
DF1:
item_id position document_id
336 1 10
337 2 10
338 3 10
1001 1 11
1002 2 11
1003 3 11
38 10 146
And the second has exactly same two columns (and plenty of others):
DF2:
item_id document_id col1 col2 col3 ...
337 10 ... ... ...
1002 11 ... ... ...
1003 11 ... ... ...
What I need is to perform an operation which, in SQL, would look as follows:
DF1 join DF2 on
DF1.document_id = DF2.document_id
and
DF1.item_id = DF2.item_id
And, as a result, I want to see DF2, complemented with column 'position':
item_id document_id position col1 col2 col3 ...
What is a good way to do this using pandas?
I think you need merge with default inner join, but is necessary no duplicated combinations of values in both columns:
print (df2)
item_id document_id col1 col2 col3
0 337 10 s 4 7
1 1002 11 d 5 8
2 1003 11 f 7 0
df = pd.merge(df1, df2, on=['document_id','item_id'])
print (df)
item_id position document_id col1 col2 col3
0 337 2 10 s 4 7
1 1002 2 11 d 5 8
2 1003 3 11 f 7 0
But if necessary position column in position 3:
df = pd.merge(df2, df1, on=['document_id','item_id'])
cols = df.columns.tolist()
df = df[cols[:2] + cols[-1:] + cols[2:-1]]
print (df)
item_id document_id position col1 col2 col3
0 337 10 2 s 4 7
1 1002 11 2 d 5 8
2 1003 11 3 f 7 0
If you're merging on all common columns as in the OP, you don't even need to pass on=, simply calling merge() will do the job.
merged_df = df1.merge(df2)
The reason is that under the hood, if on= is not passed, pd.Index.intersection is called on the columns to determine the common columns and merge on all of them.
A special thing about merging on common columns is that it doesn't matter which dataframe is on the right or the left, the rows filtered are the same because they are selected by looking up matching rows on the common columns. The only difference is where the columns are positioned; the columns in the right dataframe that are not in the left dataframe will be added to the right of the columns on the left dataframe. So unless the order of the columns matter (which can be very easily fixed using column selection or reindex()), it doesn't really matter which dataframe is on the right and which is on the left. In other words,
df12 = df1.merge(df2, on=['document_id','item_id']).sort_index(axis=1)
df21 = df2.merge(df1, on=['document_id','item_id']).sort_index(axis=1)
# df12 and df21 are the same.
df12.equals(df21) # True
This is not true if the columns to be merged on don't have the same name and you have to pass left_on= and right_on= (see #1 in this answer).
Related
I have DataFrame in Python like below where we can see duplicates for some ID:
ID
COL1
COL2
COL3
123
XX
111
ENG
123
abc
111
ENG
444
ccc
2
o
444
ccc
2
o
67
a
89
xx
And I need to select rows where is situation like for ID = 123, where rows are duplicated but in some column / columns we have different value, so as an output I need something like below:
ID
COL1
COL2
COL3
123
XX
111
ENG
123
abc
111
ENG
How can I do that in Python Pandas? I can add that in my real dataset I have many many more columns so I need to create solution whoch will be good for more columns not only ID,COL1,COL2,COL3 :)
first drop duplicates for all columns then find duplicates for id column. finally select same ids.
df = df.drop_duplicates()
mask = df.duplicated(subset=['ID'],keep=False)
df = df[mask]
here is one way to do it
# drop the duplicates
df.drop_duplicates(inplace=True)
# groupby ID and filter the ones where group size is greater than 1
df[df.groupby('ID')['ID'].transform('size')>1]
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
alternately,
# preserve the original DF and create a secondary DF with non-duplicate rows
df2=df.drop_duplicates()
# using loc, select the rows in DF2 that has a group size exceeding 1
df2.loc[df2.groupby('ID')['ID'].transform('size')>1]
Using .query
df = df.query("ID.eq(123)").drop_duplicates().reset_index(drop=True)
print(df)
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
Unless you aren't also trying to filter:
df = df.drop_duplicates().reset_index(drop=True)
print(df)
ID COL1 COL2 COL3
0 123 XX 111 ENG
1 123 abc 111 ENG
2 444 ccc 2 o
3 67 a 89 xx
I have a loop which generates dataframes with 2 columns in each. Now, when I try to append the dataframes vertically (stacking those vertically), the code adds the new dataframes horizontally when I use pd.concat within a loop. However, the results do not merge the columns (with same lenght properly). Instead, it adds 2 new columns for every loop iteration, creating a bunch on Nans. How to solve?
df_master=pd.DataFrame()
columns=list(df_master)
data=[]
for i in range(1,3):
--do something and return a df2 with 2 columns
data.append(df2)
df_master = pd.concat(data, axis=1)
df_master.head()
How do I compress the new 2 column for every iteration within one dataframe?
If you don't need to keep the column labels of original dataframes, you can try renaming the column labels of each dataframe to the same (e.g. 0 and 1) before concat, for example:
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
Demo
df1
57 59
0 1 2
1 3 4
df2
138 140
0 11 12
1 13 14
data = [df1, df2]
df_master = pd.concat([dfi.rename({old: new for new, old in enumerate(dfi.columns)}, axis=1) for dfi in data], ignore_index=True)
df_master
0 1
0 1 2
1 3 4
2 11 12
3 13 14
I suppose the problem is your columns have different names in each iteration, so you could easily solve it by calling df2.rename() and renaming it to the same names
It works for me if I change axis to 0 inside the concat command.
df_master = pd.concat(data, axis=0)
Pandas would fill empty cells with NaNs in each scenario and like the example you see below.
df1 = pd.DataFrame({'col1':[11,12,13], 'col2': [21,22,23], 'col3':[31,32,33]})
df2 = pd.DataFrame({'col1':[111,112,113, 114], 'col2': [121,122,123,124]})
merge / join / concatenate data frames [df1, df2] vertically - add rows
pd.concat([df1,df2], ignore_index=True)
# output
col1 col2 col3
0 11 21 31.0
1 12 22 32.0
2 13 23 33.0
3 111 121 NaN
4 112 122 NaN
5 113 123 NaN
6 114 124 NaN
merge / join / concatenate data frames horizontally (aligning by index)
pd.concat([df1,df2], axis=1)
# output
col1 col2 col3 col1 col2
0 11.0 21.0 31.0 111 121
1 12.0 22.0 32.0 112 122
2 13.0 23.0 33.0 113 123
3 NaN NaN NaN 114 124
I have two dataframes with same values but different column order and index.
df1=
index col1 col2 col3 col4
----------------------------------
0 1 2017 1.3 1
1 2 2017 2.4 1
2 3 2017 3.5 0
3 1 2018 3.5 0
df2=
index col3 col1 col2 col4
------------------------------------
0 1 2018 3.5 0
1 3 2017 3.5 0
2 1 2017 1.3 1
3 2 2017 2.4 1
Is there a way to transform one so that one becomes identical to the other?
I have found a way to sort columns
df1 = df1[df2.columns]
but I don't find a way to reorder rows.
Does this work?
df1.sort_values(by='col3') # change to the column you want to sort the rows by
You can use a list to sort by multiple columns
df1.sort_values(by=df2.columns)
df1.sort_values(by=['col3', 'col4'])
By default, sort_values sorts in ascending order. If you want the rows to be sorted in descending order you can use something like this:
df1.sort_values(by=['col3', 'col4'], ascending=False)
Have you tried sorting with df.index?
df1.sort_values(by=df2.index, axis=1, inplace=True)
I have a data frame like this,
df
col1 col2 col3
1 ab 4
hn
pr
2 ff 3
3 ty 3
rt
4 ym 6
Now I want to create one data frame from above, if both col1 and col3 values are empty('') just append(concatenate) it with above rows where both col3 and col1 values are present.
So the final data frame will look like,
df
col1 col2 col3
1 abhnpr 4
2 ff 3
3 tyrt 3
4 ym 6
I could do this using a for loop and comparing one with another row, but the execution time will be more, so looking for short cuts (pythonic way) to do the same task most efficiently.
Replace empty values to mising values and then forward filling them, then use aggregate join by GroupBy.agg and last reorder columns by DataFrame.reindex:
c = ['col1','col3']
df[c] = df[c].replace('', np.nan).ffill()
df = df.groupby(c)['col2'].agg(''.join).reset_index().reindex(df.columns, axis=1)
print (df)
col1 col2 col3
0 1 abhnpr 4
1 2 ff 3
2 3 tyrt 3
3 4 ym 6
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have the follow two dataframes, and I need calculate value column in the df2 based on df1
df1
col1 col2 col3 value
Chicago M 26 54
NY M 20 21
...
df2
col1 col2 col3 value
NY M 20 ? (should be 21 based on above dataframe)
I am doing loop like below which is slow
for index, row in df2.iterrows():
df1[(df1['col1'] == row['col1'])
& (df1['col2'] == df1['col2'])
&(df1['col3'] == df1['col3'])]['value'].values[0]
how to do it more efficiently/fast?
You need merge with left join by columns for compare first:
print (df2)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 ?
df = pd.merge(df2, df1, on=['col1','col2','col3'], how='left', suffixes=('','_'))
It create new column value_1 with matched values. Last use fillna for replace NaNs by original values and last remove helper column value_:
print (df)
col1 col2 col3 value value_
0 LA M 20 20 NaN
1 NY M 20 ? 21.0
df['value'] = df['value_'].fillna(df['value'])
df = df.drop('value_', axis=1)
print (df)
col1 col2 col3 value
0 LA M 20 20
1 NY M 20 21