Selecting columns from two dataframes according to another column - python

I have 2 dataframes, one of them contains some general information about football players, and second of them contains other information like winning matches for each player. They both have the "id" column. However, they are not in same length.
What I want to do is creating a new dataframe which contains 2 columns: "x" from first dataframe and "y" from second dataframe, ONLY where the "id" column contains the same value in both dataframes. Thus, I can match the "x" and "y" columns which belong to same person.
I tried to do it using concat function:
pd.concat([firstdataframe['x'], seconddataframe['y']], axis=1, keys=['x', 'y'])
But I didn't manage to know how to apply the condition of the "id" being equal in both dataframes.

It seems you need merge with default inner join, also each values in id columns has to be unique:
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
Sample:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2],'y':[7,0]})
print (df2)
id y
0 1 7
1 2 0
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 2 3 0
Solution with concat is possible, but a bit complicated, becasue need join on indexes with inner join:
df = pd.concat([df1.set_index('id')['x'],
df2.set_index('id')['y']], axis=1, join='inner')
.reset_index()
print (df)
id x y
0 1 4 7
1 2 3 0
EDIT:
If ids are not unique, duplicates create all combinations and output dataframe is expanded:
df1 = pd.DataFrame({'id':[1,2,3],'x':[4,3,8]})
print (df1)
id x
0 1 4
1 2 3
2 3 8
df2 = pd.DataFrame({'id':[1,2,1,1],'y':[7,0,4,2]})
print (df2)
id y
0 1 7
1 2 0
2 1 4
3 1 2
df = pd.merge(df1[['id','x']], df2[['id','y']], on='id')
print (df)
id x y
0 1 4 7
1 1 4 4
2 1 4 2
3 2 3 0

Related

How to merge two dataframes for updating older one with new one?

I am sorry for being a noob but I can't find a solution for my problem with hours of search.
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
print(new_df)
df1
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
df2
time bought
0 3 0
1 4 0
2 5 0
3 6 0
4 7 0
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 5 0
6 6 0
7 7 0
What I want is
updating df1(existing data) with df2(new data feed). when it comes to bought value, df1 data should comes first
the new_df should have all unique time values from df1, df2 without duplicates
I tried every method I found but no one made my desired outcome or created unnecessary duplicates as above.(two rows with time value of 5)
merge method created _x _y suffixes or duplicates
join() didn't work as well.
What I desire should look like:
new_df
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0
Thank you in advance
if you perform the join as you have done all you need to do is remove the duplicate rows keeping only the more resent data,
drop_duplicates() take the kwarg subset which takes a list of columns and keep which sets which row to keep if there are duplicates
in this case we only need to check for duplicates in the time column and wee keep the first column
import pandas as pd
df1 = pd.read_excel('df1.xlsx')
df1.set_index('time')
print(df1)
df2 = pd.read_excel('df2.xlsx')
df2.set_index('time')
print(df2)
new_df = pd.merge(df1, df2,how='outer')
new_df = new_df.drop_duplicates(subset=['time'], keep='first')
print(new_df)
Output:
time bought
0 1 0
1 2 0
2 3 0
3 4 0
4 5 1
5 6 0
6 7 0

For each row return the column names of the smallest value - pandas

Assuming that I have a dataframe with the following values:
id product1sold product2sold product3sold
1 2 3 3
2 0 0 5
3 3 2 1
How do I add a 'most_sold' and 'least_sold' column containing all most and least sold products in a list per id?
It should look like this.
id product1 product2 product3 most_sold least_sold
1 2 3 3 [product2, product3] [product1]
2 0 0 5 [product3] [product1, product2]
3 3 2 1 [product1] [product3]
Use list comprehension with test minimal and maximal values for list of products:
#select all columns without first
df1 = df.iloc[:, 1:]
cols = df1.columns.to_numpy()
df['most_sold'] = [cols[x].tolist() for x in df1.eq(df1.max(axis=1), axis=0).to_numpy()]
df['least_sold'] = [cols[x].tolist() for x in df1.eq(df1.min(axis=1), axis=0).to_numpy()]
print (df)
id product1sold product2sold product3sold most_sold \
0 1 2 3 3 [product2sold, product3sold]
1 2 0 0 5 [product3sold]
2 3 3 2 1 [product1sold]
least_sold
0 [product1sold]
1 [product1sold, product2sold]
2 [product3sold]
If performance is not important is possible use DataFrame.apply:
df1 = df.iloc[:, 1:]
f = lambda x: x.index[x].tolist()
df['most_sold'] = df1.eq(df1.max(axis=1), axis=0).apply(f, axis=1)
df['least_sold'] = df1.eq(df1.min(axis=1), axis=0).apply(f, axis=1)
You can do something like this.
minValueCol = yourDataFrame.idxmin(axis=1)
maxValueCol = yourDataFrame.idxmax(axis=1)

Pandas dataframe: merge rows into 1 row and sum a coulmn

I have a pandas dataframe that contains user id and ad click (if any) by this user across several days
df =pd.DataFrame([['A',0], ['A',1], ['A',0], ['B',0], ['B',0], ['B',0], ['B',1], ['B',1], ['B',1]],columns=['user_id', 'click_count'])
Out[8]:
user_id click_count
0 A 0
1 A 1
2 A 0
3 B 0
4 B 0
5 B 0
6 B 1
7 B 1
8 B 1
I would like to convert this dataframe into A dataframe WITH 1 row per user where 'click_cnt' now is sum of all click_count across all rows for each user in the original dataframe i.e.
Out[18]:
user_id click_cnt
0 A 1
1 B 3
What you're after is the function groupby:
df = df.groupby('user_id', as_index=False).sum()
Adding the flag as_index=False will add the keys as a separate column instead of using them for the new index.
groupby is super useful - have a read through the documentation for more info.

create pandas pivottable with a long multiindex

I have a dataframe df with the shape (4573,64) that I'm trying to pivot. The last column is an 'id' with two possible string values 'old' and 'new'. I would like to set the first 63 columns as index and then have the 'id' column across the top with values being the count of 'old' or 'new' for each index row.
I've created a list object out of columns labels that I want as index named cols.
I tried the following:
df.pivot(index=cols, columns='id')['id']
this gives an error: 'all arrays must be same length'
also tried the following to see if I can get sum but no luck either:
pd.pivot_table(df,index=cols,values=['id'],aggfunc=np.sum)
any ides greatly appreciated
I found a thread online talking about a possible bug in pandas 0.23.0 where the pandas.pivot_table() will not accept the multiindex as long as it contains NaN's (link to github in comments). My workaround was to do
df.fillna('empty', inplace=True)
then the solution below:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
as proposed by jezrael will work as intended hence the answer accepted.
I believe need convert columns names to list and then aggregate size with unstack:
df = pd.DataFrame({'B':[4,4,4,5,5,4],
'C':[1,1,9,4,2,3],
'D':[1,1,5,7,1,0],
'E':[0,0,6,9,2,4],
'id':list('aaabbb')})
print (df)
B C D E id
0 4 1 1 0 a
1 4 1 1 0 a
2 4 9 5 6 a
3 5 4 7 9 b
4 5 2 1 2 b
5 4 3 0 4 b
cols = df.columns.tolist()
df1 = df.groupby(cols)['id'].size().unstack(fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1
Solution with pivot_table:
df1 = pd.pivot_table(df, index=cols,columns='id',aggfunc='size', fill_value=0)
print (df1)
id a b
B C D E
4 1 1 0 2 0
3 0 4 0 1
9 5 6 1 0
5 2 1 2 0 1
4 7 9 0 1

copy row values based on result from isin() in pandas

I have 2 dataframes:
df_a = pd.DataFrame({'A':[1,2,3,4],'B':[4,5,6,7],'ID':['a','b','c','d']})
df_a
A B ID
0 1 4 a
1 2 5 b
2 3 6 c
3 4 7 d
df_b = pd.DataFrame({'A':[1,2,3],'ID':['b','a','c']})
df_b['CopyB'] = ""
A ID CopyB
0 1 b
1 2 a
2 3 c
Now I want to match ID columns in both the dataframes and upon a successful match, I want to copy respective value of B from df_a to df_b['CopyB']. I tried df_b.loc[df_b['ID'].isin(df_a['ID']),'Copy']= df_a['B]
but that is not correct. Then I tried comparing the ID using '==' but got an error since the length of the ID series is not equal. Any help? Sorry if it is a very trivial query.
use join
df_b.join(df_a.set_index('ID').B, on='ID')
A ID B
0 1 b 5
1 2 a 4
2 3 c 6
join works on indices. So I set the index of df_a to be the ID column and access the B column so that df_a.set_index('ID').B is a series with the column I want to add as values and the merge column as the index. Then I use join. I wouldn't have to specify an on parameter if ID was the index of df_b but it isn't, so I do.
You could use merge
In [247]: df_b.merge(df_a, on=['ID'])
Out[247]:
A_x ID A_y B
0 1 b 2 5
1 2 a 1 4
2 3 c 3 6
In [248]: df_b.merge(df_a, on=['ID'])[['A_x', 'ID', 'B']].rename(columns={'A_x': 'A'})
Out[248]:
A ID B
0 1 b 5
1 2 a 4
2 3 c 6
If you set the ID column as a common index for both dataframes you can easily add the column:
df_a = df_a.set_index('ID')
df_b = db_b.set_index('ID')
df_b['Copy B'] = df_a['B']

Categories