I have two dataframes with the same form:
> df1
Day ItemId Quantity
1 1 2
1 2 3
1 4 5
> df2
Day ItemId Quantity
1 1 0
1 2 0
1 3 0
1 4 0
I'd like to merge df1 and df2 and if a row of ['Day','ItemId'] exists in both df1 and df2 take df1 which the max
I tried this command :
df = pd.concat([df1, df2]).groupby(level=0).max(df1['Quantity'],df2['Quantity'])
Use groupby by both columns in list and aggregate max:
df = pd.concat([df1, df2]).groupby(['Day','ItemId'], as_index=False)['Quantity'].max()
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
3 1 4 5
If possible multiple columns:
df = (pd.concat([df1, df2])
.sort_values(['Day','ItemId','Quantity'], ascending=[True, True, False])
.drop_duplicates(['Day','ItemId']))
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
2 1 4 5
Related
Say I have two DataFrames that look like the following:
df1:
movieID 1 2 3 4
userID
0 2 0 0 2
1 1 1 4 0
2 0 2 3 0
3 1 2 0 0
and
df2:
userID movieID
0 0 2
1 0 3
2 0 4
3 1 3
What I am trying to accomplish is joining the two so that df2 contains a new column with the associated rating of a user for a specific movie. Thus df2 in this example would become:
df2:
userID movieID rating
0 0 2 0
1 0 3 0
2 0 4 2
3 1 3 4
I don't believe that simply reformatting df2 to have the same shape as df1 would work because there is no guarantee that it will have all userIDs or movieIDs, and i've looked into the merge function but I'm confused on how to set the how and on parameters in this scenario. If anyone can explain how could I achieve this it would be greatly appreciated.
You can apply() by row to index df1.loc[row.userID, row.movieID].
Just make sure the dtype of df1.columns matches df2.userID, and df2.movieID matches df1.index.
df1.columns = df1.columns.astype(df2.movieID.dtype)
df1.index = df1.index.astype(df2.userID.dtype)
df2['rating'] = df2.apply(lambda row: df1.loc[row.userID, row.movieID], axis=1)
# userID movieID rating
# 0 0 2 0
# 1 0 3 0
# 2 0 4 2
# 3 1 3 4
I want to join two pandas dataframes on "ColA", but the thing is that values in "ColA" in these two dataframes are not in order and dataframes are not the same lenght. I want to join them so that missing values are changed with 0 value, and that values in "ColA" are matching.
df1 = pd.DataFrame({"ColA":["num 1", "num 2", "num 3"],
"ColB":[5,6,7]})
print(df1)
df2 = pd.DataFrame({"ColA":["num 2", "num 3","num 1", "num 4"],
"ColC":[3,2,1,5]})
print(df2)
ColA ColB
0 num 1 5
1 num 2 6
2 num 3 7
ColA ColC
0 num 2 3
1 num 3 2
2 num 1 1
3 num 4 5
Result should look like this:
# num1 is matched with appropriate values and num4 has the value 0 for "ColB"
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
Use DataFrame.merge with outer join, convert NaNs to 0 and last if necessary convert dtypes to original by dictionary:
d = df1.dtypes.append(df2.dtypes).to_dict()
df = df1.merge(df2, how='outer', on='ColA').fillna(0).astype(d)
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
Or use concat with convert all columns to integers (if possible):
df = (pd.concat([df1.set_index('ColA'),
df2.set_index('ColA')], axis=1, sort=True)
.fillna(0)
.astype(int)
.rename_axis('ColA')
.reset_index())
print (df)
ColA ColB ColC
0 num 1 5 1
1 num 2 6 3
2 num 3 7 2
3 num 4 0 5
I have the following dataframe:
product Week_Number Sales
1 1 10
2 1 15
1 2 20
And I would like to groupby product and week number and create a column with the sales of the next week for that product:
product Week_Number Sales next_week
1 1 10 20
2 1 15 0
1 2 20 0
Use DataFrame.sort_values with DataFrameGroupBy.shift :
#if not sure if sorted per 2 columns
df = df.sort_values(['product','Week_Number'])
#pandas 0.24+
df['next_week'] = df.groupby('product')['Sales'].shift(-1, fill_value=0)
#pandas below
#df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 2 1 15 0
2 1 2 20 0
If possible duplicates and need aggregate sum first in real data:
df = df.groupby(['product','Week_Number'], as_index=False)['Sales'].sum()
df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 1 2 20 0
2 2 1 15 0
First sort the data
Then apply shift using tranform
df = pd.DataFrame(data={'product':[1,2,1],
'week_number':[1,1,2],
'sales':[10,15,20]})
df.sort_values(['product','week_number'],inplace=True)
df['next_week'] = df.groupby(['product'])['sales'].transform(pd.Series.shift,-1,fill_value=0)
print(df)
product week_number sales next_week
0 1 1 10 20
2 1 2 20 0
1 2 1 15 0
My goal here is to concat() alternate groups between two dataframe.
desired result :
group ordercode quantity
0 A 1
B 1
C 1
D 1
0 A 1
B 3
1 A 1
B 2
C 1
1 A 1
B 1
C 2
My dataframe:
import pandas as pd
df1=pd.DataFrame([[0,"A",1],[0,"B",1],[0,"C",1],[0,"D",1],[1,"A",1],[1,"B",2],[1,"C",1]],columns=["group","ordercode","quantity"])
df2=pd.DataFrame([[0,"A",1],[0,"B",3],[1,"A",1],[1,"B",1],[1,"C",2]],columns=["group","ordercode","quantity"])
print(df1)
print(df2)
I have used dfff=pd.concat([df1,df2]).sort_index(kind="merge")
but I have got the below result:
group ordercode quantity
0 0 A 1
0 0 A 1
1 B 1
1 B 3
2 C 1
3 D 1
4 1 A 1
4 1 A 1
5 B 2
5 B 1
6 C 1
6 C 2
You can see here the concatenate is formed between each rows not by group.
It has to print like
group 0 of df1
group0 of df2
group1 of df1
group1 of df2 and so on
Note:
I have created these DataFrame using groupby() function
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df)//3, 1)) * 4)[0:len(df)]
df=df.groupby(['group', 'ordercode']).sum()
Question:
Where I went wrong?
Its sorting out by taking index
I have used .set_index("group") but It didnt work either.
Use cumcount for helper column used for sorting by sort_values :
df1['g'] = df1.groupby('ordercode').cumcount()
df2['g'] = df2.groupby('ordercode').cumcount()
dfff = pd.concat([df1,df2]).sort_values(['group','g']).reset_index(drop=True)
print (dfff)
group ordercode quantity g
0 0 A 1 0
1 0 B 1 0
2 0 C 1 0
3 0 D 1 0
4 0 A 1 0
5 0 B 3 0
6 1 C 2 0
7 1 A 1 1
8 1 B 2 1
9 1 C 1 1
10 1 A 1 1
11 1 B 1 1
and last remove column:
dfff = dfff.drop('g', axis=1)
I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")