Groupby, Shift and Sum - python

I have the following dataframe:
product Week_Number Sales
1 1 10
2 1 15
1 2 20
And I would like to groupby product and week number and create a column with the sales of the next week for that product:
product Week_Number Sales next_week
1 1 10 20
2 1 15 0
1 2 20 0

Use DataFrame.sort_values with DataFrameGroupBy.shift :
#if not sure if sorted per 2 columns
df = df.sort_values(['product','Week_Number'])
#pandas 0.24+
df['next_week'] = df.groupby('product')['Sales'].shift(-1, fill_value=0)
#pandas below
#df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 2 1 15 0
2 1 2 20 0
If possible duplicates and need aggregate sum first in real data:
df = df.groupby(['product','Week_Number'], as_index=False)['Sales'].sum()
df['next_week'] = df.groupby('product')['Sales'].shift(-1).fillna(0, downcast='int')
print (df)
product Week_Number Sales next_week
0 1 1 10 20
1 1 2 20 0
2 2 1 15 0

First sort the data
Then apply shift using tranform
df = pd.DataFrame(data={'product':[1,2,1],
'week_number':[1,1,2],
'sales':[10,15,20]})
df.sort_values(['product','week_number'],inplace=True)
df['next_week'] = df.groupby(['product'])['sales'].transform(pd.Series.shift,-1,fill_value=0)
print(df)
product week_number sales next_week
0 1 1 10 20
2 1 2 20 0
1 2 1 15 0

Related

Group values by columns axis=1 dinamically

I have a matrix df with 70 columns.
id day_1 day_2 day_3 day_4 ... day_69 day_70
1 1 2 4 1 1 1
2 0 0 0 0 0 0
3 0 3 0 0 0 0
4 3 2 1 0 0 3
I would like to aggregate the columns dinamically by [2,7,10, etc.] number of days. I.e. [bi-daily, weekly, ten-daily, etc.]
E.g. one of the results for aggregation (sum) by 2 days would be a dataframe with 35 columns, see below:
id bi_daily_1 bi_daily_2 ...bi_daily_35
1 3 5 2
2 0 0 0
3 3 0 0
4 5 1 3
where :
bi_daily_1 = aggregation(day_1, day_2)
bi_daily_2 = aggregation(day_3, day_4) and so on...
Note: Real matrix shape is aprox (2000, 1500)
Use floor division based on the number of days to determine groups (df.shape[1] is the number of columns in the dataframe), then use groupby on these groups specifying the axis as 1 (columns). Then just rename the columns.
days = 2
result = df.groupby([x // days for x in range(df.shape[1])], axis=1).sum()
result.columns = [f'bi_daily_{n + 1}' for n in result.columns]
>>> result
bi_daily_1 bi_daily_2
id
1 3 5
2 0 0
3 3 0
4 5 1
This could work, using a list comprehension: split the dataframe into pairs of two consecutive columns, use the iloc notation, sum each new dataframe, then concat to get a new dataframe.
day_1 day_2 day_3 day_4
0 1 2 4 1
1 0 0 0 0
2 0 3 0 0
3 3 2 1 0
(pd.concat([df.iloc[:,[i,i+1]]
.sum(axis=1)
for i in range(0,df.shape[1],2)],
axis=1)
.add_prefix('bi_daily_')
)
bi_daily_0 bi_daily_1
0 3 5
1 0 0
2 3 0
3 5 1

conditional cumsum in pandas [duplicate]

This question already has an answer here:
How can I use cumsum within a group in Pandas?
(1 answer)
Closed 3 years ago.
I have following dataframe in pandas
code rank quant sales
123 1 0 2
123 1 12 2
123 1 0 2
123 2 0 1
123 2 10 1
I want to do a conditional cumsum of sales groupby rank. where quant is not zero add it in cumulative sum on the same row.
code rank quant sales cumsum
123 1 0 2 2
123 1 12 2 16
123 1 0 2 18
123 2 0 1 1
123 2 10 1 12
How to do it in pandas.
Add columns first and then use GroupBy.cumsum with df['rank'] Series:
df['cumsum'] = df['quant'].add(df['sales']).groupby(df['rank']).cumsum()
Or use sum by both columns:
df['cumsum'] = df[['quant', 'sales']].sum(axis=1).groupby(df['rank']).cumsum()
Alternative is create new column before groupby:
df['cumsum'] = (df.assign(cumsum=df['quant'].add(df['sales']))
.groupby('rank')['cumsum'].cumsum())
print (df)
code rank quant sales cumsum
0 123 1 0 2 2
1 123 1 12 2 16
2 123 1 0 2 18
3 123 2 0 1 1
4 123 2 10 1 12

Merge two data frames with taking max of two columns

I have two dataframes with the same form:
> df1
Day ItemId Quantity
1 1 2
1 2 3
1 4 5
> df2
Day ItemId Quantity
1 1 0
1 2 0
1 3 0
1 4 0
I'd like to merge df1 and df2 and if a row of ['Day','ItemId'] exists in both df1 and df2 take df1 which the max
I tried this command :
df = pd.concat([df1, df2]).groupby(level=0).max(df1['Quantity'],df2['Quantity'])
Use groupby by both columns in list and aggregate max:
df = pd.concat([df1, df2]).groupby(['Day','ItemId'], as_index=False)['Quantity'].max()
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
3 1 4 5
If possible multiple columns:
df = (pd.concat([df1, df2])
.sort_values(['Day','ItemId','Quantity'], ascending=[True, True, False])
.drop_duplicates(['Day','ItemId']))
print (df)
Day ItemId Quantity
0 1 1 2
1 1 2 3
2 1 3 0
2 1 4 5

Grouping customer orders by date, category and customer with one-hot-encoding result

I have a dataframe containing order of customers from different categories (A-F). A one indicates a purchase from this category, wheres a zero indicates none. Now I would like to indicate with 1 and 0 encoding whether a purchase in each respective category was made on a per day and per customer basis.
YEAR MONTH DAY A B C D E F Customer
2007 1 1 1 0 0 0 0 0 5000
2007 1 1 1 0 0 0 0 0 5000
2007 1 1 0 1 0 0 0 0 5000
2007 1 2 0 1 0 0 0 0 5000
2007 1 2 0 0 1 0 0 0 5000
The output should look something like this:
YEAR MONTH DAY A B C D E F Customer
2007 1 1 1 1 0 0 0 0 5000
I've been trying to work this out using pandas build in "groupby" however I cant get the right result. Anyone knows how to solve this?
Thank you very much!
I think you need groupby and aggregate max:
cols = ['YEAR','MONTH','DAY','Customer']
df = df.groupby(cols, as_index=False).max()
print (df)
YEAR MONTH DAY Customer A B C D E F
0 2007 1 1 5000 1 1 0 0 0 0
1 2007 1 2 5000 0 1 1 0 0 0
Anf if need same order of columns add DataFrame.reindex_axis:
cols = ['YEAR','MONTH','DAY','Customer']
df = df.groupby(cols, as_index=False).max().reindex_axis(df.columns, axis=1)
print (df)
YEAR MONTH DAY A B C D E F Customer
0 2007 1 1 1 1 0 0 0 0 5000
1 2007 1 2 0 1 1 0 0 0 5000

Pandas dataframe - create new column based on simple calcuation

I want to make a calculation based on 4 columns in a dataframe and apply the result to a new column.
The 4 columns I'm interested in are as follows.
rating_1, time_1, rating_2, time_2 col_x col_y etc
0 1 1 1 1 1 1
If time_1 is greater than time_2 I want rating_1 in the new column, if time_2 is greater I want rating_2 in the column.
What's the simplest way to do this please?
you can use numpy.where() method:
In [241]: x
Out[241]:
rating_1 time_1 rating_2 time_2 col_x col_y
0 11 1 21 1 1 1
1 12 2 21 1 1 1
2 13 1 21 5 1 1
3 14 5 21 5 1 1
In [242]: x['new'] = np.where(x.time_1 > x.time_2, x.rating_1, x.rating_2)
In [243]: x
Out[243]:
rating_1 time_1 rating_2 time_2 col_x col_y new
0 11 1 21 1 1 1 21
1 12 2 21 1 1 1 12
2 13 1 21 5 1 1 21
3 14 5 21 5 1 1 21
def myfunc(row):
if row.time_1 >= row.time_2:
return row.rating_1
else:
return row.rating_2
df.loc[:, 'calculatedColumn'] = df.apply(myfunc, axis = 1)

Categories