Sum to dataframes based on row and column - python

Given to Dataframes df_1
Code | Jan | Feb | Mar
a | 1 | 2 | 1
b | 3 | 4 | 3
and df_2
Code | Jan | Feb | Mar
a | 1 | 1 | 2
c | 7 | 0 | 0
I would like to sum these to tables based on the row and colum. So my result dataframe shoul look like this:
Code | Jan | Feb | Mar
a | 2 | 3 | 3
b | 3 | 4 | 3
c | 7 | 0 | 0
Is there an easy way to do this? I can to this using a lot of for loops and if statements but this is very slow for large datasets.

Use concat and aggregate sum:
df = pd.concat([df_1, df_2]).groupby('Code', as_index=False).sum()
print (df)
Code Jan Feb Mar
0 a 2 3 3
1 b 3 4 3
2 c 7 0 0

Related

Pandas Dataframe keep rows where values of 2 columns are in a list of couples

I have a list of couples :
year_month = [(2020,8), (2021,1), (2021,6)]
and a dataframe df
| ID | Year | Month |
| 1 | 2020 | 1 |
| ... |
| 1 | 2020 | 12 |
| 1 | 2021 | 1 |
| ... |
| 1 | 2021 | 12 |
| 2 | 2020 | 1 |
| ... |
| 2 | 2020 | 12 |
| 2 | 2021 | 1 |
| ... |
| 2 | 2021 | 12 |
| 3 | 2021 | 1 |
| ... |
I want to select rows where Year and Month are corresponding to one of the couples in the year_month list :
Output df :
| ID | Year | Month |
| 1 | 2020 | 8 |
| 1 | 2021 | 1 |
| 1 | 2021 | 6 |
| 2 | 2020 | 8 |
| 2 | 2021 | 1 |
| 2 | 2021 | 6 |
| 3 | 2020 | 8 |
| ... |
Any idea on how to automate it, so I have only to change year_month couples ?
I want to put many couples in year_month, so I want to keep a list of couples, and not to list all possibilities in df :
I don't want to do such :
df = df[((df['Year'] == 2020) & (df['Month'] == 8)) |
((df['Year'] == 2021) & (df['Month'] == 1)) | ((df['Year'] == 2021) & (df['Month'] == 6))]
You can use a list comprehension and filter your dataframe with your list of tuples as below:
year_month = [(2020,8), (2021,1), (2021,6)]
df[[i in year_month for i in zip(df.Year,df.Month)]]
Which gives only the paired values back:
ID Year Month
2 1 2021 1
6 2 2021 1
8 3 2021 1
One way using pandas.DataFrame.merge:
df.merge(pd.DataFrame(year_month, columns=["Year", "Month"]))
Output:
ID Year Month
0 1 2021 1
1 2 2021 1
2 3 2021 1

pull row with max date from groupby in python pandas

I'm trying to pull the max date from a df in the below format
columns: index1 index2 col1
place1
| 2018 | 5 |
| 2019 | 4 |
| 2020 | 2 |
place2
| 2016 | 9 |
| 2017 | 8 |
place3
| 2018 | 6 |
| 2019 | 1 |
I'm trying to pull rows out for the maximum years available for each place. In the above example the final df would be:
place1 | 2020 | 2
place2 | 2017 | 8
place3 | 2019 | 1
You can use dataframe.sort_values().groupby().last()
To find the maximum value in a group
In your case you have to do
df.sort_values("index2").groupby("index1").last()
I think it may work for you
I am newbie in python but might be it can help:
import pandas as pd
data=[['place1','2018','5'],
['place1','2019','4'],
['place1','2020','2'],
['place2','2016','9'],
['place2','2017','8'],
['place3','2018','6'],
['place3','2019','1']]
df=pd.DataFrame(data,columns=['index1','index2','col1'])
df.set_index(['index1','index2'], inplace=True)
df.reset_index(level=1, inplace=True)
df=df.sort_values(['index1','index2'],ascending=False).groupby('index1').first()
df.set_index('index2',append=True,inplace=True)

Pandas: Transpose, groupby and summarize columns

i have a pandas DataFrame which looks like this:
| Id | Filter 1 | Filter 2 | Filter 3 |
|----|----------|----------|----------|
| 25 | 0 | 1 | 1 |
| 25 | 1 | 0 | 1 |
| 25 | 0 | 0 | 1 |
| 30 | 1 | 0 | 1 |
| 31 | 1 | 0 | 1 |
| 31 | 0 | 1 | 0 |
| 31 | 0 | 0 | 1 |
I need to transpose this table, add "Name" column with the name of the filter and summarize Filters column values. The result table should be like this:
| Id | Name | Summ |
| 25 | Filter 1 | 1 |
| 25 | Filter 2 | 1 |
| 25 | Filter 3 | 3 |
| 30 | Filter 1 | 1 |
| 30 | Filter 2 | 0 |
| 30 | Filter 3 | 1 |
| 31 | Filter 1 | 1 |
| 31 | Filter 2 | 1 |
| 31 | Filter 3 | 2 |
The only solution i have came so far was to use apply function on groupped by Id column, but this mehod is too slow for my case - dataset can be more than 40 columns and 50_000 rows, how can i do this with pandas native methods?(eg Pivot, Transpose, Groupby)
Use:
df_new=df.melt('Id',var_name='Name',value_name='Sum').groupby(['Id','Name']).Sum.sum()\
.reset_index()
print(df_new)
Id Name Sum
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
stack then groupby
df.set_index('Id').stack().groupby(level=[0,1]).sum().reset_index()
Id level_1 0
0 25 Filter 1 1
1 25 Filter 2 1
2 25 Filter 3 3
3 30 Filter 1 1
4 30 Filter 2 0
5 30 Filter 3 1
6 31 Filter 1 1
7 31 Filter 2 1
8 31 Filter 3 1
Short version
df.set_index('Id').sum(level=0).stack()#df.groupby('Id').sum().stack()
Using filter and melt
df.filter(like='Filter').groupby(df.Id).sum().T.reset_index().melt(id_vars='index')
index Id value
0 Filter 1 25 1
1 Filter 2 25 1
2 Filter 3 25 3
3 Filter 1 30 1
4 Filter 2 30 0
5 Filter 3 30 1
6 Filter 1 31 1
7 Filter 2 31 1
8 Filter 3 31 2

Pandas, create new column based on values from previuos rows with certain values

Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.

Create a month for every date between a period and make them columns

I want to separate every month inside the period between the 'start' and 'end' column than I know I can use a pivot_table to make them columns:
subscription|values| start | end
x |1 |5/5/2018 |6/5/2018
y |2 |5/5/2018 |8/5/2018
z |1 |5/5/2018 |9/5/2018
a |3 |5/5/2018 |10/5/2018
b |4 |5/5/2018 |11/5/2018
c |2 |5/5/2018 |12/5/2018
Desired Output:
subscription|jan| feb | mar | abr | jun | jul | aug | sep | out | nov | dez
x | | | | | 1 | 1 | | | | |
y | | | | | 2 | 2 | 2 | | | |
z | | | | | 1 | 1 | 1 | 1 | | |
a | | | | | 3 | 3 | 3 | 3 | 3 | |
b | | | | | 4 | 4 | 4 | 4 | 4 | 4 |
c | | | | | 2 | 2 | 2 | 2 | 2 | 2 | 2
Using simple pd.Series.cumsum
import calendar
df2 = pd.DataFrame(np.zeros(shape=[len(df),13]),
columns=map(lambda s: calendar.month_abbr[s],
np.arange(13)))
First set begin as values, and end as -values.
r = np.arange(len(df))
df2.values[r, df.start.dt.month] = df['values']
df2.values[r, df.end.dt.month] = -df['values']
Then cumsum through axis=1
df2 = df2.cumsum(1)
Set the final to values
df2.values[r, df.end.dt.month]= df['values']
Final output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 0 0 0 0 0 1 1 0 0 0 0 0 0
1 0 0 0 0 0 2 2 2 2 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0
3 0 0 0 0 0 3 3 3 3 3 3 0 0
4 0 0 0 0 0 4 4 4 4 4 4 4 0
5 0 0 0 0 0 2 2 2 2 2 2 2 2
A method from sklearn MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
df['L'] = [pd.date_range(x, y, freq='M') for x, y in zip(df.start, df.end)]
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['L']),columns=mlb.classes_, index=df.index).mul(df['values'],0)
yourdf.columns=yourdf.columns.strftime('%Y%B')
yourdf['subscription']=df['subscription']
yourdf
Out[75]:
2018May 2018June ... 2018November subscription
0 1 0 ... 0 x
1 2 2 ... 0 y
2 1 1 ... 0 z
3 3 3 ... 0 a
4 4 4 ... 0 b
5 2 2 ... 2 c
[6 rows x 8 columns]

Categories