Group (sum) by Month, Year and another Variable in Python - python

I'm quite new to programming, and I'm using Python it for data manipulation and analysis.
I have a dataframe that looks like:
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
And I would like to group by month, year and Brand. If it helps, I also have separate columns for Month and Year. The expected result should look like this:
Brand Date Unit
A Jan 2019 25
B Jan 2019 16
A Feb 2019 18
B Feb 2019 19
A Jan 2020 8
B Feb 2020 5
I tried adapting an answer from someone else's question:
per = df.Date.dt.to_period("M")
g = df.groupby(per,'Brand')
g.sum()
but I get prompted:
ValueError: No axis named Brand for object type <class 'pandas.core.frame.DataFrame'>
and I don't have any idea how to solve this.
I used to do this with dictionaries by selecting each month/year individually, group by sum and then create the dictionary, but it seems kind of brute force, really rough and it won't help if the df gets updated with new data.
Even more, maybe I'm having a bad approach to the situation. In the end I'd like to have a df looking like:
Brand Jan 19 Feb 19 Jan 20
A 25 18 8
B 16 19 5

Use pandas.to_datetime and pandas.DataFrame.pivot_table:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b %Y")
new_df = df.pivot_table(index="Brand", columns="Date", aggfunc=sum)
print(new_df)
Output:
Unit
Date Feb 2019 Jan 2019 Jan 2020
Brand
A 18 25 10
B 19 16 5

You were close, DataFrame.groupby wants a list of groupers, not bare arguments.
Here's how I did it:
import pandas
from io import StringIO
csv = StringIO("""\
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
""")
(
pandas.read_csv(csv, parse_dates=['Date'], sep='\s+', dayfirst=True)
.groupby(['Brand', pandas.Grouper(key='Date', freq='1M')])
.sum()
.reset_index()
)
And that gives me:
Brand Date Unit
0 A 2019-01-31 25
1 A 2019-02-28 18
2 A 2020-01-31 10
3 B 2019-01-31 16
4 B 2019-02-28 19
5 B 2020-01-31 5

Related

Pandas vectorization with two dataframe

assume I have the following two dataframes. DataFrame A and DataFrame B.
DataFrame A has four columns. Year, Month, day and temperature. (e.g. 2021 || 7 || 5 || 23). Currently, some of the temperature cell in DataFrame A are NaN.
DataFrame B has two columns. Date and temperature. (e.g. 2021/7/7 || 28)
The time interval of DataFrame A and DataFrame B are different. The time interval of DataFrame A is smaller than interval B. But some of them overlap. (e.g. every 10 mins in DataFrame B and every 5 mins in DataFrame A).
Now I want to copy the temperature data from DataFrame B to DataFrame A if there is a NaN value in DataFrame A.
I have a method which using looping, but it is very slow. I want to use pandas vectorization. But I don't know how. Can anyone teach me?
for i in tqdm(range(len(dfA['Temp']))):
if(pd.isna(df['Temp'].iloc[i])):
date_time_str = str(year) + '/' + str(month) + '/' + str(day)
try:
dfA['temp'].iloc[i] = float(dfB.loc[dfB['Date'] == date_time_str].iloc[:, 1])
except:
print("no value")
pass
My solution is very slow, how to do it with pandas vectorization?
Method I tried for vectorization:
dfA.loc[df['temp'].isnull() & ((datetime.datetime(dfA['Year'], df['*Month'], dfA['Day']).strftime("%Y/%m/%d %H:%M"))in dfB.Date.values) , 'temp'] = float(dfB[dfB['Date'] == datetime.datetime(dfA['Year'], df['*Month'], dfA['Day']].iloc[:, 1])
Above is my method and trying, it doesn't work.
Example data:
DataFrame A
Year Month Day Temperature
2020 1 17 25
2020 1 18 NaN
2020 1 19 28
2020 1 20 NaN
2020 1 21 NaN
2020 1 22 NaN
DataFrame B
Date Temp
1/17/2020 25
1/19/2020 28
1/21/2020 31
1/23/2020 34
1/25/2020 23
1/27/2020 54
Expected Output
Year Month Day Temperature
2020 1 17 25
2020 1 18 NaN
2020 1 19 28
2020 1 20 NaN
2020 1 21 31
2020 1 22 NaN
Let's map them:
dfa['Date']=pd.to_datetime(dfa[['Day','Month','Year']])
dfb['Date']=pd.to_datetime(dfb['Date'])
dfb['Temperature']=dfa.pop('Date').map(dfb.set_index('Date')['Temp'])
OR
Let's Merge them:
dfa['Date']=pd.to_datetime(dfa[['Day','Month','Year']])
dfb['Date']=pd.to_datetime(dfb['Date'])
dfa=dfa.merge(dfb[['Date','Temp']],on='Date',how='left')
dfa['Temperature']=dfa['Temperature'].fillna(dfa.pop('Temp'))
One way using pandas.to_datetime with pandas.Series.fillna:
df1 = df1.set_index(pd.to_datetime(df1[["Year", "Month", "Day"]]))
s = df2.set_index(pd.to_datetime(df2.pop("Date"))).squeeze()
df1["Temperature"] = df1["Temperature"].fillna(s)
print(df1.reset_index(drop=True))
Output:
Year Month Day Temperature
0 2020 1 17 25.0
1 2020 1 18 NaN
2 2020 1 19 28.0
3 2020 1 20 NaN
4 2020 1 21 31.0
5 2020 1 22 NaN

Add new row of values to Pandas DataFrame in specific row

I'm trying to add after the Gross profit line in an income statement new line with some values from array.
I tried just to append it in the location but nothing changed.
income_statement.loc[["Gross Profit"]].append(gross)
The only way i succeed doing something similar is by making it another dataframe and concat it to end of the income_statement.
I'm trying to make it look like that:(The 'gross' line in yellow)
How can i do it?
I created a sample df that tried to look similar to yours (see below).
df
Unnamed: 0 2010 2011 2012 2013 ... 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 ... 16 17 18 19 300
1 total revenue 1 2 3 4 ... 7 8 9 10 400
The aim now would be to add a row between them ('gross'), with the values you have listed in the picture.
One way to add the row could be with numpy.insert, which returns an array back so you have to convert back to a pd.DataFrame:
# Store the columns of your df
cols = df.columns
# Add the row (the number indicates the index position for the row to be added,1 is the 2nd row as Python indexes start from 0)
new = pd.DataFrame(np.insert
(df.values, 1, values = ['gross',22, 45, 65,87,108,130,151,152,156,135,133], axis=0),
columns=cols)
Which gets back:
new
Unnamed: 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 14 15 16 17 18 19 300
1 gross 22 45 65 87 108 130 151 152 156 135 133
2 total revenue 1 2 3 4 5 6 7 8 9 10 400
Hopefully this will work for you. Let me know for issues.

use pandas to tet previous year sales in the same row

I have a table from different companies' sales.
company_name sales year
A 200 2019
A 100 2018
A 30 2017
B 15 2019
B 30 2018
B 45 2017
Now, I want to add a previous year's sales in the same row just like
company_name sales year previous_sales
A 200 2019 100
A 100 2018 30
A 30 2017 Nan
B 15 2019 30
B 30 2018 45
B 45 2017 Nan
I tried to use the code like this, but I failed to get the right result
df["previous_sales"] = df.groupby(['company_name', 'year'])['sales'].shift()

Pandas Rolling mean with GroupBy and Sort

I have a DataFrame that looks like:
f_period f_year f_month subject month year value
20140102 2014 1 a 1 2018 10
20140109 2014 1 a 1 2018 12
20140116 2014 1 a 1 2018 8
20140202 2014 2 a 1 2018 20
20140209 2014 2 a 1 2018 15
20140102 2014 1 b 1 2018 10
20140109 2014 1 b 1 2018 12
20140116 2014 1 b 1 2018 8
20140202 2014 2 b 1 2018 20
20140209 2014 2 b 1 2018 15
The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.
I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:
f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014 1 a 1 2018 10 10 13
20140109 2014 1 a 1 2018 12 10 13
20140116 2014 1 a 1 2018 8 10 13
20140202 2014 2 a 1 2018 20 17.5 null
20140209 2014 2 a 1 2018 15 17.5 null
20140102 2014 1 b 1 2018 10 10 13
20140109 2014 1 b 1 2018 12 10 13
20140116 2014 1 b 1 2018 8 10 13
20140202 2014 2 b 1 2018 20 17.5 null
20140209 2014 2 b 1 2018 15 17.5 null
Things I tried:
I was able to get mnthly_avg by :
data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
value.transform('mean')
I tried getting the rolling_2_avg :
rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()
rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
value.rolling(2).mean().reset_index(drop=True)
This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg
How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this?
grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp
Output:
value rolling
subject month f_month
a 1 1 30 NaN
2 35 32.5
b 1 1 30 32.5
2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).
An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):
for unique_subject in df['subject'].unique():
df_subject = df[df['subject'] == unique_subject]
df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
print(df_subject) # just to print, you may wanna concatenate these

Pandas dataframe: turn date in column into value in row

I'm trying to turn the following dataframe (with values for county and year)
county region 2012 2013 ... 2035
A 101 10 15 ... 7
B 101 13 8 ... 11
...
into a dataframe that looks like this:
county region year sum
A 101 2012 10
A 101 2013 15
... ... ... ...
A 101 2035 7
B 101 2012 13
B 101 2013 8
B 101 2035 11
My current dataframe has 400 rows (different counties) with values for the years 2012-2035.
My manual approach would be to slice the year columns off and put each of them below the last row of the preceding year. But of course there has to be a pythonic way.
I guess I'm missing a basic pandas concept here, probably I just couldn't find the right answer to this problem because I simply didn't know how to ask the right question. Please be gentle with the newcomer.
You can use melt from pandas:
In [26]: df
Out[26]:
county region 2012 2013
0 A 101 10 15
1 B 101 13 8
In [27]: pd.melt(df, id_vars=['county','region'], var_name='year', value_name='sum')
Out[27]:
county region year sum
0 A 101 2012 10
1 B 101 2012 13
2 A 101 2013 15
3 B 101 2013 8

Categories