I have a dataset like this
id event 2015 2016 2017
a 2015 33 na na
a 2016 na 32 na
a 2017 na na 31
b 2015 30 na na
b 2017 na na 20
how do I make all the non-missing values in the same row:
id 2015 2016 2017
a 33 32 31
b 30 0 20
sorry the questions above do not solve my case, and the code does not work
try:
df = df.set_index('event').replace('na', np.nan)
df1 = pd.concat([df[col].dropna() for col in df.columns], axis=0).to_frame().T
df1:
event 2015 2016 2017
0 33 32 31
1st replace na by NaN then set event as index. Dropall the NaN from the column values.
Use GroupBy.first for first non missing value per groups by id:
df = (df.drop('event', axis=1)
.replace('na', np.nan)
.groupby('id', as_index=False)
.first()
.fillna(0))
print (df)
id 2015 2016 2017
0 a 33 32 31
1 b 30 0 20
Related
I'm trying to add after the Gross profit line in an income statement new line with some values from array.
I tried just to append it in the location but nothing changed.
income_statement.loc[["Gross Profit"]].append(gross)
The only way i succeed doing something similar is by making it another dataframe and concat it to end of the income_statement.
I'm trying to make it look like that:(The 'gross' line in yellow)
How can i do it?
I created a sample df that tried to look similar to yours (see below).
df
Unnamed: 0 2010 2011 2012 2013 ... 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 ... 16 17 18 19 300
1 total revenue 1 2 3 4 ... 7 8 9 10 400
The aim now would be to add a row between them ('gross'), with the values you have listed in the picture.
One way to add the row could be with numpy.insert, which returns an array back so you have to convert back to a pd.DataFrame:
# Store the columns of your df
cols = df.columns
# Add the row (the number indicates the index position for the row to be added,1 is the 2nd row as Python indexes start from 0)
new = pd.DataFrame(np.insert
(df.values, 1, values = ['gross',22, 45, 65,87,108,130,151,152,156,135,133], axis=0),
columns=cols)
Which gets back:
new
Unnamed: 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 TTM
0 gross profit 10 11 12 13 14 15 16 17 18 19 300
1 gross 22 45 65 87 108 130 151 152 156 135 133
2 total revenue 1 2 3 4 5 6 7 8 9 10 400
Hopefully this will work for you. Let me know for issues.
I'm quite new to programming, and I'm using Python it for data manipulation and analysis.
I have a dataframe that looks like:
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
And I would like to group by month, year and Brand. If it helps, I also have separate columns for Month and Year. The expected result should look like this:
Brand Date Unit
A Jan 2019 25
B Jan 2019 16
A Feb 2019 18
B Feb 2019 19
A Jan 2020 8
B Feb 2020 5
I tried adapting an answer from someone else's question:
per = df.Date.dt.to_period("M")
g = df.groupby(per,'Brand')
g.sum()
but I get prompted:
ValueError: No axis named Brand for object type <class 'pandas.core.frame.DataFrame'>
and I don't have any idea how to solve this.
I used to do this with dictionaries by selecting each month/year individually, group by sum and then create the dictionary, but it seems kind of brute force, really rough and it won't help if the df gets updated with new data.
Even more, maybe I'm having a bad approach to the situation. In the end I'd like to have a df looking like:
Brand Jan 19 Feb 19 Jan 20
A 25 18 8
B 16 19 5
Use pandas.to_datetime and pandas.DataFrame.pivot_table:
df["Date"] = pd.to_datetime(df["Date"], dayfirst=True).dt.strftime("%b %Y")
new_df = df.pivot_table(index="Brand", columns="Date", aggfunc=sum)
print(new_df)
Output:
Unit
Date Feb 2019 Jan 2019 Jan 2020
Brand
A 18 25 10
B 19 16 5
You were close, DataFrame.groupby wants a list of groupers, not bare arguments.
Here's how I did it:
import pandas
from io import StringIO
csv = StringIO("""\
Brand Date Unit
A 1/1/19 10
B 3/1/19 11
A 11/1/19 15
B 11/1/19 5
A 1/1/20 10
A 9/2/19 18
B 12/2/19 11
B 19/2/19 8
B 1/1/20 5
""")
(
pandas.read_csv(csv, parse_dates=['Date'], sep='\s+', dayfirst=True)
.groupby(['Brand', pandas.Grouper(key='Date', freq='1M')])
.sum()
.reset_index()
)
And that gives me:
Brand Date Unit
0 A 2019-01-31 25
1 A 2019-02-28 18
2 A 2020-01-31 10
3 B 2019-01-31 16
4 B 2019-02-28 19
5 B 2020-01-31 5
I have a DataFrame that looks like:
f_period f_year f_month subject month year value
20140102 2014 1 a 1 2018 10
20140109 2014 1 a 1 2018 12
20140116 2014 1 a 1 2018 8
20140202 2014 2 a 1 2018 20
20140209 2014 2 a 1 2018 15
20140102 2014 1 b 1 2018 10
20140109 2014 1 b 1 2018 12
20140116 2014 1 b 1 2018 8
20140202 2014 2 b 1 2018 20
20140209 2014 2 b 1 2018 15
The f_period is the date when a forecast for a SKU (column subject) was made. The month and year column is the period for which the forecast was made. For example, the first row says that on 01/02/2018, the model was forecasting to set 10 units of product a in month 1 of year2018.
I am trying to create a rolling average prediction by subject, by month for 2 f_months. The DataFrame should look like:
f_period f_year f_month subject month year value mnthly_avg rolling_2_avg
20140102 2014 1 a 1 2018 10 10 13
20140109 2014 1 a 1 2018 12 10 13
20140116 2014 1 a 1 2018 8 10 13
20140202 2014 2 a 1 2018 20 17.5 null
20140209 2014 2 a 1 2018 15 17.5 null
20140102 2014 1 b 1 2018 10 10 13
20140109 2014 1 b 1 2018 12 10 13
20140116 2014 1 b 1 2018 8 10 13
20140202 2014 2 b 1 2018 20 17.5 null
20140209 2014 2 b 1 2018 15 17.5 null
Things I tried:
I was able to get mnthly_avg by :
data_df['monthly_avg'] = data_df.groupby(['f_month', 'f_year', 'year', 'month', 'period', 'subject']).\
value.transform('mean')
I tried getting the rolling_2_avg :
rolling_monthly_df = data_df[['f_year', 'f_month', 'subject', 'month', 'year', 'value', 'f_period']].\
groupby(['f_year', 'f_month', 'subject', 'month', 'year']).value.mean().reset_index()
rolling_monthly_df['rolling_2_avg'] = rolling_monthly_df.groupby(['subject', 'month']).\
value.rolling(2).mean().reset_index(drop=True)
This gave me an unexpected output. I don't understand how it calculated the values for rolling_2_avg
How do I group by subject and month and then sort by f_month and then take the average of the next two-month average?
Unless I'm misunderstanding it seems simpler than what you've done. What about this?
grp = pd.DataFrame(df.groupby(['subject', 'month', 'f_month'])['value'].sum())
grp['rolling'] = grp.rolling(window=2).mean()
grp
Output:
value rolling
subject month f_month
a 1 1 30 NaN
2 35 32.5
b 1 1 30 32.5
2 35 32.5
I would be a bit careful with Josh's solution. If you want to group by the subject you can't use the rolling function like that as it will roll across subjects (i.e. it will eventually take the mean of a month from subject A and B, rather than giving a null which you might prefer).
An alternative can be to split the dataframe and run the rolling individually (I noticed that you want the nulls by the end of the dataframe, whereas you might wanna sort the dataframe before and after):
for unique_subject in df['subject'].unique():
df_subject = df[df['subject'] == unique_subject]
df_subject['rolling'] = df_subject['value'].rolling(window=2).mean()
print(df_subject) # just to print, you may wanna concatenate these
I have a df with nations as index and years(1990-2015) as header. I want to make a new df2 where every column is the sum of 5 year, eg: 1995-1999, 2000-2004 etc
I have done this:
df2 = pd.DataFrame(index=df.index[:], columns=['1995', '2000', '2005', '2010', '2015'])
df2['1995'] = df.iloc[0:4].sum(axis=1)
But it doesnt replace the NaN values.
What am I doing wrong? Thanks in advance
Step 1
Transpose and reset index with df.T.reset_index
df2 = df.T.reset_index(drop=True)
Step 2
Using df.groupby, group by index in sets of 5, and then sum with dfGroupBy.agg, passing np.nansum
df2 = df2.groupby(df2.index // 5).agg(np.nansum).T
Step 3
Assign the columns inplace
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
df = ... # Borrowed from Bharath
df2 = df.T.reset_index(drop=True)
df2 = df2.groupby(df2.index // 5).sum().T
df2.columns = pd.to_datetime(df.columns[::5]).year + 5
print(df2)
Output:
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24
I think you are looking for sum of every 5 columns after a specific column. One way of doing it is using a for loop for concatinating data after slicing i.e if you have a dataframe
df = pd.DataFrame({'Country':['IN','EG'],'1990':[2,4],'1991':[4,5],'1992':[2,4],'1993':[2,4],'1994':[62,14],'1995':[21,4],'1996':[2,14],'1997':[2,4],'1998':[2,14],'1999':[2,4],'2000':[2,4],'2001':[2,14],'2002':[92,4],'2003':[2,4],'2004':[2,14],'2005':[2,24]})
df.set_index('Country',drop=True,inplace=True)
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 \
Country
IN 2 4 2 2 62 21 2 2 2 2 2
EG 4 5 4 4 14 4 14 4 14 4 4
2001 2002 2003 2004 2005
Country
IN 2 92 2 2 2
EG 14 4 4 14 24
Then
df2 = pd.DataFrame(index=df.index[:])
columns=['1990','1995', '2000', '2005']
for x in columns:
df2 = pd.concat([df2,df[df.columns[df.columns.tolist().index(x):][0:5]].sum(axis=1)],axis=1)
df2.columns= columns
Output :
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If you want to set different columns then ,
df2.columns = ['1990-1994','1995-1999','1999-2004','2005-']
Hope it helps
You can use:
convert columns to_datetime
resample by columns (axis=1) by 5A (years) and aggregate sum
last get years from columns by DatetimeIndex.year and remove 4
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year - 4
print (df2)
1990 1995 2000 2005
Country
IN 72 29 100 2
EG 31 40 40 24
If need change years, also is possible add 1:
df.columns = pd.to_datetime(df.columns, format='%Y')
df2 = df.resample('5A',axis=1, closed='left').sum()
df2.columns = df2.columns.year + 1
print (df2)
1995 2000 2005 2010
Country
IN 72 29 100 2
EG 31 40 40 24
I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40