I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
1
1
1
2
2003
46.538
46.66
46.47
46.673
1
2
1
3
2003
46.717
46.781
46.53
46.750
1
3
1
4
2003
46.815
46.843
46.68
46.750
1
4
1
5
2003
46.935
47.000
46.56
46.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
43
7259
10
28
2022
379.869
389.519
379.67
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
44
7261
11
1
2022
390.14
390.39
383.29
384.519
44
I want to create a new column titled 'week high' which will reference each week every year and pull in the high. So for Week 1, Year 2003, it will take the Highest High from rows 0 to 4 but for Week 43, Year 2022, it will take the Highest High from rows 7257 to 7259.
Is it possible to reference the columns Week and Year to calculate that value? Thanks!
Assuming pandas, create a weekly period and use it as grouper for transform('max'):
group = pd.to_datetime(df[['Year', 'Month', 'Day']]).dt.to_period('W')
# or, if you already have a "Week" column
# group = "Week"
df['week_high'] = df.groupby(group)['High'].transform('max')
Output:
Month Day Year Open High Low Close Week week_high
0 1 1 2003 46.593 46.6560 46.405 46.468 1.0 47.000
1 1 2 2003 46.538 46.6600 46.470 46.673 1.0 47.000
2 1 3 2003 46.717 46.7810 46.530 46.750 1.0 47.000
3 1 4 2003 46.815 46.8430 46.680 46.750 1.0 47.000
4 1 5 2003 46.935 47.0000 46.560 46.593 1.0 47.000
7257 10 26 2022 381.619 387.5799 381.350 382.019 43.0 389.519
7258 10 27 2022 383.070 385.0000 379.329 379.980 43.0 389.519
7259 10 28 2022 379.869 389.5190 379.670 389.019 43.0 389.519
7260 10 31 2022 386.440 388.3990 385.260 386.209 44.0 390.390
7261 11 1 2022 390.140 390.3900 383.290 384.519 44 390.390
I am assuming you are using pandas. Other libraries will work similar.
Create a new DataFrame aggregated per week using groupby and join it back to your original DataFrame
df_grouped = df["Week", "High"].groupby("Week").max().rename(columns={"High":"Highest High"}
df_result = df.join(df_grouped, "Week")
Here's the data in csv format:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
Jack 1 15 25 3 5 11 5 8 3
Jill 5 10 32 5 5 14 6 8 7
I don't want Name column to be include as it gives an error.
I tried
df.cumsum()
Try with set_index and reset_index to keep the name column:
df.set_index('Name').cumsum().reset_index()
Output:
Name 2012 2013 2014 2015 2016 2017 2018 2019 2020
0 Jack 1 15 25 3 5 11 5 8 3
1 Jill 6 25 57 8 10 25 11 16 10
Fruit January Shipments January Sales February Shipments February Sales
------------ ------------------- --------------- -------------------- ----------------
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
I'm trying to achieve the following result:
Fruit Month Shipments Sales
------------ ---------- ----------- -------
Apple January 30 11
Banana January 12 49
Pear January 25 50
Kiwi January 41 25
Strawberry January 11 33
Apple February 18 31
Banana February 39 14
Pear February 44 21
Kiwi February 10 25
Strawberry February 35 50
I've tried pandas.pivot and pandas.pivot_table and had no luck. I'm in the process of creating two dataframes (Fruit/Month/Shipments) and (Fruit/Month/Sales), and concatenating the two into one with a loop, but I was hoping for a easier way to do this.
one way is to use modify the column to a multi level then use stack. Let suppose your dataframe is called df. First set the column Fruit as index, then define the multilevel columns:
df = df.set_index('Fruit')
# manual way to create the multiindex columns
#df.columns = pd.MultiIndex.from_product([['January','February'],
# ['Shipments','Sales']], names=['Month',None])
# more general way to create the multiindex columns thanks to #Scott Boston
df.columns = df.columns.str.split(expand=True)
df.columns.names = ['Month',None]
your data looks like:
Month January February
Shipments Sales Shipments Sales
Fruit
Apple 30 11 18 31
Banana 12 49 39 14
Pear 25 50 44 21
Kiwi 41 25 10 25
Strawberry 11 33 35 50
Now you can use stack on level 0 and reset_index
df_output = df.stack(0).reset_index()
which gives
Fruit Month Sales Shipments
0 Apple February 31 18
1 Apple January 11 30
2 Banana February 14 39
3 Banana January 49 12
4 Pear February 21 44
5 Pear January 50 25
6 Kiwi February 25 10
7 Kiwi January 25 41
8 Strawberry February 50 35
9 Strawberry January 33 11
Finally, if you want a specific order for values in the column Month you can use pd.Categorical:
df_output['Month'] = pd.Categorical(df_output['Month'].tolist(), ordered=True,
categories=['January','February'])
setting that January is before February when sorting. Now, doing
df_output = df_output.sort_values(['Month'])
gives the result:
Fruit Month Sales Shipments
1 Apple January 11 30
3 Banana January 49 12
5 Pear January 50 25
7 Kiwi January 25 41
9 Strawberry January 33 11
0 Apple February 31 18
2 Banana February 14 39
4 Pear February 21 44
6 Kiwi February 25 10
8 Strawberry February 50 35
I see it's not exactly the expected output (order in Fruit column and order of columns) but both can be easily change if needed.
How to use pd.wide_to_long as #user3483203 suggests.
df1 = df.set_index('Fruit')
#First we have to so column renaming use multiindex column headers and swapping levels.
df1.columns = df1.columns.str.split(expand=True)
df1.columns = df1.columns.map('{0[1]}_{0[0]}'.format)
#Reset index and use pd.wide_to_long:
df1 = df1.reset_index()
df_out = pd.wide_to_long(df1, ['Shipments','Sales'], 'Fruit', 'Month','_','\w+')\
.reset_index()
print(df_out)
Output:
Fruit Month Shipments Sales
0 Apple January 30.0 11.0
1 Banana January 12.0 49.0
2 Pear January 25.0 50.0
3 Kiwi January 41.0 25.0
4 Strawberry January 11.0 33.0
5 Apple February 18.0 31.0
6 Banana February 39.0 14.0
7 Pear February 44.0 21.0
8 Kiwi February 10.0 25.0
9 Strawberry February 35.0 50.0
I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0
I wish to compare the output of multiple model runs, calculating these values:
Difference between current period revenue and previous period
Difference between actual current period revenue and forecasted current period revenue
I have experimented with multi-indexes, and suspect the answer lies in that direction with some creative shift(). However, I'm afraid I've mangled the problem through a haphazard application of various pivot/melt/groupby experiments. Perhaps you can help me figure out how to turn this:
import pandas as pd
ids = [1,2,3] * 5
year = ['2013', '2013', '2013', '2014', '2014', '2014', '2014', '2014', '2014', '2015', '2015', '2015', '2015', '2015', '2015']
run = ['actual','actual','actual','forecast','forecast','forecast','actual','actual','actual','forecast','forecast','forecast','actual','actual','actual']
revenue = [10,20,20,30,50,90,10,40,50,120,210,150,130,100,190]
change_from_previous_year = ['NA','NA','NA',20,30,70,0,20,30,90,160,60,120,60,140]
change_from_forecast = ['NA','NA','NA','NA','NA','NA',-20,-10,-40,'NA','NA','NA',30,-110,40]
d = {'ids':ids, 'year':year, 'run':run, 'revenue':revenue}
df = pd.DataFrame(data=d, columns=['ids','year','run','revenue'])
print df
ids year run revenue
0 1 2013 actual 10
1 2 2013 actual 20
2 3 2013 actual 20
3 1 2014 forecast 30
4 2 2014 forecast 50
5 3 2014 forecast 90
6 1 2014 actual 10
7 2 2014 actual 40
8 3 2014 actual 50
9 1 2015 forecast 120
10 2 2015 forecast 210
11 3 2015 forecast 150
12 1 2015 actual 130
13 2 2015 actual 100
14 3 2015 actual 190
....into this:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NA NA
1 2 2013 actual 20 NA NA
2 3 2013 actual 20 NA NA
3 1 2014 forecast 30 20 NA
4 2 2014 forecast 50 30 NA
5 3 2014 forecast 90 70 NA
6 1 2014 actual 10 0 -20
7 2 2014 actual 40 20 -10
8 3 2014 actual 50 30 -40
9 1 2015 forecast 120 90 NA
10 2 2015 forecast 210 160 NA
11 3 2015 forecast 150 60 NA
12 1 2015 actual 130 120 30
13 2 2015 actual 100 60 -110
14 3 2015 actual 190 140 40
EDIT-- I get pretty close with this:
df['prev_year'] = df.groupby(['ids','run']).shift(1)['revenue']
df['chg_from_prev_year'] = df['revenue'] - df['prev_year']
df['curr_forecast'] = df.groupby(['ids','year']).shift(1)['revenue']
df['chg_from_forecast'] = df['revenue'] - df['curr_forecast']
The only thing missed (as expected) is the comparison between 2014 forecast & 2013 actual. I could just duplicate the 2013 run in the dataset, calculate the chg_from_prev_year for 2014 forecast, and hide/delete the unwanted data from the final dataframe.
Firstly to get the change from previous year, do a shift on each of the groups:
In [11]: g = df.groupby(['ids', 'run'])
In [12]: df['chg_from_prev_year'] = g['revenue'].apply(lambda x: x - x.shift())
The next part is more complicated, I think you need to do a pivot_table for the next part:
In [13]: df1 = df.pivot_table('revenue', ['ids', 'year'], 'run')
In [14]: df1
Out[14]:
run actual forecast
ids year
1 2013 10 NaN
2014 10 30
2015 130 120
2 2013 20 NaN
2014 40 50
2015 100 210
3 2013 20 NaN
2014 50 90
2015 190 150
In [15]: g1 = df1.groupby(level='ids', as_index=False)
In [16]: out_by = g1.apply(lambda x: x['actual'] - x['forecast'])
In [17]: out_by # hello levels bug, fixed in 0.13/master... yesterday :)
Out[17]:
ids ids year
1 1 2013 NaN
2014 -20
2015 10
2 2 2013 NaN
2014 -10
2015 -110
3 3 2013 NaN
2014 -40
2015 40
dtype: float64
Which is the results which you want, but not in the correct format (see below [31] if you're not too fussed)... the following seems like a bit of a hack (to put it mildly), but here goes:
In [21]: df2 = df.set_index(['ids', 'year', 'run'])
In [22]: out_by.index = out_by.index.droplevel(0)
In [23]: out_by_df = pd.DataFrame(out_by, columns=['revenue'])
In [24]: out_by_df['run'] = 'forecast'
In [25]: df2['chg_from_forecast'] = out_by_df.set_index('run', append=True)['revenue']
and we're done...
In [26]: df2.reset_index()
Out[26]:
ids year run revenue chg_from_prev_year chg_from_forecast
0 1 2013 actual 10 NaN NaN
1 2 2013 actual 20 NaN NaN
2 3 2013 actual 20 NaN NaN
3 1 2014 forecast 30 NaN -20
4 2 2014 forecast 50 NaN -10
5 3 2014 forecast 90 NaN -40
6 1 2014 actual 10 0 NaN
7 2 2014 actual 40 20 NaN
8 3 2014 actual 50 30 NaN
9 1 2015 forecast 120 90 10
10 2 2015 forecast 210 160 -110
11 3 2015 forecast 150 60 40
12 1 2015 actual 130 120 NaN
13 2 2015 actual 100 60 NaN
14 3 2015 actual 190 140 NaN
Note: I think the first 6 results of chg_from_prev_year should be NaN.
However, I think you may be better off keeping it as a pivot:
In [31]: df3 = df.pivot_table(['revenue', 'chg_from_prev_year'], ['ids', 'year'], 'run')
In [32]: df3['chg_from_forecast'] = g1.apply(lambda x: x['actual'] - x['forecast']).values
In [33]: df3
Out[33]:
revenue chg_from_prev_year chg_from_forecast
run actual forecast actual forecast
ids year
1 2013 10 NaN NaN NaN NaN
2014 10 30 0 NaN -20
2015 130 120 120 90 10
2 2013 20 NaN NaN NaN NaN
2014 40 50 20 NaN -10
2015 100 210 60 160 -110
3 2013 20 NaN NaN NaN NaN
2014 50 90 30 NaN -40
2015 190 150 140 60 40