I am trying to figure out how to calculate the mean values for each row in this Python Pandas Pivot table that I have created.
I also want to add the sum of each year at the bottom of the pivot table.
The last step I want to do is to take the average value for each month calculated above and divide it with the total average in order to get the average distribution per year.
import pandas as pd
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2011, 1, 1)
end = datetime.datetime(2017, 12, 31)
libor = web.DataReader('USD1MTD156N', 'fred', start, end) # Reading the data
libor = libor.dropna(axis=0, how= 'any') # Dropping the NAN values
libor = libor.resample('M').mean() # Calculating the mean value per date
libor['Month'] = pd.DatetimeIndex(libor.index).month # Adding month value after each
libor['Year'] = pd.DatetimeIndex(libor.index).year # Adding month value after each
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
print pivot
Any suggestions how to proceed?
Thank you in advance
I think this is what you want (This is on python3 - I think only the print command is different in this script):
# Mean of each row
ave_month = pivot.mean(1)
#sum of each year at the bottom of the pivot table.
sum_year = pivot.sum(0)
# average distribution per year.
ave_year = sum_year/sum_year.mean()
print(ave_month, '\n', sum_year, '\n', ave_year)
Month
1 0.324729
2 0.321348
3 0.342014
4 0.345907
5 0.345993
6 0.369418
7 0.382524
8 0.389976
9 0.392838
10 0.392425
11 0.406292
12 0.482017
dtype: float64
Year
2011 2.792864
2012 2.835645
2013 2.261839
2014 1.860015
2015 2.407864
2016 5.953718
2017 13.356432
dtype: float64
Year
2011 0.621260
2012 0.630777
2013 0.503136
2014 0.413752
2015 0.535619
2016 1.324378
2017 2.971079
dtype: float64
I would use pivot_table over pivot, and then use the aggfunc parameter.
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
would be
import numpy as np
pivot = libor.pivot_table(index='Month',columns='Year',values='USD1MTD156N', aggfunc=np.mean)
YOu should be able to drop the resample statement also if I'm not mistaken
A link ot the docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
Related
The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives
You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()
Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()
Data import from csv:
Date
Item_1
Item 2
1990-01-01
34
78
1990-01-02
42
19
.
.
.
.
.
.
2020-12-31
41
23
df = pd.read_csv(r'Insert file directory')
df.index = pd.to_datetime(df.index)
gb= df.groupby([(df.index.year),(df.index.month)]).mean()
Issue:
So basically, the requirement is to group the data according to year and month before processing and I thought that the groupby function would have grouped the data so that the mean() calculate the averages of all values grouped under Jan-1990, Feb-1990 and so on. However, I was wrong. The output result in the average of all values under Item_1
My example is similar to the below post but in my case, it is calculating the mean. I am guessing that it has to do with the way the data is arranged after groupby or some parameters in mean() have to be specified but I have no idea which is the cause. Can someone enlighten me on how to correct the code?
Pandas groupby month and year
Update:
Hi all, I have created the sample data file .csv with 3 items and 3 months of data. I am wondering if the cause has to do with the conversion of data into df when it is imported from .csv because I have noticed some weird time data on the leftmost as shown below:
Link to sample file is:
https://www.mediafire.com/file/t81wh3zem6vf4c2/test.csv/file
import pandas as pd
df = pd.read_csv( 'test.csv', index_col = 'date' )
df.index = pd.to_datetime( df.index )
df.groupby([(df.index.year),(df.index.month)]).mean()
Seems to do the trick from the provided data.
IIUC, you want to calculate the mean of all elements. You can use numpy's mean function that operates on the flattened array by default:
df.index = pd.to_datetime(df.index, format='%d/%m/%Y')
gb = df.groupby([(df.index.year),(df.index.month)]).apply(lambda d: np.mean(d.values))
output:
date date
1990 1 0.563678
2 0.489105
3 0.459131
4 0.755165
5 0.424466
6 0.523857
7 0.612977
8 0.396031
9 0.452538
10 0.527063
11 0.397951
12 0.600371
dtype: float64
Hello StackOverflow Community,
I have been interested in calculating anomalies for data in pandas 1.2.0 using Python 3.9.1 and Numpy 1.19.5, but have been struggling to figure out the most "Pythonic" and "pandas" way to complete this task (or any way for that matter). Below I have some created some dummy data and put it into a pandas DataFrame. In addition, I have tried to clearly outline my methodology for calculating monthly anomalies for the dummy data.
What I am trying to do is take "n" years of monthly values (in this example, 2 years of monthly data = 25 months) and calculate monthly averages for all years (for example group all the January values togeather and calculate the mean). I have been able to do this using pandas.
Next, I would like to take each monthly average and subtract it from all elements in my DataFrame that fall into that specific month (for example subtract each January value from the overall January mean value). In the code below you will see some lines of code that attempt to do this subtraction, but to no avail.
If anyone has any thought or tips on what may be a good way to approach this, I really appreciate your insight. If you require further clarification, let me know. Thanks for your time and thoughts.
-Marian
#Import packages
import numpy as np
import pandas as pd
#-------------------------------------------------------------
#Create a pandas dataframe with some data that will represent:
#Column of dates for two years, at monthly resolution
#Column of corresponding values for each date.
#Create two years worth of monthly dates
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
#Create some random data that will act as our data that we want to compute the anomalies of
values = np.random.randint(0,100,size=25)
#Put our dates and values into a dataframe to demonsrate how we have tried to calculate our anomalies
df = pd.DataFrame({'Dates': dates, 'Values': values})
#-------------------------------------------------------------
#Anomalies will be computed by finding the mean value of each month over all years
#And then subtracting the mean value of each month by each element that is in that particular month
#Group our df according to the month of each entry and calculate monthly mean for each month
monthly_means = df.groupby(df['Dates'].dt.month).mean()
#-------------------------------------------------------------
#Now, how do we go about subtracting these grouped monthly means from each element that falls
#in the corresponding month.
#For example, if the monthly mean over 2 years for January is 20 and the value is 21 in January 2018, the anomaly would be +1 for January 2018
#Example lines of code I have tried, but have not worked
#ValueError:Unable to coerce to Series, length must be 1: given 12
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month) - monthly_means
#TypeError: unhashable type: "list"
#anomalies = socal_csv.groupby(socal_csv['Date'].dt.month).transform([np.subtract])
You can use pd.merge like this :
import numpy as np
import pandas as pd
dates = pd.date_range(start='2018-01-01', end='2020-01-01', freq='MS')
values = np.random.randint(0,100,size=25)
df = pd.DataFrame({'Dates': dates, 'Values': values})
monthly_means = df.groupby(df['Dates'].dt.month.mean()
df['month']=df['Dates'].dt.strftime("%m").astype(int)
df=df.merge(monthly_means.rename(columns={'Dates':'month','Values':'Mean'}),on='month',how='left')
df['Diff']=df['Mean']-df['Values']
output:
df['Diff']
Out[19]:
0 33.333333
1 19.500000
2 -29.500000
3 -22.500000
4 -24.000000
5 -3.000000
6 10.000000
7 2.500000
8 14.500000
9 -17.500000
10 44.000000
11 31.000000
12 -11.666667
13 -19.500000
14 29.500000
15 22.500000
16 24.000000
17 3.000000
18 -10.000000
19 -2.500000
20 -14.500000
21 17.500000
22 -44.000000
23 -31.000000
24 -21.666667
You can use abs() if you want absolute difference
A one-line solution is:
df = pd.DataFrame({'Values': values}, index=dates)
df.groupby(df.index.month).transform(lambda x: x-x.mean())
In the dataframe below (small snippet show, actual dataframe spans from 2000 to 2014 in time), I want to compute the annual average but starting in September of one year and going till only May of next year.
Cnt Year JD Min_Temp
S 2000 1 277.139
S 2000 2 274.725
S 2001 1 270.945
S 2001 2 271.505
N 2000 1 257.709
N 2000 2 254.533
N 2000 3 258.472
N 2001 1 255.763
I can compute annual average (Jan - Dec) using this code:
df['Min_Temp'].groupby(df['YEAR']).mean()
How do I adapt this code to mean from Sept of first year to May of next year?
--EDIT: Based on comments below, you can assume that a MONTH column is also available, specifying the month for each row
Not sure which column refers to month or if it is missing, but in the past I've used a quick and dirty method to assign custom seasons (interested if anyone has found more elegant route).
I've used Yahoo Finance data to demonstrate approach, unless one of your columns is Month?
EDIT Requires dataframe to be sorted by date ascending
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 9, 1)
end = datetime.datetime(2015, 5, 31)
df = web.DataReader("F", 'yahoo', start, end)
#Ensure date sorted --required
df = df.sort_index()
#identify custom season and set months june-august to null
count = 0
season = 1
for i,row in df.iterrows():
if i.month in [9,10,11,12,1,2,3,4,5]:
if count == 1:
season += 1
df.set_value(i,'season', season)
count = 0
else:
count = 1
df.set_value(i,'season',None)
#new data frame excluding months june-august
df_data = df[~df['season'].isnull()]
df_data['Adj Close'].groupby(df_data.season).mean()
suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases