How to group by month and year from a specific range? - python

The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives

You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()

Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()

Related

How to produce a new data frame of mean monthly data, given a data frame consisting of daily data?

I have a data frame containing the daily CO2 data since 2015, and I would like to produce the monthly mean data for each year, then put this into a new data frame. A sample of the data frame I'm using is shown below.
month day cycle trend
year
2011 1 1 391.25 389.76
2011 1 2 391.29 389.77
2011 1 3 391.32 389.77
2011 1 4 391.36 389.78
2011 1 5 391.39 389.79
... ... ... ... ...
2021 3 13 416.15 414.37
2021 3 14 416.17 414.38
2021 3 15 416.18 414.39
2021 3 16 416.19 414.39
2021 3 17 416.21 414.40
I plan on using something like the code below to create the new monthly mean data frame, but the main problem I'm having is indicating the specific subset for each month of each year so that the mean can then be taken for this. If I could highlight all of the year "2015" for the month "1" and then average this etc. that might work?
Any suggestions would be hugely appreciated and if I need to make any edits please let me know, thanks so much!
dfs = list()
for l in L:
dfs.append(refined_data[index = 2015, "month" = 1. day <=31].iloc[l].mean(axis=0))
mean_matrix = pd.concat(dfs, axis=1).T

Dataframe groupby to new dataframe

I have a table as below.
Month,Count,Parameter
March 2015,1,40
March 2015,1,10
March 2015,1,1
March 2015,1,25
March 2015,1,50
April 2015,1,15
April 2015,1,1
April 2015,1,1
April 2015,1,15
April 2015,1,15
I need to create a new table from above as shown below.
Unique Month,Total Count,<=30
March 2015,5,3
April 2015,5,5
The logic for new table is as follows. "Unique Month" column is unique month from original table and needs to sorted. "Total Count" is sum of "Count" column from original table for that particular month. "<=30" column is count of "Parameter <= 30" for that particular month.
Is there an easy way to do this in dataframes?
Thanks in advance.
IIUC, just check for Parameter < 30 and then groupby:
(df.assign(le_30=df.Parameter.le(30))
.groupby('Month', as_index=False) # pass sort=False if needed
[['Count','le_30']].sum()
)
Or
(df.Parameter.le(30)
.groupby(df['Month']) # pass sort=False if needed
.agg(['count','sum'])
)
Output:
Month Count le_30
0 April 2015 5 5.0
1 March 2015 5 3.0
Update: as commented above, adding sort=False to groupby will respect your original sorting of Month. For example:
(df.Parameter.le(30)
.groupby(df['Month'], sort=False)
.agg(['count','sum'])
.reset_index()
)
Output:
Month count sum
0 March 2015 5 3.0
1 April 2015 5 5.0

Cumulative sum of all previous values

A similar question has been asked for cumsum and grouping but it didn't solve my case.
I have a financial balance sheet of a lot of years and need to sum all previous values by year.
This is my reproducible set:
df=pd.DataFrame(
{"Amount": [265.95,2250.00,-260.00,-2255.95,120],
"Year": [2018,2018,2018,2019,2019]})
The result I want is the following:
Year Amount
2017 0
2018 2255.95
2019 120.00
2020 120.00
So actually in a loop going from the lowest year in my whole set to the highest year in my set.
...
df[df.Year<=2017].Amount.sum()
df[df.Year<=2018].Amount.sum()
df[df.Year<=2019].Amount.sum()
df[df.Year<=2020].Amount.sum()
...
First step is aggregate sum, then use Series.cumsum and Series.reindex with forward filling missing values by all possible years, last replace first missing values to 0:
years = range(2017, 2021)
df1 = (df.groupby('Year')['Amount']
.sum()
.cumsum()
.reindex(years, method='ffill')
.fillna(0)
.reset_index())
print (df1)
Year Amount
0 2017 0.00
1 2018 2255.95
2 2019 120.00
3 2020 120.00

Python Pandas Pivot Table Calculations

I am trying to figure out how to calculate the mean values for each row in this Python Pandas Pivot table that I have created.
I also want to add the sum of each year at the bottom of the pivot table.
The last step I want to do is to take the average value for each month calculated above and divide it with the total average in order to get the average distribution per year.
import pandas as pd
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2011, 1, 1)
end = datetime.datetime(2017, 12, 31)
libor = web.DataReader('USD1MTD156N', 'fred', start, end) # Reading the data
libor = libor.dropna(axis=0, how= 'any') # Dropping the NAN values
libor = libor.resample('M').mean() # Calculating the mean value per date
libor['Month'] = pd.DatetimeIndex(libor.index).month # Adding month value after each
libor['Year'] = pd.DatetimeIndex(libor.index).year # Adding month value after each
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
print pivot
Any suggestions how to proceed?
Thank you in advance
I think this is what you want (This is on python3 - I think only the print command is different in this script):
# Mean of each row
ave_month = pivot.mean(1)
#sum of each year at the bottom of the pivot table.
sum_year = pivot.sum(0)
# average distribution per year.
ave_year = sum_year/sum_year.mean()
print(ave_month, '\n', sum_year, '\n', ave_year)
Month
1 0.324729
2 0.321348
3 0.342014
4 0.345907
5 0.345993
6 0.369418
7 0.382524
8 0.389976
9 0.392838
10 0.392425
11 0.406292
12 0.482017
dtype: float64
Year
2011 2.792864
2012 2.835645
2013 2.261839
2014 1.860015
2015 2.407864
2016 5.953718
2017 13.356432
dtype: float64
Year
2011 0.621260
2012 0.630777
2013 0.503136
2014 0.413752
2015 0.535619
2016 1.324378
2017 2.971079
dtype: float64
I would use pivot_table over pivot, and then use the aggfunc parameter.
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
would be
import numpy as np
pivot = libor.pivot_table(index='Month',columns='Year',values='USD1MTD156N', aggfunc=np.mean)
YOu should be able to drop the resample statement also if I'm not mistaken
A link ot the docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

Extract Quarterly Data from Multi Quarter Periods

Public companies in the US make quarterly filings (10-Q) and yearly filings (10-K). In most cases they will file three 10Qs per year and one 10K.
In most cases, the quarterly filings (10Qs) contain quarterly data. For example, "revenue for the three months ending March 31, 2005."
The yearly filings will often only have year end sums. For example: "revenue for the twelve months ending December 31, 2005."
In order to get the value for Q4 of 2005, I need to take the yearly data and subtract the values for each of the quarters (Q1-Q3).
In some cases, each of the quarterly data is expressed as year to date. For example, the first quarterly filing is "revenue for the three months ending March 31, 2005." The second is "revenue for the six months ending June 30, 2005." The third "revenue for the nine months ending September 30, 2005." The yearly is like above, "revenue for the twelve months ending December 31, 2005." This represents a generalization of the above issues in which the desire is to extract quarterly data which can be accomplished by repeated subtraction of the previous period data.
My question is what is the best way in pandas to accomplish this quarterly data extraction?
There a large number of fields (revenue, profit, exposes, etc) per period.
A related question that I asked in regards to how to express this period data in pandas: Creating Period for Multi Quarter Timespan in Pandas
Here is some example data of the first problem (three 10Qs and one 10K which only has year end data):
10Q:
http://www.sec.gov/Archives/edgar/data/1174922/000119312512225309/d326512d10q.htm#tx326512_4
http://www.sec.gov/Archives/edgar/data/1174922/000119312512347659/d360762d10q.htm#tx360762_3
http://www.sec.gov/Archives/edgar/data/1174922/000119312512463380/d411552d10q.htm#tx411552_3
10K:
http://www.sec.gov/Archives/edgar/data/1174922/000119312513087674/d459372d10k.htm#tx459372_29
Calcbench refers to this problem: http://www.calcbench.com/Home/userGuide: "Q4 calculation: Companies often do not report Q4 data, rather opting to report full year data instead. We’ll automatically calculate it for you. Data in blue is calculated.
There will be multiple years of data and for each year I want to calculate the missing fourth quarter:
2012Q2 2012Q3 2012Y 2013Q1 2013Q2 2013Q3 2013Y
Revenue 1 1 1 1 1 1 1
Expense 10 10 10 10 10 10 10
You could define a function to subtract the quarterly totals from the annual number, and then apply the function to each row, storing the result in a new column.
In [2]: df
Out[2]:
Annual Q1 Q2 Q3
Revenue 18 3 4 5
Expense 17 2 3 4
In [3]: def calc_Q4(row):
...: return row['Annual'] - row['Q1'] - row['Q2'] - row['Q3']
In [4]: df['Q4'] = df.apply(calc_Q4, axis = 1)
In [5]: df
Out[5]:
Annual Q1 Q2 Q3 Q4
Revenue 18 3 4 5 6
Expense 17 2 3 4 8
I work for Calcbench.
I wrote an API for Calcbench and have example of getting SEC data into Pandas dataframes, https://www.calcbench.com/home/api.
You would need to sign up for Calcbench to use it.

Categories