In the dataframe below (small snippet show, actual dataframe spans from 2000 to 2014 in time), I want to compute the annual average but starting in September of one year and going till only May of next year.
Cnt Year JD Min_Temp
S 2000 1 277.139
S 2000 2 274.725
S 2001 1 270.945
S 2001 2 271.505
N 2000 1 257.709
N 2000 2 254.533
N 2000 3 258.472
N 2001 1 255.763
I can compute annual average (Jan - Dec) using this code:
df['Min_Temp'].groupby(df['YEAR']).mean()
How do I adapt this code to mean from Sept of first year to May of next year?
--EDIT: Based on comments below, you can assume that a MONTH column is also available, specifying the month for each row
Not sure which column refers to month or if it is missing, but in the past I've used a quick and dirty method to assign custom seasons (interested if anyone has found more elegant route).
I've used Yahoo Finance data to demonstrate approach, unless one of your columns is Month?
EDIT Requires dataframe to be sorted by date ascending
import pandas as pd
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 9, 1)
end = datetime.datetime(2015, 5, 31)
df = web.DataReader("F", 'yahoo', start, end)
#Ensure date sorted --required
df = df.sort_index()
#identify custom season and set months june-august to null
count = 0
season = 1
for i,row in df.iterrows():
if i.month in [9,10,11,12,1,2,3,4,5]:
if count == 1:
season += 1
df.set_value(i,'season', season)
count = 0
else:
count = 1
df.set_value(i,'season',None)
#new data frame excluding months june-august
df_data = df[~df['season'].isnull()]
df_data['Adj Close'].groupby(df_data.season).mean()
Related
I'm trying to create a dataframe using pandas that counts the number of engaged, repeaters, and inactive customers for a company based on a JSON file with the transaction data.
For context, the columns of the new dataframe would be each month from Jan to June, while the rows are:
Repeater (customers who purchased in the current and previous month)
Inactive (customers in all transactions over the months including the current month who have purchased in previous months but not the current month)
Engaged (customers in all transactions over the months including the current month who have purchased in every month)
Hence, I've written code that first fetches the month of each transaction based on the provided transaction date for each record in the JSON. Then, it creates another month column ("month_no") which contains the month number of the month which the transaction was made. Next, a function is defined with the metrics to apply to each group and it is applied to a dataframe grouped by name.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_json('data/data.json')
df = (df.astype({'transaction_date': 'datetime64'}).assign(month=lambda x: x['transaction_date'].dt.month_name()))
months = {'January': 1, 'February': 2, 'March': 3, 'April': 4, 'May': 5, 'June': 6}
df['month_no'] = df['month'].map(months)
df = df.set_flags(allows_duplicate_labels=False)
def grpProc(grp):
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months).name.notna()})
wrk['inactive'] = ~wrk.engaged
wrk['repeaters'] = wrk.engaged & wrk.engaged.shift()
return wrk
act = df.groupby('name').apply(grpProc)
result = act.groupby(level=1).sum().astype(int).T
result.columns = months.keys()
However: this code produces these errors:
FutureWarning: reindexing with a non-unique Index is deprecated and will raise in a future version.
wrk = pd.DataFrame({'engaged': grp.drop_duplicates().sort_values('month_no').set_index('month_no').reindex(months.values()).name.notna()})
...
ValueError: cannot reindex on an axis with duplicate labels
It highlights the line:
act = df.groupby('name').apply(grpProc)
For your reference, here are the important columns of the dataframe and some dummy data:
Name
Purchase Month
Mark
March
John
January
Luke
March
John
March
Mark
January
Mark
February
Luke
February
John
January
The goal is to create a pivot table based on the above table by counting the repeaters, inactive, and engaged members:
Status
January
February
March
Repeaters
0
1
2
Inactive
1
1
0
Engaged
2
1
1
How do you do this and fix the error? If you have another completely different solution to this that works, please share also.
The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives
You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()
Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()
I have a dataframe that has records from 2011 to 2018. One of the columns has the drop_off_date which is the date when the customer left the rewards program. I want to count for each month between 2011 to 2018 how many people dropped of during that month. So for the 84 month period, I want the count of people who dropped off then using the drop_off_date column.
I changed the column to datetime and I know i can use the .agg and .count method but I am not sure how to count per month. I honestly do not know what the next step would be.
Example of the data:
Record ID | store ID | drop_off_date
a1274c212| 12876| 2011-01-27
a1534c543| 12877| 2011-02-23
a1232c952| 12877| 2018-12-02
The result should look like this:
Month: | #of dropoffs:
Jan 2011 | 15
........
Dec 2018 | 6
What I suggest is to work directly with the strings in the column drop_off_ym and to strip them to only keep the year and month:
df['drop_off_ym'] = df.drop_off_date.apply(lambda x: x[:-3])
Then you apply a groupby on the new created column an then a count():
df_counts_by_month = df.groupby('drop_off_ym')['StoreId'].count()
Using your data,
I'm assuming your date has been cast to a datetime value and used errors='coerce' to handle outliers.
you should then drop any NA's from this so you're only dealing with customers who dropped off.
you can do this in a multitude of ways, I would do a simple df.dropna(subset=['drop_off_date'])
print(df)
Record ID store ID drop_off_date
0 a1274c212 12876 2011-01-27
1 a1534c543 12877 2011-02-23
2 a1232c952 12877 2018-12-02
Lets create a month column to use as an aggregate
df['Month'] = df['drop_off_date'].dt.strftime('%b')
then we can do a simple groupby on the record ID as a count. (assuming you only want to count unique ID's)?
df1 = df.groupby(df['Month'])['Record ID'].count().reset_index()
print(df1)
Month Record ID
0 Dec 1
1 Feb 1
2 Jan 1
EDIT: To account for years.
first lets create a year helper column
df['Year'] = df['drop_off_date'].dt.year
df1 = df.groupby(['Month','Year' ])['Record ID'].count().reset_index()
print(df)
Month Year Record ID
0 Dec 2018 1
1 Feb 2011 1
2 Jan 2011 1
I am trying to figure out how to calculate the mean values for each row in this Python Pandas Pivot table that I have created.
I also want to add the sum of each year at the bottom of the pivot table.
The last step I want to do is to take the average value for each month calculated above and divide it with the total average in order to get the average distribution per year.
import pandas as pd
import pandas_datareader.data as web
import datetime
start = datetime.datetime(2011, 1, 1)
end = datetime.datetime(2017, 12, 31)
libor = web.DataReader('USD1MTD156N', 'fred', start, end) # Reading the data
libor = libor.dropna(axis=0, how= 'any') # Dropping the NAN values
libor = libor.resample('M').mean() # Calculating the mean value per date
libor['Month'] = pd.DatetimeIndex(libor.index).month # Adding month value after each
libor['Year'] = pd.DatetimeIndex(libor.index).year # Adding month value after each
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
print pivot
Any suggestions how to proceed?
Thank you in advance
I think this is what you want (This is on python3 - I think only the print command is different in this script):
# Mean of each row
ave_month = pivot.mean(1)
#sum of each year at the bottom of the pivot table.
sum_year = pivot.sum(0)
# average distribution per year.
ave_year = sum_year/sum_year.mean()
print(ave_month, '\n', sum_year, '\n', ave_year)
Month
1 0.324729
2 0.321348
3 0.342014
4 0.345907
5 0.345993
6 0.369418
7 0.382524
8 0.389976
9 0.392838
10 0.392425
11 0.406292
12 0.482017
dtype: float64
Year
2011 2.792864
2012 2.835645
2013 2.261839
2014 1.860015
2015 2.407864
2016 5.953718
2017 13.356432
dtype: float64
Year
2011 0.621260
2012 0.630777
2013 0.503136
2014 0.413752
2015 0.535619
2016 1.324378
2017 2.971079
dtype: float64
I would use pivot_table over pivot, and then use the aggfunc parameter.
pivot = libor.pivot(index='Month',columns='Year',values='USD1MTD156N')
would be
import numpy as np
pivot = libor.pivot_table(index='Month',columns='Year',values='USD1MTD156N', aggfunc=np.mean)
YOu should be able to drop the resample statement also if I'm not mistaken
A link ot the docs:
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html
I am trying to obtain rolling sum on a data frame after multiple levels of grouping:
import pandas as pd
import numpy as np
year_vec = np.arange(2000, 2005)
month_vec = np.arange(1, 4)
soln_list = []
firmList = [61, 62, 63]
firmId = []
year_month = []
year = []
month = []
for firmIndex in range(0, len(firmList)):
for yearIndex in range(0, len(year_vec)):
for monthIndex in range(0, len(month_vec)):
soln_list.append("soln_%s_%s_%s" % (firmList[firmIndex], year_vec[yearIndex], month_vec[monthIndex]))
firmId.append(firmList[firmIndex])
month.append(month_vec[monthIndex])
year.append(year_vec[yearIndex])
year_month.append("%s_%s" % (year_vec[yearIndex], month_vec[monthIndex]))
df = pd.DataFrame({'firmId': firmId, 'year': year, 'month': month, 'year_month' : year_month,
'soln_vars': soln_list})
df = df.set_index(["firmId", "year_month"])
The resulting data frame looks as follows:
month soln_vars year
firmId year_month
61 2000_1 1 soln_61_2000_1 2000
2000_2 2 soln_61_2000_2 2000
2000_3 3 soln_61_2000_3 2000
2001_1 1 soln_61_2001_1 2001
2001_2 2 soln_61_2001_2 2001
2001_3 3 soln_61_2001_3 2001
2002_1 1 soln_61_2002_1 2002
... ... ...
At this point I want a rolling sum of soln_vars over every 2 years, over every month for each firm. To do so, I first group by firmId and year and then sum:
df = df.groupby([df.index.get_level_values(0), "year"])["soln_vars"].sum()
This operation gives me the sum of soln_vars over every year for each firm:
firmId year
61 2000 soln_61_2000_1soln_61_2000_2soln_61_2000_3
2001 soln_61_2001_1soln_61_2001_2soln_61_2001_3
2002 soln_61_2002_1soln_61_2002_2soln_61_2002_3
2003 soln_61_2003_1soln_61_2003_2soln_61_2003_3
2004 soln_61_2004_1soln_61_2004_2soln_61_2004_3
62 2000 soln_62_2000_1soln_62_2000_2soln_62_2000_3
2001 soln_62_2001_1soln_62_2001_2soln_62_2001_3
... ...
In my application the solution variables are provided by another library that result in mathematical expressions: soln_61_2000_1 +soln_61_2000_2
+ soln_61_2000_3 - I am using strings here for simplicity.
Then grouping by firmId and applying rolling sum:
df = df.groupby(level=0, group_keys=False).rolling(2).sum()
does not change df. Any help is appreciated in clarifying this.