Calculating the mean of groups in python/pandas - python

My grouped data looks like:
deviceid time
01691cbb94f16f737e4c83eca8e5f5e5390c2801 January 10
022009f075929be71975ce70db19cd47780b112f April 566
August 210
January 4
July 578
June 1048
May 1483
02bad1cdf92fbaa9327a65babc1c081e59fbf435 November 309
October 54
Where the last column represents the count. I obtained this grouped representation using the expression:
data1.groupby(['deviceid', 'time'])
How do I get the average for each device id, i.e., the sum of the counts of all months divided by the number of months? My output should look like:
deviceid mean
01691cbb94f16f737e4c83eca8e5f5e5390c2801 10
022009f075929be71975ce70db19cd47780b112f 777.8
02bad1cdf92fbaa9327a65babc1c081e59fbf435 181.5

You an specify the level in the mean method:
s.mean(level=0) # or: s.mean(level='deviceid')
This is equivalent to grouping by the first level of the index and taking the mean of each group: s.groupby(level=0).mean()

Related

How to group by month and year from a specific range?

The data have reported values for January 2006 through January 2019. I need to compute the total number of passengers Passenger_Count per month. The dataframe should have 121 entries (10 years * 12 months, plus 1 for january 2019). The range should go from 2009 to 2019.
I have been doing:
df.groupby(['ReportPeriod'])['Passenger_Count'].sum()
But it doesn't give me the right result, it gives
You can do
df['ReportPeriod'] = pd.to_datetime(df['ReportPeriod'])
out = df.groupby(df['ReportPeriod'].dt.strftime('%Y-%m-%d'))['Passenger_Count'].sum()
Try this:
df.index = pd.to_datetime(df["ReportPeriod"], format="%m/%d/%Y")
df = df.groupby(pd.Grouper(freq="M")).sum()

Find mean for a given group by a given time period

I have a dataframe, df, where I would like to take an average or mean of a given group per month.
group startdate enddate diff percent
A 04/01/2019 05/01/2019 160 11
A 05/01/2019 06/01/2019 136 8
B 06/01/2020 07/01/2020 202 5
B 07/01/2020 08/01/2020 283 7
For example:
I am wanting to take the mean per id per month. For group 'A',
for the month of April the diff is: 160 and for month of May, the diff is: 136.
The monthly diff mean for 'A' is 148
The monthly percent mean for 'A' would be: 9.5
Desired output
group diff_mean percent_Mean
A 148 9.5
B 242.5 6
This is what I am doing:
df.groupby['group'].mean()
I am not getting the correct output. I will keep researching. Any assistance is appreciated.
df.head()
Most likely you just need a bracket around groupby-call; and also you cant average over the start/end-dates so ignore those from your dataframe:
df[['diff', 'percent', 'group']].groupby(['group']).mean()

How to handle Datatime data with Pandas when grouping by

I have a question. I am dealing with a Datetime DataFrame in Pandas. I want to perform a count on a particular column and group by the month.
For example:
df.groupby(df.index.month)["count_interest"].count()
Assuming that I am analyzing a Data From December 2019. I get a result like this
date
1 246
2 360
3 27
12 170
In reality, December 2019 is supposed to come first. Please what can I do because when I plot the frame grouped by month, the December 2019 is showing at the last and this is practically incorrect.
See plot below for your understanding:
You can try reindex:
df.groupby(df.index.month)["count_interest"].count().reindex([12,1,2,3])

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

pandas dataframe group year index by decade

suppose I have a dataframe with index as monthy timestep, I know I can use dataframe.groupby(lambda x:x.year) to group monthly data into yearly and apply other operations. Is there some way I could quick group them, let's say by decade?
thanks for any hints.
To get the decade, you can integer-divide the year by 10 and then multiply by 10. For example, if you're starting from
>>> dates = pd.date_range('1/1/2001', periods=500, freq="M")
>>> df = pd.DataFrame({"A": 5*np.arange(len(dates))+2}, index=dates)
>>> df.head()
A
2001-01-31 2
2001-02-28 7
2001-03-31 12
2001-04-30 17
2001-05-31 22
You can group by year, as usual (here we have a DatetimeIndex so it's really easy):
>>> df.groupby(df.index.year).sum().head()
A
2001 354
2002 1074
2003 1794
2004 2514
2005 3234
or you could do the (x//10)*10 trick:
>>> df.groupby((df.index.year//10)*10).sum()
A
2000 29106
2010 100740
2020 172740
2030 244740
2040 77424
If you don't have something on which you can use .year, you could still do lambda x: (x.year//10)*10).
if your Data Frame has Headers say : DataFrame ['Population','Salary','vehicle count']
Make your index as Year: DataFrame=DataFrame.set_index('Year')
use below code to resample data in decade of 10 years and also gives you some of all other columns within that dacade
datafame=dataframe.resample('10AS').sum()
Use the year attribute of index:
df.groupby(df.index.year)
lets say your date column goes by the name Date, then you can group up
dataframe.set_index('Date').ix[:,0].resample('10AS', how='count')
Note: the ix - here chooses the first column in your dataframe
You get the various offsets:
http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Categories