Find mean for a given group by a given time period - python

I have a dataframe, df, where I would like to take an average or mean of a given group per month.
group startdate enddate diff percent
A 04/01/2019 05/01/2019 160 11
A 05/01/2019 06/01/2019 136 8
B 06/01/2020 07/01/2020 202 5
B 07/01/2020 08/01/2020 283 7
For example:
I am wanting to take the mean per id per month. For group 'A',
for the month of April the diff is: 160 and for month of May, the diff is: 136.
The monthly diff mean for 'A' is 148
The monthly percent mean for 'A' would be: 9.5
Desired output
group diff_mean percent_Mean
A 148 9.5
B 242.5 6
This is what I am doing:
df.groupby['group'].mean()
I am not getting the correct output. I will keep researching. Any assistance is appreciated.
df.head()

Most likely you just need a bracket around groupby-call; and also you cant average over the start/end-dates so ignore those from your dataframe:
df[['diff', 'percent', 'group']].groupby(['group']).mean()

Related

Converting monthly values into daily using pandas interpolation

I have 12 avg monthly values for 1000 columns and I want to convert the data into daily using pandas. I have tried to do it using interplolate but I got the daily values from 31/01/1991 to 31/12/1991, which does not cover the whole year. January month values are not getting. I used date_range for index of my data frame.
date=pd.date_range(start="01/01/1991",end="31/12/1991",freq="M")
upsampled=df.resample("D")
interpolated = upsampled.interpolate(method='linear')
How can I get the interpolated values for 365 days?
Note that interpolation is between the known points.
So to interpolate throughout the whole year, it is not enough to have
just 12 values (for each month).
You must have 13 values (e.g. for the start of each month and
the start of the next year).
Thus I created the source df as:
date = pd.date_range(start='01/01/1991', periods=13, freq='MS')
df = pd.DataFrame({'date': date, 'amount': np.random.randint(100, 200, date.size)})
getting e.g.:
date amount
0 1991-01-01 113
1 1991-02-01 164
2 1991-03-01 181
3 1991-04-01 164
4 1991-05-01 155
5 1991-06-01 157
6 1991-07-01 118
7 1991-08-01 133
8 1991-09-01 184
9 1991-10-01 183
10 1991-11-01 159
11 1991-12-01 193
12 1992-01-01 163
​Then to upsample it to daily frequency and interpolate, I ran:
df.set_index('date').resample('D').interpolate()
If you don't want the result to contain the last row (for 1992-01-01),
take only a slice of the above result, dropping the last row:
df.set_index('date').resample('D').interpolate()[:-1]

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

Finding average on equal days using pandas

I have a data set, where I'm trying to get an average on days remaining that are equal.
Example:
ship_date Order_date cumulative_ordered days_remaining
2018-07-01 2018-05-06 7 56 days
2018-07-01 2018-05-07 10 55 days
2018-07-01 2018-05-08 15 54 days
The order_date will count down until it reaches the ship_date. by that time the cumulative order equals the total orders up until the shipping date. Then a new ship_date and the process repeats. I want to see the percentage average on each day up until the order date. For instance if ship_date 2018-07-01 has a total of 100 orders and ship_date 2018-08-01 has a total of 200, then I want to see how much percentage wise on average was ordered 54 days prior to ship_date.
Thanks.
You can obtain the average of total_ordered per difference_in_days using groupby:
df.groupby("difference_in_days")['total_ordered'].mean()
This returns a Series with the total_ordered average per each group of rows with some specific difference_in_days for example:
difference_in_days
2 days 10.5
56 days 50.22
...
Name: total_ordered, dtype: float64
In order to extract one of the mean values from that Series, you need to assign it to a variable and use the index. Say you want the average of total_ordered for rows with difference_in_days equal to 56, you should do:
g = df.groupby("difference_in_days")['total_ordered'].mean()
# value is the average total_ordered for rows with 56 days of difference.
value = g[g.index.days == 56].iloc[0]

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?
For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

Calculating the mean of groups in python/pandas

My grouped data looks like:
deviceid time
01691cbb94f16f737e4c83eca8e5f5e5390c2801 January 10
022009f075929be71975ce70db19cd47780b112f April 566
August 210
January 4
July 578
June 1048
May 1483
02bad1cdf92fbaa9327a65babc1c081e59fbf435 November 309
October 54
Where the last column represents the count. I obtained this grouped representation using the expression:
data1.groupby(['deviceid', 'time'])
How do I get the average for each device id, i.e., the sum of the counts of all months divided by the number of months? My output should look like:
deviceid mean
01691cbb94f16f737e4c83eca8e5f5e5390c2801 10
022009f075929be71975ce70db19cd47780b112f 777.8
02bad1cdf92fbaa9327a65babc1c081e59fbf435 181.5
You an specify the level in the mean method:
s.mean(level=0) # or: s.mean(level='deviceid')
This is equivalent to grouping by the first level of the index and taking the mean of each group: s.groupby(level=0).mean()

Categories