Pandas: groupby and get median value by month? - python

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?

For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

Related

Creating a new column based on entries from another column in Python

I'm new in Python and hope you guys can help me with the following:
I have a data frame that contains the daily demand of a certain product. However, the demand is shown cumulative over time. I want to create a column that shows the actual daily demand (see table below).
Current Data frame:
Day#
Cumulative Demand
1
10
2
15
3
38
4
44
5
53
What I want to achieve:
Day#
Cumulative Demand
Daily Demand
1
10
10
2
15
5
3
38
23
4
44
6
5
53
9
Thank you!
Firstly, we need the data of the old column
# My Dataframe is called df
demand = df["Cumulative Demand"].tolist()
Then recalculate the data
daily_demand = [demand[0]]
for i, d in enumerate(demand[1:]):
daily_demand.append(d-demand[i])
Lastly append the data to a new column
df["Daily Demand"] = daily_demand
Assuming what you shared above is representative of your actual data, meaning you have 1 row per day, and Day column is sorted in ascending order.
You can use shift() (please read what it does), and perform a subtraction between the cumulative demand, and the shifted version of the cumulative demand. This will give you back the actual daily demand.
To make sure that it works, check whether the cumulative sum of daily demand (the new column) sums to the cumulative sum, using cumsum().
import pandas as pd
# Calculate your Daily Demand column
df['Daily Demand'] = (df['Cumulative Demand'] - df['Cumulative Demand'].shift()).fillna(df['Cumulative Demand'][0])
# Check whether the cumulative sum of daily demands sum up to the Cumulative Demand
>>> all(df['Daily Demand'].cumsum() == df['Cumulative Demand'])
True
Will print back:
Day Cumulative Demand Daily Demand
0 1 10 10.0
1 2 15 5.0
2 3 38 23.0
3 4 44 6.0
4 5 53 9.0

Pandas groupby n rows starting from bottom of df

I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!
You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25
When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error

How to cut values on a range and count how many occurences there are for any range?

I have a dataframe like the following:
client_id order_id order_date id_medium revenues
0 Jack IML-0011101 2008-05-19 9526 69.84
1 deSys IML-0011744 2008-05-28 13868 68.32
2 deSys IML-0011744 2008-05-28 9472 9.48
3 deSys IML-0011744 2008-05-28 9526 6.86
4 Paul IML-0013360 2008-06-23 21585 37.09
5 Frank IML-0013951 2008-07-01 24539 29.61
6 Frank IML-0013951 2008-07-01 9472 10.38
7 Jack IML-0042758 2016-08-23 1190408 36.46
8 Frank IML-0094979 2017-09-29 1222195 58.24
9 madinside IML-0118214 2018-07-22 1240366 14.72
10 madinside IML-0118214 2018-07-22 1240374 9.27
11 madinside IML-0118214 2018-07-22 1240293 10.59
12 madinside IML-0118214 2018-07-22 1240377 13.76
13 madinside IML-0118214 2018-07-22 1240367 14.51
I would like to know:
how many unique clients there are every year for each "range of revenues", also by considering that where values for order_id are the same I should sum values in revenues.
That is, if I define bins of 10 euros (10 or less, between 10 and 20, between 20 and 30, and so on), how many clients every year are in each bins?
So, in the dataframe above in year 2008 I would have 1 client (Jack) in bin 60-70, 1 client (deSys) in bin 80-90 (because he made 1 single order - IML-0011744 - of 3 mediums and then for my purpouse I must sum 68.32+9.48+6.86), and so on..
the same thing for order_id but in this case I would like to count the number of orders that are unique on respect the year of order and the range of revenues.
I think I could add a column that assign a value for each record that is an approximation, based on the range desired, of the revenues value. However in this case I should assign the same value for records that have the same order_id and the value assigned should keep into account the sum of revenues for that same order_id.
Expected result:
I expect two tables, one for clients and one for orders, where for each index there is a different bins (10 or less, between 20 included and 10 excluded, between 30 included and 20 excluded, ...) and columns are the year of order.
UPDATE
Still working on it, I might have reached a partial result, through this code:
pt = ordini.groupby(["order_id","client_id","order_date"]).agg({"revenues":"sum"}).reset_index()
om = pt.groupby(["revenues"]).agg({"client_id":"nunique","order_id":"count"}).reset_index()
binn = pd.cut(om.revenues,np.arange(0,4300,100))
om.groupby(binn)[["client_id"]].count().reset_index()
I get something similar to this:
incasso_tot id_cliente
0 (0, 100] 12891
1 (100, 200] 8730
2 (200, 300] 4525
3 (300, 400] 2264
4 (400, 500] 1190
That should be fine, except for missing so the date information (i.e. to see values splitted per year).
Anyone could help me, please?

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

How to add columns with the average percent and average count to the dataframe?

This question is related to my previous question. I have the following dataframe:
df =
QUEUE_1 QUEUE_2 DAY HOUR TOTAL_SERVICE_TIME TOTAL_WAIT_TIME EVAL
ABC123 DEF656 1 7 20 30 1
ABC123 1 7 22 32 0
DEF656 ABC123 1 8 15 12 0
FED456 DEF656 2 8 15 16 1
I need to get the following dataframe (it's similar to the one I wanted to get in my previous question, but here I need to add 2 additional columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1).
QUEUE HOUR AVG_TOT_SERVICE_TIME AVG_TOT_WAIT_TIME AVG_COUNT_PER_DAY_HOUR AVG_PERCENT_EVAL_1
ABC123 7 21 31 1 50
ABC123 8 15 12 0.5 100
DEF656 7 20 30 0.5 100
DEF656 8 15 14 1 50
FED456 7 0 0 0 0
FED456 8 15 14 0.5 100
The column AVG_COUNT_PER_DAY_HOUR should contain the average count of a corresponding HOUR value over days (DAY) grouped by QUEUE. For example, in df, in case of ABC123, the HOUR 7 appears 2 times for the DAY 1 and 0 times for the DAY 2. Therefore the average is 1. The same logic is applied to the HOUR 8. It appears 1 time in DAY 1 and 0 times in DAY 2 for ABC123. Therefore the average is 0.5.
The column AVG_PERCENT_EVAL_1 should contain the percent of EVAL equal to 1 over hours, grouped by QUEUE. For example, in case of ABC123, the EVAL is equal to 1 one time when HOUR is 7. It is also equal to 0 one time when HOUR is 7. So, AVG_PERCENT_EVAL_1 is 50 for ABC123 and hour 7.
I use this approach:
df = pd.lreshape(aa, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index=['QUEUE'], columns=['HOUR'], fill_value=0)
result = piv_df.stack().add_prefix('AVG_').reset_index()
I get stuck with adding columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1. For instance, to add the column AVG_COUNT_PER_DAY_HOUR I am thinking to use .apply(pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int), while for calculating AVG_PERCENT_EVAL_1 I am thinking to use [df.EVAL==1].agg({'EVAL' : 'count'}). However, don't know how to incorporate it into my current code in order to get correct solution.
UPDATE:
Perhaps it is easier to adopt this solution to what I need in this questions:
result = pd.lreshape(df, {'QUEUE': ['QUEUE_1','QUEUE_2']})
mux = pd.MultiIndex.from_product([result.QUEUE.dropna().unique(),
result.dropna().DAY.unique(),
result.HOUR.dropna().unique(), ], names=['QUEUE','DAY','HOUR'])
print (result.groupby(['QUEUE','DAY','HOUR'])
.mean()
.reindex(mux, fill_value=0)
.add_prefix('AVG_')
.reset_index())
Steps:
1) To compute AVG_COUNT_PER_DAY_HOUR :
With the help of pd.crosstab(), compute the distinct counts of HOUR w.r.t DAYS (so that we obtain cases for missing days) grouped by QUEUE.
stack the DF so that HOUR which was part of a hierarchical column before now gets positioned as an index, leaving just DAYS as columns. We take the mean columnwise after filling NaNs with 0.
2) To compute AVG_PERCENT_EVAL_1:
After getting the pivoted frame (same as before) and also from the fact that mean would just give us the percentage change as those are simply binary in nature (1/0), we simply take EVAL from this DF and multiply it's result by 100 as means were computed while pivoting itself (default agg=np.mean).
Finally, we join all these frames.
same as in the linked post:
df = pd.lreshape(df, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index='QUEUE', columns='HOUR', fill_value=0).stack()
avg_tot = piv_df[['TOTAL_SERVICE_TIME', 'TOTAL_WAIT_TIME']].add_prefix("AVG_")
additional portion:
avg_cnt = pd.crosstab(df['QUEUE'], [df['DAY'], df['HOUR']]).stack().fillna(0).mean(1)
avg_pct = piv_df['EVAL'].mul(100).astype(int)
avg_tot.join(
avg_cnt.to_frame("AVG_COUNT_PER_DAY_HOUR")
).join(avg_pct.to_frame("AVG_PERCENT_EVAL_1")).reset_index()
avg_cnt looks like:
QUEUE HOUR
ABC123 7 1.0
8 0.5
DEF656 7 0.5
8 1.0
FED456 7 0.0
8 0.5
dtype: float64
avg_pct looks like:
QUEUE HOUR
ABC123 7 50
8 0
DEF656 7 100
8 50
FED456 7 0
8 100
Name: EVAL, dtype: int32

Categories