Pandas groupby n rows starting from bottom of df - python

I have the following df:
Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10
I would like to group every 3 weeks and sum up sales. I want so start with the bottom 3 weeks. If there are less than 3 weeks left at the top like in this example, these weeks should be ignored. Desired output is this:
Week Sales
5-3 50
8-6 35
I tried this on my original df df.reset_index(drop=True).groupby(by=lambda x: x/N, axis=0).sum()
but this solution is not starting from the bottom rows.
Can anyone point me into the right direction here? Thanks!

You can try inverse the data with .iloc[::-1]:
N=3
(df.iloc[::-1].groupby(np.arange(len(df))//N)
.agg({'Week': lambda x: f'{x.iloc[0]}-{x.iloc[-1]}',
'Sales': 'sum'
})
)
Output:
Week Sales
0 8-6 35
1 5-3 50
2 2-1 25

When dealing with period aggregation, I usually use .resample as it is fixable in binning data with different time periods
import io
from datetime import timedelta
import pandas as pd
dataf = pd.read_csv(io.StringIO("""Week Sales
1 10
2 15
3 10
4 20
5 20
6 10
7 15
8 10"""), sep='\s+',).astype(int)
# reverse data and transform int weeks to actual date time
dataf = dataf.iloc[::-1]
dataf['Week'] = dataf['Week'].map(lambda x: timedelta(weeks=x))
# set date object to index for resampling
dataf = dataf.set_index('Week')
# now we resample
dataf.resample('21d').sum() # 21days
::: Note: the label is misleading. And setting kind='period' does raises error

Related

Use np.where on a Mixed data type column

I have a column with data type 'o'. It has numbers, as well String. For example:
Days
5
10
15
7
No Sales Data available
9
I am trying to make a separate column using np.where, where I have written the code as
np.where(df['Days']=='No Sales Data available','No Sales',np.where(df['Days']<=10, 'Less than 10 days Sales','More than 10 Days Sales'))
Naturally, the code is giving problems due to mixed data types. Any idea how to get around such cases?
You could rewrite your statement in this way which will preserve the data type of your 'Days' column.
df['new'] = np.where(pd.to_numeric(df['Days'],errors='coerce').isna(),'No Sale',
np.where(pd.to_numeric(df['Days'],errors='coerce') <= 10,
'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5 Less than 10 days Sales
1 10 Less than 10 days Sales
2 15 More than 10 Days Sales
3 7 Less than 10 days Sales
4 No Sales Data available No Sale
5 9 Less than 10 days Sales
If you don't mind changing the type of your column, you could first convert to numeric and following a similar logic:
df['Days'] = pd.to_numeric(df['Days'],errors='coerce')
df['new'] = np.where(df['Days'].isna(),'No Sale',np.where(df['Days']<=10,'Less than 10 days Sales','More than 10 Days Sales'))
print(df)
Days new
0 5.0 Less than 10 days Sales
1 10.0 Less than 10 days Sales
2 15.0 More than 10 Days Sales
3 7.0 Less than 10 days Sales
4 NaN No Sale
5 9.0 Less than 10 days Sales

In Pandas, how do I breakdown human readable format of time into different units like days, hours, minutes, and seconds using Regex?

On a dataframe, I have a column of duration in a human-readable format like "29 days 4 hours 32 minutes 1 second". I want to break them down by having columns of days, hours, minutes, seconds with the values derived from the duration column. Like 29 for days, 4 for hours, 32 minutes, and 1 for seconds. I've already used this but it's not working correctly:
# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')
For example,
import pandas as pd
import re
list = {'id': ['123','124','125','126','127'],
'date': ['1/1/2018', '1/2/2018', '1/3/2018', '1/4/2018','1/5/2018'],
'duration': ['29 days 4 hours 32 minutes',
'1 hour 23 minutes',
'3 hours 2 minutes 1 second',
'4 hours 46 minutes 22 seconds',
'2 hours 1 minute']}
df = pd.DataFrame(list)
# Use regex to extract time values into their respective columns
new_df = df['duration'].str.extract(r'(?P<days>\d+(?= day))|(?P<hours>\d+(?= hour))|(?P<minutes>\d+(?= min))|(?P<seconds>\d+(?= sec))')
Results in the following dataframe:
The new dataframe only has the first value but not the rest. It captured the 29 for days, and 1, 3, 4, 2, for minutes but the subsequent columns values are NaNs.
Ideally, the dataframe should like this below:
I have a feeling something is wrong with my regex. Should I not use the "|" to separate the groups? Any help or nudge in the right direction is appreciated.
Your string format is matched with pd.Timedelta string specs. Just convert it directly to Timedelta and call its attribute components
df_final = (df.duration.map(pd.Timedelta)
.dt.components[['days','hours','minutes','seconds']])
Or
df_final = (pd.to_timedelta(df.duration)
.dt.components[['days','hours','minutes','seconds']])
Out[258]:
days hours minutes seconds
0 29 4 32 0
1 0 1 23 0
2 0 3 2 1
3 0 4 46 22
4 0 2 1 0
Here's my approach with extractall instead of extract:
# same pattern as yours
# can replace this with a for loop
pattern = ( '(?P<days>\d+)(?= days?\s*)|' # days
+ '(?P<hours>\d+)(?= hours?\s*)|' # hours
+ '(?P<minutes>\d+)(?= minutes?\s*)|' # minutes
+ '(?P<seconds>\d+)(?= seconds?\s*)' # seconds
)
(df.duration.str.extractall(pattern) # extract all with regex
.reset_index('match',drop=True) # merge the matches of the same row
.stack()
.unstack(level=-1, fill_value=0) # remove fill_value if you want NaN instead of 0
)
Output:
days hours minutes seconds
0 29 4 32 0
1 0 12 23 0
2 0 3 2 1
3 0 4 46 22
4 0 2 1 0

Get the average mean of entries per month with datetime in Pandas

I have a large df with many entries per month. I would like to see the average entries per month as to see as an example if there are any months that normally have more entries. (Ideally I'd like to plot this with a line of the over all mean to compare with but that is maybe a later question).
My df is something like this:
ufo=pd.read_csv('https://raw.githubusercontent.com/justmarkham/pandas-videos/master/data/ufo.csv')
ufo['Time']=pd.to_datetime(ufo.Time)
Where the head looks like this:
So if I'd like to see if there are more ufo-sightings in the summer as an example, how would I go about?
I have tried:
ufo.groupby(ufo.Time.month).mean()
But it does only work if I am calculating a numerical value. If I use count()instead I get the sum of all entries for all months.
EDIT: To clarify, I would like to have the mean of entries - ufo-sightings - per month.
You could do something like this:
# count the total months in the records
def total_month(x):
return x.max().year -x.min().year + 1
new_df = ufo.groupby(ufo.Time.dt.month).Time.agg(['size', total_month])
new_df['mean_count'] = new_df['size'] /new_df['total_month']
Output:
size total_month mean_count
Time
1 862 57 15.122807
2 817 70 11.671429
3 1096 55 19.927273
4 1045 68 15.367647
5 1168 53 22.037736
6 3059 71 43.084507
7 2345 65 36.076923
8 1948 64 30.437500
9 1635 67 24.402985
10 1723 65 26.507692
11 1509 50 30.180000
12 1034 56 18.464286
I think this what you are looking for, still please ask for clarification if i didn't reached what you are looking for.
# Add a new column instance, this adds a value to each instance of ufo sighting
ufo['instance'] = 1
# set index to time, this makes df a time series df and then you can apply pandas time series functions.
ufo.set_index(ufo['Time'], drop=True, inplace=True)
# create another df by resampling the original df and counting the instance column by Month ('M' is resample by month)
ufo2 = pd.DataFrame(ufo['instance'].resample('M').count())
# just to find month of resampled observation
ufo2['Time'] = pd.to_datetime(ufo2.index.values)
ufo2['month'] = ufo2['Time'].apply(lambda x: x.month)
and finally you can groupby month :)
ufo2.groupby(by='month').mean()
and this is the output which looks like this:
month mean_instance
1 12.314286
2 11.671429
3 15.657143
4 14.928571
5 16.685714
6 43.084507
7 33.028169
8 27.436620
9 23.028169
10 24.267606
11 21.253521
12 14.563380
Do you mean you want to group your data by month? I think we can do this
ufo['month'] = ufo['Time'].apply(lambda t: t.month)
ufo['year'] = ufo['Time'].apply(lambda t: t.year)
In this way, you will have 'year' and 'month' to group your data.
ufo_2 = ufo.groupby(['year', 'month'])['place_holder'].mean()

How to add columns with the average percent and average count to the dataframe?

This question is related to my previous question. I have the following dataframe:
df =
QUEUE_1 QUEUE_2 DAY HOUR TOTAL_SERVICE_TIME TOTAL_WAIT_TIME EVAL
ABC123 DEF656 1 7 20 30 1
ABC123 1 7 22 32 0
DEF656 ABC123 1 8 15 12 0
FED456 DEF656 2 8 15 16 1
I need to get the following dataframe (it's similar to the one I wanted to get in my previous question, but here I need to add 2 additional columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1).
QUEUE HOUR AVG_TOT_SERVICE_TIME AVG_TOT_WAIT_TIME AVG_COUNT_PER_DAY_HOUR AVG_PERCENT_EVAL_1
ABC123 7 21 31 1 50
ABC123 8 15 12 0.5 100
DEF656 7 20 30 0.5 100
DEF656 8 15 14 1 50
FED456 7 0 0 0 0
FED456 8 15 14 0.5 100
The column AVG_COUNT_PER_DAY_HOUR should contain the average count of a corresponding HOUR value over days (DAY) grouped by QUEUE. For example, in df, in case of ABC123, the HOUR 7 appears 2 times for the DAY 1 and 0 times for the DAY 2. Therefore the average is 1. The same logic is applied to the HOUR 8. It appears 1 time in DAY 1 and 0 times in DAY 2 for ABC123. Therefore the average is 0.5.
The column AVG_PERCENT_EVAL_1 should contain the percent of EVAL equal to 1 over hours, grouped by QUEUE. For example, in case of ABC123, the EVAL is equal to 1 one time when HOUR is 7. It is also equal to 0 one time when HOUR is 7. So, AVG_PERCENT_EVAL_1 is 50 for ABC123 and hour 7.
I use this approach:
df = pd.lreshape(aa, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index=['QUEUE'], columns=['HOUR'], fill_value=0)
result = piv_df.stack().add_prefix('AVG_').reset_index()
I get stuck with adding columns AVG_COUNT_PER_DAY_HOUR and AVG_PERCENT_EVAL_1. For instance, to add the column AVG_COUNT_PER_DAY_HOUR I am thinking to use .apply(pd.value_counts, 1).notnull().groupby(level=0).sum().astype(int), while for calculating AVG_PERCENT_EVAL_1 I am thinking to use [df.EVAL==1].agg({'EVAL' : 'count'}). However, don't know how to incorporate it into my current code in order to get correct solution.
UPDATE:
Perhaps it is easier to adopt this solution to what I need in this questions:
result = pd.lreshape(df, {'QUEUE': ['QUEUE_1','QUEUE_2']})
mux = pd.MultiIndex.from_product([result.QUEUE.dropna().unique(),
result.dropna().DAY.unique(),
result.HOUR.dropna().unique(), ], names=['QUEUE','DAY','HOUR'])
print (result.groupby(['QUEUE','DAY','HOUR'])
.mean()
.reindex(mux, fill_value=0)
.add_prefix('AVG_')
.reset_index())
Steps:
1) To compute AVG_COUNT_PER_DAY_HOUR :
With the help of pd.crosstab(), compute the distinct counts of HOUR w.r.t DAYS (so that we obtain cases for missing days) grouped by QUEUE.
stack the DF so that HOUR which was part of a hierarchical column before now gets positioned as an index, leaving just DAYS as columns. We take the mean columnwise after filling NaNs with 0.
2) To compute AVG_PERCENT_EVAL_1:
After getting the pivoted frame (same as before) and also from the fact that mean would just give us the percentage change as those are simply binary in nature (1/0), we simply take EVAL from this DF and multiply it's result by 100 as means were computed while pivoting itself (default agg=np.mean).
Finally, we join all these frames.
same as in the linked post:
df = pd.lreshape(df, {'QUEUE': df.columns[df.columns.str.startswith('QUEUE')].tolist()})
piv_df = df.pivot_table(index='QUEUE', columns='HOUR', fill_value=0).stack()
avg_tot = piv_df[['TOTAL_SERVICE_TIME', 'TOTAL_WAIT_TIME']].add_prefix("AVG_")
additional portion:
avg_cnt = pd.crosstab(df['QUEUE'], [df['DAY'], df['HOUR']]).stack().fillna(0).mean(1)
avg_pct = piv_df['EVAL'].mul(100).astype(int)
avg_tot.join(
avg_cnt.to_frame("AVG_COUNT_PER_DAY_HOUR")
).join(avg_pct.to_frame("AVG_PERCENT_EVAL_1")).reset_index()
avg_cnt looks like:
QUEUE HOUR
ABC123 7 1.0
8 0.5
DEF656 7 0.5
8 1.0
FED456 7 0.0
8 0.5
dtype: float64
avg_pct looks like:
QUEUE HOUR
ABC123 7 50
8 0
DEF656 7 100
8 50
FED456 7 0
8 100
Name: EVAL, dtype: int32

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?
For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

Categories