I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')
I have a table as such
+------+------------+-------+
| Idx | date | value |
+------+------------+-------+
| A | 20/11/2016 | 10 |
| A | 21/11/2016 | 8 |
| A | 22/11/2016 | 12 |
| B | 20/11/2016 | 16 |
| B | 21/11/2016 | 18 |
| B | 22/11/2016 | 11 |
+------+------------+-------+
I'd like to create a column that creates a new column 'rolling_quantile_value' based on the column 'value' that calculates a quantile based on the past for each row and each possible Idx.
For the example above, if the quantile chosen is median, the output should look like this :
+------+------------+-------+-----------------------+
| Idx | date | value | rolling_median_value |
+------+------------+-------+-----------------------+
| A | 20/11/2016 | 10 | NaN |
| A | 21/11/2016 | 8 | 10 |
| A | 22/11/2016 | 12 | 9 |
| A | 23/11/2016 | 14 | 10 |
| B | 20/11/2016 | 16 | NaN |
| B | 21/11/2016 | 18 | 16 |
| B | 22/11/2016 | 11 | 17 |
+------+------------+-------+-----------------------+
I've done it the naive way where I just put a function that creates row by row based on precedents rows of value and flags the jump from one Id to another but I'm sure that it's not the most efficient way to do that, nor the most elegant.
Looking forward to your suggestions !
I think you want expanding
df['rolling_median_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.median()
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_median_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.0
3 A 23/11/2016 14 10.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.0
UPDATE
df['rolling_quantile_value']=(df.groupby('Idx',sort=False)
.expanding(1)['value']
.quantile(0.75)
.groupby(level=0)
.shift()
.reset_index(drop=True))
print(df)
Idx date value rolling_quantile_value
0 A 20/11/2016 10 NaN
1 A 21/11/2016 8 10.0
2 A 22/11/2016 12 9.5
3 A 23/11/2016 14 11.0
4 B 20/11/2016 16 NaN
5 B 21/11/2016 18 16.0
6 B 22/11/2016 11 17.5
Hi I'm trying to use ML to predict some future sales. So i would like to add mean sales from the previous month/year for each product
My df is something like: [ id | year | month | product_id | sales ] I would like to add prev_month_mean_sale and prev_month_id_sale columns
id | year | month | product_id | sales | prev_month_mean_sale | prev_month_id_sale
----------------------------------------------------------------------
1 | 2018 | 1 | 123 | 5 | NaN | NaN
2 | 2018 | 1 | 234 | 4 | NaN | NaN
3 | 2018 | 1 | 345 | 2 | NaN | NaN
4 | 2018 | 2 | 123 | 3 | 3.6 | 5
5 | 2018 | 2 | 345 | 2 | 3.6 | 2
6 | 2018 | 3 | 123 | 4 | 2.5 | 3
7 | 2018 | 3 | 234 | 6 | 2.5 | 0
8 | 2018 | 3 | 567 | 7 | 2.5 | 0
9 | 2019 | 1 | 234 | 4 | 5.6 | 6
10 | 2019 | 1 | 567 | 3 | 5.6 | 7
also I would like to add prev_year_mean_sale and prev_year_id_sale
prev_month_mean_sale is the mean of the total sales of the previuos month, eg: for month 2 is (5+4+2)/3
My actual code is something like:
for index,row in df.iterrows():
loc = df.index[(df['month'] == row['month']-1) &
(df['year'] == row['year']) &
(df['product_id'] == row['product_id']).tolist()[0]]
df.loc[index, 'prev_month_id_sale'] = df.loc[ loc ,'sales']
but it is really slow and my df is really big. Maybe there is another option using groupby() or something like that.
A simple way to avoid loop is to use merge() from dataframe:
df["prev_month"] = df["month"] - 1
result = df.merge(df.rename(columns={"sales", "prev_month_id"sale"}),
how="left",
left_on=["year", "prev_month", "product_id"],
right_on=["year", "month", "product_id"])
The result in this way will have more columns than you needed. You should drop() some of them and/or rename() some other.
I have a challenge in front of me in python.
| Growth_rate | Month |
| ------------ |-------|
| 0 | 1 |
| -2 | 1 |
| 1.2 | 1 |
| 0.3 | 2 |
| -0.1 | 2 |
| 7 | 2 |
| 9 | 3 |
| 4.1 | 3 |
Now I want to average the growth rate according to the months in a new columns. Like 1st month the avg would be -0.26 and it should look like below table.
| Growth_rate | Month | Mean |
| ----------- | ----- | ----- |
| 0 | 1 | -0.26 |
| -2 | 1 | -0.26 |
| 1.2 | 1 | -0.26 |
| 0.3 | 2 | 2.2 |
| -0.1 | 2 | 2.2 |
| 7 | 2 | 2.2 |
| 9 | 3 | 6.5 |
| 4.1 | 3 | 6.5 |
This will calculate the mean growth rate and put it into mean column.
Any help would be great.
df.groupby(df.months).mean().reset_index().rename(columns={'Growth_Rate':'mean'}).merge(df,on='months')
Out[59]:
months mean Growth_Rate
0 1 -0.266667 0.0
1 1 -0.266667 -2.0
2 1 -0.266667 1.2
3 2 2.200000 -0.3
4 2 2.200000 -0.1
5 2 2.200000 7.0
6 3 6.550000 9.0
7 3 6.550000 4.1
Assuming that you are using the pandas package. If your table is in a DataFrame df
In [91]: means = df.groupby('Month').mean().reset_index()
In [92]: means.columns = ['Month', 'Mean']
Then join via merge
In [93]: pd.merge(df, means, how='outer', on='Month')
Out[93]:
Growth_rate Month Mean
0 0.0 1 -0.266667
1 -2.0 1 -0.266667
2 1.2 1 -0.266667
3 0.3 2 2.400000
4 -0.1 2 2.400000
5 7.0 2 2.400000
6 9.0 3 6.550000
7 4.1 3 6.550000
I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:
ID | Time | Activity
a | 1 | Bar
a | 3 | Bathroom
a | 2 | Bar
a | 4 | Bathroom
a | 5 | Outside
a | 6 | Bar
a | 7 | Bar
What's the most efficient way to convert it to this format?
ID | StartTime | EndTime | Location
a | 1 | 2 | Bar
a | 3 | 4 | Bathroom
a | 5 | N/A | Outside
a | 6 | 7 | Bar
I have to do this with a lot of data, so wondering how to speed up this process as much as possible.
I am using groupby
df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]:
ID Activity starttime endtime
0 a Bar 1.0 2.0
1 a Bathroom 3.0 4.0
2 a Outside 5.0 NaN
Or using pivot_table
df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]:
I 0 1
ID Activity
a Bar 1.0 2.0
Bathroom 3.0 4.0
Outside 5.0 NaN
Update
df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]:
ID Activity I starttime endtime
0 a Bar 0 1.0 2.0
1 a Bar 1 6.0 7.0
2 a Bathroom 0 3.0 4.0
3 a Outside 0 5.0 NaN