Partition dataset by timestamp - python

I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:
ID | Time | Activity
a | 1 | Bar
a | 3 | Bathroom
a | 2 | Bar
a | 4 | Bathroom
a | 5 | Outside
a | 6 | Bar
a | 7 | Bar
What's the most efficient way to convert it to this format?
ID | StartTime | EndTime | Location
a | 1 | 2 | Bar
a | 3 | 4 | Bathroom
a | 5 | N/A | Outside
a | 6 | 7 | Bar
I have to do this with a lot of data, so wondering how to speed up this process as much as possible.

I am using groupby
df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]:
ID Activity starttime endtime
0 a Bar 1.0 2.0
1 a Bathroom 3.0 4.0
2 a Outside 5.0 NaN
Or using pivot_table
df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]:
I 0 1
ID Activity
a Bar 1.0 2.0
Bathroom 3.0 4.0
Outside 5.0 NaN
Update
df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]:
ID Activity I starttime endtime
0 a Bar 0 1.0 2.0
1 a Bar 1 6.0 7.0
2 a Bathroom 0 3.0 4.0
3 a Outside 0 5.0 NaN

Related

Multiply dataframe's row non-NA values "element-wise" with list

Imagine we have pandas.DataFrame like :
| na | na | 3 | 3 | 5 | 2. |
| na | 5.. | 2 | 2 | 1 | na|
| 1.. | 2.. | 2 | 3 |na| na|
Idea is to multiply each row by const list like = [ 0, 1, 2, 3]
If we have na in a column then it should still be na in a results:
| na | na | 0 | 3 | 10 | 6 |
| na | 0 | 2 | 4 | 3 | na |
| 0 | 2 | 4 | 9 | na | na |
Using cumsum and mul for variable number of NaNs and avoiding stack:
df.mul(df.notnull().cumsum(1).sub(1))
0 1 2 3 4 5
0 NaN NaN 0 3 10.0 6.0
1 NaN 0.0 2 4 3.0 NaN
2 0.0 2.0 4 9 NaN NaN
IIUC, you can stack then unstack:
df.stack().mul(np.tile(const, df.shape[0])).unstack()
Output:
0 1 2 3 4 5
0 NaN NaN 0.0 3.0 10.0 6.0
1 NaN 0.0 2.0 4.0 3.0 NaN
2 0.0 2.0 4.0 9.0 NaN NaN

Create new columns based on other's columns value

I'm trying to do some feature engineering for a pandas data frame.
Say I have this:
Data frame 1:
X | date | is_holiday
a | 1/4/2018 | 0
a | 1/5/2018 | 0
a | 1/6/2018 | 1
a | 1/7/2018 | 0
a | 1/8/2018 | 0
...
b | 1/1/2018 | 1
I'd like to have an additional indicator for some dates, to indicate if the date is before 1 and 2 days from a holiday, and also 1 and 2 days after.
Data frame 1:
X | date | is_holiday | one_day_before_hol | ... | one_day_after_hol
a | 1/4/2018 | 0 | 0 | ... | 0
a | 1/5/2018 | 0 | 1 | ... | 0
a | 1/6/2018 | 1 | 0 | ... | 0
a | 1/7/2018 | 0 | 0 | ... | 1
a | 1/8/2018 | 0 | 0 | ... | 0
...
b | 1/1/2018 | 1 | 0 | ... | 0
Is there any efficient way to do it? I believe I can do it using for statements, but since I'm new to python, I'd like to see if there is an elegant way to do it. Dates might not be adjacent or continuos (i.e. for some of the X columns, a specific date might not be present)
Thank you so much!
Use pandas.DataFrame.groupby.shift:
import pandas as pd
g = df.groupby('X')['is_holiday']
df['one_day_before'] = g.shift(-1).fillna(0)
df['two_day_before'] = g.shift(-2).fillna(0)
df['one_day_after'] = g.shift(1).fillna(0)
Output:
X date is_holiday one_day_before two_day_before one_day_after
0 a 1/4/2018 0 0.0 1.0 0.0
1 a 1/5/2018 0 1.0 0.0 0.0
2 a 1/6/2018 1 0.0 0.0 0.0
3 a 1/7/2018 0 0.0 0.0 1.0
4 a 1/8/2018 0 0.0 0.0 0.0
5 b 1/1/2018 1 0.0 0.0 0.0
You could shift:
import pandas as pd
df = pd.DataFrame([1,0,0,1,1,0], columns=['day'])
d.head()
day
0 1
1 0
2 0
3 1
4 1
df['Once Day Before'] = d['day'].shift(-1)
df['One Day After'] = df['day'].shift(1)
df['Two Days before'] = df['day'].shift(-2)
df.head()
day Holiday One Day Before One Day After Two Days before
0 1 0.0 NaN 0.0
1 0 0.0 1.0 1.0
2 0 1.0 0.0 1.0
3 1 1.0 0.0 0.0
4 1 0.0 1.0 NaN
5 0 NaN 1.0 NaN
This would move the is_holiday up or down and to a new column. You will have to deal with the NaN's though.

How to find average after sorting month column in python

I have a challenge in front of me in python.
| Growth_rate | Month |
| ------------ |-------|
| 0 | 1 |
| -2 | 1 |
| 1.2 | 1 |
| 0.3 | 2 |
| -0.1 | 2 |
| 7 | 2 |
| 9 | 3 |
| 4.1 | 3 |
Now I want to average the growth rate according to the months in a new columns. Like 1st month the avg would be -0.26 and it should look like below table.
| Growth_rate | Month | Mean |
| ----------- | ----- | ----- |
| 0 | 1 | -0.26 |
| -2 | 1 | -0.26 |
| 1.2 | 1 | -0.26 |
| 0.3 | 2 | 2.2 |
| -0.1 | 2 | 2.2 |
| 7 | 2 | 2.2 |
| 9 | 3 | 6.5 |
| 4.1 | 3 | 6.5 |
This will calculate the mean growth rate and put it into mean column.
Any help would be great.
df.groupby(df.months).mean().reset_index().rename(columns={'Growth_Rate':'mean'}).merge(df,on='months')
Out[59]:
months mean Growth_Rate
0 1 -0.266667 0.0
1 1 -0.266667 -2.0
2 1 -0.266667 1.2
3 2 2.200000 -0.3
4 2 2.200000 -0.1
5 2 2.200000 7.0
6 3 6.550000 9.0
7 3 6.550000 4.1
Assuming that you are using the pandas package. If your table is in a DataFrame df
In [91]: means = df.groupby('Month').mean().reset_index()
In [92]: means.columns = ['Month', 'Mean']
Then join via merge
In [93]: pd.merge(df, means, how='outer', on='Month')
Out[93]:
Growth_rate Month Mean
0 0.0 1 -0.266667
1 -2.0 1 -0.266667
2 1.2 1 -0.266667
3 0.3 2 2.400000
4 -0.1 2 2.400000
5 7.0 2 2.400000
6 9.0 3 6.550000
7 4.1 3 6.550000

pandas combing dataframes optimisation

Hey I have a time series order dataset in pandas with missing values for some dates to correct it I am trying to pick up the value from the previous dates available.
for date in dates_missing:
df_temp = df[df.order_date<date].sort_values(['order_date'],ascending=False)
supplier_map = df_temp.groupby('supplier_id')['value'].first()
for supplier_id in supplier_map.index.values:
df[(df.order_datetime==date)&(df.su_id == supp)]['value'] = supplier_map.get(supplier_id)
To explain the code I am looping over the missing dates then fetching the list of values previous to the missing date.
Then getting the supplier id to value map using the pandas first()
NOW the slowest part is updating back the original data frame
I am looping over each supplier and updating the values in the original data frame.
Need suggestion to speed up this inner for loop
Example:
|order_date|supplier_id |value |sku_id|
|2017-12-01| 10 | 1.0 | 1 |
|2017-12-01| 9 | 1.3 | 7 |
|2017-12-01| 3 | 1.4 | 2 |
|2017-12-02| 3 | 0 | 2 |
|2017-12-02| 9 | 0 | 7 |
|2017-12-03| 3 | 1.0 | 2 |
|2017-12-03| 10 | 1.0 | 1 |
|2017-12-03| 9 | 1.3 | 7 |
date to fix 2017-12-02
|2017-12-02| 3 | 0 | 2 |
|2017-12-02| 9 | 0 | 7 |
corrected data frame
|order_date|supplier_id |value |sku_id|
|2017-12-01| 10 | 1.0 | 1 |
|2017-12-01| 9 | 1.3 | 7 |
|2017-12-01| 3 | 1.4 | 2 |
|2017-12-02| 3 | 1.4 | 2 |
|2017-12-02| 9 | 1.3 | 7 |
|2017-12-03| 3 | 1.0 | 2 |
|2017-12-03| 10 | 1.0 | 1 |
|2017-12-03| 9 | 1.3 | 7 |
PS:
I might not be way clear with the question so would be happy to answer doubts and re-edit the post moving on.
You can group the dataframe by day and supplier_id, for each grouped dataframe replace 0 with Null, once you got null fill with forward fill, for early values you can use backward fill,
It may reduce your time
df.replace(0,np.nan,inplace=True)
df['values'] = df.groupby([df.supplier_id])['values'].apply(lambda x: x.replace(0,np.nan).fillna(method='ffill').fillna(method = 'bfill'))
Out:
order_date sku_id supplier_id values
0 2017-12-01 1 10 1.0
1 2017-12-01 7 9 1.3
2 2017-12-01 2 3 1.4
3 2017-12-02 2 3 1.4
4 2017-12-02 7 9 1.3
5 2017-12-03 2 3 1.0
6 2017-12-03 1 10 1.0
7 2017-12-03 7 9 1.3

Top Bottom pairings based on column values in a pandas dataframe

I would like to generate Sector/Group wise pairs from a DataFrame based on the values in it's Score column.
+---------+-------------------+---------+
| Ticker | Sector | Score |
+---------+-------------------+---------+
| ABC | Energy | 3.5 |
| XYZ | Energy | 4.5 |
| PQR | Tech | 5.5 |
| MNP | Tech | 1.5 |
| JKL | Energy | 10.5 |
| BCA | Energy | 8.5 |
| RDB | Tech | 6.5 |
| JMP | Tech | 2.5 |
+---------+-------------------+---------+
From above example in sector Energy JKL/ABC would be one such pairing as JKL is highest and ABC is lowest scorer in that sector.Similarly next pairing within Energy would be BCA/XYZ as BCA is second highest and XYZ is the second lowest within that sector.
As a next step I would like to retain those pairs within each sector where the pair-difference is greater than a certain threshold.
Thank you for your help.
Output can be
+---------+-------------------+---------+
| Ticker | Sector | Result |
+---------+-------------------+---------+
| ABC | Energy | 0 |
| XYZ | Energy | 0 |
| PQR | Tech | 1 |
| MNP | Tech | 0 |
| JKL | Energy | 1 |
| BCA | Energy | 1 |
| RDB | Tech | 1 |
| JMP | Tech | 0 |
+---------+-------------------+---------+
Is this what you are after?
(
df.groupby('Sector')
.apply(lambda x: [df.Ticker.iloc[x.Score.idxmin()],
df.Ticker.iloc[x.Score.idxmax()],
x.Score.idxmin(), x.Score.idxmax()])
.apply(pd.Series)
.set_axis(['Low Ticker', 'High Ticker', 'Low', 'High'],
axis=1, inplace=False)
.assign(Diff = lambda x: x.High-x.Low)
)
Out[653]:
Low Ticker High Ticker Low High Diff
Sector
Energy ABC JKL 0 4 4
Utilities MNP RDB 3 6 3
Then you can retain those pairs within each sector where the pair-difference is greater than a certain threshold by filtering the Diff column.
This is what I will do
df=df.sort_values('Score')
df=df.assign(New=df.groupby('Sector').cumcount()%2)
df=df.assign(New2=(df.groupby('Sector').New.apply(lambda x :x.cumsum().replace(0,len(x)/2))))
df.groupby(['Sector','New2']).Ticker.apply(list)
Out[1464]:
Sector New2
Energy 1 [XYZ, BCA]
2 [ABC, JKL]
Utilities 1 [JMP, PQR]
2 [MNP, RDB]
Name: Ticker, dtype: object
Then
df['Result']=(df.Score==df.groupby(['Sector','New2']).Score.transform('max')).astype(int)
df.sort_index()
Out[1471]:
Ticker Sector Score New New2 Result
0 ABC Energy 3.5 0 2 0
1 XYZ Energy 4.5 1 1 0
2 PQR Utilities 5.5 0 1 1
3 MNP Utilities 1.5 0 2 0
4 JKL Energy 10.5 1 2 1
5 BCA Energy 8.5 0 1 1
6 RDB Utilities 6.5 1 2 1
7 JMP Utilities 2.5 1 1 0
Edit : As per op adding the diff
df['DIFF']=df.groupby(['Sector','New2']).Score.apply(lambda x : x.diff().bfill())
df.sort_index()
Out[1479]:
Ticker Sector Score New New2 Result DIFF
0 ABC Energy 3.5 0 2 0 7.0
1 XYZ Energy 4.5 1 1 0 4.0
2 PQR Utilities 5.5 0 1 1 3.0
3 MNP Utilities 1.5 0 2 0 5.0
4 JKL Energy 10.5 1 2 1 7.0
5 BCA Energy 8.5 0 1 1 4.0
6 RDB Utilities 6.5 1 2 1 5.0
7 JMP Utilities 2.5 1 1 0 3.0

Categories