Create new columns based on other's columns value - python

I'm trying to do some feature engineering for a pandas data frame.
Say I have this:
Data frame 1:
X | date | is_holiday
a | 1/4/2018 | 0
a | 1/5/2018 | 0
a | 1/6/2018 | 1
a | 1/7/2018 | 0
a | 1/8/2018 | 0
...
b | 1/1/2018 | 1
I'd like to have an additional indicator for some dates, to indicate if the date is before 1 and 2 days from a holiday, and also 1 and 2 days after.
Data frame 1:
X | date | is_holiday | one_day_before_hol | ... | one_day_after_hol
a | 1/4/2018 | 0 | 0 | ... | 0
a | 1/5/2018 | 0 | 1 | ... | 0
a | 1/6/2018 | 1 | 0 | ... | 0
a | 1/7/2018 | 0 | 0 | ... | 1
a | 1/8/2018 | 0 | 0 | ... | 0
...
b | 1/1/2018 | 1 | 0 | ... | 0
Is there any efficient way to do it? I believe I can do it using for statements, but since I'm new to python, I'd like to see if there is an elegant way to do it. Dates might not be adjacent or continuos (i.e. for some of the X columns, a specific date might not be present)
Thank you so much!

Use pandas.DataFrame.groupby.shift:
import pandas as pd
g = df.groupby('X')['is_holiday']
df['one_day_before'] = g.shift(-1).fillna(0)
df['two_day_before'] = g.shift(-2).fillna(0)
df['one_day_after'] = g.shift(1).fillna(0)
Output:
X date is_holiday one_day_before two_day_before one_day_after
0 a 1/4/2018 0 0.0 1.0 0.0
1 a 1/5/2018 0 1.0 0.0 0.0
2 a 1/6/2018 1 0.0 0.0 0.0
3 a 1/7/2018 0 0.0 0.0 1.0
4 a 1/8/2018 0 0.0 0.0 0.0
5 b 1/1/2018 1 0.0 0.0 0.0

You could shift:
import pandas as pd
df = pd.DataFrame([1,0,0,1,1,0], columns=['day'])
d.head()
day
0 1
1 0
2 0
3 1
4 1
df['Once Day Before'] = d['day'].shift(-1)
df['One Day After'] = df['day'].shift(1)
df['Two Days before'] = df['day'].shift(-2)
df.head()
day Holiday One Day Before One Day After Two Days before
0 1 0.0 NaN 0.0
1 0 0.0 1.0 1.0
2 0 1.0 0.0 1.0
3 1 1.0 0.0 0.0
4 1 0.0 1.0 NaN
5 0 NaN 1.0 NaN
This would move the is_holiday up or down and to a new column. You will have to deal with the NaN's though.

Related

Create a counter of date values for a given max-min interval

Be the following python pandas DataFrame:
| date | column_1 | column_2 |
| ---------- | -------- | -------- |
| 2022-02-01 | val | val2 |
| 2022-02-03 | val1 | val |
| 2022-02-01 | val | val3 |
| 2022-02-04 | val2 | val |
| 2022-02-27 | val2 | val4 |
I want to create a new DataFrame, where each row has a value between the minimum and maximum date value from the original DataFrame. The counter column contains a row counter for that date.
| date | counter |
| ---------- | -------- |
| 2022-02-01 | 2 |
| 2022-02-02 | 0 |
| 2022-02-03 | 1 |
| 2022-02-04 | 1 |
| 2022-02-05 | 0 |
...
| 2022-02-26 | 0 |
| 2022-02-27 | 1 |
Count dates first & remove duplicates using Drop duplicates. Fill intermidiate dates with Pandas has asfreq function for datetimeIndex, this is basically just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
df['counts'] = df['date'].map(df['date'].value_counts())
df = df.drop_duplicates(subset='date', keep="first")
df.date = pd.to_datetime(df.date)
df = df.set_index('date').asfreq('D').reset_index()
df = df.fillna(0)
print(df)
Gives #
date counts
0 2022-02-01 2.0
1 2022-02-02 0.0
2 2022-02-03 1.0
3 2022-02-04 1.0
4 2022-02-05 0.0
5 2022-02-06 0.0
6 2022-02-07 0.0
7 2022-02-08 0.0
8 2022-02-09 0.0
9 2022-02-10 0.0
10 2022-02-11 0.0
11 2022-02-12 0.0
12 2022-02-13 0.0
13 2022-02-14 0.0
14 2022-02-15 0.0
15 2022-02-16 0.0
16 2022-02-17 0.0
17 2022-02-18 0.0
18 2022-02-19 0.0
19 2022-02-20 0.0
20 2022-02-21 0.0
21 2022-02-22 0.0
22 2022-02-23 0.0
23 2022-02-24 0.0
24 2022-02-25 0.0
25 2022-02-26 0.0
Many ways to do this. Here is mine. Probably not optimal, but at least I am not iterating rows, nor using .apply, which are both sure recipes to create slow solutions
import pandas as pd
import datetime
# A minimal example (you should provide such an example next time)
df=pd.DataFrame({'date':pd.to_datetime(['2022-02-01', '2022-02-03', '2022-02-01', '2022-02-04', '2022-02-27']), 'c1':['val','val1','val','val2','val2'], 'c2':range(5)})
# A delta of 1 day, to create list of date
dt=datetime.timedelta(days=1)
# Result dataframe, with a count of 0 for now
res=pd.DataFrame({'date':df.date.min()+dt*np.arange((df.date.max()-df.date.min()).days+1), 'count':0})
# Cound dates
countDates=df[['date', 'c1']].groupby('date').agg('count')
# Merge the counted dates with the target array, filling missing values with 0
res['count']=res.merge(countDates, on='date', how='left').fillna(0)['c1']

Dataframe: calculate difference in dates column by another column

I'm trying to calculate running difference on the date column depending on "event column".
So, to add another column with date difference between 1 in event column (there only 0 and 1).
Spo far I came to this half-working crappy solution
Dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Code:
x = df.loc[df['event']==1, 'date']
k = 0
for i in range(len(x)):
df.loc[k:x.index[i], 'duration'] = x.iloc[i] - k
k = x.index[i]
But I'm sure there is a more elegant solution.
Thanks for any advice.
Output format:
+------+-------+----------+
| date | event | duration |
+------+-------+----------+
| 1 | 0 | 3 |
| 2 | 0 | 3 |
| 3 | 1 | 3 |
| 4 | 0 | 6 |
| 5 | 0 | 6 |
| 6 | 0 | 6 |
| 7 | 0 | 6 |
| 8 | 0 | 6 |
| 9 | 1 | 6 |
| 10 | 0 | 4 |
| 11 | 0 | 4 |
| 12 | 0 | 4 |
| 13 | 1 | 4 |
| 14 | 0 | 2 |
| 15 | 1 | 2 |
+------+-------+----------+
Using your initial dataframe:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0],'duration':None})
Add an index-like column to mark where the transitions occur (you could also base this on the date column if it is unique):
df = df.reset_index().rename(columns={'index':'idx'})
df.loc[df['event']==0, 'idx'] = np.nan
df['idx'] = df['idx'].fillna(method='bfill')
Then, use a groupby() to count the records, and backfill them to match your structure:
df['duration'] = df.groupby('idx')['event'].count()
df['duration'] = df['duration'].fillna(method='bfill')
# Alternatively, the previous two lines can be combined as pointed out by OP
# df['duration'] = df.groupby('idx')['event'].transform('count')
df = df.drop(columns='idx')
print(df)
date event duration
0 1 0 2.0
1 2 1 2.0
2 3 0 3.0
3 4 0 3.0
4 5 1 3.0
5 6 0 5.0
6 7 0 5.0
7 8 0 5.0
8 9 0 5.0
9 10 1 5.0
10 11 0 6.0
11 12 0 6.0
12 13 0 6.0
13 14 0 6.0
14 15 0 6.0
15 16 1 6.0
16 17 0 NaN
It ends up as a float value because of the NaN in the last row. This approach works well in general if there are obvious "groups" of things to count.
As an alternative, because the dates are already there as integers you can look at the differences in dates directly:
df = pd.DataFrame({'date':[1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17],'event':[0,1,0,0,1,0,0,0,0,1,0,0,0,0,0,1,0]})
tmp = df[df['event']==1].copy()
tmp['duration'] = (tmp['date'] - tmp['date'].shift(1)).fillna(tmp['date'])
df = pd.merge(df, tmp[['date','duration']], on='date', how='left').fillna(method='bfill')

Interpolate time series and resample/pivot. How to get the expected output

I have a df that looks like this:
Video | Start | End | Duration |
vid1 |2018-10-02 16:00:29 |2018-10-02 20:07:05 | 246 |
vid2 |2018-10-04 16:03:08 |2018-10-04 16:10:11 | 7 |
vid3 |2018-10-04 10:13:40 |2018-10-06 12:07:38 | 113 |
What I want to do is resample dataframe by 10 minutes by start column and assign 1 if the video lasted in that timestamp and 0 if not.
The desired output is:
Start | vid1 | vid2 | vid3 |
2018-10-02 16:00:00| 1 | 0 | 0 |
2018-10-02 16:10:00| 1 | 0 | 0 |
...
2018-10-04 16:10:00| 0 | 1 | 0 |
2018-10-04 16:20:00| 0 | 0 | 1 |
The output is presented only for visualization the output, hence, it can contain errors.
The problem is that I can not resample dataframe in a way to make a desired crosstab output.
Try this:
df.apply(lambda x: pd.Series(x['Video'],
index=pd.date_range(x['Start'].floor('10T'),
x['End'].ceil('10T'),
freq='10T')), axis=1)\
.stack().str.get_dummies().reset_index(level=0, drop=True)
Output:
vid1 vid2 vid3
2018-10-02 16:00:00 1 0 0
2018-10-02 16:10:00 1 0 0
2018-10-02 16:20:00 1 0 0
2018-10-02 16:30:00 1 0 0
2018-10-02 16:40:00 1 0 0
... ... ... ...
2018-10-06 11:30:00 0 0 1
2018-10-06 11:40:00 0 0 1
2018-10-06 11:50:00 0 0 1
2018-10-06 12:00:00 0 0 1
2018-10-06 12:10:00 0 0 1
[330 rows x 3 columns]

Calculate streak in pandas without apply

I have a DataFrame like this:
date | type | column1
----------------------------
2019-01-01 | A | 1
2019-02-01 | A | 1
2019-03-01 | A | 1
2019-04-01 | A | 0
2019-05-01 | A | 1
2019-06-01 | A | 1
2019-07-01 | B | 1
2019-08-01 | B | 1
2019-09-01 | B | 0
I want to have a column called "streak" that has a streak, but grouped by column "type":
date | type | column1 | streak
-------------------------------------
2019-01-01 | A | 1 | 1
2019-02-01 | A | 1 | 2
2019-03-01 | A | 1 | 3
2019-04-01 | A | 0 | 0
2019-05-01 | A | 1 | 1
2019-06-01 | A | 1 | 2
2019-07-01 | B | 1 | 1
2019-08-01 | B | 1 | 2
2019-09-01 | B | 0 | 0
I managed to do it like that:
def streak(df):
grouper = (df.column1 != df.column1.shift(1)).cumsum()
df['streak'] = df.groupby(grouper).cumsum()['column1']
return df
df = df.groupby(['type']).apply(streak)
But I'm wondering if it's possible to do it inline without using a groupby and apply, because my DataFrame contains about 100M rows and it takes several hours to process.
Any ideas on how to optimize this for speed?
You want the cumsum of 'column1' grouping by 'type' + the cumsum of a Boolean Series which resets the grouping at every 0.
df['streak'] = df.groupby(['type', df.column1.eq(0).cumsum()]).column1.cumsum()
date type column1 streak
0 2019-01-01 A 1 1
1 2019-02-01 A 1 2
2 2019-03-01 A 1 3
3 2019-04-01 A 0 0
4 2019-05-01 A 1 1
5 2019-06-01 A 1 2
6 2019-07-01 B 1 1
7 2019-08-01 B 1 2
8 2019-09-01 B 0 0
IIUC, this is what you need.
m = df.column1.ne(df.column1.shift()).cumsum()
df['streak'] =df.groupby([m , 'type'])['column1'].cumsum()
Output
date type column1 streak
0 1/1/2019 A 1 1
1 2/1/2019 A 1 2
2 3/1/2019 A 1 3
3 4/1/2019 A 0 0
4 5/1/2019 A 1 1
5 6/1/2019 A 1 2
6 7/1/2019 B 1 1
7 8/1/2019 B 1 2
8 9/1/2019 B 0 0

Partition dataset by timestamp

I have a dataframe of millions of rows like so, with no duplicate time-ID stamps:
ID | Time | Activity
a | 1 | Bar
a | 3 | Bathroom
a | 2 | Bar
a | 4 | Bathroom
a | 5 | Outside
a | 6 | Bar
a | 7 | Bar
What's the most efficient way to convert it to this format?
ID | StartTime | EndTime | Location
a | 1 | 2 | Bar
a | 3 | 4 | Bathroom
a | 5 | N/A | Outside
a | 6 | 7 | Bar
I have to do this with a lot of data, so wondering how to speed up this process as much as possible.
I am using groupby
df.groupby(['ID','Activity']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[251]:
ID Activity starttime endtime
0 a Bar 1.0 2.0
1 a Bathroom 3.0 4.0
2 a Outside 5.0 NaN
Or using pivot_table
df.assign(I=df.groupby(['ID','Activity']).cumcount()).pivot_table(index=['ID','Activity'],columns='I',values='Time')
Out[258]:
I 0 1
ID Activity
a Bar 1.0 2.0
Bathroom 3.0 4.0
Outside 5.0 NaN
Update
df.assign(I=df.groupby(['ID','Activity']).cumcount()//2).groupby(['ID','Activity','I']).Time.apply(list).apply(pd.Series).rename(columns={0:'starttime',1:'endtime'}).reset_index()
Out[282]:
ID Activity I starttime endtime
0 a Bar 0 1.0 2.0
1 a Bar 1 6.0 7.0
2 a Bathroom 0 3.0 4.0
3 a Outside 0 5.0 NaN

Categories