Pandas: time column addition and repeating all rows for a month

Pandas: time column addition and repeating all rows for a month - python

I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?

Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]

Related

Group rows by certain timeperiod dending on other factors

What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.

Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2

I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp

pandas: evaluate if condition is met for consecutive data points in a given timeframe

I want to evaluate if a given condition (e.g. treshold) is met for a certain duration in pandas dataframe and set an output value accordingly.
E.g. set output to 1 if data > treshold for at least the next 45 min and back to 0 if data < treshold
What works so far (for treshold = 3 for a minimum duration of 45 min):
import pandas as pd
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
output = []
for i,e in enumerate(data):
if i > (len(data) - cell_range):
futures = data[i:len(data)]
else:
futures = data[i:i + cell_range]
if i == 0:
last = 0
else:
last = output[i-1]
current = data[i]
if (min(futures) > treshold or (last > 0 and current > treshold)):
output.append(1)
else:
output.append(0)
df['output'] = output
result:
data output
dt
2020-01-01 00:00:00 1 0
2020-01-01 00:15:00 1 0
2020-01-01 00:30:00 5 1
2020-01-01 00:45:00 6 1
2020-01-01 01:00:00 7 1
2020-01-01 01:15:00 0 0
2020-01-01 01:30:00 4 0
2020-01-01 01:45:00 5 0
2020-01-01 02:00:00 0 0
2020-01-01 02:15:00 10 1
2020-01-01 02:30:00 5 1
2020-01-01 02:45:00 9 1
2020-01-01 03:00:00 6 1
2020-01-01 03:15:00 6 1
2020-01-01 03:30:00 4 1
2020-01-01 03:45:00 10 1
2020-01-01 04:00:00 6 1
2020-01-01 04:15:00 5 1
2020-01-01 04:30:00 0 0
2020-01-01 04:45:00 8 1
2020-01-01 05:00:00 9 1
2020-01-01 05:15:00 5 1
2020-01-01 05:30:00 9 1
2020-01-01 05:45:00 6 1
2020-01-01 06:00:00 3 0
However, I'm wondering if there is an easier (and more efficient) way to do this with python/pandas?

I found a solution which seems to work, using .rolling and .shift.
import pandas as pd
import numpy as np
import random, math
df = pd.DataFrame({'dt': pd.date_range('2020-01-01 00:00:00','2020-01-01 06:00:00',freq='15T')})
treshold = 3
data = []
for i in range(0,df.size):
n = random.randint(0, 10)
data.append(n)
df['data'] = data
df = df.set_index('dt')
timestep = df.index.to_series().diff().dt.seconds.div(3600,fill_value=None)[1]
min_duration_hours = 0.75
cell_range = math.ceil(min_duration_hours / timestep)
df['above_treshold'] = np.where(df['data'] > treshold, df['data'], 0)
indexer = pd.api.indexers.FixedForwardWindowIndexer(window_size=cell_range)
rolling_min_fwd = df['above_treshold'].rolling(window=indexer).min()
rolling_min_bwd = df['above_treshold'].rolling(window=cell_range).min()
shifted_fwd = df['above_treshold'].shift(1)
shifted_bwd = df['above_treshold'].shift(-1)
start_condition = ((rolling_min_fwd > 0) & ((df['above_treshold'] - shifted_fwd) == df['above_treshold']))
stop_condition = ((rolling_min_bwd > 0) & ((df['above_treshold'] - shifted_bwd) == df['above_treshold']))
cycles = start_condition.cumsum()
idx = cycles == stop_condition.shift(1).cumsum()
cycles.loc[idx] = 0
df['output'] = np.where(cycles > 0, 1, 0)
resulting in:
data above_treshold output
dt
2020-01-01 00:00:00 8 8 0
2020-01-01 00:15:00 3 0 0
2020-01-01 00:30:00 3 0 0
2020-01-01 00:45:00 1 0 0
2020-01-01 01:00:00 3 0 0
2020-01-01 01:15:00 9 9 1
2020-01-01 01:30:00 4 4 1
2020-01-01 01:45:00 8 8 1
2020-01-01 02:00:00 6 6 1
2020-01-01 02:15:00 4 4 1
2020-01-01 02:30:00 6 6 1
2020-01-01 02:45:00 6 6 1
2020-01-01 03:00:00 1 0 0
2020-01-01 03:15:00 6 6 0
2020-01-01 03:30:00 7 7 0
2020-01-01 03:45:00 0 0 0
2020-01-01 04:00:00 2 0 0
2020-01-01 04:15:00 8 8 1
2020-01-01 04:30:00 8 8 1
2020-01-01 04:45:00 9 9 1
2020-01-01 05:00:00 1 0 0
2020-01-01 05:15:00 9 9 1
2020-01-01 05:30:00 10 10 1
2020-01-01 05:45:00 5 5 1
2020-01-01 06:00:00 8 8 1
Couldn't measure a significant impact on performance (working on DataFrames with > 35k data points), but still better than iterating over each datapoint (though less intuitive).

Convert column of integers to time in HH:MM:SS format efficiently

I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00

This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object

In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00

I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object

A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00

how to convert pd.to_timedelta() to time() object?

I need get 0 days 08:00:00 to 08:00:00.
code:
import pandas as pd
df = pd.DataFrame({
'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:00','12:01:00','14:01:00','18:01:00','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00'],
'location_type':['not considered','Food','Parks & Outdoors','Food',
'Arts & Entertainment','Parks & Outdoors','Food']})
df = df.reindex_axis(['Slot_no','start_time','end_time','location_type','loc_set'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
output:
print (df)
Slot_no start_time end_time location_type loc_set
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN

You can use to_datetime with dt.time:
df['end_time_times'] = pd.to_datetime(df['end_time']).dt.time
print (df)
Slot_no start_time end_time location_type loc_set \
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
end_time_times
0 08:00:00
1 10:00:00
2 12:00:00
3 14:00:00
4 18:00:00
5 20:00:00
6 00:00:00

Sort csv-data while reading, using pandas

I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.

OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: time column addition and repeating all rows for a month - python

Related

Group rows by certain timeperiod dending on other factors

pandas: evaluate if condition is met for consecutive data points in a given timeframe

Convert column of integers to time in HH:MM:SS format efficiently

how to convert pd.to_timedelta() to time() object?

Sort csv-data while reading, using pandas

Categories

Resources