I need get 0 days 08:00:00 to 08:00:00.
code:
import pandas as pd
df = pd.DataFrame({
'Slot_no':[1,2,3,4,5,6,7],
'start_time':['0:01:00','8:01:00','10:01:00','12:01:00','14:01:00','18:01:00','20:01:00'],
'end_time':['8:00:00','10:00:00','12:00:00','14:00:00','18:00:00','20:00:00','0:00:00'],
'location_type':['not considered','Food','Parks & Outdoors','Food',
'Arts & Entertainment','Parks & Outdoors','Food']})
df = df.reindex_axis(['Slot_no','start_time','end_time','location_type','loc_set'], axis=1)
df['start_time'] = pd.to_timedelta(df['start_time'])
df['end_time'] = pd.to_timedelta(df['end_time'].replace('0:00:00', '24:00:00'))
output:
print (df)
Slot_no start_time end_time location_type loc_set
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
You can use to_datetime with dt.time:
df['end_time_times'] = pd.to_datetime(df['end_time']).dt.time
print (df)
Slot_no start_time end_time location_type loc_set \
0 1 00:01:00 0 days 08:00:00 not considered NaN
1 2 08:01:00 0 days 10:00:00 Food NaN
2 3 10:01:00 0 days 12:00:00 Parks & Outdoors NaN
3 4 12:01:00 0 days 14:00:00 Food NaN
4 5 14:01:00 0 days 18:00:00 Arts & Entertainment NaN
5 6 18:01:00 0 days 20:00:00 Parks & Outdoors NaN
6 7 20:01:00 1 days 00:00:00 Food NaN
end_time_times
0 08:00:00
1 10:00:00
2 12:00:00
3 14:00:00
4 18:00:00
5 20:00:00
6 00:00:00
Related
What I start with is a large dataframe (more than a million entires) of this structure:
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
...
1 2021-02-01 00:00:00 0 ...
2 2020-01-15 00:05:00 0 ...
2 2020-03-10 00:07:00 0 ...
...
2 2021-05-22 00:00:00 1 ...
...
There is no specific order other than a sort by id and then datetime. The dataset is not complete (there is not data for every day, but there can be multiple entires of the same day).
Now for each time where indicator==1 I want to collect every row with the same id and a datetime that is at most 10 days before. All other rows which are not in range of the indicator can be dropped. In the best case I want it to be saved as a dataset of time series which each will be later used in a Neural network. (There can be more than one indicator==1 case per id, other values should be saved).
An example for one id: I want to convert this
id datetime indicator other_values ...
1 2020-01-14 00:12:00 0 ...
1 2020-01-17 00:23:00 1 ...
1 2020-01-17 00:13:00 0 ...
1 2020-01-20 00:05:00 0 ...
1 2020-03-10 00:07:00 0 ...
1 2020-05-19 00:00:00 0 ...
1 2020-05-20 00:00:00 1 ...
into this
id datetime group other_values ...
1 2020-01-14 00:12:00 A ...
1 2020-01-17 00:23:00 A ...
1 2020-01-17 00:13:00 A ...
1 2020-05-19 00:00:00 B ...
1 2020-05-20 00:00:00 B ...
or a similar way to group into group A, B, ... .
A naive python for-loop is not possible due to taking ages for a dataset like this.
There is propably a clever way to use df.groupby('id'), df.groupby('id').agg(...), df.sort_values(...) or df.apply(), but I just do not see it.
Here is a way to do it with pd.merge_asof(). Let's create our data:
data = {'id': [1,1,1,1,1,1,1],
'datetime': ['2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2020-01-17 00:13:00',
'2020-01-20 00:05:00',
'2020-03-10 00:07:00',
'2020-05-19 00:00:00',
'2020-05-20 00:00:00'],
'ind': [0,1,0,0,0,0,1]
}
df = pd.DataFrame(data)
df['datetime'] = df['datetime'].astype('datetime64')
Data:
id datetime ind
0 1 2020-01-14 00:12:00 0
1 1 2020-01-17 00:23:00 1
2 1 2020-01-17 00:13:00 0
3 1 2020-01-20 00:05:00 0
4 1 2020-03-10 00:07:00 0
5 1 2020-05-19 00:00:00 0
6 1 2020-05-20 00:00:00 1
Next, let's add a date to the dataset and pull all dates where the indicator is 1.
df['date'] = df['datetime'].dt.date.astype('datetime64')
df2 = df.loc[df['ind'] == 1, ['id', 'date', 'ind']].rename({'ind': 'ind2'}, axis=1)
Which gives us this:
df:
id datetime ind date
0 1 2020-01-14 00:12:00 0 2020-01-14
1 1 2020-01-17 00:23:00 1 2020-01-17
2 1 2020-01-17 00:13:00 0 2020-01-17
3 1 2020-01-20 00:05:00 0 2020-01-20
4 1 2020-03-10 00:07:00 0 2020-03-10
5 1 2020-05-19 00:00:00 0 2020-05-19
6 1 2020-05-20 00:00:00 1 2020-05-20
df2:
id date ind2
1 1 2020-01-17 1
6 1 2020-05-20 1
Now let's join them using pd.merge_asof() with direction=forward and a tolerance of 10 days. This will join all data up to 10 days looking forward.
df = pd.merge_asof(df.drop('ind', axis=1), df2, by='id', on='date', tolerance=pd.Timedelta('10d'), direction='forward')
Which gives us this:
id datetime ind date ind2
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0
Next, let's work on creating groups. There are three rules we want to use:
The next value of ind2 is NaN
The next value of ID is not the current value of ID (we're at the last value in the group)
The next day is 10 days greater than the current
With these rules, we can create a Boolean which we can then cumulatively sum to create our groups.
df['group_id'] = df['ind2'].eq( (df['ind2'].shift() == np.NaN)
| (df['id'].shift() != df['id'])
| (df['date'] - df['date'].shift() > pd.Timedelta('10d') )
).cumsum()
id datetime ind date ind2 group_id
0 1 2020-01-14 00:12:00 0 2020-01-14 1.0 1
1 1 2020-01-17 00:23:00 1 2020-01-17 1.0 1
2 1 2020-01-17 00:13:00 0 2020-01-17 1.0 1
3 1 2020-01-20 00:05:00 0 2020-01-20 NaN 1
4 1 2020-03-10 00:07:00 0 2020-03-10 NaN 1
5 1 2020-05-19 00:00:00 0 2020-05-19 1.0 2
6 1 2020-05-20 00:00:00 1 2020-05-20 1.0 2
Now we need to drop all the NaNs from ind2, remove date and we're done.
df = df.dropna(subset='ind2').drop(['date', 'ind2'], axis=1)
Final output:
id datetime ind group_id
0 1 2020-01-14 00:12:00 0 1
1 1 2020-01-17 00:23:00 1 1
2 1 2020-01-17 00:13:00 0 1
5 1 2020-05-19 00:00:00 0 2
6 1 2020-05-20 00:00:00 1 2
I'm not aware of a way to do this with df.agg, but you can put your for loop inside the groupby using .apply(). That way, your comparisons/lookups can be done on smaller tables, then groupby will handle the re-concatenation:
import pandas as pd
import datetime
import uuid
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2],
"datetime": [
'2020-01-14 00:12:00',
'2020-01-17 00:23:00',
'2021-02-01 00:00:00',
'2020-01-15 00:05:00',
'2020-03-10 00:07:00',
'2021-05-22 00:00:00',
],
"indicator": [0, 1, 0, 0, 0, 1]
})
df.datetime = pd.to_datetime(df.datetime)
timedelta = datetime.timedelta(days=10)
def consolidate(grp):
grp['Group'] = None
for time in grp[grp.indicator == 1]['datetime']:
grp['Group'][grp['datetime'].between(time - timedelta, time)] = uuid.uuid4()
return grp.dropna(subset=['Group'])
df.groupby('id').apply(consolidate)
If there are multiple rows with indicator == 1 in each id grouping, then the for loop will apply in index order (so a later group might overwrite an earlier group). If you can be certain that there is only one indicator == 1 in each grouping, we can simplify the consolidate function:
def consolidate(grp):
time = grp[grp.indicator == 1]['datetime'].iloc[0]
grp = grp[grp['datetime'].between(time - timedelta, time)]
grp['Group'] = uuid.uuid4()
return grp
Let's say the dataframe (df) consists of 3 columns.
V1 V2 V3
1 0 days 23:09:00 0 days 23:34:00
1 0 days 23:36:00 1 days 00:03:00
1 1 days 00:06:00 1 days 00:29:00
1 1 days 00:31:00 1 days 00:57:00
2 0 days 22:40:00 0 days 23:04:00
2 0 days 23:09:00 0 days 23:35:00
2 0 days 23:37:00 1 days 00:01:00
2 1 days 00:06:00 1 days 00:30:00
2 1 days 00:33:00 1 days 00:56:00
3 0 days 22:50:00 0 days 23:21:09
3 0 days 23:38:56 1 days 00:09:00
3 1 days 00:12:00 1 days 00:42:09
I have used the following code:
df['V4']=(df.groupby('V1')['V3'] - df.groupby('V1')['V2'].shift(1)).astype('timedelta64[m]')
Essentially, I want to perform operation for each unique value in V1 and the result should look like this:
V1 V2 V3 V4
1 0 days 23:09:00 0 days 23:34:00 NaN
1 0 days 23:36:00 1 days 00:03:00 54
1 1 days 00:06:00 1 days 00:29:00 53
1 1 days 00:31:00 1 days 00:57:00 51
2 0 days 22:40:00 0 days 23:04:00 NaN
2 0 days 23:09:00 0 days 23:35:00 55
2 0 days 23:37:00 1 days 00:01:00 52
2 1 days 00:06:00 1 days 00:30:00 53
2 1 days 00:33:00 1 days 00:56:00 50
3 0 days 22:50:00 0 days 23:21:09 NaN
3 0 days 23:38:56 1 days 00:09:00 79
3 1 days 00:12:00 1 days 00:42:09 63
Error received:
Cannot add/subtract non-tick DateOffset to TimedeltaArray
Datatypes:
{'V1': {1: 1, 2: 2, 3: 3}, 'V2': {0: Timedelta('0 days 23:09:00'), 1: Timedelta('0 days 23:36:00')}, 'V3': {0: Timedelta('0 days 23:34:00'), 1: Timedelta('1 days 00:03:00')}, 'V4': {0: 54, 1: 53}}
Try this:
Do the subtraction on all rows
Set the value as NaN when there is a change in V1.
df = df.sort_values(["V1", "V2", "V3"])
df["V4"] = (df["V3"]-df["V2"].shift()).dt.seconds//60
df["V4"] = df["V4"].where(df["V1"]==df["V1"].shift())
>>> df
V1 V2 V3 V4
0 1 0 days 23:09:00 0 days 23:34:00 NaN
1 1 0 days 23:36:00 1 days 00:03:00 54.0
2 1 1 days 00:06:00 1 days 00:29:00 53.0
3 1 1 days 00:31:00 1 days 00:57:00 51.0
4 2 0 days 22:40:00 0 days 23:04:00 NaN
5 2 0 days 23:09:00 0 days 23:35:00 55.0
6 2 0 days 23:37:00 1 days 00:01:00 52.0
7 2 1 days 00:06:00 1 days 00:30:00 53.0
8 2 1 days 00:33:00 1 days 00:56:00 50.0
9 3 0 days 22:50:00 0 days 23:21:09 NaN
10 3 0 days 23:38:56 1 days 00:09:00 79.0
11 3 1 days 00:12:00 1 days 00:42:09 63.0
If you want to use groupby:
df["V4"] = df.groupby("V1").apply(lambda x: (x["V3"]-x["V2"].shift()).dt.seconds//60).reset_index(drop=True)
>>> df
V1 V2 V3 V4
0 1 0 days 23:09:00 0 days 23:34:00 NaN
1 1 0 days 23:36:00 1 days 00:03:00 54.0
2 1 1 days 00:06:00 1 days 00:29:00 53.0
3 1 1 days 00:31:00 1 days 00:57:00 51.0
4 2 0 days 22:40:00 0 days 23:04:00 NaN
5 2 0 days 23:09:00 0 days 23:35:00 55.0
6 2 0 days 23:37:00 1 days 00:01:00 52.0
7 2 1 days 00:06:00 1 days 00:30:00 53.0
8 2 1 days 00:33:00 1 days 00:56:00 50.0
9 3 0 days 22:50:00 0 days 23:21:09 NaN
10 3 0 days 23:38:56 1 days 00:09:00 79.0
11 3 1 days 00:12:00 1 days 00:42:09 63.0
Input:
df = pd.DataFrame({"V1": [1, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3, 3],
"V2": [pd.Timedelta("0 days 23:09:00"), pd.Timedelta("0 days 23:36:00"), pd.Timedelta("1 days 00:06:00"), pd.Timedelta("1 days 00:31:00"),
pd.Timedelta("0 days 22:40:00"), pd.Timedelta("0 days 23:09:00"), pd.Timedelta("0 days 23:37:00"), pd.Timedelta("1 days 00:06:00"),
pd.Timedelta("1 days 00:33:00"), pd.Timedelta("0 days 22:50:00"), pd.Timedelta("0 days 23:38:56"), pd.Timedelta("1 days 00:12:00")],
"V3":[pd.Timedelta("0 days 23:34:00"), pd.Timedelta("1 days 00:03:00"), pd.Timedelta("1 days 00:29:00"), pd.Timedelta("1 days 00:57:00"),
pd.Timedelta("0 days 23:04:00"), pd.Timedelta("0 days 23:35:00"), pd.Timedelta("1 days 00:01:00"), pd.Timedelta("1 days 00:30:00"),
pd.Timedelta("1 days 00:56:00"), pd.Timedelta("0 days 23:21:09"), pd.Timedelta("1 days 00:09:00"), pd.Timedelta("1 days 00:42:09")]
})
>>> df
V1 V2 V3
0 1 0 days 23:09:00 0 days 23:34:00
1 1 0 days 23:36:00 1 days 00:03:00
2 1 1 days 00:06:00 1 days 00:29:00
3 1 1 days 00:31:00 1 days 00:57:00
4 2 0 days 22:40:00 0 days 23:04:00
5 2 0 days 23:09:00 0 days 23:35:00
6 2 0 days 23:37:00 1 days 00:01:00
7 2 1 days 00:06:00 1 days 00:30:00
8 2 1 days 00:33:00 1 days 00:56:00
9 3 0 days 22:50:00 0 days 23:21:09
10 3 0 days 23:38:56 1 days 00:09:00
11 3 1 days 00:12:00 1 days 00:42:09
I'd like to change my dataframe adding time intervals for every hour during a month
Original df
money food
0 1 2
1 4 5
2 5 7
Output:
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 00:01:00
2 1 2 2020-01-01 00:02:00
...
2230 5 7 2020-01-31 00:22:00
2231 5 7 2020-01-31 00:23:00
where 2231 = out_rows_number-1 = month_days_number*hours_per_day*orig_rows_number - 1
What is the proper way to perform it?
Use cross join by DataFrame.merge and new DataFrame with all hours per month created by date_range:
df1 = pd.DataFrame({'a':1,
'time':pd.date_range('2020-01-01', '2020-01-31 23:00:00', freq='h')})
df = df.assign(a=1).merge(df1, on='a', how='outer').drop('a', axis=1)
print (df)
money food time
0 1 2 2020-01-01 00:00:00
1 1 2 2020-01-01 01:00:00
2 1 2 2020-01-01 02:00:00
3 1 2 2020-01-01 03:00:00
4 1 2 2020-01-01 04:00:00
... ... ...
2227 5 7 2020-01-31 19:00:00
2228 5 7 2020-01-31 20:00:00
2229 5 7 2020-01-31 21:00:00
2230 5 7 2020-01-31 22:00:00
2231 5 7 2020-01-31 23:00:00
[2232 rows x 3 columns]
I am trying to develop a more efficient loop to complete a problem. At the moment, the code below applies a string if it aligns with a specific value. However, the values are in identical order so a loop could make this process more efficient.
Using the df below as an example, using integers to represent time periods, each integer increase equates to a 15 min period. So 1 == 8:00:00 and 2 == 8:15:00 etc. At the moment I would repeat this process until the last time period. If this gets up to 80 it could become very inefficient. Could a loop be incorporated here?
import pandas as pd
d = ({
'Time' : [1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6],
})
df = pd.DataFrame(data = d)
def time_period(row) :
if row['Time'] == 1 :
return '8:00:00'
if row['Time'] == 2 :
return '8:15:00'
if row['Time'] == 3 :
return '8:30:00'
if row['Time'] == 4 :
return '8:45:00'
if row['Time'] == 5 :
return '9:00:00'
if row['Time'] == 6 :
return '9:15:00'
.....
if row['Time'] == 80 :
return '4:00:00'
df['24Hr Time'] = df.apply(lambda row: time_period(row), axis=1)
print(df)
Out:
Time 24Hr Time
0 1 8:00:00
1 1 8:00:00
2 1 8:00:00
3 2 8:15:00
4 2 8:15:00
5 2 8:15:00
6 3 8:30:00
7 3 8:30:00
8 3 8:30:00
9 4 8:45:00
10 4 8:45:00
11 4 8:45:00
12 5 9:00:00
13 5 9:00:00
14 5 9:00:00
15 6 9:15:00
16 6 9:15:00
17 6 9:15:00
This is possible with some simple timdelta arithmetic:
df['24Hr Time'] = (
pd.to_timedelta((df['Time'] - 1) * 15, unit='m') + pd.Timedelta(hours=8))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time timedelta64[ns]
dtype: object
If you need a string, use pd.to_datetime with unit and origin:
df['24Hr Time'] = (
pd.to_datetime((df['Time']-1) * 15, unit='m', origin='8:00:00')
.dt.strftime('%H:%M:%S'))
df.head()
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
df.dtypes
Time int64
24Hr Time object
dtype: object
In general, you want to make a dictionary and apply
my_dict = {'old_val1': 'new_val1',...}
df['24Hr Time'] = df['Time'].map(my_dict)
But, in this case, you can do with time delta:
df['24Hr Time'] = pd.to_timedelta(df['Time']*15, unit='T') + pd.to_timedelta('7:45:00')
Output (note that the new column is of type timedelta, not string)
Time 24Hr Time
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I end up using this
pd.to_datetime((df.Time-1)*15*60+8*60*60,unit='s').dt.time
0 08:00:00
1 08:00:00
2 08:00:00
3 08:15:00
4 08:15:00
5 08:15:00
6 08:30:00
7 08:30:00
8 08:30:00
9 08:45:00
10 08:45:00
11 08:45:00
12 09:00:00
13 09:00:00
14 09:00:00
15 09:15:00
16 09:15:00
17 09:15:00
Name: Time, dtype: object
A fun way is using pd.timedelta_range and index.repeat
n = df.Time.nunique()
c = df.groupby('Time').size()
df['24_hr'] = pd.timedelta_range(start='8 hours', periods=n, freq='15T').repeat(c)
Out[380]:
Time 24_hr
0 1 08:00:00
1 1 08:00:00
2 1 08:00:00
3 2 08:15:00
4 2 08:15:00
5 2 08:15:00
6 3 08:30:00
7 3 08:30:00
8 3 08:30:00
9 4 08:45:00
10 4 08:45:00
11 4 08:45:00
12 5 09:00:00
13 5 09:00:00
14 5 09:00:00
15 6 09:15:00
16 6 09:15:00
17 6 09:15:00
I have a DataFrame which looks like that
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
Date
2013-05-07 112.25000 114.00000 97.52 109.60013 139626.724860 14898971.673747 106.705731
2013-05-08 109.60013 116.77700 109.50 113.20000 61680.324704 6990518.957611 113.334665
2013-05-09 113.20000 113.71852 108.80 112.79900 26894.458204 3003068.410660 111.661235
2013-05-10 112.79900 122.50000 111.54 117.70000 77443.672681 9140709.083964 118.030418
2013-05-11 117.70000 118.74000 113.00 113.47000 25532.277740 2952016.798507 115.619015
I'm looking for a way to transform this kind of data to
index open
index+1 low
index+2 high
index+3 open
index+4 low
index+5 high
so in my sample it should looks like
Date
2013-05-07 00:00 112.25000
2013-05-07 08:00 97.52
2013-05-07 16:00 114.00000
2013-05-08 00:00 109.60013
2013-05-08 08:00 109.50
2013-05-08 16:00 116.77700
...
My first idea is to resample DataFrame
but my first problem is that when I'm doing
df2 = df.resample('8H', how='mean')
I get
Open High Low Close Volume (BTC) Volume (Currency) Weighted Price
2013-05-07 00:00:00 112.25000 114.00000 97.52000 109.60013 139626.724860 14898971.673747 106.705731
2013-05-07 08:00:00 NaN NaN NaN NaN NaN NaN NaN
2013-05-07 16:00:00 NaN NaN NaN NaN NaN NaN NaN
2013-05-08 00:00:00 109.60013 116.77700 109.50000 113.20000 61680.324704 6990518.957611 113.334665
2013-05-08 08:00:00 NaN NaN NaN NaN NaN NaN NaN
2013-05-08 16:00:00 NaN NaN NaN NaN NaN NaN NaN
2013-05-09 00:00:00 113.20000 113.71852 108.80000 112.79900 26894.458204 3003068.410660 111.661235
...
I need now to build a column with modulo 3 values
Like this
ModCol
2013-05-07 00:00:00 0
2013-05-07 08:00:00 1
2013-05-07 16:00:00 2
2013-05-08 00:00:00 0
2013-05-08 08:00:00 1
2013-05-08 16:00:00 2
2013-05-09 00:00:00 3
...
so I will use np.where to make price column
(open if Mod==0, low if Mod==1 and high if Mod==2)
My problem if that I don't know how to build ModCol column
Heres how to create mod columns
In [1]: Series(range(10))
Out[1]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
In [2]: Series(range(10)) % 3
Out[2]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64