I have a time period from 9:00 till 22:00 and I need to list all possible durations with a step of 30 minutes within this period. E.g.
9:00 - 9:30
9:00 - 10:00
9:00 - 10:30
...
21:00 - 22:00
21:30 - 22:00
I've googled and found itertools.combinations() for numbers but nothing comparable for dates
Related
I have two dataFrames as shown below:
df1 =
temperature Mon_start Mon_end Tues_start Tues_end
cold 1:00 3:00 9:00 10:00
warm 7:00 8:00 16:00 20:00
hot 4:00 6:00 12:00 14:00
df2 =
sample1 data_value
A 2:00
A 7:30
B 18:00
B 9:45
I need to use the values in df2['data_value'] to find out what day an experiment was performed and what temperature it was using df1. So essentially using df1 as a lookup table to check for if data_value is between a given start and end time and for what temp and if so, assign its value in a new column called day with the day. The output I've been trying to get is:
sample1 data_value day temperature
A 2:00 Mon cold
A 7:30 Mon warm
B 18:00 Tues warm
B 9:45 Tues cold
The actual dataFrame is quite long, so I defined a function and did np.vectorize() to speed it up, but can't seem to get the mapping and new columns defined correctly.
Or do I need to do a for-loop and check over every combination of *_start and *_end to do so?
Any help would be greatly appreciated!
If your data are valid, e.g. no row in df2 with 3:30, then you can use merge_asof:
# convert data to timedelta so we can compare correctly
for col in df1.columns[1:]:
df1[col] = pd.to_timedelta(df1[col]+':00')
df2['data_value'] = pd.to_timedelta(df2['data_value'] + ':00')
pd.merge_asof(df2.sort_values('data_value'),
df1.melt('temperature', var_name='day').sort_values('value'),
left_on='data_value', right_on='value')
Output:
sample1 data_value temperature day value
0 A 0 days 02:00:00 cold Mon_start 0 days 01:00:00
1 A 0 days 07:30:00 warm Mon_start 0 days 07:00:00
2 B 0 days 09:45:00 cold Tues_start 0 days 09:00:00
3 B 0 days 18:00:00 warm Tues_start 0 days 16:00:00
I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM.
Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM).
If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions.
If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition.
For production: get the total production of the shift partitions
For RPM: get average of the shift partions
If No, get the appropriate values for each Brand.
(For more understanding, Please check the expected output.)
Sample of the Raw dataframe:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 17:07 Day A 5 50
7/9/2020 17:07 7/9/2020 17:58 Day A 10 100
7/9/2020 17:58 7/9/2020 19:00 Day A 5 60
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/9/2020 22:40 Night B 5 20
7/9/2020 22:40 7/10/2020 7:00 Night B 5 30
7/10/2020 7:00 7/10/2020 18:27 Day C 15 20
7/10/2020 18:27 7/10/2020 19:00 Day C 5 40
Expected Output:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 19:00 Day A 20 70
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/10/2020 7:00 Night B 10 25
7/10/2020 7:00 7/10/2020 19:00 Day C 20 30
Thanks in advance.
Here's a suggestion:
Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)):
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
Then
df['Day'] = df['Start'].dt.strftime('%Y-%m-%d')
df = (df.groupby(['Day', 'Shift', 'Brand'])
.agg(Start = pd.NamedAgg(column='Start', aggfunc='min'),
End = pd.NamedAgg(column='End', aggfunc='max'),
Production = pd.NamedAgg(column='Production', aggfunc='sum'),
RPM = pd.NamedAgg(column='RPM', aggfunc='mean'))
.reset_index()[df.columns]
.drop('Day', axis='columns'))
gives you
Start End Shift Brand Production RPM
0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50
1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70
2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10
3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25
4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30
which seems to be your desired output (if I'm not mistaken).
If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding):
df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M')
df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')
I am creating a dictionary for 7 days. From 22th January to 29th. But there is two different data in one column in a day. Column name is Last Update. That values are I want to combine is '1/25/2020 10:00 PM', '1/25/2020 12:00 PM'. This values in the same column. So 25. January is Saturday. I want to combine them together as Saturday.
For understanding the column:
Last Update
0 1/22/2020 12:00
1 1/22/2020 12:00
2 1/22/2020 12:00
3 1/22/2020 12:00
4 1/22/2020 12:00
...
363 1/29/2020 21:00
364 1/29/2020 21:00
365 1/29/2020 21:00
366 1/29/2020 21:00
367 1/29/2020 21:00
i came so far:
day_map = {'1/22/2020 12:00': 'Wednesday', '1/23/20 12:00 PM': 'Thursday',
'1/24/2020 12:00 PM': 'Friday', .?.?.
You just need to convert date to datetime and use pandas.dt functions. In this case
df["Last Update"] = df["Last Update"].astype("M8")
df["Last Update"].dt.weekday_name
# returns
0 Wednesday
1 Wednesday
2 Wednesday
3 Wednesday
4 Wednesday
Name: Last Update, dtype: object
Date_Time Position Trade
7/16/2018 13:00 Long 1
7/16/2018 13:30 Flat 1
7/16/2018 14:00 Flat 1
7/16/2018 14:30 Long 2
7/16/2018 15:00 Long 2
7/16/2018 15:30 Long 2
7/16/2018 17:00 Short 3
7/16/2018 17:30 Short 3
7/16/2018 18:00 Short 3
7/16/2018 18:30 Short 3
7/16/2018 19:00 Short 3
7/16/2018 19:30 Long 4
7/16/2018 20:00 Long 4
7/16/2018 20:30 Long 4
7/16/2018 21:00 Long 4
7/16/2018 21:30 Short 5
7/16/2018 22:00 Short 5
7/16/2018 22:30 Short 5
7/16/2018 23:00 Short 5
7/16/2018 23:30 Short 5
7/17/2018 0:00 Short 5
7/17/2018 0:30 Short 5
7/17/2018 1:00 Short 5
7/17/2018 1:30 Short 5
7/17/2018 2:00 Short 5
7/17/2018 2:30 Long 6
I have a dataframe that looks like the above. I'm trying to create a function that returns a series grouped by the trades.
def compact_view(groupby):
agg_dict = {'EntryTime': groupby.iloc[0, :].name,
'Trade Type': groupby['Position'].iat[0],
'Size': groupby['Size'].iat[0],
}
return pd.Series(agg_dict, index=['EntryTime', 'Trade Type', 'Size', 'ExitTime'])
compact_results = results.groupby(['Trades']).apply(compact_view)
I'm having trouble with the syntax for one of the series items.
I'd like to have a line called 'ExitTime' which would go in my dictionary in the compact_view function and returns the index value of the row below the final position of the word 'Long' or 'Short' within each set of trade numbers.
so the first one would be 7/16/2018 13:30. The second would be,7/16/2018 17:00 etc etc
Expected Results:
Trades EntryTime Trade Type Size ExitTime
0 7/16/2018 3:30 Flat 0
1 7/16/2018 13:00 Long 5 7/16/2018 13:30
2 7/16/2018 14:30 Long 5 7/16/2018 17:00
3 7/16/2018 17:00 Short -5 7/16/2018 19:30
4 7/16/2018 19:30 Long 5 7/16/2018 21:30
5 7/16/2018 21:30 Short -5 7/17/2018 2:30
6 7/17/2018 2:30 Long 5 7/17/2018 4:30
IIUUC, within each Trade group you need to find the last index of the occurrence of either Long or Short and then grab the row below that.
There are a lot of things that can go wrong, and I don't know how you want to handle that.
What happens if a Trade group never contains Long or Short. (Currently this will throw and IndexError)
What do you want to do if the last row in your DataFrame is Long or Short
So you can add exceptions to deal with these cases separately (like try and except). At least from your sample data, You can do something like:
ids = df.reset_index().groupby('Trade').apply(lambda x: x[x.Position.isin(['Long', 'Short'])].index[-1]+1)
df.reset_index().reindex(ids)['Date_Time']
Output:
1 2018-07-16 13:30:00
6 2018-07-16 17:00:00
11 2018-07-16 19:30:00
15 2018-07-16 21:30:00
25 2018-07-17 02:30:00
26 NaT
Name: Date_Time, dtype: datetime64[ns]
Now you can just join these to your aggregation result if needed. As you can see my last line is NaT because there is no row after the last Long value for group 6 in your DataFrame
One safer way might be:
def next_id(x):
try:
return x[x.Position.isin(['Long', 'Short'])].index[-1]+1
except IndexError:
pass
ids = df.reset_index().groupby('Trade').apply(lambda x: next_id(x))
You can identify the last row in a block using pandas.DataFrame.drop_duplicates():
df.drop_duplicates(subset=['Position','Trade'],keep='last')
So to get the next row indices:
row_indices = [x+1 for x in df.drop_duplicates(
subset=['Position','Trade'],keep='last').index.get_values()]
I have a set of data that looks like this (3 columns). The date and time are in 1 column and the timezone is in another column.
location,time,zone
EASTERN HILLSBOROUGH,1/27/2015 12:00,EST-5
EASTERN HILLSBOROUGH,1/24/2015 7:00,EST-5
EASTERN HILLSBOROUGH,1/27/2015 6:00,EST-5
EASTERN HILLSBOROUGH,2/14/2015 8:00,EST-5
EASTERN HILLSBOROUGH,2/7/2015 22:00,EST-5
EASTERN HILLSBOROUGH,2/2/2015 2:00,EST-5
I'm using pandas in order to parse the date and time with its respective timezone. In read_csv I can do parse_dates = [[1,2]] which, according to the docs, combines the columns into 1 and parses them.
So now the new data looks like this (2 columns)
location,time_zone
EASTERN HILLSBOROUGH,1/27/2015 12:00 EST-5
EASTERN HILLSBOROUGH,1/24/2015 7:00 EST-5
EASTERN HILLSBOROUGH,1/27/2015 6:00 EST-5
EASTERN HILLSBOROUGH,2/14/2015 8:00 EST-5
EASTERN HILLSBOROUGH,2/7/2015 22:00 EST-5
EASTERN HILLSBOROUGH,2/2/2015 2:00 EST-5
However, if I type df['time_zone'].dtype I get dtype('O') which isn't a datetimelike because I can't use the dt accessor with it.
How else can I parse those two columns properly?
Not sure if this is what you want, but you could just read in (without any datetime parsing) and then use to_datetime (note that new variable time_zone is 5 hours later than time).
df['time_zone'] = pd.to_datetime( df.time + df.zone )
location time zone time_zone
0 EASTERN HILLSBOROUGH 1/27/2015 12:00 EST-5 2015-01-27 17:00:00
1 EASTERN HILLSBOROUGH 1/24/2015 7:00 EST-5 2015-01-24 12:00:00
2 EASTERN HILLSBOROUGH 1/27/2015 6:00 EST-5 2015-01-27 11:00:00
3 EASTERN HILLSBOROUGH 2/14/2015 8:00 EST-5 2015-02-14 13:00:00
4 EASTERN HILLSBOROUGH 2/7/2015 22:00 EST-5 2015-02-08 03:00:00
5 EASTERN HILLSBOROUGH 2/2/2015 2:00 EST-5 2015-02-02 07:00:00
df.info()
location 6 non-null object
time 6 non-null object
zone 6 non-null object
time_zone 6 non-null datetime64[ns]
Per the pytz module:
The preferred way of dealing with times is to always work in UTC,
converting to localtime only when generating output to be read by
humans.
I don't believe your timezones are standard, which makes the conversion a little more tricky. We should, however, be able to strip the timezone offset and add it to the UTC time using datetime.timedelta. This is a hack, and I wish I knew a better way.
I assume all times are recorded in their local timezones, so 1/27/2015 12:00 EST-5 would be 1/27/2015 17:00 UTC.
from pytz import utc
import datetime as dt
df = pd.read_csv('times.csv')
df['UTC_time'] = [utc.localize(t) - dt.timedelta(hours=int(h))
for t, h in zip(pd.to_datetime(df.time),
df.zone.str.extract(r'(-?\d+)'))]
>>> df
location time zone UTC_time
0 EASTERN HILLSBOROUGH 1/27/2015 12:00 EST-5 2015-01-27 17:00:00+00:00
1 EASTERN HILLSBOROUGH 1/24/2015 7:00 EST-5 2015-01-24 12:00:00+00:00
2 EASTERN HILLSBOROUGH 1/27/2015 6:00 EST-5 2015-01-27 11:00:00+00:00
3 EASTERN HILLSBOROUGH 2/14/2015 8:00 EST-5 2015-02-14 13:00:00+00:00
4 EASTERN HILLSBOROUGH 2/7/2015 22:00 EST-5 2015-02-08 03:00:00+00:00
5 EASTERN HILLSBOROUGH 2/2/2015 2:00 EST-5 2015-02-02 07:00:00+00:00
Examining a single timestamp, you'll notice the timezone is set to UTC:
>>> df.UTC_time.iat[0]
Timestamp('2015-01-27 17:00:00+0000', tz='UTC')
>>> df.UTC_time.iat[0].tzname()
'UTC'
To display them in a different time zone:
fmt = '%Y-%m-%d %H:%M:%S %Z%z'
>>> [t.astimezone('EST').strftime(fmt) for t in df.UTC_time]
['2015-01-27 12:00:00 EST-0500',
'2015-01-24 07:00:00 EST-0500',
'2015-01-27 06:00:00 EST-0500',
'2015-02-14 08:00:00 EST-0500',
'2015-02-07 22:00:00 EST-0500',
'2015-02-02 02:00:00 EST-0500']
Here is a test. Let's change the timezones in df and see if alternative solutions still work:
df['zone'] = ['EST-5', 'CST-6', 'MST-7', 'GST10', 'PST-8', 'AKST-9']
df['UTC_time'] = [utc.localize(t) - dt.timedelta(hours=int(h))
for t, h in zip(pd.to_datetime(df.time),
df.zone.str.extract(r'(-?\d+)'))]
>>> df
location time zone UTC_time
0 EASTERN HILLSBOROUGH 1/27/2015 12:00 EST-5 2015-01-27 17:00:00+00:00
1 EASTERN HILLSBOROUGH 1/24/2015 7:00 CST-6 2015-01-24 13:00:00+00:00
2 EASTERN HILLSBOROUGH 1/27/2015 6:00 MST-7 2015-01-27 13:00:00+00:00
3 EASTERN HILLSBOROUGH 2/14/2015 8:00 GST10 2015-02-13 22:00:00+00:00
4 EASTERN HILLSBOROUGH 2/7/2015 22:00 PST-8 2015-02-08 06:00:00+00:00
5 EASTERN HILLSBOROUGH 2/2/2015 2:00 AKST-9 2015-02-02 11:00:00+00:00
Check the python docs for more details about working with time.
Here is a good SO article on the subject.
How to make an unaware datetime timezone aware in python
And here is a link to the tz database timezones.