I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM.
Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM).
If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions.
If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition.
For production: get the total production of the shift partitions
For RPM: get average of the shift partions
If No, get the appropriate values for each Brand.
(For more understanding, Please check the expected output.)
Sample of the Raw dataframe:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 17:07 Day A 5 50
7/9/2020 17:07 7/9/2020 17:58 Day A 10 100
7/9/2020 17:58 7/9/2020 19:00 Day A 5 60
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/9/2020 22:40 Night B 5 20
7/9/2020 22:40 7/10/2020 7:00 Night B 5 30
7/10/2020 7:00 7/10/2020 18:27 Day C 15 20
7/10/2020 18:27 7/10/2020 19:00 Day C 5 40
Expected Output:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 19:00 Day A 20 70
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/10/2020 7:00 Night B 10 25
7/10/2020 7:00 7/10/2020 19:00 Day C 20 30
Thanks in advance.
Here's a suggestion:
Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)):
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
Then
df['Day'] = df['Start'].dt.strftime('%Y-%m-%d')
df = (df.groupby(['Day', 'Shift', 'Brand'])
.agg(Start = pd.NamedAgg(column='Start', aggfunc='min'),
End = pd.NamedAgg(column='End', aggfunc='max'),
Production = pd.NamedAgg(column='Production', aggfunc='sum'),
RPM = pd.NamedAgg(column='RPM', aggfunc='mean'))
.reset_index()[df.columns]
.drop('Day', axis='columns'))
gives you
Start End Shift Brand Production RPM
0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50
1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70
2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10
3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25
4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30
which seems to be your desired output (if I'm not mistaken).
If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding):
df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M')
df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')
Related
I have two dataFrames as shown below:
df1 =
temperature Mon_start Mon_end Tues_start Tues_end
cold 1:00 3:00 9:00 10:00
warm 7:00 8:00 16:00 20:00
hot 4:00 6:00 12:00 14:00
df2 =
sample1 data_value
A 2:00
A 7:30
B 18:00
B 9:45
I need to use the values in df2['data_value'] to find out what day an experiment was performed and what temperature it was using df1. So essentially using df1 as a lookup table to check for if data_value is between a given start and end time and for what temp and if so, assign its value in a new column called day with the day. The output I've been trying to get is:
sample1 data_value day temperature
A 2:00 Mon cold
A 7:30 Mon warm
B 18:00 Tues warm
B 9:45 Tues cold
The actual dataFrame is quite long, so I defined a function and did np.vectorize() to speed it up, but can't seem to get the mapping and new columns defined correctly.
Or do I need to do a for-loop and check over every combination of *_start and *_end to do so?
Any help would be greatly appreciated!
If your data are valid, e.g. no row in df2 with 3:30, then you can use merge_asof:
# convert data to timedelta so we can compare correctly
for col in df1.columns[1:]:
df1[col] = pd.to_timedelta(df1[col]+':00')
df2['data_value'] = pd.to_timedelta(df2['data_value'] + ':00')
pd.merge_asof(df2.sort_values('data_value'),
df1.melt('temperature', var_name='day').sort_values('value'),
left_on='data_value', right_on='value')
Output:
sample1 data_value temperature day value
0 A 0 days 02:00:00 cold Mon_start 0 days 01:00:00
1 A 0 days 07:30:00 warm Mon_start 0 days 07:00:00
2 B 0 days 09:45:00 cold Tues_start 0 days 09:00:00
3 B 0 days 18:00:00 warm Tues_start 0 days 16:00:00
I am creating a dictionary for 7 days. From 22th January to 29th. But there is two different data in one column in a day. Column name is Last Update. That values are I want to combine is '1/25/2020 10:00 PM', '1/25/2020 12:00 PM'. This values in the same column. So 25. January is Saturday. I want to combine them together as Saturday.
For understanding the column:
Last Update
0 1/22/2020 12:00
1 1/22/2020 12:00
2 1/22/2020 12:00
3 1/22/2020 12:00
4 1/22/2020 12:00
...
363 1/29/2020 21:00
364 1/29/2020 21:00
365 1/29/2020 21:00
366 1/29/2020 21:00
367 1/29/2020 21:00
i came so far:
day_map = {'1/22/2020 12:00': 'Wednesday', '1/23/20 12:00 PM': 'Thursday',
'1/24/2020 12:00 PM': 'Friday', .?.?.
You just need to convert date to datetime and use pandas.dt functions. In this case
df["Last Update"] = df["Last Update"].astype("M8")
df["Last Update"].dt.weekday_name
# returns
0 Wednesday
1 Wednesday
2 Wednesday
3 Wednesday
4 Wednesday
Name: Last Update, dtype: object
I have a dataframe of two DateTime Object columns (one representing a surgery clocking in and the other when it is clocked out). For each row (ie case), I need to create a column of total time within business hours (07:00 - 17:30) and another column of total time outside of business hours. I am not sure the best approach.
Reproducible segment of my dataframe:
Actual Room In DateTime Actual Room Out DateTime
0 2013-11-01 02:16 2013-11-01 04:35
1 2016-06-10 16:42 2016-06-10 19:28
2 2014-12-13 09:15 2014-12-13 10:55
3 2014-01-03 19:46 2014-01-03 22:54
4 2015-01-12 18:13 2015-01-12 19:58
5 2017-03-24 18:55 2017-03-24 19:57
6 2015-08-07 18:46 2015-08-07 19:42
7 2016-03-18 20:43 2016-03-19 00:40
8 2017-02-23 15:21 2017-02-23 17:35
9 2013-11-29 17:08 2013-11-29 17:42
10 2014-05-28 18:17 2014-05-28 19:12
11 2017-07-15 17:04 2017-07-15 18:19
12 2017-02-16 09:14 2017-02-16 21:29
13 2014-07-11 12:04 2014-07-11 17:40
14 2017-07-05 12:27 2017-07-05 20:08
15 2014-08-18 17:55 2014-08-18 19:50
16 2015-01-23 15:41 2015-01-23 19:41
17 2015-01-12 16:59 2015-01-12 17:49
18 2014-02-23 11:24 2014-02-23 15:06
19 2017-09-21 13:40 2017-09-21 18:11
pd.read_clipboard(sep=',')
The maximum amount of time between the two columns is:
df['Room Difference'] = df['Actual Room Out DateTime'] - df['Actual Room In DateTime']
max(df['Room Difference'])
Timedelta('1 days 01:17:00')
Which helps me think about the problem and the algorithm I want to write.
I guess it would go something like this (as pseudocode):
if 00:00:00 <= 'Actual Room In DateTime' < 07:00:00 and 00:00:00 <= 'Actual Room Out DateTime' < 07:00:00:
'After-hours' = 'Actual Room Out DateTime' - 'Actual Room In DateTime'
... to cover all the possible cases.
Is there an easier way or some sort of framework/tool for this exact kind of problem?
Subtract In time from Business Starting time , to get outside hours before day start
Subtract Out time from Business End time, to get outside hours after day end
Subtract In time from Out time, to get total hours of surgery
Add the two outside hours, to get total outside hours
Subtract total outside hours from total hours, to get total within business hours
Make a separate column for every calculation
I have formatted my data through pandas in such a way that I get the number of orders that are placed in every 2 hour period for the past 3 months. I need to get the total amount of order that is placed for each timeslot based on the day of the week.
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/1/2019 2:00 0 Friday
2/1/2019 4:00 0 Friday
2/1/2019 6:00 0 Friday
2/1/2019 8:00 0 Friday
2/1/2019 10:00 1 Friday
2/1/2019 12:00 2 Friday
2/1/2019 14:00 3 Friday
2/1/2019 16:00 5 Friday
2/2/2019 0:00 2 Saturday
2/2/2019 2:00 1 Saturday
2/2/2019 4:00 0 Saturday
2/2/2019 6:00 0 Saturday
2/2/2019 8:00 0 Saturday
Where Converted is my index and OrderCount column contains the count of orders by timeslot(2hr)
I have tried the following code
df.groupby([df.index.hour, df.index.weekday]).count()
But this give totally different result
What is want is the total number of orders placed on a particular day based on the timeslot
Ex
Converted OrderCount day_of_week
2/1/2019 0:00 2 Friday
2/8/2019 0:00 5 Friday
2/2/2019 4:00 1 Saturday
2/9/2019 4:00 10 Saturday
The Output Should be
TimeSlot OrderCount day_of_week
0:00 7 Friday
4:00 11 Saturday
Where total 7 is (5+2) and 11 is (1+11)
Date_Time Position Trade
7/16/2018 13:00 Long 1
7/16/2018 13:30 Flat 1
7/16/2018 14:00 Flat 1
7/16/2018 14:30 Long 2
7/16/2018 15:00 Long 2
7/16/2018 15:30 Long 2
7/16/2018 17:00 Short 3
7/16/2018 17:30 Short 3
7/16/2018 18:00 Short 3
7/16/2018 18:30 Short 3
7/16/2018 19:00 Short 3
7/16/2018 19:30 Long 4
7/16/2018 20:00 Long 4
7/16/2018 20:30 Long 4
7/16/2018 21:00 Long 4
7/16/2018 21:30 Short 5
7/16/2018 22:00 Short 5
7/16/2018 22:30 Short 5
7/16/2018 23:00 Short 5
7/16/2018 23:30 Short 5
7/17/2018 0:00 Short 5
7/17/2018 0:30 Short 5
7/17/2018 1:00 Short 5
7/17/2018 1:30 Short 5
7/17/2018 2:00 Short 5
7/17/2018 2:30 Long 6
I have a dataframe that looks like the above. I'm trying to create a function that returns a series grouped by the trades.
def compact_view(groupby):
agg_dict = {'EntryTime': groupby.iloc[0, :].name,
'Trade Type': groupby['Position'].iat[0],
'Size': groupby['Size'].iat[0],
}
return pd.Series(agg_dict, index=['EntryTime', 'Trade Type', 'Size', 'ExitTime'])
compact_results = results.groupby(['Trades']).apply(compact_view)
I'm having trouble with the syntax for one of the series items.
I'd like to have a line called 'ExitTime' which would go in my dictionary in the compact_view function and returns the index value of the row below the final position of the word 'Long' or 'Short' within each set of trade numbers.
so the first one would be 7/16/2018 13:30. The second would be,7/16/2018 17:00 etc etc
Expected Results:
Trades EntryTime Trade Type Size ExitTime
0 7/16/2018 3:30 Flat 0
1 7/16/2018 13:00 Long 5 7/16/2018 13:30
2 7/16/2018 14:30 Long 5 7/16/2018 17:00
3 7/16/2018 17:00 Short -5 7/16/2018 19:30
4 7/16/2018 19:30 Long 5 7/16/2018 21:30
5 7/16/2018 21:30 Short -5 7/17/2018 2:30
6 7/17/2018 2:30 Long 5 7/17/2018 4:30
IIUUC, within each Trade group you need to find the last index of the occurrence of either Long or Short and then grab the row below that.
There are a lot of things that can go wrong, and I don't know how you want to handle that.
What happens if a Trade group never contains Long or Short. (Currently this will throw and IndexError)
What do you want to do if the last row in your DataFrame is Long or Short
So you can add exceptions to deal with these cases separately (like try and except). At least from your sample data, You can do something like:
ids = df.reset_index().groupby('Trade').apply(lambda x: x[x.Position.isin(['Long', 'Short'])].index[-1]+1)
df.reset_index().reindex(ids)['Date_Time']
Output:
1 2018-07-16 13:30:00
6 2018-07-16 17:00:00
11 2018-07-16 19:30:00
15 2018-07-16 21:30:00
25 2018-07-17 02:30:00
26 NaT
Name: Date_Time, dtype: datetime64[ns]
Now you can just join these to your aggregation result if needed. As you can see my last line is NaT because there is no row after the last Long value for group 6 in your DataFrame
One safer way might be:
def next_id(x):
try:
return x[x.Position.isin(['Long', 'Short'])].index[-1]+1
except IndexError:
pass
ids = df.reset_index().groupby('Trade').apply(lambda x: next_id(x))
You can identify the last row in a block using pandas.DataFrame.drop_duplicates():
df.drop_duplicates(subset=['Position','Trade'],keep='last')
So to get the next row indices:
row_indices = [x+1 for x in df.drop_duplicates(
subset=['Position','Trade'],keep='last').index.get_values()]