I have two dataFrames as shown below:
df1 =
temperature Mon_start Mon_end Tues_start Tues_end
cold 1:00 3:00 9:00 10:00
warm 7:00 8:00 16:00 20:00
hot 4:00 6:00 12:00 14:00
df2 =
sample1 data_value
A 2:00
A 7:30
B 18:00
B 9:45
I need to use the values in df2['data_value'] to find out what day an experiment was performed and what temperature it was using df1. So essentially using df1 as a lookup table to check for if data_value is between a given start and end time and for what temp and if so, assign its value in a new column called day with the day. The output I've been trying to get is:
sample1 data_value day temperature
A 2:00 Mon cold
A 7:30 Mon warm
B 18:00 Tues warm
B 9:45 Tues cold
The actual dataFrame is quite long, so I defined a function and did np.vectorize() to speed it up, but can't seem to get the mapping and new columns defined correctly.
Or do I need to do a for-loop and check over every combination of *_start and *_end to do so?
Any help would be greatly appreciated!
If your data are valid, e.g. no row in df2 with 3:30, then you can use merge_asof:
# convert data to timedelta so we can compare correctly
for col in df1.columns[1:]:
df1[col] = pd.to_timedelta(df1[col]+':00')
df2['data_value'] = pd.to_timedelta(df2['data_value'] + ':00')
pd.merge_asof(df2.sort_values('data_value'),
df1.melt('temperature', var_name='day').sort_values('value'),
left_on='data_value', right_on='value')
Output:
sample1 data_value temperature day value
0 A 0 days 02:00:00 cold Mon_start 0 days 01:00:00
1 A 0 days 07:30:00 warm Mon_start 0 days 07:00:00
2 B 0 days 09:45:00 cold Tues_start 0 days 09:00:00
3 B 0 days 18:00:00 warm Tues_start 0 days 16:00:00
Related
I have a pandas series s, I would like to extract the Monday before the third Friday:
with the help of the answer in following link, I can get a resample of third friday, I am still not sure how to get the Monday just before it.
pandas resample to specific weekday in month
from pandas.tseries.offsets import WeekOfMonth
s.resample(rule=WeekOfMonth(week=2,weekday=4)).bfill().asfreq(freq='D').dropna()
Any help is welcome
Many thanks
For each source date, compute your "wanted" date in 3 steps:
Shift back to the first day of the current month.
Shift forward to Friday in third week.
Shift back 4 days (from Friday to Monday).
For a Series containing dates, the code to do it is:
s.dt.to_period('M').dt.to_timestamp() + pd.offsets.WeekOfMonth(week=2, weekday=4)\
- pd.Timedelta('4D')
To test this code I created the source Series as:
s = (pd.date_range('2020-01-01', '2020-12-31', freq='MS') + pd.Timedelta('1D')).to_series()
It contains the second day of each month, both as the index and value.
When you run the above code, you will get:
2020-01-02 2020-01-13
2020-02-02 2020-02-17
2020-03-02 2020-03-16
2020-04-02 2020-04-13
2020-05-02 2020-05-11
2020-06-02 2020-06-15
2020-07-02 2020-07-13
2020-08-02 2020-08-17
2020-09-02 2020-09-14
2020-10-02 2020-10-12
2020-11-02 2020-11-16
2020-12-02 2020-12-14
dtype: datetime64[ns]
The left column contains the original index (source date) and the right
column - the "wanted" date.
Note that third Monday formula (as proposed in one of comments) is wrong.
E.g. third Monday in January is 2020-01-20, whereas the correct date is 2020-01-13.
Edit
If you have a DataFrame, something like:
Date Amount
0 2020-01-02 10
1 2020-01-12 10
2 2020-01-13 2
3 2020-01-20 2
4 2020-02-16 2
5 2020-02-17 12
6 2020-03-15 12
7 2020-03-16 3
8 2020-03-31 3
and you want something like resample but each "period" should start
on a Monday before the third Friday in each month, and e.g. compute
a sum for each period, you can:
Define the following function:
def dateShift(d):
d += pd.Timedelta(4, 'D')
d = pd.offsets.WeekOfMonth(week=2, weekday=4).rollback(d)
return d - pd.Timedelta(4, 'D')
i.e.:
Add 4 days (e.g. move 2020-01-13 (Monday) to 2020-01-17 (Friday).
Roll back (in the above case (on offset) this date will not be moved).
Subtract 4 days.
Run:
df.groupby(df.Date.apply(dateShift)).sum()
The result is:
Amount
Date
2019-12-16 20
2020-01-13 6
2020-02-17 24
2020-03-16 6
E. g. two values of 10 for 2020-01-02 and 2020-01-12 are assigned
to period starting on 2019-12-16 (the "wanted" date for December 2019).
I working on the Production analysis data set(Shift-wise one(Day/Night)). Day shift is 7 AM-7 PM Aand Night Shift is 7 PM-7 AM.
Sometimes day & night shift can be divided into two or more portions(ex:7AM-7PM Day shift can be - 7AM-10AM & 10AM-7PM).
If shifts are divided into two or more portions, first need to check if the Brand is the same for that entire Shift partitions.
If YES, set the start time as the beginning of the first shift start time partition and the End time as the end of the last shift end time partition.
For production: get the total production of the shift partitions
For RPM: get average of the shift partions
If No, get the appropriate values for each Brand.
(For more understanding, Please check the expected output.)
Sample of the Raw dataframe:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 17:07 Day A 5 50
7/9/2020 17:07 7/9/2020 17:58 Day A 10 100
7/9/2020 17:58 7/9/2020 19:00 Day A 5 60
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/9/2020 22:40 Night B 5 20
7/9/2020 22:40 7/10/2020 7:00 Night B 5 30
7/10/2020 7:00 7/10/2020 18:27 Day C 15 20
7/10/2020 18:27 7/10/2020 19:00 Day C 5 40
Expected Output:
Start end shift Brand Production RPM
7/8/2020 19:00 7/9/2020 7:00 Night A 10 50
7/9/2020 7:00 7/9/2020 19:00 Day A 20 70
7/9/2020 19:00 7/9/2020 21:30 Night A 2 10
7/9/2020 21:30 7/10/2020 7:00 Night B 10 25
7/10/2020 7:00 7/10/2020 19:00 Day C 20 30
Thanks in advance.
Here's a suggestion:
Make sure the columns Start and End have datetime values (I've renamed end to End and shift to Shift :)):
df['Start'] = pd.to_datetime(df['Start'])
df['End'] = pd.to_datetime(df['End'])
Then
df['Day'] = df['Start'].dt.strftime('%Y-%m-%d')
df = (df.groupby(['Day', 'Shift', 'Brand'])
.agg(Start = pd.NamedAgg(column='Start', aggfunc='min'),
End = pd.NamedAgg(column='End', aggfunc='max'),
Production = pd.NamedAgg(column='Production', aggfunc='sum'),
RPM = pd.NamedAgg(column='RPM', aggfunc='mean'))
.reset_index()[df.columns]
.drop('Day', axis='columns'))
gives you
Start End Shift Brand Production RPM
0 2020-07-08 19:00:00 2020-07-09 07:00:00 Night A 10 50
1 2020-07-09 07:00:00 2020-07-09 19:00:00 Day A 20 70
2 2020-07-09 19:00:00 2020-07-09 21:30:00 Night A 2 10
3 2020-07-09 21:30:00 2020-07-10 07:00:00 Night B 10 25
4 2020-07-10 07:00:00 2020-07-10 19:00:00 Day C 20 30
which seems to be your desired output (if I'm not mistaken).
If you want to transform the columns Start and End back to string with a format similar to the one you've given above (there's some additional padding):
df['Start'] = df['Start'].dt.strftime('%m/%d/%Y %H:%M')
df['End'] = df['End'].dt.strftime('%m/%d/%Y %H:%M')
Problem: Return all rows for an ID (1,2,3,4) if there is any instance where the time difference between dissimilar categories (A,B) for that ID is below 60 minutes. This time difference, or 'Delta' should be the minimum between two dissimilar categories within the same 'ID'.
Example df:
ID Category Time
1 A 1:00
1 A 3:00
1 B 3:30
2 A 13:00
2 B 13:15
2 B 1:00
3 B 12:30
3 B 12:00
4 A 1:00
4 B 3:00
4 B 4:00
4 B 4:30
Desired Output. Note that event 2 B 1:00 is included because ID 2 does have an instance where a time difference between dissimilar categories was <=60.
ID Category Time Delta(minutes)
1 A 1:00 150
1 A 3:00 30
1 B 3:30 30
2 A 13:00 15
2 B 13:15 15
2 B 1:00 120
Not this because there is no duration between dissimilar categories:
ID Category Time Delta
3 B 12:00 n/a
3 B 12:30 n/a
Not this because Delta is not < 60min.
ID Category Time Delta
4 A 1:00 120
4 B 3:00 120
4 B 4:00 180
4 B 4:30 240
I've tried using:
df["Delta"] = df["Time"].groupby(df['ID']).diff()
But this does not take into account Category. Any assistance would be much appreciated. Thanks!
Here's a way:
df['Time'] = pd.to_datetime(df['Time'], format='%H:%M')
def f(x):
x1 = x.assign(key=1).merge(x.assign(key=1), on='key').query('Category_x < Category_y')
x1['TimeDiff'] = x1['Time_y'] - x1['Time_x']
return (x1['TimeDiff'] <= pd.Timedelta('60T')).any()
df.groupby('ID').filter(f)
Output:
ID Category Time
0 1 A 1900-01-01 01:00:00
1 1 A 1900-01-01 03:00:00
2 1 B 1900-01-01 03:30:00
3 2 A 1900-01-01 13:00:00
4 2 B 1900-01-01 13:15:00
5 2 B 1900-01-01 01:00:00
I have the following Dataframe:
Date Holiday
0 2018-01-01 New Year's Day
1 2018-01-15 Martin Luther King, Jr. Day
2 2018-02-19 Washington's Birthday
3 2018-05-08 Truman Day
4 2018-05-28 Memorial Day
... ... ...
58 2022-10-10 Columbus Day
59 2022-11-11 Veterans Day
60 2022-11-24 Thanksgiving
61 2022-12-25 Christmas Day
62 2022-12-26 Christmas Day (Observed)
I would like to re-sample this data frame so that it is an hourly df from a daily df (while copying the content in the holidays column to the correct date). I'd like it to look like this [Ignore the index of the table, it should be alot more numbers than this]
Timestamp Holiday
0 2018-01-01 00:00:00 New Year's Day
1 2018-01-01 01:00:00 New Year's Day
2 2018-01-01 02:00:00 New Year's Day
3 2018-01-01 03:00:00 New Year's Day
4 2018-01-01 04:00:00 New Year's Day
5 2018-01-01 05:00:00 New Year's Day
... ... ...
62 2022-12-26 20:00:00 Christmas Day (Observed)
63 2022-12-26 21:00:00 Christmas Day (Observed)
64 2022-12-26 22:00:00 Christmas Day (Observed)
65 2022-12-26 23:00:00 Christmas Day (Observed)
What's the fastest way to go about doing so? Thanks in advance.
How about
df.set_index("Date").resample("H").ffill().reset_index().rename(
{"Date": "Timestamp"}, axis=1
)
(1) Create a new DataFrame using date_range, (2) concat this with the original DF, (3) make dates as a column again using reset_index, (4) fill the empty slots using groupby and ffill, (5) sort values and drop duplicates/NaN values.
dates = pd.DataFrame(pd.date_range(df2['date'].min(), df2['date'].max(), freq='H'), columns=['date']).set_index('date')
df3 = pd.concat([df2.set_index('date'), dates], sort = False)
df3.reset_index(inplace = True)
df3['Holiday'] = df3.groupby(df3['date'].dt.date)['Holiday'].ffill()
df3 = df3.sort_values('date').drop_duplicates().dropna(axis = 0)
I have a .csv file with some data. There is only one column of in this file, which includes timestamps. I need to organize that data into bins of 30 minutes. This is what my data looks like:
Timestamp
04/01/2019 11:03
05/01/2019 16:30
06/01/2019 13:19
08/01/2019 13:53
09/01/2019 13:43
So in this case, the last two data points would be grouped together in the bin that includes all the data from 13:30 to 14:00.
This is what I have already tried
df = pd.read_csv('book.csv')
df['Timestamp'] = pd.to_datetime(df.Timestamp)
df.groupby(pd.Grouper(key='Timestamp',
freq='30min')).count().dropna()
I am getting around 7000 rows showing all hours for all days with the count next to them, like this:
2019-09-01 03:00:00 0
2019-09-01 03:30:00 0
2019-09-01 04:00:00 0
...
I want to create bins for only the hours that I have in my dataset. I want to see something like this:
Time Count
11:00:00 1
13:00:00 1
13:30:00 2 (we have two data points in this interval)
16:30:00 1
Thanks in advance!
Use groupby.size as:
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df = df.Timestamp.dt.floor('30min').dt.time.to_frame()\
.groupby('Timestamp').size()\
.reset_index(name='Count')
Or as per suggestion by jpp:
df = df.Timestamp.dt.floor('30min').dt.time.value_counts().reset_index(name='Count')
print(df)
Timestamp Count
0 11:00:00 1
1 13:00:00 1
2 13:30:00 2
3 16:30:00 1