In my dataframe I have a column that is timestamp formatted as 2021-11-18 00:58:22.705
I wish to create a column that displays the time elapsed from each row to the interval time (first time).
There are 2 ways in which I can think of doing this but I don't seem to know how to make it happen.
Method 1:
To subtract each time stamp to the row above.
df["difference"]= df["timestamp"].diff()
Now that this time difference has been calculated I would like to create another column that sums each time difference but it keeps the sum from the delta above (elapsed time from start of process)
Method 2:
I guess another way would be to calculate the timestamp of each row to the interval time stamp (first one)
I do not know how I would do that.
Thanks in advance.
not completely understood the type of difference needed so adding both which I think are reasonable:
import pandas as pd
times = pd.date_range('2022-05-23', periods=20, freq='0D30min')
df = pd.DataFrame({'Timestamp': times})
df['difference_in_min'] = (df.Timestamp - df.Timestamp.min()).astype('timedelta64[m]')
df['cumulative_dif_in_min'] = df.difference_in_min.cumsum()
print(df)
Timestamp difference_in_min cumulative_dif_in_min
0 2022-05-23 00:00:00 0.0 0.0
1 2022-05-23 00:30:00 30.0 30.0
2 2022-05-23 01:00:00 60.0 90.0
3 2022-05-23 01:30:00 90.0 180.0
4 2022-05-23 02:00:00 120.0 300.0
5 2022-05-23 02:30:00 150.0 450.0
6 2022-05-23 03:00:00 180.0 630.0
7 2022-05-23 03:30:00 210.0 840.0
8 2022-05-23 04:00:00 240.0 1080.0
Related
EDIT: My main goal is not to use a for loop and find a way of grouping the data efficiently/fast.
I am trying to solve a problem, which is about grouping together different rows of data based on an ID and a time window of 30 Days.
I have the following example data:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
And I would like to have the following data:
ID
Time
Group
12345
2021-01-01 14:00:00
1
12345
2021-01-15 14:00:00
1
12345
2021-01-29 14:00:00
1
12345
2021-02-15 14:00:00
2
12345
2021-02-16 14:00:00
2
12345
2021-03-15 14:00:00
3
12345
2021-04-24 14:00:00
4
12344
2021-01-24 14:00:00
5
12344
2021-01-25 14:00:00
5
12344
2021-04-24 14:00:00
6
(4 can also be 1 as it is in a new group based on the ID 12344; 5 can also be 2)
I could differentiate then based on the ID column. So the Group does not need to be unique but can be.
The most important would be to separate it based on the ID and then check all the rows for each ID and assign an ID to the 30 Days time window. By 30 Days time window I mean that e.g. the first time frame for ID 12345 starts at 2021-01-01 and goes up to 2021-01-31 (this should be the group 1) and then the second time time frame for the ID 12345 starts at 2021-02-01 and would go to 2021-03-02 (for 30 days).
The problem I have faced with using the following code is that it uses the first date it finds in the dataframe:
grouped_data = df.groupby(["ID",pd.Grouper(key = "Time", freq = "30D")]).count()
In the above code I have just tried to count the rows (which wouldn't give me the Group, but I have tried to group it with my logic).
I hope someone can help me with this, because I have tried so many different things and nothing did work. I have already used the following (but maybe wrong):
pd.rolling()
pd.Grouper()
for loop
etc.
I really don't want to use for loop as I have 1.5 Mio rows.
And I have tried to vectorize the for loop but I am not really familiar with vectorization and was struggling to transfer my for loop to a vectorization.
Please let me know if I can use pd.Grouper differently so I get the results. thanks in advance.
For arbitrary windows you can use pandas.cut
eg, for 30 day bins starting at 2021-01-01 00:00:00 for the entirety of 2021 you can use:
bins = pd.date_range("2021", "2022", freq="30D")
group = pd.cut(df["Time"], bins)
group will label each row with an interval which you can then group on etc. If you want the groups to have labels 0, 1, 2, etc then you can map values with:
dict(zip(group.unique(), range(group.nunique())))
EDIT: approach where the windows are 30 day intervals, disjoint, and starting at a time in the Time column:
times = df["Time"].sort_values()
ii = pd.IntervalIndex.from_arrays(times, times+pd.Timedelta("30 days"))
disjoint_intervals = []
prev_interval = None
for i, interval in enumerate(ii):
if prev_interval is None or interval.left >= prev_interval.right: # no overlap
prev_interval = interval
disjoint_intervals.append(i)
bins = ii[disjoint_intervals]
group = pd.cut(df["Time"], bins)
Apologies, this is not a vectorised approach. Struggling to think if one could exist.
SOLUTION:
The solution which worked for me is the following:
I have imported the sampleData from excel into a dataframe. The data looks like this:
ID
Time
12345
2021-01-01 14:00:00
12345
2021-01-15 14:00:00
12345
2021-01-29 14:00:00
12345
2021-02-15 14:00:00
12345
2021-02-16 14:00:00
12345
2021-03-15 14:00:00
12345
2021-04-24 14:00:00
12344
2021-01-24 14:00:00
12344
2021-01-25 14:00:00
12344
2021-04-24 14:00:00
Then I have used the following steps:
Import the data:
df_test = pd.read_excel(r"sampleData.xlsx")
Order the dataframe so we have the correct order of ID and Time:
df_test_ordered = df_test.sort_values(["ID","Time"])
df_test_ordered = df_test_ordered.reset_index(drop=True)
I have also reset the index and dropped it as it has manipulated my calculations later on.
Create column with time difference between the previous row:
df_test_ordered.loc[df_test_ordered["ID"] == df_test_ordered["ID"].shift(1),"time_diff"] = df_test_ordered["Time"] - df_test_ordered["Time"].shift(1)
Transform timedelta64[ns] to timedelta64[D]:
df_test_ordered["time_diff"] = df_test_ordered["time_diff"].astype("timedelta64[D]")
Calculate the cumsum per ID:
df_test_ordered["cumsum"] = df_test_ordered.groupby("ID")["time_diff"].transform(pd.Series.cumsum)
Backfill the dataframe (exchange the NaN values with the next value):
df_final = df_test_ordered.ffill().bfill()
Create the window by dividing by 30 (30 days time period):
df_final["Window"] = df_final["cumsum"] / 30
df_final["Window_int"] = df_final["Window"].astype(int)
The "Window_int" column is now a kind of ID (not unique; but unique within the groups of column "ID").
Furthermore, I needed to backfill the dataframe as there were NaN values due to the calculation of time difference only if the previous ID equals the ID. If not then NaN is set as time difference. Backfilling will just set the NaN value to the next time difference which makes no difference mathematically and assign the correct value.
Solution dataframe:
ID Time time_diff cumsum Window Window_int
0 12344 2021-01-24 14:00:00 1.0 1.0 0.032258 0
1 12344 2021-01-25 14:00:00 1.0 1.0 0.032258 0
2 12344 2021-04-24 14:00:00 89.0 90.0 2.903226 2
3 12345 2021-01-01 14:00:00 14.0 14.0 0.451613 0
4 12345 2021-01-15 14:00:00 14.0 14.0 0.451613 0
5 12345 2021-01-29 14:00:00 14.0 28.0 0.903226 0
6 12345 2021-02-15 14:00:00 17.0 45.0 1.451613 1
7 12345 2021-02-16 14:00:00 1.0 46.0 1.483871 1
8 12345 2021-03-15 14:00:00 27.0 73.0 2.354839 2
9 12345 2021-04-24 14:00:00 40.0 113.0 3.645161 3
I have two generator's generation data which is 15 min time block, I want to convert it hourly. Here is an example:
Time Gen1 Gen2
00:15:00 10 21
00:30:00 12 22
00:45:00 16 26
01:00:00 20 11
01:15:00 60 51
01:30:00 30 31
01:45:00 70 21
02:00:00 40 61
I want to take the average of the first 4 values( basically the 15 min block to the hourly block) and put them in place of a 1-hour block. Expected output:
Time Gen1 Gen2
01:00:00 14.5 20
02:00:00 50 41
I know I can use the pandas' grourpby function to get the expected output but I don't know its proper syntex. So can anyone please help?
Use resample with closed='right'. But first we convert your Time column to datetime type:
df['Time'] = pd.to_datetime(df['Time'])
df.resample('H', on='Time', closed='right').mean().reset_index()
Time Gen1 Gen2
0 2021-01-09 00:00:00 14.5 20.0
1 2021-01-09 01:00:00 50.0 41.0
To convert the Time column back to time format, use:
df['Time'] = df['Time'].dt.time
Time Gen1 Gen2
0 00:00:00 14.5 20.0
1 01:00:00 50.0 41.0
you can try create a column hourand then groupby('hour').mean().
df['date_time'] = pd.to_datetime(df['Time'], format="%H:%M:%S")
df['hour'] = df['date_time'].apply(lambda x: x.strftime("%H:00:00"))
gr_df = df.groupby('hour').mean()
gr_df.index.name = 'Time'
print(gr_df.reset_index())
Time Gen1 Gen2
0 00:00:00 12.666667 23.0
1 01:00:00 45.000000 28.5
2 02:00:00 40.000000 61.0
I need to essentially measure how much each employee gets paid during each hour of work. There was some data cleaning to do and so I'm trying to make the formatting consistent.
It is a homework problem and its proving tough. I am new to python so please feel free to compress the code. I'm trying to use the pandas database.
csv file in pandas
break_notes end_time pay_rate start_time
0 15-18 23:00 10.0 10:00
1 18.30-19.00 23:00 12.0 18:00
2 4PM-5PM 22:30 14.0 12:00
3 3-4 18:00 10.0 09:00
4 4-4.10PM 23:00 20.0 09:00
5 15 - 17 23:00 10.0 11:00
6 11 - 13 16:00 10.0 10:00
'''
import pandas as pd
import datetime
import numpy as np
work_shifts = pd.read_csv('work_shifts.csv')
break_shifts = work_shifts['break_notes'].str.extract('(?P<start>[\d\.]+)?\D*(?P<end>[\d\.]+)?')
print(work_shifts)
for i in range(len(break_shifts['start'])):
if '.' not in break_shifts['start'][i]:
break_shifts['start'][i] = break_shifts['start'][i] + ':00'
else:
break_shifts['start'][i] = break_shifts['start'][i].replace('.',':')
for i in range(len(break_shifts['end'])):
if '.' in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i].replace('.',':')
elif '.' not in str(break_shifts['end'][i]):
break_shifts['end'][i] = break_shifts['end'][i] + ':00'
for i in range(len(break_shifts['end'])):
break_shifts['end'][i] = datetime.datetime.strptime(break_shifts['end'][i], '%H:%M').time()
break_shifts['start'][i] = datetime.datetime.strptime(break_shifts['start'][i], '%H:%M').time()
work_shifts[['start_break','end_break']] = break_shifts[['start', 'end']]
for i in range(len(work_shifts['end_time'])):
work_shifts['end_time'][i] = datetime.datetime.strptime(work_shifts['end_time'][i], '%H:%M').time()
for i in range(len(work_shifts['start_time'])):
work_shifts['start_time'][i] = datetime.datetime.strptime(work_shifts['start_time'][i], '%H:%M').time()
print(work_shifts)
this is the result
break_notes end_time pay_rate start_time start_break end_break
0 15-18 23:00:00 10.0 10:00:00 15:00:00 18:00:00
1 18.30-19.00 23:00:00 12.0 18:00:00 18:30:00 19:00:00
2 4PM-5PM 22:30:00 14.0 12:00:00 04:00:00 05:00:00
3 3-4 18:00:00 10.0 09:00:00 03:00:00 04:00:00
4 4-4.10PM 23:00:00 20.0 09:00:00 04:00:00 04:10:00
5 15 - 17 23:00:00 10.0 11:00:00 15:00:00 17:00:00
6 11 - 13 16:00:00 10.0 10:00:00 11:00:00 13:00:00
I tried adding the times but they are inconsistent types. If theres a different approach then please provide guidance. I need to calculate how many employees are working at what time and then calculate how much pay is given to the employees per hour.
My approach was to convert the formatting of the break notes into time then convert the 12-hour to 12 provided both end_break and start_break was before datetime.datetime(12,0,0).
I'm not sure how to calculate the money per hour. Maybe using if statements?
I have a DataFrame that has time stamps in the form of (yyyy-mm-dd hh:mm:ss). I'm trying to delete data between two different time stamps. At the moment I can delete the data between 1 range of time stamps but I have trouble extending this to multiple time stamps.
For example, with the DataFrame I can delete a range of rows (e.g. 2015-03-01 00:20:00 to 2015-08-01 01:10:00) however, I'm not sure how to go about deleting another range alongside it. The code that does that is shown below.
index_list= df.timestamp[(df.timestamp >= "2015-07-01 00:00:00") & (df.timestamp <= "2015-12-30 23:50:00")].index.tolist()
df1.drop(df1.index[index_list1, inplace = True)
The DataFrame extends over 3 years and has every day in the 3 years included.
I'm trying to delete all the rows from months July to December (2015-07-01 00:00:00 to 2015-12-30 23:50:00) for all 3 years.
I was thinking that I create a helper column that gets the Month from the Date column and then drops based off the Month from the helper column.
I would greatly appreciate any advice. Thanks!
Edit:
I've added in a small summarised version of the DataFrame. This is what the intial DataFrame looks like.
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-04-01 00:30:00 65.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-07-01 01:00:00 74.0
2015-08-01 01:10:00 54.0
2015-09-01 01:20:00 86.0
2015-10-01 01:30:00 91.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
To get something like this
df Date v
2015-01-01 00:00:00 30.0
2015-02-01 00:10:00 55.0
2015-03-01 00:20:00 36.0
2015-05-01 00:40:00 35.0
2015-06-01 00:50:00 22.0
2015-11-01 01:40:00 65.0
2015-12-01 01:50:00 35.0
Where time stamps "2015-07-01 00:20:00 to 2015-10-01 00:30:00"and "2015-07-01 01:00:00 to 2015-10-01 01:30:00" are removed. Sorry if my formatting isn't up to standard.
If your timestamp column uses the correct dtype, you can just do:
df.loc[df.timestamp.dt.month.isin([1, 2, 3, 5, 6, 11, 12])]
This should filter out the months not inside the list.
As you hinted, data manipulation is always easier when you use the right data types. To support time stamps, pandas has the Timestamp type. You can do this as follows:
df['Date'] = pd.to_datetime(df['Date']) # No date format needs to be specified,
# "YYYY-MM-DD HH:MM:SS" is the standard
Then, removing all entries in the months of July to December for all years is straightforward:
df = df[df['Date'].dt.month < 7] # Keep only months less than July
This is my data:
time id w
0 2018-03-01 00:00:00 39.0 1176.000000
1 2018-03-01 00:15:00 39.0 NaN
2 2018-03-01 00:30:00 39.0 NaN
3 2018-03-01 00:45:00 39.0 NaN
4 2018-03-01 01:00:00 39.0 NaN
5 2018-03-01 01:15:00 39.0 NaN
6 2018-03-01 01:30:00 39.0 NaN
7 2018-03-01 01:45:00 39.0 1033.461538
8 2018-03-01 02:00:00 39.0 1081.066667
9 2018-03-01 02:15:00 39.0 1067.909091
10 2018-03-01 02:30:00 39.0 NaN
11 2018-03-01 02:45:00 39.0 1051.866667
12 2018-03-01 03:00:00 39.0 1127.000000
13 2018-03-01 03:15:00 39.0 1047.466667
14 2018-03-01 03:30:00 39.0 1037.533333
I want to get index: 10
Because I need to know which time not continuous and I need to add the value.
I want to know if there is a NAN in front of and behind each 'time'. If not I need to know it index. I need to add value for it.
My data is very large. I need a faster way.
I really need your help.Many thanks.
Not sure if I understood you correctly. If you want the index of the column time where the change is more than 15 minutes, you will have more index than 4, and you can do so:
df['time'] = pd.to_datetime(df['time'], format='%Y-%m-%d %H:%M:%S')
df['Delta']=(df['time'].subtract(df['time'].shift(1)))
df['Delta'] = df['Delta'].astype(str)
print df.index[df['Delta'] != '0 days 00:15:00.000000000'].tolist()
And the output is:
[4561, 4723, 5154, 5220, 5293, 5437, 5484]
Edit
Again, if I understood you right, just use this:
df.index[(pd.isnull(df['w'])) & (pd.notnull(df['w'].shift(1))) & (pd.notnull(df['w'].shift(-1)))].tolist()
Output:
[10]
This should work pretty fast:
import numpy as np
index = np.array([4561,4723,4724,4725,4726,5154,5220,5221,5222,5223,5224,5293,5437,5484,5485,5486,5487])
continuous = np.diff(index) == 1
not_continuous = np.where(~continuous[1:] & ~continuous[:-1])[0] + 1 # check on both 'sides', +1 because you 'loose' one index in the diff operation
index[not_continuous]
array([5154, 5293, 5437])
It doesn't handle the first value well but this is quite ambiguous since you don't have a preceding value to check against. Up to you to add this extra check if it matters to you... Same for last value, potentially.