I'm trying to read a log and compute the duration of a certain workflow. So the dataframe containing the log looks something like this:
Timestamp Workflow Status
20:31:52 ABC Started
...
...
20:32:50 ABC Completed
In order to compute the duration, I am doing using the following code:
start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time
and the answer I get is:
1 NaT
72 NaT
Name: Timestamp, dtype: timedelta64[ns]
I think since the index is different, the time difference is not being calculated correctly. Of course, I could get the correct answer by using the index of each row explicitly by:
duration = compl_time.loc[72] - start_time[1]
But this seems to be an inelegant way of doing things. Is there a better way to accomplish the same?
You are right, there is problem with different indexes, so output cannot be aligned and get NaNs.
The simpliest is convert output to numpy array by values, but need same lenght of both Series (here both are length == 1), for selecting with boolean indexing is better use loc:
mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']
print (len(start_time))
1
print (len(compl_time))
1
duration = compl_time - start_time.values
print (duration)
1 00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values
print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)
print (pd.Series(pd.to_timedelta(duration)))
0 00:00:58
dtype: timedelta64[ns]
Related
I have a lot of data from a lot of different datasets with different time frames (hourly, every 5 minutes, and every minute). I decided to get all of the data on even times, and only want the data ending in YYYY:MM:DD HH:00:00 (I have decades of data on this).
I have tried a few different methods to filter out only the data I want:
df.loc[starting_row_value::value_to_skip_by] but unfortunately there is some missing data so I start off with the HH:00:00, but by the end in a few different frames it ends up being HH:00:05 or HH:00:55, so missing data is messing this solution up
I also tried df[df.time_column[-5:] == 00:00], but that gives me:
TypeError: cannot do slice indexing on RangeIndex with these indexers
with a few false values + Name: time, dtype: bool] of type Series
I've done a lot of looking, and couldn't find anything for filter by specific hours. Does anyone have any ideas on what I could do? Any help would be much appreciated!
Edit: dtypes for reach dataframe are as follows:
DATE (MM/DD/YYYY) object
MST object
Global PSP [W/m^2] float64
Direct NIP [W/m^2] float64
Reflected PSP [W/m^2] float64
time datetime64[ns]
dtype: object
Everything but the time column was kept as is, whereas I used the following code to create the dataframe columns
df['time'] = pd.to_datetime(df['DATE (MM/DD/YYYY)'] + ' ' +df['MST']
Assuming you are working with pd.Timestamp values, you could do the following:
import pandas as pd
df = pd.DataFrame([
pd.to_datetime('2022-04-06 11:00:00'),
pd.to_datetime('2022-04-06 11:00:05')
], columns=['time_column'])
idx1 = df['time_column'].dt.minute == 0
idx2 = df['time_column'].dt.second == 0
df2 = df[idx1 & idx2]
print(df2)
prints
index
time_column
0
2022-04-06 11:00:00
Date,hrs,Count,Status
2018-01-02,4,15,SFZ
2018-01-03,5,16,ACZ
2018-01-04,3,14,SFZ
2018-01-05,5,15,SFZ
2018-01-06,5,18,ACZ
This is the fraction of data to what I've been working on. The actual data is in the same format with around 1000 entries of each date in it. I am taking the start_date and end_date as inputs from user:
start_date=dt.date(2018, 1, 2)
end_date=dt.date(2018, 1, 23)
Now, I have to display a total for hrs and the count within the selected date range, on the output. I am able to do so by entering the dates directly into between clause, using this snippet:
df = df.loc[df['Date'].between('2018-01-02','2018-01-06'), ['hrs','Count']].sum()
print (df)
Output:
hrs 22
Count 78
dtype: int64
I am using pandas and datetime library. But, I want to pass them using the variables start_date and end_date as they might change everytime. I tried replacing it, it dosen't gives me an error, but the total shows 0.
df = df.loc[df['Date'].between('start_date','end_date'), ['hrs','Count']].sum()
print (df)
Output:
Duration_hrs 0
Reject_Count 0
dtype: int64
You only need to convert all the values to a compatible type, pd.Timestamp:
df = df.loc[pd.to_datetime(df['Date']).between(pd.Timestamp(start_date),
pd.Timestamp(end_date)),
['hrs','Count']].sum()
I need help converting into python/pandas date time format. For example, my times are saved like the following line:
2017-01-01 05:30:24.468911+00:00
.....
2017-05-05 01:51:31.351718+00:00
and I want to know the simplest way to convert this into date time format for essentially performing operations with time (like what is the range in days of my dataset to split up my dataset into chunks by time, what's the time difference from one time to another)? I don't mind losing some of the significance for the times if that makes things easier. Thank you so much!
Timestamp will convert it for you.
>>> pd.Timestamp('2017-01-01 05:30:24.468911+00:00')
Timestamp('2017-01-01 05:30:24.468911+0000', tz='UTC')
Let's say you have a dataframe that includes your timestamp column (let's call it stamp). You can use apply on that column together with Timestamp:
df = pd.DataFrame(
{'stamp': ['2017-01-01 05:30:24.468911+00:00',
'2017-05-05 01:51:31.351718+00:00']})
>>> df
stamp
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
>>> df['stamp'].apply(pd.Timestamp)
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
Name: stamp, dtype: datetime64[ns, UTC]
You could also use Timeseries:
>>> pd.TimeSeries(df.stamp)
0 2017-01-01 05:30:24.468911+00:00
1 2017-05-05 01:51:31.351718+00:00
Name: stamp, dtype: object
Once you have a Timestamp object, it is pretty efficient to manipulate. You can just difference their values, for example.
You may also want to have a look at this SO answer which discusses timezone unaware values to aware.
Let's say I have two strings 2017-06-06 and 1944-06-06 and I wanted to get the difference (what Python calls a timedelta) between the two.
First, I'll need to import datetime. Then I'll need to get both of those strings into datetime objects:
>>> a = datetime.datetime.strptime('2017-06-06', '%Y-%m-%d')
>>> b = datetime.datetime.strptime('1944-06-06', '%Y-%m-%d')
That will give us two datetime objects that can be used in arithmetic functions that will return a timedelta object:
>>> c = abs((a-b).days)
This will give us 26663, and days is the largest resolution that timedelta supports: documentation
Since the Pandas tag is there:
df = pd.DataFrame(['2017-01-01 05:30:24.468911+00:00'])
df.columns = ['Datetime']
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S.%f', utc=True)
print(df.dtypes)
I am trying to add more than two timestamp values and I expect to see output in minutes/seconds. How can I add two timestamps? I basically want to do: '1995-07-01 00:00:01' + '1995-07-01 00:05:06' and see if total time>=60minutes.
I tried this code: df['timestamp'][0]+df['timestamp'][1]. I referred this post but my timestamps are coming from dataframe.
Head of my dataframe column looks like this:
0 1995-07-01 00:00:01
1 1995-07-01 00:00:06
2 1995-07-01 00:00:09
3 1995-07-01 00:00:09
4 1995-07-01 00:00:09
Name: timestamp, dtype: datetime64[ns]
I am getting this error:
TypeError: unsupported operand type(s) for +: 'Timestamp' and 'Timestamp'
The problem is that adding Timestamps makes no sense. What if they were on different days? What you want are the sum of Timedeltas. We can create Timedeltas by subtracting a common date from the whole series. Let's subtract the minimum date. Then sum up the Timedeltas. Let s be your series of Timestamps
s.sub(s.dt.date.min()).sum().total_seconds()
34.0
#Adding two timestamps is not supported and not logical
#Probably, you really want to add the time rather than the timestamp itself
#This is how to extract the time from the timestamp then summing it up
import datetime
import time
t = ['1995-07-01 00:00:01','1995-07-01 00:00:06','1995-07-01 00:00:09','1995-07-01 00:00:09','1995-07-01 00:00:09']
tSum = datetime.timedelta()
df = pd.DataFrame(t, columns=['timestamp'])
for i in range(len(df)):
df['timestamp'][i] = datetime.datetime.strptime(df['timestamp'][i], "%Y-%m-%d %H:%M:%S").time()
dt=df['timestamp'][i]
(hr, mi, sec) = (dt.hour, dt.minute, dt.second)
sum = datetime.timedelta(hours=int(hr), minutes=int(mi),seconds=int(sec))
tSum += sum
if tSum.seconds >= 60*60:
print("more than 1 hour")
else:
print("less than 1 hour")
I have a file that has the following format:
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1
I am interested in finding time differences between appearances of name1 in the list and then calculating the frequency of such appearances (for example, delta time = 1s appeared 20 time, delta time = 30s appeared 1 time etc). The second problem is how to find number of events per minute/hour/day.
I found all time differences by using
pd.to_datetime(pd.Series([time]))
to convert each string to datetime format and placed all values in list named 'times'. Then I iterated through the list:
new=[x - times[i - 1] for i, x in enumerate(times)][1:]
and the resulting list was something like this:
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:10
dtype: timedelta64[ns], 0 00:00:51
dtype: timedelta64[ns], 0 00:00:09
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:11
Any further attempt to calculate frequency results in 'TypeError: 'Series' objects are mutable, thus they cannot be hashed' error. And I am not sure where to find how to calculate number of events per minute or any other time unit.
Obviously, I don't have a lot of experience with datetime in Python, so any pointers would be appreciated.
Use resample and sum to get the number of events per time period - examples below
I gather you want the intervals for individuals (name1: 1st to 2nd event interval; and then his/her 2nd to 3rd event interval). You will need to group by name and then difference the times for each group. In your dataset, only name1 has more than one event, and two events are necessary for a person-centric interval.
Quick and dirty ...
# --- get your data into a DataFrame so I can play with it ...
# first, put the data in a multi-line string (I would read it from a file
# if I had it in a file - but for my purposes a string will do).
data = """
time name
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1"""
# second I will use StringIO and pandas.read_csv to pretend I am
# reading it from a file.
from StringIO import StringIO # import from io in Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')
# third, because pandas did not recognise the date-time format
# of the column I made the index, I will force the string to be
# converted to a pandas Timestamp come DatetimeIndex.
df.index = pd.to_datetime(df.index, format='%Y%m%d%H%M%S')
# number of events per minute
df['event'] = 1 # we will sum this to get events per time-period
dfepm = df.resample('1min', how=sum)
# number of events per hour
dfeph = df.resample('1h', how=sum)
# time differences by name
del df['event'] # we don't need this anymore
df['time'] = df.index
df['time_diff_by_name'] = df.groupby('name')['time'].diff()