Working with datetime in Python - python

I have a file that has the following format:
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1
I am interested in finding time differences between appearances of name1 in the list and then calculating the frequency of such appearances (for example, delta time = 1s appeared 20 time, delta time = 30s appeared 1 time etc). The second problem is how to find number of events per minute/hour/day.
I found all time differences by using
pd.to_datetime(pd.Series([time]))
to convert each string to datetime format and placed all values in list named 'times'. Then I iterated through the list:
new=[x - times[i - 1] for i, x in enumerate(times)][1:]
and the resulting list was something like this:
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:10
dtype: timedelta64[ns], 0 00:00:51
dtype: timedelta64[ns], 0 00:00:09
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:11
Any further attempt to calculate frequency results in 'TypeError: 'Series' objects are mutable, thus they cannot be hashed' error. And I am not sure where to find how to calculate number of events per minute or any other time unit.
Obviously, I don't have a lot of experience with datetime in Python, so any pointers would be appreciated.

Use resample and sum to get the number of events per time period - examples below
I gather you want the intervals for individuals (name1: 1st to 2nd event interval; and then his/her 2nd to 3rd event interval). You will need to group by name and then difference the times for each group. In your dataset, only name1 has more than one event, and two events are necessary for a person-centric interval.
Quick and dirty ...
# --- get your data into a DataFrame so I can play with it ...
# first, put the data in a multi-line string (I would read it from a file
# if I had it in a file - but for my purposes a string will do).
data = """
time name
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1"""
# second I will use StringIO and pandas.read_csv to pretend I am
# reading it from a file.
from StringIO import StringIO # import from io in Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')
# third, because pandas did not recognise the date-time format
# of the column I made the index, I will force the string to be
# converted to a pandas Timestamp come DatetimeIndex.
df.index = pd.to_datetime(df.index, format='%Y%m%d%H%M%S')
# number of events per minute
df['event'] = 1 # we will sum this to get events per time-period
dfepm = df.resample('1min', how=sum)
# number of events per hour
dfeph = df.resample('1h', how=sum)
# time differences by name
del df['event'] # we don't need this anymore
df['time'] = df.index
df['time_diff_by_name'] = df.groupby('name')['time'].diff()

Related

Python Datetime conversion for excel dataframe

Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64

Pandas convert series of one to datetime object

I have a data frame with a lot of columns and rows, the index column contains datetime objects.
date_time column1 column2
10-10-2010 00:00:00 1 10
10-10-2010 00:00:03 1 10
10-10-2010 00:00:06 1 10
Now I want to calculate the difference in time between the first and last datetime object. Therefore:
start = df["date_time"].head(1)
stop = df["date_time"].tail(1)
However I now want to extract this datetime value so that I can use the .total_seconds() seconds to calculate the number of seconds difference between the two datetime objects, something like:
delta_t_seconds = (start - stop).total_seconds()
This however doesn't give the desired result, since start and stop are still series with only one member.
please help

Drop certain character in Object before converting to Datetime column in Pandas

My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]

Efficient timedelta calculator

I have a time series data from a data logger that puts time stamps (in the form of dates MM--DD-YY HH:MM:SS:xxx:yyy (e.g. --[ 29.08.2018 16:26:31.406 ] --) where xxx and yyy are milliseconds and microseconds respectively) precise up to microseconds when recording data. Now you can imagine that the generated file recorded over a few minutes could be very big. (100s of megabytes). I need to plot a bunch of data from this file vs time in millisconds (ideally).
The data looks like below:
So I need to parse these dates in python and calculate timedelta to find timelapsed between samples and then generate plots. As when I subtract these two time stamps (--[ 29.08.2018 16:23:41.052 ] -- and --[ 29.08.2018 16:23:41.114 ] --), I want to get 62 milliseconds as time lapsed between these two time stamps.
Currently I am using 'dateparser' (by import dateparser as dp) which outputs datetime after parsing and then I can subtract those to extract timedelta and then convert into ms or seconds as I need.
But this function is taking too long and is the bottleneck in my post processing script.
Anyone could suggest a better library that is more efficient in parsing dates and calculating timedelta?
Here's the piece of code that is not so efficient
import dateparser as dp
def timedelta_local(date1, date2):
import dateparser as dp
timedelta = dp.parse(date2)-dp.parse(date1)
timediff={'us': timedelta.microseconds+timedelta.seconds*1000000+timedelta.days*24*60*60*1000000,
'ms':timedelta.microseconds/1000+timedelta.seconds*1000+timedelta.days*24*60*60*1000,
'sec': timedelta.microseconds/1000000+timedelta.seconds+timedelta.days*24*60*60,
'minutes': timedelta.microseconds/1000000/60+timedelta.seconds/60+timedelta.days*24*60
}
return timediffe
Thanks in advance
#zvone is correct here. pandas is your best friend for this. Here is some sample code that will hopefully get you on the right track. It assumes your data is in a CSV file with a header line like the one you show in your example. I wasn't sure whether you wanted to keep the time difference as a timedelta object (easy for doing further math with) or just simplify it to a float. I did both.
import pandas as pd
df = pd.read_csv("test.csv", parse_dates=[0])
# What are the data types after the initial import?
print(f'{df.dtypes}\n\n')
# What are the contents of the data frame?
print(f'{df}\n\n')
# Create a new column that strips away leading and trailing characters
# that surround the data we want
df['Clean Time Stamp'] = df['Time Stamp'].apply(lambda x: x[3:-4])
# Convert to a pandas Timestamp. Use infer_datetime_format for speed.
df['Real Time Stamp'] = pd.to_datetime(df['Clean Time Stamp'], infer_datetime_format=True)
# Calculate time difference between successive rows
df['Delta T'] = df['Real Time Stamp'].diff()
# Convert pandas timedelta to a floating point value in milliseconds.
df['Delta T ms'] = df['Delta T'].dt.total_seconds() * 1000
print(f'{df.dtypes}\n\n')
print(df)
The output looks like this. Note that the printing of the dataframe is wrapping the columns around to another line - this is just an artifact of printing it.
Time Stamp object
Limit A int64
Value A float64
Limit B int64
Value B float64
dtype: object
Time Stamp Limit A Value A Limit B Value B
0 --[ 29.08.2018 16:23:41.052 ] -- 15 3.109 30 2.907
1 --[ 29.08.2018 16:23:41.114 ] -- 15 3.020 30 8.242
Time Stamp object
Limit A int64
Value A float64
Limit B int64
Value B float64
Clean Time Stamp object
Real Time Stamp datetime64[ns]
Delta T timedelta64[ns]
Delta T ms float64
dtype: object
Time Stamp Limit A Value A Limit B Value B \
0 --[ 29.08.2018 16:23:41.052 ] -- 15 3.109 30 2.907
1 --[ 29.08.2018 16:23:41.114 ] -- 15 3.020 30 8.242
Clean Time Stamp Real Time Stamp Delta T \
0 29.08.2018 16:23:41.052 2018-08-29 16:23:41.052 NaT
1 29.08.2018 16:23:41.114 2018-08-29 16:23:41.114 00:00:00.062000
Delta T ms
0 NaN
1 62.0
If your files are large you may gain some efficiency by editing columns in place rather than creating new ones like I did.

Pandas dataframe calculations between rows

I'm trying to read a log and compute the duration of a certain workflow. So the dataframe containing the log looks something like this:
Timestamp Workflow Status
20:31:52 ABC Started
...
...
20:32:50 ABC Completed
In order to compute the duration, I am doing using the following code:
start_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Started')]['Timestamp']
compl_time = log_text[(log_text['Workflow']=='ABC') & (log_text['Category']=='Completed')]['Timestamp']
duration = compl_time - start_time
and the answer I get is:
1 NaT
72 NaT
Name: Timestamp, dtype: timedelta64[ns]
I think since the index is different, the time difference is not being calculated correctly. Of course, I could get the correct answer by using the index of each row explicitly by:
duration = compl_time.loc[72] - start_time[1]
But this seems to be an inelegant way of doing things. Is there a better way to accomplish the same?
You are right, there is problem with different indexes, so output cannot be aligned and get NaNs.
The simpliest is convert output to numpy array by values, but need same lenght of both Series (here both are length == 1), for selecting with boolean indexing is better use loc:
mask = log_text['Workflow']=='ABC'
start_time = log_text.loc[mask & (log_text['Status']=='Started'), 'Timestamp']
compl_time = log_text.loc[mask & (log_text['Status']=='Completed'),'Timestamp']
print (len(start_time))
1
print (len(compl_time))
1
duration = compl_time - start_time.values
print (duration)
1 00:00:58
Name: Timestamp, dtype: timedelta64[ns]
duration = compl_time.values - start_time.values
print (pd.to_timedelta(duration))
TimedeltaIndex(['00:00:58'], dtype='timedelta64[ns]', freq=None)
print (pd.Series(pd.to_timedelta(duration)))
0 00:00:58
dtype: timedelta64[ns]

Categories