pandas time differences (delta between rows) - python

I have a column with timestamps (strings) which look like the following:
2017-10-25T09:57:00.319Z
2017-10-25T09:59:00.319Z
2017-10-27T11:03:00.319Z
Tbh I do not know the meaning of Z but I guess it is not that important.
How to convert the above strings into correct timestamp to calculate the difference/delta (e.g. in seconds or minutes)?
I want to have a column where the deltas between one to anoter timestamp are listed.

You can use pd.to_datetime() to convert the string to datetime format. Then get the time difference/delta by .diff(). Finally, convert the timedelta to seconds by .dt.total_seconds(), as follows:
(Assuming your column of string is named Date):
df['Date'] = pd.to_datetime(df['Date'])
df['TimeDelta'] = df['Date'].diff().dt.total_seconds()
Result:
Time delta in seconds:
print(df)
Date TimeDelta
0 2017-10-25 09:57:00.319000+00:00 NaN
1 2017-10-25 09:59:00.319000+00:00 120.0
2 2017-10-27 11:03:00.319000+00:00 176640.0

Related

Split Date Time string (not in usual format) and pull out month

I have a dataframe that has a date time string but is not in traditional date time format. I would like to separate out the date from the time into two separate columns. And then eventually also separate out the month.
This is what the date/time string looks like: 2019-03-20T16:55:52.981-06:00
>>> df.head()
Date Score
2019-03-20T16:55:52.981-06:00 10
2019-03-07T06:16:52.174-07:00 9
2019-06-17T04:32:09.749-06:003 1
I tried this but got a type error:
df['Month'] = pd.DatetimeIndex(df['Date']).month
This can be done just using pandas itself. You can first convert the Date column to datetime by passing utc = True:
df['Date'] = pd.to_datetime(df['Date'], utc = True)
And then just extract the month using dt.month:
df['Month'] = df['Date'].dt.month
Output:
Date Score Month
0 2019-03-20 22:55:52.981000+00:00 10 3
1 2019-03-07 13:16:52.174000+00:00 9 3
2 2019-06-17 10:32:09.749000+00:00 1 6
From the documentation of pd.to_datetime you can see a parameter:
utc : boolean, default None
Return UTC DatetimeIndex if True (converting any tz-aware datetime.datetime objects as well).

Convert strings with and without time (mixed format) to datetime in a pandas dataframe

When converting a pandas dataframe column from object to datetime using astype function, the behavior is different depending on if the strings have the time component or not. What is the correct way of converting the column?
df = pd.DataFrame({'Date': ['12/07/2013 21:50:00','13/07/2013 00:30:00','15/07/2013','11/07/2013']})
df['Date'] = pd.to_datetime(df['Date'], format="%d/%m/%Y %H:%M:%S", exact=False, dayfirst=True, errors='ignore')
Output:
Date
0 12/07/2013 21:50:00
1 13/07/2013 00:30:00
2 15/07/2013
3 11/07/2013
but the dtype is still object. When doing:
df['Date'] = df['Date'].astype('datetime64')
it becomes of datetime dtype but the day and month are not parsed correctly on rows 0 and 3.
Date
0 2013-12-07 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-11-07 00:00:00
The expected result is:
Date
0 2013-07-12 21:50:00
1 2013-07-13 00:30:00
2 2013-07-15 00:00:00
3 2013-07-11 00:00:00
If we look at the source code, if you pass format= and dayfirst= arguments, dayfirst= will never be read because passing format= calls a C function (np_datetime_strings.c) that doesn't use dayfirst= to make conversions. On the other hand, if you pass only dayfirst=, it will be used to first guess the format and falls back on dateutil.parser.parse to make conversions. So, use only one of them.
In most cases,
df['Date'] = pd.to_datetime(df['Date'])
does the job.
In the specific example in the OP, passing dayfirst=True does the job.
df['Date'] = pd.to_datetime(df['Date'], dayfirst=True)
That said, passing the format= makes the conversion run ~25x faster (see this post for more info), so if your frame is anything larger than 10k rows, then it's better to pass the format=. Now since the format is mixed, one way is to perform the conversion in two steps (errors='coerce' argument will be useful)
convert the datetimes with time component
fill in the NaT values (the "coerced" rows) by a Series converted with a different format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y %H:%M:%S', errors='coerce')
df['Date'] = df['Date'].fillna(pd.to_datetime(df['Date'], format='%d/%m/%Y', errors='coerce'))
This method (of performing or more conversions) can be used to convert any column with "weirdly" formatted datetimes.

How can I efficiently extract hours and minutes from a pandas column that has integer values in format HHMM, HMM, MM, and M?

I have a csv file that contains a column of data where each value is an integer meant to represent the hour and minute in a day. The problem is that each value does not follow the same format. If it is between 12:00 AM and 12:10 AM the value will just be one digit, the minute. If it is between 12:10 AM and 1:00 AM, the value will have to digits, again the minute. If it is between 1:00 AM and 10:00 AM, the value will have three digits, the hour and minute. Finally, for all other values (those between 10:00 AM and 12:00 AM, the value will have four digits, again the hour and minute.
I tried using the pandas, "to_datetime" function to operate on the whole column.
from pandas import read_csv, to_datetime
url = lambda year: f'ftp://sidads.colorado.edu/pub/DATASETS/NOAA/G00807/IIP_{year}IcebergSeason.csv'
df = read_csv(url(2011))
def convert_float_column_to_int_column(df, *column_names):
for column_name in column_names:
try:
df[column_name] = df[column_name].astype(int)
except ValueError:
df = df.dropna(subset=[column_name]).reset_index(drop=True)
df[column_name] = df[column_name].astype(int)
return df
df2 = convert_float_column_to_int_column(df, 'ICEBERG_NUMBER', 'SIGHTING_TIME')
df2['SIGHTING_TIME'] = to_datetime(df2['SIGHTING_TIME'].astype(str), format='%H%M')
The result I got was:
ValueError: time data '0' does not match format '%H%M' (match).
Which was as expected.
I'm sure I could work around this problem by iterating through each row, using if statements, and converting each value to a four character string but these files are relatively big so that would be too slow of a solution.
No need for if statements. Series.str.zfill will pad it with the correct number of zeros to get it in the proper format. Then use pd.to_datetime, subtracting off 1900-01-01 which is the date it will use when none of those fields are present:
Input Data
import pandas as pd
df = pd.DataFrame({'Time': [1, 12, 123, 1234]})
# Time
#0 1
#1 12
#2 123
#3 1234
pd.to_datetime
df['Time'] = (pd.to_datetime(df.Time.astype(str).str.zfill(4), format='%H%M')
- pd.to_datetime('1900-01-01'))
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]
pd.to_timedelta
Can also be used, but since you cannot specify a format parameter you need to clean everything beforehand:
df['Time'] = df.Time.astype(str).str.zfill(4)
# Pandas .str methods are slow, use a list comprehension to speed it up
#df['Time'] = df.Time.str[0:2] + ':' + df.Time.str[2:4] + ':00'
csize=2
df['Time'] = [':'.join(x[i:i+csize] for i in range(0, len(x), csize))+':00' for x in df.Time.values]
df['Time'] = pd.to_timedelta(df.Time)
#0 00:01:00
#1 00:12:00
#2 01:23:00
#3 12:34:00
#Name: Time, dtype: timedelta64[ns]

Pandas - Add seconds from a column to datetime in other column

I have a dataFrame with two columns, ["StartDate" ,"duration"]
the elements in the StartDate column are datetime type, and the duration are ints.
Something like:
StartDate Duration
08:16:05 20
07:16:01 20
I expect to get:
EndDate
08:16:25
07:16:21
Simply add the seconds to the hour.
I'd being checking some ideas about it like the delta time types and that all those datetimes have the possibilities to add delta times, but so far I can find how to do it with the DataFrames (in a vector fashion, cause It might be possible to iterate over all the rows performing the operation ).
consider this df
StartDate duration
0 01/01/2017 135
1 01/02/2017 235
You can get the datetime column like this
df['EndDate'] = pd.to_datetime(df['StartDate']) + pd.to_timedelta(df['duration'], unit='s')
df.drop('StartDate,'duration', axis = 1, inplace = True)
You get
EndDate
0 2017-01-01 00:02:15
1 2017-01-02 00:03:55
EDIT: with the sample dataframe that you posted
df['EndDate'] = pd.to_timedelta(df['StartDate']) + pd.to_timedelta(df['Duration'], unit='s')
df.StartDate = df.apply(lambda x: pd.to_datetime(x.StartDate)+pd.Timedelta(Second(df.duration)) ,axis = 1)

How to group a data frame by a time interval in pandas?

I have a data frame df
Date Mobile_No Amount Time .....
121526 2014-12-24 739637 200.00 9:44:00
121529 2014-12-28 199002 500.00 9:49:44
121531 2014-12-10 813770 100.00 9:50:41
121536 2014-12-09 178795 100.00 9:52:15
121537 2014-12-09 178795 100.00 9:52:24
having Date and Time of type datetime64 and object. I need to group this data frame by time interval of 5 minutes and Mobile_No. My expected output is the last two rows should be counted as one (Same Mobile_No and time interval is less than 5 minutes).
Is there any way to achieve this?
First I thought to combine Date and Time column and make timestamp and then use it as index and apply pd.TimeGrouper(), but this doesn't seem to work
>>>import datetime as dt
>>>import pandas as pd
...
>>> df.apply(lambda x: dt.datetime.combine(x['Date'], dt.time(x['Time'])), axis=1)
gives the error
'an integer is required', u'occurred at index 121526'
Can you not convert to string, concat the strings and parse the format in to_datetime if you are having issues:
df['Time']=df['Time'].astype(str)
df['Date']=df['Date'].astype(str)
df['Timestamp'] = df['Date'] +' ' + df['Time']
df.index = pd.to_datetime(df['Timestamp'], format='%Y/%m/%d %H:%M:%S')
from there you can resample or us pd.Grouper as required.

Categories