Pandas convert series of one to datetime object - python

I have a data frame with a lot of columns and rows, the index column contains datetime objects.
date_time column1 column2
10-10-2010 00:00:00 1 10
10-10-2010 00:00:03 1 10
10-10-2010 00:00:06 1 10
Now I want to calculate the difference in time between the first and last datetime object. Therefore:
start = df["date_time"].head(1)
stop = df["date_time"].tail(1)
However I now want to extract this datetime value so that I can use the .total_seconds() seconds to calculate the number of seconds difference between the two datetime objects, something like:
delta_t_seconds = (start - stop).total_seconds()
This however doesn't give the desired result, since start and stop are still series with only one member.
please help

Related

Python Datetime conversion for excel dataframe

Hello,
I am trying to extract date and time column from my excel data. I am getting column as DataFrame with float values, after using pandas.to_datetime I am getting date with different date than actual date from excel. for example, in excel starting date is 01.01.1901 00:00:00 but in python I am getting 1971-01-03 00:00:00.000000 like this.
How can I solve this problem?
I need a final output in total seconds with DataFrame. First cell starting as a 00 sec and very next cell with timestep of seconds (time difference in ever cell is 15min.)
Thank you.
Your input is fractional days, so there's actually no need to convert to datetime if you want the duration in seconds relative to the first entry. Subtract that from the rest of the column and multiply by the number of seconds in a day:
import pandas as pd
df = pd.DataFrame({"Datum/Zeit": [367.0, 367.010417, 367.020833]})
df["totalseconds"] = (df["Datum/Zeit"] - df["Datum/Zeit"].iloc[0]) * 86400
df["totalseconds"]
0 0.0000
1 900.0288
2 1799.9712
Name: totalseconds, dtype: float64
If you have to use datetime, you'll need to convert to timedelta (duration) to do the same, e.g. like
df["datetime"] = pd.to_datetime(df["Datum/Zeit"], unit="d")
# df["datetime"]
# 0 1971-01-03 00:00:00.000000
# 1 1971-01-03 00:15:00.028800
# 2 1971-01-03 00:29:59.971200
# Name: datetime, dtype: datetime64[ns]
# subtraction of datetime from datetime gives timedelta, which has total_seconds:
df["totalseconds"] = (df["datetime"] - df["datetime"].iloc[0]).dt.total_seconds()
# df["totalseconds"]
# 0 0.0000
# 1 900.0288
# 2 1799.9712
# Name: totalseconds, dtype: float64

Convert HH:MM:SS AM/PM OBJECT column to SECONDS

I have an Object Type column with time in format of HH:MM:SS AM/PM. output I need is a column with this time object column converted to Seconds.
For example:
import pandas as pd
df={'time_col':['10:10:10 PM','02:00:05 AM'],'time_seconds':[72610,7205]}
df2=pd.DataFrame(df)
I tried different ways. However, it is adding 1900-01-01 to some rows and not to some rows.
Convert time string to datetime (to account for AM/PM), take the string of the time component (ignore date), and convert that to timedelta. Now you can extract the seconds.
df = pd.DataFrame({'time_col':['10:10:10 PM','02:00:05 AM']})
# make sure we have time objects
df['time_col'] = pd.to_datetime(df['time_col']).dt.time
# time column to string, then to timedelta and extract seconds from that
df['time_seconds'] = pd.to_timedelta(df['time_col'].astype(str)).dt.total_seconds()
df['time_seconds']
0 79810.0
1 7205.0
Name: time_seconds, dtype: float64
If you can fire a pyspark session. This could also work and supplement #MrFuppes answer:
df1=spark.createDataFrame(df2)
timeFmt = "yyyy-MM-dd'T'HH:mm:ss.SSS"
df1.select("time_col", F.unix_timestamp(to_timestamp('time_col', 'hh:mm:ss a'),timeFmt).cast("long").alias("time")).show()
+-----------+-----+
| time_col| time|
+-----------+-----+
|10:10:10 PM|79810|
|02:00:05 AM| 7205|
+-----------+-----+

Drop certain character in Object before converting to Datetime column in Pandas

My dataframe has a column which measures time difference in the format HH:MM:SS.000
The pandas is formed from an excel file, the column which stores time difference is an Object. However some entries have negative time difference, the negative sign doesn't matter to me and needs to be removed from the time as it's not filtering a condition I have:
Note: I only have the negative time difference there because of the issue I'm currently having.
I've tried the following functions but I get errors as some of the time difference data is just 00:00:00 and some is 00:00:02.65 and some are 00:00:02.111
firstly how would I ensure that all data in this column is to 00:00:00.000. And then how would I remove the '-' from some the data.
Here's a sample of the time diff column, I cant transform this column into datetime as some of the entries dont have 3 digits after the decimal. Is there a way to iterate through the column and add a 0 if the length of the value isn't equal to 12 digits.
00:00:02.97
00:00:03:145
00:00:00
00:00:12:56
28 days 03:05:23.439
It looks like you need to clean your input before you can parse to timedelta, e.g. with the following function:
import pandas as pd
def clean_td_string(s):
if s.count(':') > 2:
return '.'.join(s.rsplit(':', 1))
return s
Applied to a df's column, this looks like
df = pd.DataFrame({'Time Diff': ['00:00:02.97', '00:00:03:145', '00:00:00', '00:00:12:56', '28 days 03:05:23.439']})
df['Time Diff'] = pd.to_timedelta(df['Time Diff'].apply(clean_td_string))
# df['Time Diff']
# 0 0 days 00:00:02.970000
# 1 0 days 00:00:03.145000
# 2 0 days 00:00:00
# 3 0 days 00:00:12.560000
# 4 28 days 03:05:23.439000
# Name: Time Diff, dtype: timedelta64[ns]

Pandas Dataframe Time column has float values

I am doing a cleaning of my Database. In one of the tables, the time column has values like 0.013391204. I am unable to convert this to time [mm:ss] format. Is there a function to convert this to the required format [mm:ss]
The head for the column
0 20:00
1 0.013391204
2 0.013333333
3 0.012708333
4 0.012280093
Use the below reproducible data:
import pandas as pd
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333", "0.012708333", "0.012280093"]})
I expect the output to be like the first row of the column values shown above.
What is the correct time interpretation for say the first entry? 0.013391204 is it 48 seconds?
Because, if we use datetime module we can convert float into the time format:
Updating answer to add the new information
import datetime
datetime.timedelta(days = 0.013391204)
str(datetime.timedelta(days = 0.013391204))
Output:'0:19:17.000026'
Hope this helps :))
First convert values by to_numeric with errors='coerce' for replace non floats to missing values and then replace them by original values with 00: for hours, last convert by to_timedelta with unit='d':
df = pd.DataFrame({"time": ["20:00", "0.013391204", "0.013333333",
"0.012708333", "0.012280093"]})
s = pd.to_numeric(df['time'], errors='coerce').fillna(df['time'].radd('00:'))
df['new'] = pd.to_timedelta(s, unit='d')
print (df)
time new
0 20:00 00:20:00
1 0.013391204 00:19:17.000025
2 0.013333333 00:19:11.999971
3 0.012708333 00:18:17.999971
4 0.012280093 00:17:41.000035

Working with datetime in Python

I have a file that has the following format:
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1
I am interested in finding time differences between appearances of name1 in the list and then calculating the frequency of such appearances (for example, delta time = 1s appeared 20 time, delta time = 30s appeared 1 time etc). The second problem is how to find number of events per minute/hour/day.
I found all time differences by using
pd.to_datetime(pd.Series([time]))
to convert each string to datetime format and placed all values in list named 'times'. Then I iterated through the list:
new=[x - times[i - 1] for i, x in enumerate(times)][1:]
and the resulting list was something like this:
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:10
dtype: timedelta64[ns], 0 00:00:51
dtype: timedelta64[ns], 0 00:00:09
dtype: timedelta64[ns], 0 00:00:50
dtype: timedelta64[ns], 0 00:00:11
Any further attempt to calculate frequency results in 'TypeError: 'Series' objects are mutable, thus they cannot be hashed' error. And I am not sure where to find how to calculate number of events per minute or any other time unit.
Obviously, I don't have a lot of experience with datetime in Python, so any pointers would be appreciated.
Use resample and sum to get the number of events per time period - examples below
I gather you want the intervals for individuals (name1: 1st to 2nd event interval; and then his/her 2nd to 3rd event interval). You will need to group by name and then difference the times for each group. In your dataset, only name1 has more than one event, and two events are necessary for a person-centric interval.
Quick and dirty ...
# --- get your data into a DataFrame so I can play with it ...
# first, put the data in a multi-line string (I would read it from a file
# if I had it in a file - but for my purposes a string will do).
data = """
time name
20150426010203 name1
20150426010303 name2
20150426010307 name3
20150426010409 name1
20150426010503 name4
20150426010510 name1"""
# second I will use StringIO and pandas.read_csv to pretend I am
# reading it from a file.
from StringIO import StringIO # import from io in Python 3
df = pd.read_csv(StringIO(data), header=0, index_col=0, sep='\s+')
# third, because pandas did not recognise the date-time format
# of the column I made the index, I will force the string to be
# converted to a pandas Timestamp come DatetimeIndex.
df.index = pd.to_datetime(df.index, format='%Y%m%d%H%M%S')
# number of events per minute
df['event'] = 1 # we will sum this to get events per time-period
dfepm = df.resample('1min', how=sum)
# number of events per hour
dfeph = df.resample('1h', how=sum)
# time differences by name
del df['event'] # we don't need this anymore
df['time'] = df.index
df['time_diff_by_name'] = df.groupby('name')['time'].diff()

Categories