Pandas; transform column with MM:SS,decimals into number of seconds - python

Hey: Spent several hours trying to do a quite simple thing,but couldnt figure it out.
I have a dataframe with a column, df['Time'] which contains time, starting from 0, up to 20 minutes,like this:
1:10,10
1:16,32
3:03,04
First being minutes, second is seconds, third is miliseconds (only two digits).
Is there a way to automatically transform that column into seconds with Pandas, and without making that column the time index of the series?
I already tried the following but it wont work:
pd.to_datetime(df['Time']).convert('s') # AttributeError: 'Series' object has no attribute 'convert'
If the only way is to parse the time just point that out and I will prepare a proper / detailed answer to this question, dont waste your time =)
Thank you!

Code:
import pandas as pd
import numpy as np
import datetime
df = pd.DataFrame({'Time':['1:10,10', '1:16,32', '3:03,04']})
df['time'] = df.Time.apply(lambda x: datetime.datetime.strptime(x,'%M:%S,%f'))
df['timedelta'] = df.time - datetime.datetime.strptime('00:00,0','%M:%S,%f')
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time time timedelta secs
0 1:10,10 1900-01-01 00:01:10.100000 00:01:10.100000 70.10
1 1:16,32 1900-01-01 00:01:16.320000 00:01:16.320000 76.32
2 3:03,04 1900-01-01 00:03:03.040000 00:03:03.040000 183.04
If you have also negative time deltas:
import pandas as pd
import numpy as np
import datetime
import re
regex = re.compile(r"(?P<minus>-)?((?P<minutes>\d+):)?(?P<seconds>\d+)(,(?P<centiseconds>\d{2}))?")
def parse_time(time_str):
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.iteritems():
if param and (name != 'minus'):
time_params[name] = int(param)
time_params['milliseconds'] = time_params['centiseconds']*10
del time_params['centiseconds']
return (-1 if parts['minus'] else 1) * datetime.timedelta(**time_params)
df = pd.DataFrame({'Time':['-1:10,10', '1:16,32', '3:03,04']})
df['timedelta'] = df.Time.apply(lambda x: parse_time(x))
df['secs'] = df['timedelta'].apply(lambda x: x / np.timedelta64(1, 's'))
print df
Output:
Time timedelta secs
0 -1:10,10 -00:01:10.100000 -70.10
1 1:16,32 00:01:16.320000 76.32
2 3:03,04 00:03:03.040000 183.04

Related

Pandas substracting number of days from date

I am trying to create a new column "Starting_time" by subtracting 60 days out of "Harvest_date" but I get the same date each time. Can someone point out what did I do wrong please?
Harvest_date
20.12.21
12.01.21
10.03.21
import pandas as pd
from datetime import timedelta
df1 = pd.read_csv (r'C:\Flower_weight.csv')
def subtract_days_from_date(date, days):
subtracted_date = pd.to_datetime(date) - timedelta(days=days)
subtracted_date = subtracted_date.strftime("%Y-%m-%d")
return subtracted_date
df1['Harvest_date'] = pd.to_datetime(df1.Harvest_date)
df1.style.format({"Harvest_date": lambda t: t.strftime("%Y-%m-%d")})
for harvest_date in df1['Harvest_date']:
df1["Starting_date"]=subtract_days_from_date(harvest_date,60)
print(df1["Starting_date"])
Starting_date
2021-10-05
2021-10-05
2021-10-05
I am not sure if the use of the loop was necessary here. Perhaps try the following:
df1_dates['Starting_date'] = df1_dates['Harvest_date'].apply(lambda x: pd.to_datetime(x) - timedelta(days=60))
df1_dates['Starting_date'].dt.strftime("%Y-%m-%d")
df1_dates['Starting_date']
You're overwriting the series on each iteration of the last loop
for harvest_date in df1['Harvest_date']:
df1["Starting_date"]=subtract_days_from_date(harvest_date,60)
You can do away with the loop by vectorizing the subtract_days_from_date function.
You could also reference an index with enumerate
np.vectorize
import numpy as np
subtract_days_from_date = np.vectorize(subtract_days_from_date)
df1["Starting_date"]=subtract_days_from_date(df1["Harvest_date"], 60)
enumerate
for idx, harvest_date in enumerate(df1['Harvest_date']):
df1.iloc[idx][ "Starting_date"]=subtract_days_from_date(harvest_date,60)

Measuring elapsed time in Pandas

I'm trying to make a simple analysis of my sport activities where I have elapsed time in the string format like this:
00:22:05
00:30:34
00:30:31
00:37:19
00:28:43
00:22:08
I've tried to convert it to the pandas datetime type but I'm only interested in time of my activities so I could calculate mean for instance or how much I was pausing during whole run.
I've tried that code but it doesn't resolve my issue.
df_test['Elapsed time'] = pd.to_datetime(df_test['Elapsed time'], format = '%H:%M:%S')
Any ideas how I can make that work? I've been trying to find answers but nothing helps. And I'm still new to Pandas. Thanks in advance.
Welcome to StackOverflow. I think the question you are looking to answer is how to convert the time string to a datetime format without the date portion. Doing so requires only a minor modification to your code.
pd.to_datetime(df['Elapsed Time'], format = '%H:%M:%S').dt.time
Complete code:
import pandas as pd
data_dict = { 'Elapsed Time': ['00:22:05', '00:30:34', '00:30:31', '00:37:19', '00:28:43', '00:22:08'] }
df = pd.DataFrame.from_dict(data_dict)
df['Formatted Time'] = pd.to_datetime(df['Elapsed Time'], format = '%H:%M:%S').dt.time
type(df['Elapsed Time'][0]) # 'str'
type(df['Formatted Time'][0]) # 'datetime.time'
Computing with Time
In order to perform analysis of the data you'll need to convert the time value to something useful, such as seconds. Here I'll present two methods of doing that.
The first method performs manual calculations using the original time string.
def total_seconds_in_time_string(time_string):
segments = time_string.strip().split(':')
# segments: [ 'HH', 'MM', 'SS' ]
# total seconds = (((HH * 60) + MM) * 60) + SS
return (((int(segments[0]) * 60) + int(segments[1])) * 60) + int(segments[2])
df['Total Seconds'] = df['Elapsed Time'].apply(lambda x: total_seconds_in_time_string(x))
type(df['Total Seconds'][0]) # 'numpy.int64'
df['Total Seconds'].mean() # 1713.3333333333333
def seconds_to_timestring(secs):
import time
time_secs = time.gmtime(round(secs))
return time.strftime('%H:%M:%S', time_secs)
avg_time_str = seconds_to_timestring(df['Total Seconds'].mean())
print(avg_time_str) # '00:28:33'
The second method would be the more Pythonic solution using the datetime library.
def total_seconds_in_time(t):
from datetime import timedelta
return timedelta(hours=t.hour, minutes=t.minute, seconds=t.second) / timedelta(seconds=1)
df['TimeDelta Seconds'] = df['Formatted Time'].apply(lambda x: total_seconds_in_time(x))
type(df['TimeDelta Seconds'][0]) # 'numpy.float64'
df['TimeDelta Seconds'].mean() # 1713.3333333333333
def seconds_to_timedelta(secs):
from datetime import timedelta
return timedelta(seconds=round(secs))
mean_avg = seconds_to_timedelta(df['TimeDelta Seconds'].mean())
print(mean_avg) # '0:28:33'

Timestamp string to seconds in Dataframe

I have a large dataframe containing a Timestamp column like the one shown below:
Timestamp
16T122109960
16T122109965
16T122109970
16T122109975
[73853 rows x 1 columns]
I need to convert this into a seconds (formatted 12.523) since first timestamp column using something like this:
start_time = log_file['Timestamp'][0]
log_file['Timestamp'] = log_file.Timestamp.apply(lambda x: x - start_time)
But first I need to parse the timestamps into seconds as quickly as possible, I've tried using regex to split the timestamp into hours, minuntes, seconds, and milliseconds and then multipling & dividing appropriatly but was given a memory error. Is there a function within datetime or dateutils that would help?
The method I have used at the moment is below:
def regex_time(time):
list = re.split(r"(\d*)(T)(\d{2})(\d{2})(\d{2})(\d{3})", time)
date, delim, hours, minutes, seconds, mills = list[1:-1]
seconds = int(seconds)
seconds += int(mills) /1000
seconds += int(minutes) * 60
seconds += int(hours) * 3600
return seconds
df['Timestamp'] = df.Timestamp.apply(lambda j: regex_time(j))
You could try to convert the timestamp to datetime format and then extract the seconds in the format you want.
Here I attach you a code sample of how it works:
from datetime import datetime
timestamp = 1545730073
dt_object = datetime.fromtimestamp(timestamp)
seconds = dt_object.strftime("%S.%f")
print(seconds)
Output:
53.000000
You can also apply it to the dataframe you are using, for instance:
from datetime import datetime
df = pd.DataFrame({'timestamp':[1545730073]})
df['datetime'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x))
df['seconds'] = df['datetime'] .apply(lambda x : x.strftime("%S.%f"))
And it will return a dataFrame containing:
timestamp datetime seconds
0 1545730073 2018-12-25 10:27:53 53.000000
you could parse the string with strptime, subtract the start_time as a pd.Timestamp and use the total_seconds() of the resulting timedelta:
import pandas as pd
df = pd.DataFrame({'Timestamp': ['16T122109960','16T122109965','16T122109970','16T122109975']})
start_time = pd.Timestamp('1900-01-01')
df['totalseconds'] = (pd.to_datetime(df['Timestamp'], format='%dT%H%M%S%f')-start_time).dt.total_seconds()
df['totalseconds']
# 0 1340469.960
# 1 1340469.965
# 2 1340469.970
# 3 1340469.975
# Name: totalseconds, dtype: float64
To use the first entry of the 'Timestamp' column as reference time start_time, use
start_time = pd.to_datetime(df['Timestamp'].iloc[0], format='%dT%H%M%S%f')

Pandas Python to count minutes between times

I'm trying to use pandas / python to load a dataframe and count outage minutes that occur between 0900-2100. I've been trying to get this per site but have only been able to get a sum value. Example dataframe is below. I'm trying to produce the data in the third column:
import pandas as pd
from pandas import Timestamp
import pytz
from pytz import all_timezones
import datetime
from datetime import time
from threading import Timer
import time as t
import xlrd
import xlwt
import numpy as np
import xlsxwriter
data = pd.read_excel('lab.xlsx')
data['outage'] = data['Down'] - data['Down']
data['outage'] = data['Down']/np.timedelta64(1,'m')
s = data.apply(lambda row: pd.date_range(row['Down'], row['Up'], freq='T'), axis=1).explode()
#returns total amount of downtime between 9-21 but not by site
total = s.dt.time.between(time(9), time(21)).sum()
#range of index[0] for s
slist = range(0, 20)
#due to thy this loop itterates, it returns the number of minutes between down and up
for num in slist:
Duration = s[num].count()
print(Duration)
#percentage of minutes during business hours
percentage = (total / sum(data['duration'])) * 100
print('The percentage of outage minutes during business hours is:', percentage)
#secondary function to test
def by_month():
s = data.apply(lambda row: pd.date_range(row['Adjusted_Down'], row['Adjusted_Up'], freq='T'), axis=1).explode()
downtime = pd.DataFrame({
'Month': s.astype('datetime64[M]'),
'IsDayTime': s.dt.time.between(time(9), time(21))
})
downtime.groupby('Month')['IsDayTime'].sum()
#data.to_excel('delete.xls', 'a+')
You can use pandas' DatetimeIndex function to convert the difference between your down time and up time into hours, minutes, and seconds. Then you can multiply the hours by 60 and add minutes to get your total down time in minutes. See example below:
import pandas as pd
date_format = "%m-%d-%Y %H:%M:%S"
# Example up and down times to insert into dataframe
down1 = dt.datetime.strptime('8-01-2019 00:00:00', date_format)
up1 = dt.datetime.strptime('8-01-2019 00:20:00', date_format)
down2 = dt.datetime.strptime('8-01-2019 02:26:45', date_format)
up2 = dt.datetime.strptime('8-01-2019 03:45:04', date_format)
down3 = dt.datetime.strptime('8-01-2019 06:04:00', date_format)
up3 = dt.datetime.strptime('8-01-2019 06:06:34', date_format)
time_df = pd.DataFrame([{'down':down1,'up':up1},{'down':down2,'up':up2},{'down':down3,'up':up3},])
# Subtract your up column from your down column and convert the result to a datetime index
down_time = pd.DatetimeIndex(time_df['up'] - time_df['down'])
# Access your new index, converting the hours to minutes and adding minutes to get down time in minutes
down_time_min = time.hour * 60 + time.minute
# Apply above array to new dataframe column
time_df['down_time'] = down_time_min
time_df
This is the result for this example:

How to change year value in numpy datetime64?

I have a pandas DataFrame with dtype=numpy.datetime64
In the data I want to change
'2011-11-14T00:00:00.000000000'
to:
'2010-11-14T00:00:00.000000000'
or other year. Timedelta is not known, only year number to assign.
this displays year in int
Dates_profit.iloc[50][stock].astype('datetime64[Y]').astype(int)+1970
but can't assign value.
Anyone know how to assign year to numpy.datetime64?
Since you're using a DataFrame, consider using pandas.Timestamp.replace:
In [1]: import pandas as pd
In [2]: dates = pd.DatetimeIndex([f'200{i}-0{i+1}-0{i+1}' for i in range(5)])
In [3]: df = pd.DataFrame({'Date': dates})
In [4]: df
Out[4]:
Date
0 2000-01-01
1 2001-02-02
2 2002-03-03
3 2003-04-04
4 2004-05-05
In [5]: df.loc[:, 'Date'] = df['Date'].apply(lambda x: x.replace(year=1999))
In [6]: df
Out[6]:
Date
0 1999-01-01
1 1999-02-02
2 1999-03-03
3 1999-04-04
4 1999-05-05
numpy.datetime64 objects are hard to work with. To update a value, it is normally easier to convert the date to a standard Python datetime object, do the change and then convert it back to a numpy.datetime64 value again:
import numpy as np
from datetime import datetime
dt64 = np.datetime64('2011-11-14T00:00:00.000000000')
# convert to timestamp:
ts = (dt64 - np.datetime64('1970-01-01T00:00:00Z')) / np.timedelta64(1, 's')
# standard utctime from timestamp
dt = datetime.utcfromtimestamp(ts)
# get the new updated year
dt = dt.replace(year=2010)
# convert back to numpy.datetime64:
dt64 = np.datetime64(dt)
There might be simpler ways, but this works, at least.
This vectorised solution gives the same result as using pandas to iterate over with x.replace(year=n), but the speed up on large arrays is at least x10 faster.
It is important to remember the year that the datetime64 object is replaced with should be a leap year. Using the python datetime library, the following crashes: datetime(2012,2,29).replace(year=2011) crashes. Here, the function 'replace_year' will simply move 2012-02-29 to 2011-03-01.
I'm using numpy v 1.13.1.
import numpy as np
import pandas as pd
def replace_year(x, year):
""" Year must be a leap year for this to work """
# Add number of days x is from JAN-01 to year-01-01
x_year = np.datetime64(str(year)+'-01-01') + (x - x.astype('M8[Y]'))
# Due to leap years calculate offset of 1 day for those days in non-leap year
yr_mn = x.astype('M8[Y]') + np.timedelta64(59,'D')
leap_day_offset = (yr_mn.astype('M8[M]') - yr_mn.astype('M8[Y]') - 1).astype(np.int)
# However, due to days in non-leap years prior March-01,
# correct for previous step by removing an extra day
non_leap_yr_beforeMarch1 = (x.astype('M8[D]') - x.astype('M8[Y]')).astype(np.int) < 59
non_leap_yr_beforeMarch1 = np.logical_and(non_leap_yr_beforeMarch1, leap_day_offset).astype(np.int)
day_offset = np.datetime64('1970') - (leap_day_offset - non_leap_yr_beforeMarch1).astype('M8[D]')
# Finally, apply the day offset
x_year = x_year - day_offset
return x_year
x = np.arange('2012-01-01', '2014-01-01', dtype='datetime64[h]')
x_datetime = pd.to_datetime(x)
x_year = replace_year(x, 1992)
x_datetime = x_datetime.map(lambda x: x.replace(year=1992))
print(x)
print(x_year)
print(x_datetime)
print(np.all(x_datetime.values == x_year))

Categories