How do I find the median in the DataFrame column? - python

df['diff']
23:59:01
23:59:13
23:59:17
23:59:27
23:59:52
hh-mm-ss data is obtained after calculating the difference between sessions via TimesDelta.
Converted time into seconds and found the median. How do I find the median in hh-mm-ss format?

The diff column need to be converted to numerical seconds.
import pandas as pd
def time2sec(t):
(h, m, s) = t.split(':')
return int(h) * 3600 + int(m) * 60 + int(s)
df = pd.DataFrame(['23:59:01','23:59:13','23:59:17','23:59:27','23:59:52'],columns=['diff'])
df['diff_sec'] = df['diff'].map(time2sec)
print(df)
median = df['diff_sec'].median()
print('median :',median)
diff diff_sec
0 23:59:01 86341
1 23:59:13 86353
2 23:59:17 86357
3 23:59:27 86367
4 23:59:52 86392
86357.0

If your data is already in Timedelta format as you mentioned, you can just use df.median() to get the median of the series.

You can try:
pd.to_timedelta(df['diff']).median()
pd.to_timedelta converts the date string to Timedelta. Then, we can use Series.median() to get the median.
Result:
Timedelta('0 days 23:59:17')

Related

Dataframe - mean of string type column with time values

I have to calculate mean() of time column, but this column type is string, how can I do it?
id time
1 1h:2m
2 1h:58m
3 35m
4 2h
...
You can use regex to extract hours and minutes. To calcualte the mean time in minutus:
h = df['time'].str.extract('(\d{1,2})h').fillna(0).astype(int)
m = df['time'].str.extract('(\d{1,2})m').fillna(0).astype(int)
(h * 60 + m).mean()
Result:
0 83.75
dtype: float64
It's largely inspired from How to construct a timedelta object from a simple string, but you can do as below:
def convertToSecond(time_str):
regex=re.compile(r'((?P<hours>\d+?)h)?:*((?P<minutes>\d+?)m)?:*((?P<seconds>\d+?)s)?')
parts = regex.match(time_str)
if not parts:
return
parts = parts.groupdict()
time_params = {}
for (name, param) in parts.items():
if param:
time_params[name] = int(param)
return timedelta(**time_params).total_seconds()
df = pd.DataFrame({
'time': ['1h:2m', '1h:58m','35m','2h'],})
df['inSecond']=df['time'].apply(convertToSecond)
mean_inSecond=df['inSecond'].mean()
print(f"Mean of Time Column: {datetime.timedelta(seconds=mean_inSecond)}")
Result:
Mean of Time Column: 1:23:45
Another possibility is to convert your string column into timedelta (since they don't seem to be times but rather durations?).
Since your strings are not all formatted equally, you unfortinately cannot use pandas' to_timedelta function. However, parser from dateutil has an option fuzzy that you can use to convert your column to datetime. If you subtract midnight today from that, you get the value as a timedelta.
import pandas as pd
from dateutil import parser
from datetime import date
from datetime import datetime
df = pd.DataFrame([[1,'1h:2m'],[2,'1h:58m'],[3,'35m'],[4,'2h']],columns=['id','time'])
today = date.today()
midnight = datetime.combine(today, datetime.min.time())
df['time'] = df['time'].apply(lambda x: (parser.parse(x, fuzzy=True)) - midnight)
This will convert your dataframe like this (print(df)):
id time
0 1 01:02:00
1 2 01:58:00
2 3 00:35:00
3 4 02:00:00
from which you can calculate the mean using print(df['time'].mean()):
0 days 01:23:45
Full example: https://ideone.com/Aze9mR

Timestamp string to seconds in Dataframe

I have a large dataframe containing a Timestamp column like the one shown below:
Timestamp
16T122109960
16T122109965
16T122109970
16T122109975
[73853 rows x 1 columns]
I need to convert this into a seconds (formatted 12.523) since first timestamp column using something like this:
start_time = log_file['Timestamp'][0]
log_file['Timestamp'] = log_file.Timestamp.apply(lambda x: x - start_time)
But first I need to parse the timestamps into seconds as quickly as possible, I've tried using regex to split the timestamp into hours, minuntes, seconds, and milliseconds and then multipling & dividing appropriatly but was given a memory error. Is there a function within datetime or dateutils that would help?
The method I have used at the moment is below:
def regex_time(time):
list = re.split(r"(\d*)(T)(\d{2})(\d{2})(\d{2})(\d{3})", time)
date, delim, hours, minutes, seconds, mills = list[1:-1]
seconds = int(seconds)
seconds += int(mills) /1000
seconds += int(minutes) * 60
seconds += int(hours) * 3600
return seconds
df['Timestamp'] = df.Timestamp.apply(lambda j: regex_time(j))
You could try to convert the timestamp to datetime format and then extract the seconds in the format you want.
Here I attach you a code sample of how it works:
from datetime import datetime
timestamp = 1545730073
dt_object = datetime.fromtimestamp(timestamp)
seconds = dt_object.strftime("%S.%f")
print(seconds)
Output:
53.000000
You can also apply it to the dataframe you are using, for instance:
from datetime import datetime
df = pd.DataFrame({'timestamp':[1545730073]})
df['datetime'] = df['timestamp'].apply(lambda x: datetime.fromtimestamp(x))
df['seconds'] = df['datetime'] .apply(lambda x : x.strftime("%S.%f"))
And it will return a dataFrame containing:
timestamp datetime seconds
0 1545730073 2018-12-25 10:27:53 53.000000
you could parse the string with strptime, subtract the start_time as a pd.Timestamp and use the total_seconds() of the resulting timedelta:
import pandas as pd
df = pd.DataFrame({'Timestamp': ['16T122109960','16T122109965','16T122109970','16T122109975']})
start_time = pd.Timestamp('1900-01-01')
df['totalseconds'] = (pd.to_datetime(df['Timestamp'], format='%dT%H%M%S%f')-start_time).dt.total_seconds()
df['totalseconds']
# 0 1340469.960
# 1 1340469.965
# 2 1340469.970
# 3 1340469.975
# Name: totalseconds, dtype: float64
To use the first entry of the 'Timestamp' column as reference time start_time, use
start_time = pd.to_datetime(df['Timestamp'].iloc[0], format='%dT%H%M%S%f')

accelerometer is sampled at high frequency , how to convert UTC time to float time?

import pandas as pd
df = pd.DataFrame(['Time':['16:47:55.510','16:47:55.511','16:47:55.410']})
df
output:
Time
0 16:47:55.510
1 16:47:55.511
2 16:47:55.410
how to convert this time values to float value using python?
If the time is in h:m:s.ms format you can do something like this:
hours, minutes, seconds = df.Time[0].split(':')
total_seconds = int(hours) * 3600 + int(minutes) * 60 + float(seconds)

Convert string to timedelta in pandas

I have a series where the timestamp is in the format HHHHH:MM:
timestamp = pd.Series(['34:23', '125:26', '15234:52'], index=index)
I would like to convert it to a timedelta series.
For now I manage to do that on a single string:
str[:-3]
str[-2:]
timedelta(hours=int(str[:-3]),minutes=int(str[-2:]))
I would like to apply it to the whole series, if possible in a cleaner way. Is there a way to do this?
You can use column-wise Pandas methods:
s = pd.Series(['34:23','125:26','15234:52'])
v = s.str.split(':', expand=True).astype(int)
s = pd.to_timedelta(v[0], unit='h') + pd.to_timedelta(v[1], unit='m')
print(s)
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
As pointed out in comments, this can also be achieved in one line, albeit less clear:
s = pd.to_timedelta((s.str.split(':', expand=True).astype(int) * (60, 1)).sum(axis=1), unit='min')
This is how I would do it:
timestamp = pd.Series(['34:23','125:26','15234:52'])
x = timestamp.str.split(":").apply(lambda x: int(x[0])*60 + int(x[1]))
timestamp = pd.to_timedelta(x, unit='s')
Parse the delta in seconds as an argument to pd.to_timedelta like this,
In [1]: import pandas as pd
In [2]: ts = pd.Series(['34:23','125:26','15234:52'])
In [3]: secs = 60 * ts.apply(lambda x: 60*int(x[:-3]) + int(x[-2:]))
In [4]: pd.to_timedelta(secs, 's')
Out[4]:
0 1 days 10:23:00
1 5 days 05:26:00
2 634 days 18:52:00
dtype: timedelta64[ns]
Edit: missed erncyp's answer which would work as well but you need to multiply the argument to pd.to_timedelta by 60 since if I recall correctly minutes aren't an available as a measure of elapsed time except modulo the previous hour.
You can use pandas.Series.apply, i.e.:
def convert(args):
return timedelta(hours=int(args[:-3]),minutes=int(args[-2:]))
s = pd.Series(['34:23','125:26','15234:52'])
s = s.apply(convert)

Calculate day time differences in a pandas Dataframe

I have the following dataframe:
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C")]
#Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], '%m/%d/%Y')
Each row represents when a user makes a specific action. I want to compute how frequently (in terms of days) each user makes that specific action.
Let's say user A transacted first time on 08/11/2016, and then he transacted again on 09/12/2016, i.e. around 30 days after. Then, he transacted again on 10/10/2016, around 29 days after his second transaction. So, his average frequency in days would be (29+30)/2.
What is the most efficient way to do that?
Thanks in advance!
Update
I wrote the following function that computes my desired output.
from datetime import timedelta
def averagetime(a):
numdeltas = len(a) - 1
sumdeltas = 0
i = 1
while i < len(a):
delta = abs((a[i] - a[i-1]).days)
sumdeltas += delta
i += 1
if numdeltas > 1:
avg = sumdeltas / numdeltas
else:
avg = 'NaN'
return avg
It works correctly, for example, when I pass the whole "Time" column:
averagetime(df["Time"])
But it gives me an error when I try to apply it after group by.
df.groupby('User')['Time'].apply(averagetime)
Any suggestions how I can fix the above?
You can use diff, convert to float by np.timedelta64(1,'D') and with abs count sum:
print (averagetime(df["Time"]))
12.0
su = ((df["Time"].diff() / np.timedelta64(1,'D')).abs().sum())
print (su / (len(df) - 1))
12.0
Then I apply it to groupby, but there is necessary condition, because:
ZeroDivisionError: float division by zero
print (df.groupby('User')['Time']
.apply(lambda x: np.nan if len(x) == 1
else (x.diff()/np.timedelta64(1,'D')).abs().sum()/(len(x)-1)))
User
A 30.0
B 28.0
C NaN
Name: Time, dtype: float64
Building on from #Jezrael's answer:
If by "how frequently" you mean - how much time passes between each user performing the action then here's an approach:
import pandas as pd
import numpy as np
data = [
("10/10/2016","A"),
("10/10/2016","B"),
("09/12/2016","B"),
("09/12/2016","A"),
("08/11/2016","A"),
("08/11/2016","C"),
]
# Create DataFrame base
df = pd.DataFrame(data, columns=("Time","User"))
# Convert time column to correct format for time calculations
df["Time"] = pd.to_datetime(df["Time"], dayfirst=True)
# Group the DF by min, max and count the number of instances
grouped = (df.groupby("User").agg([np.max, np.min, np.count_nonzero])
# This step is a bit messy and could be improved,
# but we need the count as an int
.assign(counter=lambda x: x["Time"]["count_nonzero"].astype(int))
# Use apply to calculate the time between first and last, then divide by frequency
.apply(lambda x: (x["Time"]["amax"] - x["Time"]["amin"]) / x["counter"].astype(int), axis=1)
)
# Output the DF if using an interactive prompt
grouped
Output:
User
A 20 days
B 30 days
C 0 days

Categories