I have dataframe like as below
cust_id,purchase_date
1,10/01/1998
1,10/12/1999
2,13/05/2016
3,14/02/2018
3,15/03/2019
I would like to do the below
a) display the output in text format as 5 years and 9 months instead of 5.93244 etc.
I tried the below
from datetime import timedelta
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
gb = df_new.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()
df_cust_age['diff_in_days'] = df_cust_age['max_date'] - df_cust_age['min_date']
df_cust_age['years_diff'] = df_cust_age['diff_in_days']/timedelta(days=365)
but the above code gives the output in decimal numbers.
I expect my output to be like as below
cust_id,years_diff
1, 1 years and 11 months and 0 day
2, 0 years
3, 1 year and 1 month and 1 day
If possible create 'default' month with 30 days use this custom function:
#https://stackoverflow.com/a/13756038/2901002
def td_format(td_object):
seconds = int(td_object.total_seconds())
periods = [
('year', 60*60*24*365),
('month', 60*60*24*30),
('day', 60*60*24),
('hour', 60*60),
('minute', 60),
('second', 1)
]
strings=[]
for period_name, period_seconds in periods:
if seconds > period_seconds:
period_value , seconds = divmod(seconds, period_seconds)
has_s = 's' if period_value > 1 else ''
strings.append("%s %s%s" % (period_value, period_name, has_s))
return ", ".join(strings) if len(strings) > 0 else '0 year'
df_cust_age['years_diff'] = df_cust_age['diff_in_days'].apply(td_format)
print (df_cust_age)
cust_id min_date max_date diff_in_days years_diff
0 1 1998-10-01 1999-10-12 376 days 1 year, 11 days
1 2 2016-05-13 2016-05-13 0 days 0 year
2 3 2018-02-14 2019-03-15 394 days 1 year, 29 days
from io import StringIO
import pandas as pd
from dateutil.relativedelta import relativedelta as RD
string_data = '''unique_key,purchase_date
1,10/01/1998
1,10/12/1999
2,13/05/2016
3,14/02/2018
3,15/03/2019'''
## Custom functions
diff_obj = lambda d1,d2:RD(d1, d2) if d1>d2 else RD(d2, d1)
date_tuple = lambda diff:(diff.years,diff.months,diff.days)
pipeline = lambda row:date_tuple(diff_obj(row['min_date'],row['max_date']))
def string_format(date_tuple):
final_string = []
for val,name in zip(date_tuple,['years','months','day']):
if val:
final_string.append(f'{val} {name}')
return ' and '.join(final_string) if final_string else '0 years'
## Custom functions
df = pd.read_csv(StringIO(string_data))
df['purchase_date'] = pd.to_datetime(df['purchase_date'],format='%d/%m/%Y')
gb = df.groupby(['unique_key'])
df_cust_age = gb['purchase_date'].agg(min_date=np.min, max_date=np.max).reset_index()
df_cust_age['years_diff'] = df_cust_age.apply(pipeline,axis=1).apply(string_format)
print(df_cust_age)
unique_key min_date max_date years_diff
0 1 1998-01-10 1999-12-10 1 years and 11 months
1 2 2016-05-13 2016-05-13 0 years
2 3 2018-02-14 2019-03-15 1 years and 1 months and 1 day
Related
I have a dataframe with a list of time value as object and needed to convert them to datetime, the issue is, they are not on the same format so when I try:
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M:%S')
it gives me an error
ValueError: time data '3:22' does not match format '%H:%M:%S' (match)
or if use this code
df['Total call time'] = pd.to_datetime(df['Total call time'], format='%H:%M')
I get this error
ValueError: unconverted data remains: :58
These are the values on my data
Total call time
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
**45:48**
1:41:40
5:08:37
**3:22**
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58
times = """\
2:04:07
3:22:41
2:30:41
2:19:06
1:45:55
1:30:08
1:32:15
1:43:28
45:48
1:41:40
5:08:37
3:22
4:29:05
2:47:25
2:39:29
2:29:32
2:09:52
3:31:57
2:27:58
2:34:28
3:14:10
2:12:10
2:46:58""".split()
import pandas as pd
df = pd.DataFrame(times, columns=['elapsed'])
def pad(s):
if len(s) == 4:
return '00:0'+s
elif len(s) == 5:
return '00:'+s
return s
print(pd.to_timedelta(df['elapsed'].apply(pad)))
Output:
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 00:03:22
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Name: elapsed, dtype: timedelta64[ns]
Alternatively to grovina's answer ... instead of using apply you can directly use the dt accessor.
Here's a sample:
>>> data = [['2017-12-01'], ['2017-12-
30'],['2018-01-01']]
>>> df = pd.DataFrame(data=data,
columns=['date'])
>>> df
date
0 2017-12-01
1 2017-12-30
2 2018-01-01
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: object
Note how df.date is an object? Let's turn it into a date like you want
>>> df.date = pd.to_datetime(df.date)
>>> df.date
0 2017-12-01
1 2017-12-30
2 2018-01-01
Name: date, dtype: datetime64[ns]
The format you want is for string formatting. I don't think you'll be able to convert the actual datetime64 to look like that format. For now, let's make a newly formatted string version of your date in a separate column
>>> df['new_formatted_date'] =
df.date.dt.strftime('%d/%m/%y %H:%M')
>>> df.new_formatted_date
0 01/12/17 00:00
1 30/12/17 00:00
2 01/01/18 00:00
Name: new_formatted_date, dtype: object
Finally, since the df.date column is now of date datetime64... you can use the dt accessor right on it. No need to use apply
>>> df['month'] = df.date.dt.month
>>> df['day'] = df.date.dt.day
>>> df['year'] = df.date.dt.year
>>> df['hour'] = df.date.dt.hour
>>> df['minute'] = df.date.dt.minute
>>> df
date new_formatted_date month day
year hour minute
0 2017-12-01 01/12/17 00:00 12
1 2017 0 0
1 2017-12-30 30/12/17 00:00 12
30 2017 0 0
2 2018-01-01 01/01/18 00:00
Another idea is test if double : and if not added :00 with converting to timedeltas by to_timedelta, also is test if number before first : is less like 23 - then is parsing like HH:MM, if is greater is parising like MM:SS:
m1 = df['Total call time'].str.count(':').ne(2)
m2 = df['Total call time'].str.extract('^(\d+):', expand=False).astype(float).gt(23)
s = np.select([m1 & m2, m1 & ~m2],
['00:' + df['Total call time'], df['Total call time']+ ':00'],
df['Total call time'] )
df['Total call time'] = pd.to_timedelta(s)
print (df)
Total call time
0 0 days 02:04:07
1 0 days 03:22:41
2 0 days 02:30:41
3 0 days 02:19:06
4 0 days 01:45:55
5 0 days 01:30:08
6 0 days 01:32:15
7 0 days 01:43:28
8 0 days 00:45:48
9 0 days 01:41:40
10 0 days 05:08:37
11 0 days 03:22:00
12 0 days 04:29:05
13 0 days 02:47:25
14 0 days 02:39:29
15 0 days 02:29:32
16 0 days 02:09:52
17 0 days 03:31:57
18 0 days 02:27:58
19 0 days 02:34:28
20 0 days 03:14:10
21 0 days 02:12:10
22 0 days 02:46:58
Currently I am reading in a data frame with the timestamp from film 00(days):00(hours clocks over at 24 to day):00(min):00(sec)
pandas reads time formats HH:MM:SS and YYYY:MM:DD HH:MM:SS fine.
Though is there a way of having pandas read the duration of time such as the DD:HH:MM:SS.
Alternatively using timedelta how would I go about getting the DD into HH in the data frame so that pandas can make it "1 day HH:MM:SS" for example
Data sample
00:00:00:00
00:07:33:57
02:07:02:13
00:00:13:11
00:00:10:11
00:00:00:00
00:06:20:06
01:12:13:25
Expected output for last sample
36:13:25
Thanks
If you want timedelta objects, a simple way is to replace the first colon with days :
df['timedelta'] = pd.to_timedelta(df['col'].str.replace(':', 'days ', n=1))
output:
col timedelta
0 00:00:00:00 0 days 00:00:00
1 00:07:33:57 0 days 07:33:57
2 02:07:02:13 2 days 07:02:13
3 00:00:13:11 0 days 00:13:11
4 00:00:10:11 0 days 00:10:11
5 00:00:00:00 0 days 00:00:00
6 00:06:20:06 0 days 06:20:06
7 01:12:13:25 1 days 12:13:25
>>> df.dtypes
col object
timedelta timedelta64[ns]
dtype: object
From there it's also relatively easy to combine the days and hours as string:
c = df['timedelta'].dt.components
df['str_format'] = ((c['hours']+c['days']*24).astype(str)
+df['col'].str.split('(?=:)', n=2).str[-1]).str.zfill(8)
output:
col timedelta str_format
0 00:00:00:00 0 days 00:00:00 00:00:00
1 00:07:33:57 0 days 07:33:57 07:33:57
2 02:07:02:13 2 days 07:02:13 55:02:13
3 00:00:13:11 0 days 00:13:11 00:13:11
4 00:00:10:11 0 days 00:10:11 00:10:11
5 00:00:00:00 0 days 00:00:00 00:00:00
6 00:06:20:06 0 days 06:20:06 06:20:06
7 01:12:13:25 1 days 12:13:25 36:13:25
Convert days separately, add to times and last call custom function:
def f(x):
ts = x.total_seconds()
hours, remainder = divmod(ts, 3600)
minutes, seconds = divmod(remainder, 60)
return ('{}:{:02d}:{:02d}').format(int(hours), int(minutes), int(seconds))
d = pd.to_timedelta(df['col'].str[:2].astype(int), unit='d')
td = pd.to_timedelta(df['col'].str[3:])
df['col'] = d.add(td).apply(f)
print (df)
col
0 0:00:00
1 7:33:57
2 55:02:13
3 0:13:11
4 0:10:11
5 0:00:00
6 6:20:06
7 36:13:25
I have the timeseries dataframe as:
timestamp
signal_value
2017-08-28 00:00:00
10
2017-08-28 00:05:00
3
2017-08-28 00:10:00
5
2017-08-28 00:15:00
5
I am trying to get the average Monthly percentage of the time where "signal_value" is greater than 5. Something like:
Month
metric
January
16%
February
2%
March
8%
April
10%
I tried the following code which gives the result for the whole dataset but how can I summarize it per each month?
total,count = 0, 0
for index, row in df.iterrows():
total += 1
if row["signal_value"] >= 5:
count += 1
print((count/total)*100)
Thank you in advance.
Let us first generate some random data (generate random dates taken from here):
import pandas as pd
import numpy as np
import datetime
def randomtimes(start, end, n):
frmt = '%d-%m-%Y %H:%M:%S'
stime = datetime.datetime.strptime(start, frmt)
etime = datetime.datetime.strptime(end, frmt)
td = etime - stime
dtimes = [np.random.random() * td + stime for _ in range(n)]
return [d.strftime(frmt) for d in dtimes]
# Recreat some fake data
timestamp = randomtimes("01-01-2021 00:00:00", "01-01-2023 00:00:00", 10000)
signal_value = np.random.random(len(timestamp)) * 10
df = pd.DataFrame({"timestamp": timestamp, "signal_value": signal_value})
Now we can transform the timestamp column to pandas timestamps to extract month and year per timestamp:
df.timestamp = pd.to_datetime(df.timestamp)
df["month"] = df.timestamp.dt.month
df["year"] = df.timestamp.dt.year
We generate a boolean column whether signal_value is larger than some threshold (here 5):
df["is_larger5"] = df.signal_value > 5
Finally, we can get the average for every month by using pandas.groupby:
>>> df.groupby(["year", "month"])['is_larger5'].mean()
year month
2021 1 0.509615
2 0.488189
3 0.506024
4 0.519362
5 0.498778
6 0.483709
7 0.498824
8 0.460396
9 0.542918
10 0.463043
11 0.492500
12 0.519789
2022 1 0.481663
2 0.527778
3 0.501139
4 0.527322
5 0.486936
6 0.510638
7 0.483370
8 0.521253
9 0.493639
10 0.495349
11 0.474886
12 0.488372
Name: is_larger5, dtype: float64
I have a column with timedelta and I would like to create an extra column extracting the hour and minute from the timedelta column.
df
time_delta hour_minute
02:51:21.401000 2h:51min
03:10:32.401000 3h:10min
08:46:43.401000 08h:46min
This is what I have tried so far:
df['rh'] = df.time_delta.apply(lambda x: round(pd.Timedelta(x).total_seconds() \
% 86400.0 / 3600.0) )
Unfortunately, I'm not quite sure how to extract the minutes without incl. the hour
Use Series.dt.components for get hours and minutes and join together:
td = pd.to_timedelta(df.time_delta).dt.components
df['rh'] = (td.hours.astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 08:46:43.401000 08h:46min 08h:46min
If possible values of hour are more like 24hours is necessary also add days:
print (df)
time_delta hour_minute
0 02:51:21.401000 2h:51min
1 03:10:32.401000 3h:10min
2 28:46:43.401000 28h:46min
td = pd.to_timedelta(df.time_delta).dt.components
print (td)
days hours minutes seconds milliseconds microseconds nanoseconds
0 0 2 51 21 401 0 0
1 0 3 10 32 401 0 0
2 1 4 46 43 401 0 0
df['rh'] = ((td.days * 24 + td.hours).astype(str).str.zfill(2) + 'h:' +
td.minutes.astype(str).str.zfill(2) + 'min')
print (df)
time_delta hour_minute rh
0 02:51:21.401000 2h:51min 02h:51min
1 03:10:32.401000 3h:10min 03h:10min
2 28:46:43.401000 28h:46min 28h:46min
See also this post which defines the function
def strfdelta(tdelta, fmt):
d = {"days": tdelta.days}
d["hours"], rem = divmod(tdelta.seconds, 3600)
d["minutes"], d["seconds"] = divmod(rem, 60)
return fmt.format(**d)
Then, e.g.
strfdelta(pd.Timedelta('02:51:21.401000'), '{hours}h:{minutes}min')
gives '2h:51min'.
For your full dataframe
df['rh'] = df.time_delta.apply(lambda x: strfdelta(pd.Timedelta(x), '{hours}h:{minutes}min'))
Struggling with something that should be easy:
today = '26/8/2018'
start = '1/8/2018'
diff = today - start
diff gives us 26 days
how do I take the integer value of this datetime? i.e. 26?
basically, im trying to calc a daycount fraction, (diff / 365) * 10,000 say, but it wont work.
My actual values I have are:
0 304.548
1 371.397
2 350.466
3 -3574.36
4 255.452
and im trying to multiply them by:
duration
0 13 days
1 2 days
2 1 days
3 20 days
4 7 days
But I get:
0 TimedeltaIndex(['3959 days 02:57:32.054794', ...
1 TimedeltaIndex([ '4828 days 03:56:42.739725', ...
2 TimedeltaIndex([ '4556 days 01:18:54.246575', ...
3 TimedeltaIndex(['-46467 days +08:52:36.164383'...
4 TimedeltaIndex(['3320 days 21:02:27.945204', ...
desired output is
0 3959.124 as an integer (304.548*13), not as a daycount
Perhaps something like this might work:
In [1]: import datetime
In [4]: diff = datetime.datetime.today() - datetime.datetime(year=2018, month=8, day=1)
In [5]: diff.days
Out[5]: 25
Then you can do something like:
In [10]: diff.days / 365 * 10000
Out[10]: 684.931506849315