I am facing problem while finding the duration. The df is
data ={
'initial_time': ['2019-05-21 22:29:55','2019-10-07 17:43:09','2020-12-13 23:53:00','2018-04-17 23:51:23','2016-08-31 07:40:49'],
'final_time' : ['2019-05-22 01:10:30','2019-10-07 17:59:09','2020-12-13 00:30:10','2018-04-18 01:01:23','2016-08-31 08:45:49'],
'duration' : [0,0,0,0,0]
}
df =pd.DataFrame(data)
df
Output:
initial_time final_time duration
0 2019-05-21 22:29:55 2019-05-22 01:10:30 0
1 2019-10-07 17:43:09 2019-10-07 17:59:09 0
2 2020-12-13 23:53:00 2020-12-13 00:30:10 0
3 2018-04-17 23:51:23 2018-04-18 01:01:23 0
4 2016-08-31 07:40:49 2016-08-31 08:45:49 0
The output I'm expecting is total duration i.e final_time - initial_time.
Note : It consist values whose initial and final time comes on different dates(row 1).
The problem can be broken down into 3 parts:
convert strings to datetime objects
write function that compute duration between 2 datetime objects
apply the function to the new column of your dataframe
datetime_object = datetime.strptime('2019-05-21 22:29:55', '%Y-%m-%d %H:%M:%S')
How do I find the time difference between two datetime objects in python?
3) df['duration'] = df.apply(lambda row: getDuration(row['initial_time'], row['final_time'], 'seconds'), axis=1)
Related
Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5
I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.
Data which I currently have:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
What I would like to have:
data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
'Duration':[1,1,1,1,1,1,1],
'Site':['1','2','3','3','4','5','5']
}
df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?
******** EDIT ********
In response to #DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect.
I think you might want to address use case where Duration > 2 as well.
For the modified given input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
This code should do the trick:
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Basically, it works as follow:
create a list of integer based on Duration value;
replicate row (explode) with consecutive integer offset;
transform integer offset into timedelta offset;
perform datetime arithmetics and reset Duration field.
The result is about:
StartTime Duration Site offset
0 2018-12-30 12:45:00+11:00 1 1 00:00:00
1 2018-12-31 16:48:00+11:00 1 2 00:00:00
2 2019-01-01 04:36:00+11:00 1 3 00:00:00
2 2019-01-01 04:37:00+11:00 1 3 00:01:00
2 2019-01-01 04:38:00+11:00 1 3 00:02:00
3 2019-01-01 19:27:00+11:00 1 4 00:00:00
4 2019-01-02 05:13:00+11:00 1 5 00:00:00
4 2019-01-02 05:14:00+11:00 1 5 00:01:00
Use df.index.repeat according to the Duration column to add the relevant number of rows. Then create a mask with .groupby and cumcount that adds the appropriate number of minutes on top of the base time.
input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,2,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
code:
df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df
output:
StartTime Duration Site
0 2018-12-30 12:45:00+11:00 1 1
1 2018-12-31 16:48:00+11:00 1 2
2 2019-01-01 04:36:00+11:00 1 3
2 2019-01-01 04:37:00+11:00 1 3
2 2019-01-01 04:38:00+11:00 1 3
3 2019-01-01 19:27:00+11:00 1 4
4 2019-01-02 05:13:00+11:00 1 5
4 2019-01-02 05:14:00+11:00 1 5
If you are running into memory issues, then you can also try with dask. I have included #jlandercy's pandas answer and changed to dask syntax as I'm not sure if the pandas operation index.repeat would work with dask. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:
import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
i got dataframe with column like this:
Date
3 mins
2 hours
9-Feb
13-Feb
the type of the dates is string for every row. What is the easiest way to get that dates into integer unixtime ?
One idea is convert columns to datetimes and to timedeltas:
df['dates'] = pd.to_datetime(df['Date']+'-2020', format='%d-%b-%Y', errors='coerce')
times = df['Date'].replace({'(\d+)\s+mins': '00:\\1:00',
'\s+hours': ':00:00'}, regex=True)
df['times'] = pd.to_timedelta(times, errors='coerce')
#remove rows if missing values in dates and times
df = df[df['Date'].notna() | df['times'].notna()]
df['all'] = df['dates'].dropna().astype(np.int64).append(df['times'].dropna().astype(np.int64))
print (df)
Date dates times all
0 3 mins NaT 00:03:00 180000000000
1 2 hours NaT 02:00:00 7200000000000
2 9-Feb 2020-02-09 NaT 1581206400000000000
3 13-Feb 2020-02-13 NaT 1581552000000000000
Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)
I have the Following Dataframe:
ID Minutes Datetime
1 30 6/4/2018 23:47:00
2 420
3 433 6/10/2018 2:50
4 580 6/9/2018 3:10
5 1020
I want to count the number of times the Minutes occur between a certain range. I want to do a similar count for datetime field (timestamp falls within certain range of time).
Below is the output I want:
MIN_RANGE COUNT
6-8 hours 2
8-10 hours 1
10-12 hours 0
12-14 hours 0
14-16 hours 0
16+ hours 1
RANGE COUNT
8pm - 10pm 0
10pm - 12am 1
12am - 2am 0
2am-4am 2
4am-6am 0
6am-8am 0
8am -10am 0
10am - 12pm 0
12pm - 2pm 0
2pm - 4pm 0
4pm - 6pm 0
6pm - 8pm 0
I have searched around google and stackoverflow on how to do this (searching bins and stuff) but couldn't find anything directly related to what I am trying to do.
Help?
This is a complex problem that can be achieved by using pd.date_range and pd.cut, and then some index manipulation.
First of all, you can start by cutting your data frame using pd.cut
cuts = pd.cut(pd.to_datetime(df.Datetime), pd.date_range('02:00:00', freq='2H', periods=13))
0 (2018-07-09 22:00:00, 2018-07-10]
1 NaN
2 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
3 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
4 NaN
This will yield the cuts based on your Datetime column and the ranges defined.
Lets start by having a base data frame with values set to 0, such that we will update it later with your counts. Using your cuts from above,
cats = cuts.cat.categories
bases = ["{}-{}".format(v.left.strftime("%H%p"),v.right.strftime("%H%p")) for v in cats]
df_base = pd.DataFrame({"Range": bases, "Count":0}).set_index("Range")
which yields
COUNT
Range
02AM-04AM 0
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 0
00AM-02AM 0
Now, you can use collections.Counter to quickly count your occurrences
x = Counter(cuts.dropna())
Notice that I have used dropna() not to count NaNs. With your x variable, we can
values = {"{}-{}".format(k.left.strftime("%H%p"), k.right.strftime("%H%p")) : v for k,v in x.items()}
counts_df = pd.DataFrame([values]).T
which yields
0
02AM-04AM 2
22PM-00AM 1
Finally, we just update our previous data frame with these values
df_base.loc[counts_df.index, "Count"] = counts_df[0]
COUNT
Range
02AM-04AM 2
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 1
00AM-02AM 0
import numpy as np
counts = np.histogram(df['Minutes'],
bins = list(range(6*60,18*60,2*60))+[24*60])[0]
bin_labels = [ '6-8 hours',
'8-10 hours',
'10-12 hours',
'12-14 hours',
'14-16 hours',
'16+ hours']
pd.Series(counts, index = bin_labels)
You can do a similar thing with the hours, using the hour attribute of datetime objects. You will have to fill in the empty parts of the Datetime column first.
#RafaelC has already addressed the binning and counting, but I'll make a note about reading the data from a file.
First, let's assume you separate your columns by commas (CSV), and start with:
dates.csv
ID,Minutes,Datetime
1,30,6/4/2018 23:47:00
2,420,
3,433,6/10/2018 2:50
4,580,6/9/2018 3:10
5,1020,
Then, you can read the values and parse the third column as dates as follows.
from datetime import datetime
import pandas as pd
def my_date_parser(date_str):
# Allow empty values to be coerced to NaT (Not a Time)
# rather than throw an exception
return pd.to_datetime(date_str, errors='coerce')
df = pd.read_csv(
'./dates.csv',
date_parser=my_date_parser,
parse_dates=['Datetime']
)
You can get also get the counts by using the built in floor attribute of datetime objects. In this case, you want to use the frequency of '2h' so that you are looking at 2 hour bins. Then just grab the time part
import pandas as pd
df['Datetime'] = pd.to_datetime(df.Datetime)
df.Datetime.dt.floor('2h').dt.time
#0 22:00:00
#1 NaT
#2 02:00:00
#3 02:00:00
#4 NaT
(Alternatively you could also just use df.Datetime.dt.hour//2 to get the same grouping logic, but slightly different labels)
So you can easily just groupby this and then count:
df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
#Datetime
#02:00:00 2
#22:00:00 1
#dtype: int64
Now to get the full list, we can just reindex, and change the index labels to be a bit more informative.
import datetime
import numpy as np
df_counts = df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
ids = [datetime.time(2*x,0) for x in range(12)]
df_counts = df_counts.reindex(ids).fillna(0).astype('int')
# Appropriately label the ranges with more info if needed
df_counts.index = '['+df_counts.index.astype(str) + ' - ' + np.roll(df_counts.index.astype(str), -1)+')'
Output:
df_counts
[00:00:00 - 02:00:00) 0
[02:00:00 - 04:00:00) 2
[04:00:00 - 06:00:00) 0
[06:00:00 - 08:00:00) 0
[08:00:00 - 10:00:00) 0
[10:00:00 - 12:00:00) 0
[12:00:00 - 14:00:00) 0
[14:00:00 - 16:00:00) 0
[16:00:00 - 18:00:00) 0
[18:00:00 - 20:00:00) 0
[20:00:00 - 22:00:00) 0
[22:00:00 - 00:00:00) 1
dtype: float64