Group by list of different time ranges in Pandas - python

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?

Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

Related

Python pandas - Finding time duration

I am facing problem while finding the duration. The df is
data ={
'initial_time': ['2019-05-21 22:29:55','2019-10-07 17:43:09','2020-12-13 23:53:00','2018-04-17 23:51:23','2016-08-31 07:40:49'],
'final_time' : ['2019-05-22 01:10:30','2019-10-07 17:59:09','2020-12-13 00:30:10','2018-04-18 01:01:23','2016-08-31 08:45:49'],
'duration' : [0,0,0,0,0]
}
df =pd.DataFrame(data)
df
Output:
initial_time final_time duration
0 2019-05-21 22:29:55 2019-05-22 01:10:30 0
1 2019-10-07 17:43:09 2019-10-07 17:59:09 0
2 2020-12-13 23:53:00 2020-12-13 00:30:10 0
3 2018-04-17 23:51:23 2018-04-18 01:01:23 0
4 2016-08-31 07:40:49 2016-08-31 08:45:49 0
The output I'm expecting is total duration i.e final_time - initial_time.
Note : It consist values whose initial and final time comes on different dates(row 1).
The problem can be broken down into 3 parts:
convert strings to datetime objects
write function that compute duration between 2 datetime objects
apply the function to the new column of your dataframe
datetime_object = datetime.strptime('2019-05-21 22:29:55', '%Y-%m-%d %H:%M:%S')
How do I find the time difference between two datetime objects in python?
3) df['duration'] = df.apply(lambda row: getDuration(row['initial_time'], row['final_time'], 'seconds'), axis=1)

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

Pandas Dataframe: Count and Bin integers and datetime into ranges producing two output dataframes

I have the Following Dataframe:
ID Minutes Datetime
1 30 6/4/2018 23:47:00
2 420
3 433 6/10/2018 2:50
4 580 6/9/2018 3:10
5 1020
I want to count the number of times the Minutes occur between a certain range. I want to do a similar count for datetime field (timestamp falls within certain range of time).
Below is the output I want:
MIN_RANGE COUNT
6-8 hours 2
8-10 hours 1
10-12 hours 0
12-14 hours 0
14-16 hours 0
16+ hours 1
RANGE COUNT
8pm - 10pm 0
10pm - 12am 1
12am - 2am 0
2am-4am 2
4am-6am 0
6am-8am 0
8am -10am 0
10am - 12pm 0
12pm - 2pm 0
2pm - 4pm 0
4pm - 6pm 0
6pm - 8pm 0
I have searched around google and stackoverflow on how to do this (searching bins and stuff) but couldn't find anything directly related to what I am trying to do.
Help?
This is a complex problem that can be achieved by using pd.date_range and pd.cut, and then some index manipulation.
First of all, you can start by cutting your data frame using pd.cut
cuts = pd.cut(pd.to_datetime(df.Datetime), pd.date_range('02:00:00', freq='2H', periods=13))
0 (2018-07-09 22:00:00, 2018-07-10]
1 NaN
2 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
3 (2018-07-09 02:00:00, 2018-07-09 04:00:00]
4 NaN
This will yield the cuts based on your Datetime column and the ranges defined.
Lets start by having a base data frame with values set to 0, such that we will update it later with your counts. Using your cuts from above,
cats = cuts.cat.categories
bases = ["{}-{}".format(v.left.strftime("%H%p"),v.right.strftime("%H%p")) for v in cats]
df_base = pd.DataFrame({"Range": bases, "Count":0}).set_index("Range")
which yields
COUNT
Range
02AM-04AM 0
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 0
00AM-02AM 0
Now, you can use collections.Counter to quickly count your occurrences
x = Counter(cuts.dropna())
Notice that I have used dropna() not to count NaNs. With your x variable, we can
values = {"{}-{}".format(k.left.strftime("%H%p"), k.right.strftime("%H%p")) : v for k,v in x.items()}
counts_df = pd.DataFrame([values]).T
which yields
0
02AM-04AM 2
22PM-00AM 1
Finally, we just update our previous data frame with these values
df_base.loc[counts_df.index, "Count"] = counts_df[0]
COUNT
Range
02AM-04AM 2
04AM-06AM 0
06AM-08AM 0
08AM-10AM 0
10AM-12PM 0
12PM-14PM 0
14PM-16PM 0
16PM-18PM 0
18PM-20PM 0
20PM-22PM 0
22PM-00AM 1
00AM-02AM 0
import numpy as np
counts = np.histogram(df['Minutes'],
bins = list(range(6*60,18*60,2*60))+[24*60])[0]
bin_labels = [ '6-8 hours',
'8-10 hours',
'10-12 hours',
'12-14 hours',
'14-16 hours',
'16+ hours']
pd.Series(counts, index = bin_labels)
You can do a similar thing with the hours, using the hour attribute of datetime objects. You will have to fill in the empty parts of the Datetime column first.
#RafaelC has already addressed the binning and counting, but I'll make a note about reading the data from a file.
First, let's assume you separate your columns by commas (CSV), and start with:
dates.csv
ID,Minutes,Datetime
1,30,6/4/2018 23:47:00
2,420,
3,433,6/10/2018 2:50
4,580,6/9/2018 3:10
5,1020,
Then, you can read the values and parse the third column as dates as follows.
from datetime import datetime
import pandas as pd
def my_date_parser(date_str):
# Allow empty values to be coerced to NaT (Not a Time)
# rather than throw an exception
return pd.to_datetime(date_str, errors='coerce')
df = pd.read_csv(
'./dates.csv',
date_parser=my_date_parser,
parse_dates=['Datetime']
)
You can get also get the counts by using the built in floor attribute of datetime objects. In this case, you want to use the frequency of '2h' so that you are looking at 2 hour bins. Then just grab the time part
import pandas as pd
df['Datetime'] = pd.to_datetime(df.Datetime)
df.Datetime.dt.floor('2h').dt.time
#0 22:00:00
#1 NaT
#2 02:00:00
#3 02:00:00
#4 NaT
(Alternatively you could also just use df.Datetime.dt.hour//2 to get the same grouping logic, but slightly different labels)
So you can easily just groupby this and then count:
df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
#Datetime
#02:00:00 2
#22:00:00 1
#dtype: int64
Now to get the full list, we can just reindex, and change the index labels to be a bit more informative.
import datetime
import numpy as np
df_counts = df.groupby(df.Datetime.dt.floor('2h').dt.time).size()
ids = [datetime.time(2*x,0) for x in range(12)]
df_counts = df_counts.reindex(ids).fillna(0).astype('int')
# Appropriately label the ranges with more info if needed
df_counts.index = '['+df_counts.index.astype(str) + ' - ' + np.roll(df_counts.index.astype(str), -1)+')'
Output:
df_counts
[00:00:00 - 02:00:00) 0
[02:00:00 - 04:00:00) 2
[04:00:00 - 06:00:00) 0
[06:00:00 - 08:00:00) 0
[08:00:00 - 10:00:00) 0
[10:00:00 - 12:00:00) 0
[12:00:00 - 14:00:00) 0
[14:00:00 - 16:00:00) 0
[16:00:00 - 18:00:00) 0
[18:00:00 - 20:00:00) 0
[20:00:00 - 22:00:00) 0
[22:00:00 - 00:00:00) 1
dtype: float64

Pandas: How to group the non-continuous date column?

I have a column in a dataframe which contains non-continuous dates. I need to group these date by a frequency of 2 days. Data Sample(after normalization):
2015-04-18 00:00:00
2015-04-20 00:00:00
2015-04-20 00:00:00
2015-04-21 00:00:00
2015-04-27 00:00:00
2015-04-30 00:00:00
2015-05-07 00:00:00
2015-05-08 00:00:00
I tried following but as the dates are not continuous I am not getting the desired result.
df.groupby(pd.Grouper(key = 'l_date', freq='2D'))
Is these a way to achieve the desired grouping using pandas or should I write a separate logic?
Once you have a l_date sorted dataframe. you can create a continuous dummy date (dum_date) column and groupby 2D frequency on it.
df = df.sort_values(by='l_date')
df['dum_date'] = pd.date_range(pd.datetime.today(), periods=df.shape[0]).tolist()
df.groupby(pd.Grouper(key = 'dum_date', freq='2D'))
OR
If you are fine with groupings other than dates. then a generalized way to group n consecutive rows could be:
n = 2 # n = 2 for your use case
df = df.sort_values(by='l_date')
df['grouping'] = [(i//n + 1) for i in range(df.shape[0])]
df.groupby(pd.Grouper(key = 'grouping'))

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Categories