Pandas Dataframe Time Duration Expand to Minute Data - python

I am receiving data which consists of a 'StartTime' and a 'Duration' of time active. This is hard to work with when I need to do calculations on a specified time range over multiple days. I would like to break this data down to minutely data to make future calculations easier. Please see the example to get a better understanding.
Data which I currently have:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
What I would like to have:
data_expected = {'Time':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 04:37:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00','2019-01-02 05:14:00+11:00'],
'Duration':[1,1,1,1,1,1,1],
'Site':['1','2','3','3','4','5','5']
}
df_expected = pd.DataFrame(data_expected)
df_expected['Time'] = pd.to_datetime(df_expected['Time']).dt.tz_localize('utc').dt.tz_convert('Australia/Melbourne')
I would like to see if anyone has a good solution for this problem. Effectively, I would need data rows with Duration >1 to be duplicated with time +1minute for each minute above 1 minute duration. Is there a way to do this without creating a whole new dataframe?
******** EDIT ********
In response to #DavidErickson 's answer. Putting this here because I can't put images in comments. I ran into a bit of trouble. df1 is a subset of the original dataframe. df2 is df1 after applying the code provided. You can see that the time that is added on to index 635 is incorrect.

I think you might want to address use case where Duration > 2 as well.
For the modified given input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,3,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
This code should do the trick:
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = df.explode('offset')
df['offset'] = df['offset'].apply(lambda x: pd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1
Basically, it works as follow:
create a list of integer based on Duration value;
replicate row (explode) with consecutive integer offset;
transform integer offset into timedelta offset;
perform datetime arithmetics and reset Duration field.
The result is about:
StartTime Duration Site offset
0 2018-12-30 12:45:00+11:00 1 1 00:00:00
1 2018-12-31 16:48:00+11:00 1 2 00:00:00
2 2019-01-01 04:36:00+11:00 1 3 00:00:00
2 2019-01-01 04:37:00+11:00 1 3 00:01:00
2 2019-01-01 04:38:00+11:00 1 3 00:02:00
3 2019-01-01 19:27:00+11:00 1 4 00:00:00
4 2019-01-02 05:13:00+11:00 1 5 00:00:00
4 2019-01-02 05:14:00+11:00 1 5 00:01:00

Use df.index.repeat according to the Duration column to add the relevant number of rows. Then create a mask with .groupby and cumcount that adds the appropriate number of minutes on top of the base time.
input:
data = {'StartTime':['2018-12-30 12:45:00+11:00','2018-12-31 16:48:00+11:00','2019-01-01 04:36:00+11:00','2019-01-01 19:27:00+11:00','2019-01-02 05:13:00+11:00'],
'Duration':[1,1,2,1,2],
'Site':['1','2','3','4','5']
}
df = pd.DataFrame(data)
df['StartTime'] = pd.to_datetime(df['StartTime'])
code:
df = df.loc[df.index.repeat(df['Duration'])]
mask = df.groupby('Site').cumcount()
df['StartTime'] = df['StartTime'] + pd.to_timedelta(mask, unit='m')
df = df.append(df).sort_values('StartTime').assign(Duration=1).drop_duplicates()
df
output:
StartTime Duration Site
0 2018-12-30 12:45:00+11:00 1 1
1 2018-12-31 16:48:00+11:00 1 2
2 2019-01-01 04:36:00+11:00 1 3
2 2019-01-01 04:37:00+11:00 1 3
2 2019-01-01 04:38:00+11:00 1 3
3 2019-01-01 19:27:00+11:00 1 4
4 2019-01-02 05:13:00+11:00 1 5
4 2019-01-02 05:14:00+11:00 1 5
If you are running into memory issues, then you can also try with dask. I have included #jlandercy's pandas answer and changed to dask syntax as I'm not sure if the pandas operation index.repeat would work with dask. Here is documentation on the funcitons/operations. I would research the ones in the code https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.read_sql_table:
import dask.dataframe as dd
#read as a dask dataframe from csv or SQL or other
df = dd.read_csv(files) #df = dd.read_sql_table(table, uri, index_col='StartTime')
df['offset'] = df['Duration'].apply(lambda x: list(range(x)))
df = dd.explode('offset')
df['offset'] = df['offset'].apply(lambda x: dd.Timedelta(x, unit='T'))
df['StartTime'] += df['offset']
df["Duration"] = 1

Related

Group by list of different time ranges in Pandas

Edit: Changing example to use Timedelta indices.
I have a DataFrame of different time ranges that represent indices in my main DataFrame. eg:
ranges = pd.DataFrame(data=np.array([[1,10,20],[3,15,30]]).T, columns=["Start","Stop"])
ranges = ranges.apply(pd.to_timedelta, unit="s")
ranges
Start Stop
0 0 days 00:00:01 0 days 00:00:03
1 0 days 00:00:10 0 days 00:00:15
2 0 days 00:00:20 0 days 00:00:30
my_data= pd.DataFrame(data=list(range(0,40*5,5)), columns=["data"])
my_data.index = pd.to_timedelta(my_data.index, unit="s")
I want to calculate the averages of the data in my_data for each of the time ranges in ranges. How can I do this?
One option would be as follows:
ranges.apply(lambda row: my_data.loc[row["Start"]:row["Stop"]].iloc[:-1].mean(), axis=1)
data
0 7.5
1 60.0
2 122.5
But can we do this without apply?
Here is one way to approach it:
Generate the timedeltas and concatenate into a single block:
# note the use of closed='left' (`Stop` is not included in the build)
timedelta = [pd.timedelta_range(a,b, closed='left', freq='1s')
for a, b in zip(ranges.Start, ranges.Stop)]
timedelta = timedelta[0].append(timedelta[1:])
Get the grouping which will be used for the groupby and aggregation:
counts = ranges.Stop.sub(ranges.Start).dt.seconds
counts = np.arange(counts.size).repeat(counts)
Group by and aggregate:
my_data.loc[timedelta].groupby(counts).mean()
data
0 7.5
1 60.0
2 122.5

Pandas - Add at least one row for every day (datetimes include a time)

Edit: You can use the alleged duplicate solution with reindex() if your dates don't include times, otherwise you need a solution like the one by #kosnik. In addition, their solution doesn't need your dates to be the index!
I have data formatted like this
df = pd.DataFrame(data=[['2017-02-12 20:25:00', 'Sam', '8'],
['2017-02-15 16:33:00', 'Scott', '10'],
['2017-02-15 16:45:00', 'Steve', '5']],
columns=['Datetime', 'Sender', 'Count'])
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y-%m-%d %H:%M:%S')
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-15 16:33:00 Scott 10
2 2017-02-15 16:45:00 Steve 5
I need there to be at least one row for every date, so the expected result would be
Datetime Sender Count
0 2017-02-12 20:25:00 Sam 8
1 2017-02-13 00:00:00 None 0
2 2017-02-14 00:00:00 None 0
3 2017-02-15 16:33:00 Scott 10
4 2017-02-15 16:45:00 Steve 5
I have tried to make datetime the index, add the dates and use reindex() like so
df.index = df['Datetime']
values = df['Datetime'].tolist()
for i in range(len(values)-1):
if values[i].date() + timedelta < values[i+1].date():
values.insert(i+1, pd.Timestamp(values[i].date() + timedelta))
print(df.reindex(values, fill_value=0))
This makes every row forget about the other columns and the same thing happens for asfreq('D') or resample()
ID Sender Count
Datetime
2017-02-12 16:25:00 0 Sam 8
2017-02-13 00:00:00 0 0 0
2017-02-14 00:00:00 0 0 0
2017-02-15 20:25:00 0 0 0
2017-02-15 20:25:00 0 0 0
What would be the appropriate way of going about this?
I would create a new DataFrame column which contains all the required data and then left join with your data frame.
A working code example is the following
df['Datetime'] = pd.to_datetime(df['Datetime']) # first convert to datetimes
datetimes = df['Datetime'].tolist() # these are existing datetimes - will add here the missing
dates = [x.date() for x in datetimes] # these are converted to dates
min_date = min(dates)
max_date = max(dates)
for d in range((max_date - min_date).days):
forward_date = min_date + datetime.timedelta(d)
if forward_date not in dates:
datetimes.append(np.datetime64(forward_date))
# create new dataframe, merge and fill 'Count' column with zeroes
df = pd.DataFrame({'Datetime': datetimes}).merge(df, on='Datetime', how='left')
df['Count'].fillna(0, inplace=True)

(python) Using diff() function in a DataFrame

How can I use the func diff() resetting the result to zero if the date in the current row is different from the date in the previous?
For instance, I have the df below containing ts and value, when generating value_diff I can use:
df['value_diff'] = df.value.diff()
but in this case the row of index 4 will have value_diff = 200 and I need it to reset to zero because date has changed.
i ts value value_diff
0 2019-01-02 11:48:01.001 100 0
1 2019-01-02 14:26:01.001 150 50
2 2019-01-02 16:12:01.001 75 -75
3 2019-01-02 18:54:01.001 50 -25
4 2019-01-03 09:12:01.001 250 0
5 2019-01-03 12:25:01.001 310 60
6 2019-01-03 16:50:01.001 45 -265
7 2019-01-03 17:10:01.001 30 -15
I know I can build a loop for it, but I was wondering if it can be solved in a more fancy way, maybe using lambda functions.
You want to use groupby and then fillna to get the 0 values.
import pandas as pd
# Reading your example and getting back to correct format from clipboard
df = pd.read_clipboard()
df['ts'] = df['i'] + ' ' + df['ts']
df.drop(['i', 'value_diff'], axis=1, inplace=True) # The columns get misaligned from reading clipboard
# Now we have your original
print(df.head())
# Convert ts to datetime
df['ts'] = pd.to_datetime(df['ts'], infer_datetime_format=True)
# Add a date column for us to groupby
df['date'] = df['ts'].dt.date
# Apply diff and fillna
df['value_diff'] = df.groupby('date')['value'].diff().fillna(0)

Pandas group by number (instead of time)

In pd.Grouper we can group by time, for example using 10s
Time Count
10:05:03 2
10:05:04 3
10:05:05 4
10:05:11 3
10:05:12 4
Will provide the result of:
Time Count
10:05:10 9
10:05:20 7
I'm looking for the other way around. Can I group the time by count, for example, using 5
Count Time (s)
5 (4-3)=1s
5 (11-5)=6s
5 (12-11)=1s
Thanks a bunch!
Maybe this is what you have in mind. Start with a pandas Series df:
2018-03-14 06:38:46.308425+00:00 2
2018-03-14 06:38:47.308425+00:00 3
2018-03-14 06:38:48.308425+00:00 4
2018-03-14 06:38:54.308425+00:00 3
2018-03-14 06:38:55.308425+00:00 4
dtype: int64
Find the indices where the cumulative sum crosses a multiple of 5:
df[:] = df.values.cumsum() // 5 * 5
hit5 = (df.diff() == 5).nonzero()[0]
In this case it's array([1, 3, 4]). Then iterate over those indices and take the difference with the previous index:
for i in hit5:
print(df.index[i] - df.index[i-1])
Giving:
0 days 00:00:01
0 days 00:00:06
0 days 00:00:01
If I understand your question correctly, you can try
import io
import numpy as np
import pandas as pd
df_txt = """
Time Count
10:05:03 2
10:05:04 3
10:05:05 4
10:05:11 3
10:05:12 4"""
df = pd.read_csv(io.StringIO(df_txt), sep='\t')
df['Time'] = df.Time.apply(lambda x: pd.to_datetime(x))
df['CumCount'] = df.Count.cumsum()
df['Ind1'] = df.CumCount // 5
df['Ind2'] = df.Ind1.shift()
df['LagTime'] = df.Time.shift()
df.loc[df.Ind1 == df.Ind2, 'LagTime'] = np.nan
df['StartTime'] = df.LagTime.bfill()
out = df.groupby(['StartTime'], as_index=False).last()
out['Time (s)'] = out.Time.values - out.StartTime.values
Output:
print(out['Time (s)'])
# 0 00:00:01
# 1 00:00:06
# 2 00:00:01
# Name: Time (s), dtype: timedelta64[ns]

How Can I Detect Gaps and Consecutive Periods In A Time Series In Pandas

I have a pandas Dataframe that is indexed by Date. I would like to select all consecutive gaps by period and all consecutive days by Period. How can I do this?
Example of Dataframe with No Columns but a Date Index:
In [29]: import pandas as pd
In [30]: dates = pd.to_datetime(['2016-09-19 10:23:03', '2016-08-03 10:53:39','2016-09-05 11:11:30', '2016-09-05 11:10:46','2016-09-05 10:53:39'])
In [31]: ts = pd.DataFrame(index=dates)
As you can see there is a gap from 2016-08-03 and 2016-09-19. How do I detect these so I can create descriptive statistics, i.e. 40 gaps, with median gap duration of "x", etc. Also, I can see that 2016-09-05 and 2016-09-06 is a two day range. How I can detect these and also print descriptive stats?
Ideally the result would be returned as another Dataframe in each case since I want use other columns in the Dataframe to groupby.
Pandas version 1.0.1 has a built-in method DataFrame.diff() which you can use to accomplish this. One benefit is you can use pandas series functions like mean() to quickly compute summary statistics on the gaps series object
from datetime import datetime, timedelta
import pandas as pd
# Construct dummy dataframe
dates = pd.to_datetime([
'2016-08-03',
'2016-08-04',
'2016-08-05',
'2016-08-17',
'2016-09-05',
'2016-09-06',
'2016-09-07',
'2016-09-19'])
df = pd.DataFrame(dates, columns=['date'])
# Take the diff of the first column (drop 1st row since it's undefined)
deltas = df['date'].diff()[1:]
# Filter diffs (here days > 1, but could be seconds, hours, etc)
gaps = deltas[deltas > timedelta(days=1)]
# Print results
print(f'{len(gaps)} gaps with average gap duration: {gaps.mean()}')
for i, g in gaps.iteritems():
gap_start = df['date'][i - 1]
print(f'Start: {datetime.strftime(gap_start, "%Y-%m-%d")} | '
f'Duration: {str(g.to_pytimedelta())}')
here's something to get started:
df = pd.DataFrame(np.ones(5),columns = ['ones'])
df.index = pd.DatetimeIndex(['2016-09-19 10:23:03', '2016-08-03 10:53:39', '2016-09-05 11:11:30', '2016-09-05 11:10:46', '2016-09-06 10:53:39'])
daily_rng = pd.date_range('2016-08-03 00:00:00', periods=48, freq='D')
daily_rng = daily_rng.append(df.index)
daily_rng = sorted(daily_rng)
df = df.reindex(daily_rng).fillna(0)
df = df.astype(int)
df['ones'] = df.cumsum()
The cumsum() creates a grouping variable on 'ones' partitioning your data at the points your provided. If you print df to say a spreadsheet it will make sense:
print df.head()
ones
2016-08-03 00:00:00 0
2016-08-03 10:53:39 1
2016-08-04 00:00:00 1
2016-08-05 00:00:00 1
2016-08-06 00:00:00 1
print df.tail()
ones
2016-09-16 00:00:00 4
2016-09-17 00:00:00 4
2016-09-18 00:00:00 4
2016-09-19 00:00:00 4
2016-09-19 10:23:03 5
now to complete:
df = df.reset_index()
df = df.groupby(['ones']).aggregate({'ones':{'gaps':'count'},'index':{'first_spotted':'min'}})
df.columns = df.columns.droplevel()
which gives:
first_time gaps
ones
0 2016-08-03 00:00:00 1
1 2016-08-03 10:53:39 34
2 2016-09-05 11:10:46 1
3 2016-09-05 11:11:30 2
4 2016-09-06 10:53:39 14
5 2016-09-19 10:23:03 1

Categories