Pandas custom re-sample for time series data

Pandas custom re-sample for time series data - python

I have a time series data in 1 Min frequency. I would like re-sample the data for every 5 min and re-sample data should include the data of first time step, middle time step and last time step.
I have tried like this, but I am not getting what I am expecting...
def my_fun(array)
return array[0],array[-1]
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'
df.resample('5T').apply(my_fun)

If I understood you correctly then you want the data for minutes 0,2,4,5,7,9,10,... in a new dataframe. A faster way than using resample may be:
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
l = len(df)
df.loc[df.iloc[range(2,l,5)].index | df.iloc[range(4,l,5)].index | df.iloc[range(0,l,5)].index]
Output:
0
2017-01-01 00:00:00 0
2017-01-01 00:02:00 2
2017-01-01 00:04:00 4
2017-01-01 00:05:00 5
2017-01-01 00:07:00 7
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
If you just wanted a combined list of your selected data in one row then you were almost there:
def my_fun(array):
return [array[0], array[2], array[4]]
df=pd.DataFrame({'0':np.arange(60)}, index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
df.resample('5T').apply(my_fun)
Output:
0
2017-01-01 00:00:00 (0, 2, 4)
2017-01-01 00:05:00 (5, 7, 9)
2017-01-01 00:10:00 (10, 12, 14)

Related

Pandas upsampling does not include the 23 hours of last day in year

I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.

Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]

What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.

Add time interval values in new column Pandas

I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks

I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8

Pandas groupby aggregation to truncate earliest date instead of oldest date

I'm trying to aggregate from the end of a date range instead of from the beginning. Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. Please let me know how I can achieve my desired output shown at the bottom, thanks.
import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df
date number
0 2000-01-01 00:00:00 0
1 2000-01-01 00:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 00:07:00 7
With the groupby and aggregation of the date I get the following. Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2):
df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')
date number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
My desired output is to instead truncate the earliest date group:
date number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3
Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. Note that this is similar to this question, but my question is specific to the date truncation.
EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8]

It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch.
freq = '3min'
minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)
df.groupby(pd.cut(df["date"], r)).agg('count')
Gives
date date number
(1999-12-31 23:58:00, 2000-01-01 00:01:00] 2 2
(2000-01-01 00:01:00, 2000-01-01 00:04:00] 3 3
(2000-01-01 00:04:00, 2000-01-01 00:07:00] 3 3

This is one hack, which let's you group by a constant group size, counting bottom up.
from itertools import chain
def grouper(x, k=3):
n = len(df.index)
return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))
df['grouper'] = grouper(df, 3)
res = df.groupby('grouper', as_index=False)\
.agg({'date': 'first', 'number': 'count'})\
.drop('grouper', 1)
# date number
# 0 2000-01-01 00:00:00 2
# 1 2000-01-01 00:02:00 3
# 2 2000-01-01 00:05:00 3

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.

threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')

i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Boxplot Pandas data

DataFrame is as follows:
ID1 ID2
0 00:00:01.002 00:00:01.002
1 00:00:01.001 00:00:01.006
2 00:00:01.004 00:00:01.011
3 00:00:00.998 00:00:01.012
4 NaT 00:00:01.000
...
20 NaT 00:00:00.998
What I am trying to do is create a boxplot for each ID. There may or may not be multiple IDs depending on the dataset I provide. For right now I am trying to solve this for 2 datasets. If possible I would like a solution that has all the data on the same boxplot and then another with the data displayed on its own boxplot per ID.
I am very new to pandas (trying to learn it...) and am just getting frustrated at how long this is taking to figure out... Here is my code...
deltaTime = pd.DataFrame() #Create blank df
for x in range(0, len(totIDs)):
ID = IDList[x]
df = pd.DataFrame(data[ID]).T
deltaT[ID] = pd.to_datetime(df[TIME_COL]).diff()
deltaT.boxplot()
Pretty simple just cant seem to get it do what I want in plotting a boxplot for each ID. I should not that data is given to me by a homegrown file reader that takes several complex files and sorts them into the data dictionary which is indexed by IDs.
I am running pandas version 0.14.0 and python version 2.7.7

I am not sure how this works in 0.14.0 version, because last is 0.19.2 - I recommend upgrade if possible:
#sample data
np.random.seed(180)
dates = pd.date_range('2017-01-01 10:11:20', periods=10, freq='T')
cols = ['ID1','ID2']
df = pd.DataFrame(np.random.choice(dates, size=(10,2)), columns=cols)
print (df)
ID1 ID2
0 2017-01-01 10:12:20 2017-01-01 10:17:20
1 2017-01-01 10:16:20 2017-01-01 10:20:20
2 2017-01-01 10:18:20 2017-01-01 10:17:20
3 2017-01-01 10:12:20 2017-01-01 10:16:20
4 2017-01-01 10:14:20 2017-01-01 10:18:20
5 2017-01-01 10:18:20 2017-01-01 10:19:20
6 2017-01-01 10:17:20 2017-01-01 10:12:20
7 2017-01-01 10:13:20 2017-01-01 10:17:20
8 2017-01-01 10:16:20 2017-01-01 10:11:20
9 2017-01-01 10:13:20 2017-01-01 10:19:20
Call DataFrame.diff and then convert timedeltas to total_seconds:
df = df.diff().apply(lambda x: x.dt.total_seconds())
print(df)
ID1 ID2
0 NaN NaN
1 240.0 180.0
2 120.0 -180.0
3 -360.0 -60.0
4 120.0 120.0
5 240.0 60.0
6 -60.0 -420.0
7 -240.0 300.0
8 180.0 -360.0
9 -180.0 480.0
Last use DataFrame.plot.box
df.plot.box()
You can also check docs.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas custom re-sample for time series data - python

Related

Pandas upsampling does not include the 23 hours of last day in year

Add time interval values in new column Pandas

Pandas groupby aggregation to truncate earliest date instead of oldest date

how to merge group rows in dataframe based on differences between datetime?

Boxplot Pandas data

Categories

Resources