Pandas custom re-sample for time series data - python

I have a time series data in 1 Min frequency. I would like re-sample the data for every 5 min and re-sample data should include the data of first time step, middle time step and last time step.
I have tried like this, but I am not getting what I am expecting...
def my_fun(array)
return array[0],array[-1]
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'
df.resample('5T').apply(my_fun)

If I understood you correctly then you want the data for minutes 0,2,4,5,7,9,10,... in a new dataframe. A faster way than using resample may be:
df=pd.DataFrame(np.arange(60),index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
l = len(df)
df.loc[df.iloc[range(2,l,5)].index | df.iloc[range(4,l,5)].index | df.iloc[range(0,l,5)].index]
Output:
0
2017-01-01 00:00:00 0
2017-01-01 00:02:00 2
2017-01-01 00:04:00 4
2017-01-01 00:05:00 5
2017-01-01 00:07:00 7
2017-01-01 00:09:00 9
2017-01-01 00:10:00 10
If you just wanted a combined list of your selected data in one row then you were almost there:
def my_fun(array):
return [array[0], array[2], array[4]]
df=pd.DataFrame({'0':np.arange(60)}, index=pd.date_range('2017-01-01 00:00','2017-01-01 00:59', freq='1T'))
df.resample('5T').apply(my_fun)
Output:
0
2017-01-01 00:00:00 (0, 2, 4)
2017-01-01 00:05:00 (5, 7, 9)
2017-01-01 00:10:00 (10, 12, 14)

Related

Pandas upsampling does not include the 23 hours of last day in year

I have a time series dataframe with dates|weather information that looks like this:
2017-01-01 5
2017-01-02 10
.
.
2017-12-31 6
I am trying to upsample it to hourly data using the following:
weather.resample('H').pad()
I expected to see 8760 entries for 24 intervals * 365 days. However, it only returns 8737 with the last 23 intervals missing for 31st of december. Is there something special I need to do to get 24 intervals for the last day?
Thanks in advance.
Pandas normalizes 2017-12-31 to 2017-12-31 00:00 and then creates a range that ends in that last datetime... I would include a last row before resampling with
df.loc['2018-01-01'] = 0
Edit:
You can get the result you want with numpy.repeat
Take this df
np.random.seed(1)
weather = pd.DataFrame(index=pd.date_range('2017-01-01', '2017-12-31'),
data={'WEATHER_MAX': np.random.random(365)*15})
WEATHER_MAX
2017-01-01 6.255330
2017-01-02 10.804867
2017-01-03 0.001716
2017-01-04 4.534989
2017-01-05 2.201338
... ...
2017-12-27 4.503725
2017-12-28 2.145087
2017-12-29 13.519627
2017-12-30 8.123391
2017-12-31 14.621106
[365 rows x 1 columns]
By repeating on axis=1 you can then transform the default range(24) column names to hourly timediffs
# repeat, then stack
hourly = pd.DataFrame(np.repeat(weather.values, 24, axis=1),
index=weather.index).stack()
# combine date and hour
hourly.index = (
hourly.index.get_level_values(0) +
pd.to_timedelta(hourly.index.get_level_values(1), unit='h')
)
hourly = hourly.rename('WEATHER_MAX').to_frame()
Output
WEATHER_MAX
2017-01-01 00:00:00 6.255330
2017-01-01 01:00:00 6.255330
2017-01-01 02:00:00 6.255330
2017-01-01 03:00:00 6.255330
2017-01-01 04:00:00 6.255330
... ...
2017-12-31 19:00:00 14.621106
2017-12-31 20:00:00 14.621106
2017-12-31 21:00:00 14.621106
2017-12-31 22:00:00 14.621106
2017-12-31 23:00:00 14.621106
[8760 rows x 1 columns]
What to do and the reason are the same as #RichieV's answer.
However, the value to be used is not 0 or a meaningless value, it is necessary to use valid data actually measured on 2018-01-01.
This is because using a meaningless value reduces the effectiveness of the resampled 2017-12-31 data and the results derived using that data.
Prepare a valid value for 2018-01-01 at the end of the data.
Call resample.
Delete the data of 2018-01-01 after resample.
You will get 8670 data for 2017.
Look at #RichieV's modified answer:
I was misunderstanding the question.
My answer was to complement resample with interpolate etc.
resampleを用いた外挿 (データ補間) を行いたい
If the same value as 00:00 on the day is all right, it would be a different way of thinking.

Add time interval values in new column Pandas

I have a large pandas dataframe (40 million rows) with the following format :
ID DATETIME TIMESTAMP
81215545953683710540 2017-01-01 17:39:57 1483243205
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243353
I am trying to assign a value to each row depending on the timestamp so that i have group of rows with the same value if they are in the same time interval.
Let's say I have t0 = 1483243205 and I want a differently value when TIMESTAMP = t0+10 . So here my time interval would be of 10.
I would like something like that :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261 5
48126186377367976994 2017-01-01 17:19:29 1483243263 5
23522333658893375671 2017-01-01 12:50:46 1483243266 6
16194691060240380504 2017-01-01 15:59:23 1483243288 8
Here is my code :
df['VALUE']=''
t=1483243205
j=0
for i in range(0,len(df['TIMESTAMP'])):
while(df.iloc[i][2])<(t+10):
df['VALUE'][i]=j
i+=1
t+=10
j+=1
I have a warning when executing my code (SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame) and I have the following result :
ID DATETIME TIMESTAMP VALUE
81215545953683710540 2017-01-01 17:39:57 1483243205 0
74994612102903447699 2017-01-01 19:14:12 1483243261
48126186377367976994 2017-01-01 17:19:29 1483243263
23522333658893375671 2017-01-01 12:50:46 1483243266
16194691060240380504 2017-01-01 15:59:23 1483243288
It is not the first time I encounter the warning and I always overcame it, but I am confused with the fact I only got a value for the first row.
Does anyone know what I am missing ?
Thanks
I would suggest using pandas' cut method to achieve this, preventing the need to explicitly loop through your DataFrame.
tmin, tmax = df['TIMESTAMP'].min(), df['TIMESTAMP'].max()
bins = [i for i in range(tmin, tmax+10, 10)]
labels = [i for i in range(len(bins)-1)]
df['VALUE'] = pd.cut(df['TIMESTAMP'], bins=bins, labels=labels, include_lowest=True)
ID DATETIME TIMESTAMP VALUE
0 81215545953683710540 2017-01-01 17:39:57 1483243205 0
1 74994612102903447699 2017-01-01 19:14:12 1483243261 5
2 48126186377367976994 2017-01-01 17:19:29 1483243263 5
3 23522333658893375671 2017-01-01 12:50:46 1483243266 6
4 16194691060240380504 2017-01-01 15:59:23 1483243288 8

Pandas groupby aggregation to truncate earliest date instead of oldest date

I'm trying to aggregate from the end of a date range instead of from the beginning. Despite the fact that I would think that adding closed='right' to the grouper would solve the issue, it doesn't. Please let me know how I can achieve my desired output shown at the bottom, thanks.
import pandas as pd
df = pd.DataFrame(columns=['date','number'])
df['date'] = pd.date_range('1/1/2000', periods=8, freq='T')
df['number'] = pd.Series(range(8))
df
date number
0 2000-01-01 00:00:00 0
1 2000-01-01 00:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 00:07:00 7
With the groupby and aggregation of the date I get the following. Since I have 8 dates and I'm grouping by periods of 3 it must choose whether to truncate the earliest date group or the oldest date group, and it chooses the oldest date group (the oldest date group has a count of 2):
df.groupby(pd.Grouper(key='date', freq='3T')).agg('count')
date number
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
My desired output is to instead truncate the earliest date group:
date number
2000-01-01 00:00:00 2
2000-01-01 00:02:00 3
2000-01-01 00:05:00 3
Please let me know how this can be achieved, I'm hopeful there's just a parameter that can be set that I've overlooked. Note that this is similar to this question, but my question is specific to the date truncation.
EDIT: To reframe the question (thanks Alexdor) the default behavior in pandas is to bin by period [0, 3), [3, 6), [6, 9) but instead I'd like to bin by (-1, 2], (2, 5], (5, 8]
It seems like the grouper function build up the bins starting from the oldest time in the series that you pass to it. I couldn't see a way to make it build up the bins from the newest time, but it's fairly easy to construct the bins from scratch.
freq = '3min'
minTime = df.date.min()
maxTime = df.date.max()
deltaT = pd.Timedelta(freq)
minTime -= deltaT - (maxTime - minTime) % deltaT # adjust min time to start of first bin
r = pd.date_range(start=minTime, end=maxTime, freq=freq)
df.groupby(pd.cut(df["date"], r)).agg('count')
Gives
date date number
(1999-12-31 23:58:00, 2000-01-01 00:01:00] 2 2
(2000-01-01 00:01:00, 2000-01-01 00:04:00] 3 3
(2000-01-01 00:04:00, 2000-01-01 00:07:00] 3 3
This is one hack, which let's you group by a constant group size, counting bottom up.
from itertools import chain
def grouper(x, k=3):
n = len(df.index)
return list(chain.from_iterable([[0]*int(n//k)] + [[i]*k for i in range(1, int(n/k)+1)]))
df['grouper'] = grouper(df, 3)
res = df.groupby('grouper', as_index=False)\
.agg({'date': 'first', 'number': 'count'})\
.drop('grouper', 1)
# date number
# 0 2000-01-01 00:00:00 2
# 1 2000-01-01 00:02:00 3
# 2 2000-01-01 00:05:00 3

how to merge group rows in dataframe based on differences between datetime?

I have a dataframe with contains events on each row, with a Start and End datatime.
import pandas as pd
import datetime
df = pd.DataFrame({ 'Value' : [1.,2.,3.],
'Start' : [datetime.datetime(2017,1,1,0,0,0),datetime.datetime(2017,1,1,0,1,0),datetime.datetime(2017,1,1,0,4,0)],
'End' : [datetime.datetime(2017,1,1,0,0,59),datetime.datetime(2017,1,1,0,5,0),datetime.datetime(2017,1,1,0,6,00)]},
index=[0,1,2])
df
Out[7]:
End Start Value
0 2017-01-01 00:00:59 2017-01-01 00:00:00 1.0
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0
2 2017-01-01 00:07:00 2017-01-01 00:06:00 3.0
I would like to group consecutive rows where the the differences between End and Start of consecutive rows is smaller than a given timedelta.
e.g. here for a timedelta of 5 seconds I would like to group row with index 0,1 and with timedelta of 2 minutes it should yield in rows 0,1,2
A solution would be to compare consecutive rows with their shifted version using .shift(), however, I would need to iterate the comparison multiple times if groups of more than 2 rows need to be merged.
As my df is very large, this is not an option.
threshold = datetime.timedelta(minutes=5)
df['delta'] = df['End'] - df['Start']
df['group'] = (df['delta'] - df['delta'].shift(-1) <= threshold).cumsum()
groups = df.groupby('group')
i assume you try to aggregate based on time difference.
marker = 60
df = df.assign(diff=df.apply(lambda row:(row.End - row.Start).total_seconds() <= marker, axis=1))
for g in df.groupby('diff'):
print g[1]
End Start Value diff
1 2017-01-01 00:05:00 2017-01-01 00:01:00 2.0 False
2 2017-01-01 00:06:00 2017-01-01 00:04:00 3.0 False
End Start Value diff
0 2017-01-01 00:00:59 2017-01-01 1.0 True

Boxplot Pandas data

DataFrame is as follows:
ID1 ID2
0 00:00:01.002 00:00:01.002
1 00:00:01.001 00:00:01.006
2 00:00:01.004 00:00:01.011
3 00:00:00.998 00:00:01.012
4 NaT 00:00:01.000
...
20 NaT 00:00:00.998
What I am trying to do is create a boxplot for each ID. There may or may not be multiple IDs depending on the dataset I provide. For right now I am trying to solve this for 2 datasets. If possible I would like a solution that has all the data on the same boxplot and then another with the data displayed on its own boxplot per ID.
I am very new to pandas (trying to learn it...) and am just getting frustrated at how long this is taking to figure out... Here is my code...
deltaTime = pd.DataFrame() #Create blank df
for x in range(0, len(totIDs)):
ID = IDList[x]
df = pd.DataFrame(data[ID]).T
deltaT[ID] = pd.to_datetime(df[TIME_COL]).diff()
deltaT.boxplot()
Pretty simple just cant seem to get it do what I want in plotting a boxplot for each ID. I should not that data is given to me by a homegrown file reader that takes several complex files and sorts them into the data dictionary which is indexed by IDs.
I am running pandas version 0.14.0 and python version 2.7.7
I am not sure how this works in 0.14.0 version, because last is 0.19.2 - I recommend upgrade if possible:
#sample data
np.random.seed(180)
dates = pd.date_range('2017-01-01 10:11:20', periods=10, freq='T')
cols = ['ID1','ID2']
df = pd.DataFrame(np.random.choice(dates, size=(10,2)), columns=cols)
print (df)
ID1 ID2
0 2017-01-01 10:12:20 2017-01-01 10:17:20
1 2017-01-01 10:16:20 2017-01-01 10:20:20
2 2017-01-01 10:18:20 2017-01-01 10:17:20
3 2017-01-01 10:12:20 2017-01-01 10:16:20
4 2017-01-01 10:14:20 2017-01-01 10:18:20
5 2017-01-01 10:18:20 2017-01-01 10:19:20
6 2017-01-01 10:17:20 2017-01-01 10:12:20
7 2017-01-01 10:13:20 2017-01-01 10:17:20
8 2017-01-01 10:16:20 2017-01-01 10:11:20
9 2017-01-01 10:13:20 2017-01-01 10:19:20
Call DataFrame.diff and then convert timedeltas to total_seconds:
df = df.diff().apply(lambda x: x.dt.total_seconds())
print(df)
ID1 ID2
0 NaN NaN
1 240.0 180.0
2 120.0 -180.0
3 -360.0 -60.0
4 120.0 120.0
5 240.0 60.0
6 -60.0 -420.0
7 -240.0 300.0
8 180.0 -360.0
9 -180.0 480.0
Last use DataFrame.plot.box
df.plot.box()
You can also check docs.

Categories