merging pandas DataFrames after resampling - python

I have a DataFramewith with a datetime index.
df1=pd.DataFrame(index=pd.date_range('20100201', periods=24, freq='8h3min'),
data=np.random.rand(24),columns=['Rubbish'])
df1.index=df1.index.to_datetime()
I want to resample this DataFrame, as in :
df1=df1.resample('7D').agg(np.median)
Then I have another DataFrame, with index of different frequency and starting at a different offset hour
df2=pd.DataFrame(index=pd.date_range('20100205', periods=24, freq='6h3min'),
data=np.random.rand(24),columns=['Rubbish'])
df2.index=df2.index.to_datetime()
df2=df2.resample('7D').agg(np.median)
The operations work well independently, but when I try to merge the results using
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
I get:
Rubbish_x Rubbish_y
2010-02-01 0.585986 NaN
2010-02-05 NaN 0.423316
2010-02-08 0.767499 NaN
While I would like to resample both with same offset, and get the following result after a merge
Rubbish_x Rubbish_y
2010-02-01 AVALUE AVALUE
2010-02-08 AVALUE AVALUE
I have tried the following, but it only generates nans
df2.reindex(df1.index)
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
I have to stick to pandas 0.20.1.
I have tried mergeas_of
df1.index
Out[48]: Index([2015-03-24, 2015-03-31, 2015-04-07, 2015-04-14, 2015-04-21, 2015-04-28], dtype='object')
df2.index
Out[49]: Index([2015-03-24, 2015-03-31, 2015-04-07, 2015-04-14, 2015-04-21, 2015-04-28], dtype='object')
output=pd.merge_asof(df1,df2,left_index=True,right_index=True)
but it crashes with following traceback
Traceback (most recent call last):
TypeError: 'NoneType' object is not callable

I believe need merge_asof:
print(pd.merge_asof(df1,df2,right_index=True,left_index=True))
Rubbish_x Rubbish_y
2010-02-01 0.446505 NaN
2010-02-08 0.474330 0.606826
Or parameter method='nearest' to reindex:
df2 = df2.reindex(df1.index, method='nearest')
print (df2)
Rubbish
2010-02-01 0.415248
2010-02-08 0.415248
print(pd.merge(df1,df2,right_index=True,left_index=True,how='outer'))
Rubbish_x Rubbish_y
2010-02-01 0.431966 0.415248
2010-02-08 0.279121 0.415248

I think follow code base would achieve your task.
>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Freq: T, dtype: int64
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
https://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.DataFrame.resample.html

Related

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

how to understand closed and label arguments in pandas resample method?

Based on the pandas documentation from here: Docs
And the examples:
>>> index = pd.date_range('1/1/2000', periods=9, freq='T')
>>> series = pd.Series(range(9), index=index)
>>> series
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Freq: T, dtype: int64
After resampling:
>>> series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00 0
2000-01-01 00:03:00 6
2000-01-01 00:06:00 15
2000-01-01 00:09:00 15
In my thoughts, the bins should looks like these after resampling:
=========bin 01=========
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
=========bin 02=========
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
=========bin 03=========
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
Am I right on this step??
So after .sum I thought it should be like this:
2000-01-01 00:02:00 3
2000-01-01 00:05:00 12
2000-01-01 00:08:00 21
I just do not understand how it comes out:
2000-01-01 00:00:00 0
(because label='right', 2000-01-01 00:00:00 cannot be any right edge of any bins in this case).
2000-01-01 00:09:00 15
(the label 2000-01-01 00:09:00 even does not exists in the original Series.
Short answer: If you use closed='left' and loffset='2T' then you'll get what you expected:
series.resample('3T', label='left', closed='left', loffset='2T').sum()
2000-01-01 00:02:00 3
2000-01-01 00:05:00 12
2000-01-01 00:08:00 21
Long answer: (or why the results you got were correct, given the arguments you used) This may not be clear from the documentation, but open and closed in this setting is about strict vs non-strict inequality (e.g. < vs <=).
An example should make this clear. Using an interior interval from your example, this is the difference from changing the value of closed:
closed='right' => ( 3:00, 6:00 ] or 3:00 < x <= 6:00
closed='left' => [ 3:00, 6:00 ) or 3:00 <= x < 6:00
You can find an explanation of the interval notation (parentheses vs brackets) in many places like here, for example:
https://en.wikipedia.org/wiki/Interval_(mathematics)
The label parameter merely controls whether the left (3:00) or right (6:00) side is displayed, but doesn't impact the results themselves.
Also note that you can change the starting point for the intervals with the loffset parameter (which should be entered as a time delta).
Back to the example, where we change only the labeling from 'right' to 'left':
series.resample('3T', label='right', closed='right').sum()
2000-01-01 00:00:00 0
2000-01-01 00:03:00 6
2000-01-01 00:06:00 15
2000-01-01 00:09:00 15
series.resample('3T', label='left', closed='right').sum()
1999-12-31 23:57:00 0
2000-01-01 00:00:00 6
2000-01-01 00:03:00 15
2000-01-01 00:06:00 15
As you can see, the results are the same, only the index label changes. Pandas only lets you display the right or left label, but if it showed both, then it would look like this (below I'm using standard index notation where ( on the left side means open and ] on the right side means closed):
( 1999-12-31 23:57:00, 2000-01-01 00:00:00 ] 0 # = 0
( 2000-01-01 00:00:00, 2000-01-01 00:03:00 ] 6 # = 1+2+3
( 2000-01-01 00:03:00, 2000-01-01 00:06:00 ] 15 # = 4+5+6
( 2000-01-01 00:06:00, 2000-01-01 00:09:00 ] 15 # = 7+8
Note that the first bin (23:57:00,00:00:00] is not empty, it's just that it contains a single row and the value in that single row is zero. If you change 'sum' to 'count' this becomes more obvious:
series.resample('3T', label='left', closed='right').count()
1999-12-31 23:57:00 1
2000-01-01 00:00:00 3
2000-01-01 00:03:00 3
2000-01-01 00:06:00 2
Per JohnE's answer I put together a little helpful infographic which should settle this issue once and for all:
It is important that resampling is performed by first producing a raster which is a sequence of instants (not periods, intervals, durations), and it is done independent of the 'label' and 'closed' parameters. It uses only the 'freq' parameter and 'loffset'. In your case, the system will produce the following raster:
2000-01-01 00:00:00
2000-01-01 00:03:00
2000-01-01 00:06:00
2000-01-01 00:09:00
Note again that at this moment there is no interpretation in terms of intervals or periods. You can shift it using 'loffset'.
Then the system will use the 'closed' parameter in ordre to choose among two options:
(start, end]
[start, end)
Here start and end are two adjacent time stamps in the raster. The 'label' parameter is used to choose whether start or end are used as a representative of the interval.
In your example, if you choose closed='right' then you will get the following intervals:
( previous_interval , 2000-01-01 00:00:00] - {0}
(2000-01-01 00:00:00, 2000-01-01 00:03:00] - {1,2,3}
(2000-01-01 00:03:00, 2000-01-01 00:06:00] - {1,2,3}
(2000-01-01 00:06:00, 2000-01-01 00:09:00] - {4,5,6}
(2000-01-01 00:09:00, next_interval ] - {7,8}
Note that after you aggregate the values over these intervals, the result is displayed in two versions depending on the 'label' parameter, that is, whether one and the same interval is represented by its left or right time stamp.
I now realized how it works, but still the strange thing about this is why the additional timestamp is added at the right side, which is counter-intuitive in a way. I guess this is similar to the range or iloc thing.

Vectorized computation of new pandas column using only hour-of-date from existing dates column

What I have:
A pandas dataframe with a column containing dates
Python 3.6
What I want:
Compute a new column, where the new value for every row depends only on a part of the date in the existing column for the same row (for example, an operation that depends only on the hour of the date)
Do so in an efficient manner (thinking, vectorized), as opposed to row-by-row computations.
Example dataframe (small dataframe is convenient for printing, but I also have an actual use-case with a larger dataframe which I can't share, but can use to for timing different solutions):
import numpy as np
import pandas as pd
from datetime import datetime
from datetime import timedelta
df = pd.DataFrame({'Date': np.arange(datetime(2000,1,1),
datetime(2000,1,2),
timedelta(hours=3)).astype(datetime)})
print(df)
Which gives:
Date
0 2000-01-01 00:00:00
1 2000-01-01 03:00:00
2 2000-01-01 06:00:00
3 2000-01-01 09:00:00
4 2000-01-01 12:00:00
5 2000-01-01 15:00:00
6 2000-01-01 18:00:00
7 2000-01-01 21:00:00
Existing solution (too slow):
df['SinHour'] = df.apply(
lambda row: np.sin((row.Date.hour + float(row.Date.minute) / 60.0) * np.pi / 12.0),
axis=1)
print(df)
Which gives:
Date SinHour
0 2000-01-01 00:00:00 0.000000e+00
1 2000-01-01 03:00:00 7.071068e-01
2 2000-01-01 06:00:00 1.000000e+00
3 2000-01-01 09:00:00 7.071068e-01
4 2000-01-01 12:00:00 1.224647e-16
5 2000-01-01 15:00:00 -7.071068e-01
6 2000-01-01 18:00:00 -1.000000e+00
7 2000-01-01 21:00:00 -7.071068e-01
I say this solution is too slow, because it computes every value in the column row-by-row. Of course, if this really is the only possibility, I'll have to settle for this. However, in the case of simpler functions, I've gotten huge speedups by using vectorized numpy functions, which I'm hoping will be possible in some way here too.
Direction for desired solution (does not work):
I was hoping to be able to do something like this:
df = df.assign(
SinHour=lambda data: np.sin((data.Date.hour + float(data.Date.minute) / 60.0)
* np.pi / 12.0))
This is the direction I was hoping to go in, because it's no longer a row-by-row apply. However, it obviously doesn't work, because it can't access the hour and minute properties of the entire Date column at once in a "vectorized" manner.
You was really close, only need .dt for process datetimes Series and for cast astype:
df = df.assign(SinHour=np.sin((df.Date.dt.hour +
(df.Date.dt.minute).astype(float) / 60.0) * np.pi / 12.0)
)
print(df)
Date SinHour
0 2000-01-01 00:00:00 0.000000e+00
1 2000-01-01 03:00:00 7.071068e-01
2 2000-01-01 06:00:00 1.000000e+00
3 2000-01-01 09:00:00 7.071068e-01
4 2000-01-01 12:00:00 1.224647e-16
5 2000-01-01 15:00:00 -7.071068e-01
6 2000-01-01 18:00:00 -1.000000e+00
7 2000-01-01 21:00:00 -7.071068e-01

Another way to use downsampling in pandas

let’s look at some one-minute data:
In [513]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [514]: ts = Series(np.arange(12), index=rng)
In [515]: ts
Out[515]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:
In [516]: ts.resample('5min', how='sum')
Out[516]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T
However I don't want to use the resample method and still want the same input-output. How can I use group_by or reindex or any of such other methods?
You can use a custom pd.Grouper this way:
In [78]: ts.groupby(pd.Grouper(freq='5min', closed='right')).sum()
Out [78]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The closed='right' ensures that the output is exactly the same.
However, if your aim is to do more custom grouping, you can use .groupby with your own vector:
In [78]: buckets = (ts.index - ts.index[0]) / pd.Timedelta('5min')
In [79]: grp = ts.groupby(np.ceil(buckets.values))
In [80]: grp.sum()
Out[80]:
0 0
1 15
2 40
3 11
dtype: int64
The output is not exactly the same, but the method is more flexible (e.g. can create uneven buckets).

Categories