I have one set of values measured at regular times. Say:
import pandas as pd
import numpy as np
rng = pd.date_range('2013-01-01', periods=12, freq='H')
data = pd.Series(np.random.randn(len(rng)), index=rng)
And another set of more arbitrary times, for example, (in reality these times are not a regular sequence)
ts_rng = pd.date_range('2013-01-01 01:11:21', periods=7, freq='87Min')
ts = pd.Series(index=ts_rng)
I want to know the value of data interpolated at the times in ts.
I can do this in numpy:
x = np.asarray(ts_rng,dtype=np.float64)
xp = np.asarray(data.index,dtype=np.float64)
fp = np.asarray(data)
ts[:] = np.interp(x,xp,fp)
But I feel pandas has this functionality somewhere in resample, reindex etc. but I can't quite get it.
You can concatenate the two time series and sort by index. Since the values in the second series are NaN you can interpolate and the just select out the values that represent the points from the second series:
pd.concat([data, ts]).sort_index().interpolate().reindex(ts.index)
or
pd.concat([data, ts]).sort_index().interpolate()[ts.index]
Assume you would like to evaluate a time series ts on a different datetime_index. This index and the index of ts may overlap. I recommend to use the following groupby trick. This essentially gets rid of dubious double stamps. I then forward interpolate but feel free to apply more fancy methods
def interpolate(ts, datetime_index):
x = pd.concat([ts, pd.Series(index=datetime_index)])
return x.groupby(x.index).first().sort_index().fillna(method="ffill")[datetime_index]
Here's a clean one liner:
ts = np.interp( ts_rng.asi8 ,data.index.asi8, data[0] )
Related
This question follows one I previously asked here, and that was answered for numeric values.
I raise this 2nd one now relative to data of Period type.
While the example given below appears simple, I have actually windows that are of variable size. Interested in the 1st row of the windows, I am looking for a technic that makes use of this definition.
import pandas as pd
from random import seed, randint
# DataFrame
pi1h = pd.period_range(start='2020-01-01 00:00+00:00', end='2020-01-02 00:00+00:00', freq='1h')
seed(1)
values = [randint(0, 10) for ts in pi1h]
df = pd.DataFrame({'Values' : values, 'Period' : pi1h}, index=pi1h)
# This works (numeric type)
df['first'] = df['Values'].rolling(3).agg(lambda rows: rows[0])
# This doesn't (Period type)
df['OpeningPeriod'] = df['Period'].rolling(3).agg(lambda rows: rows[0])
Result of 2nd command
DataError: No numeric types to aggregate
Please, any idea? Thanks for any help! Bests,
First row of rolling window of size 3 means row, which is 2 rows above the current - just use pd.Series.shift(2):
df['OpeningPeriod'] = df['Period'].shift(2)
For the variable size (for the sake of example- I took Values column as this variable size):
import numpy as np
x=(np.arange(len(df))-df['Values'])
df['OpeningPeriod'] = np.where(x.ge(0), df.loc[df.index[x.tolist()], 'Period'], np.nan)
Convert your period[H] to a float
# convert to float
df['Period1'] = df['Period'].dt.to_timestamp().values.astype(float)
# rolling and convert back to period
df['OpeningPeriod'] = pd.to_datetime(df['Period1'].rolling(3)\
.agg(lambda rows: rows[0])).dt.to_period('1h')
# drop column
df = df.drop(columns='Period1')
I am trying to efficiently split a 3D point cloud into a number of 2D tiles/segments.
Using a combination of numpy's searchsorted() and pandas groupby() functions i have been able to achieve sorting the data into groups with pleasing speed.
For example:
import numpy as np
import pandas as pd
import time
scale=100
n_points= 1000000
n_tiles = 1000000
pos = np.empty((n_points,3))
pos[:,0]=np.random.random(n_points)*scale
pos[:,1]=np.random.random(n_points)*scale
pos[:,2]=np.random.random(n_points)
df = pd.DataFrame(pos)
# create bounds for each segment
min_bound,max_bound = 0,scale
x_segment_bounds,xstep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep = True)
x_segment_bounds[0]=x_segment_bounds[0]+xstep/2
y_segment_bounds,ystep = np.linspace(min_bound, max_bound, num=n_tiles**0.5,retstep=True)
y_segment_bounds[0]=y_segment_bounds[0]+ystep/2
# sort into bins
time_grab = time.clock()
bins_x = np.searchsorted(x_segment_bounds, pos[:, 0])
bins_y = np.searchsorted(y_segment_bounds, pos[:, 1])
print("Time for binning: ", time.clock()-time_grab)
df["bins_x"] = bins_x.astype(np.uint16)
df["bins_y"] = bins_y.astype(np.uint16)
# group points
time_grab = time.clock()
segments = df.groupby(['bins_x', 'bins_y'])
print("Time for grouping: ", time.clock()-time_grab)
Produces:
Time for binning: 0.1390
Time for grouping: 0.0043
The problem i am having is in efficiently accessing the point indexes that belong to each group in the pandas groupby object.
For example looping through each group is very inefficient:
segment_indices = []
for i,segment in enumerate(segments):
segment_indices.append(segment[1].index.values)
takes ~70 seconds.
I have found this method for retrieving the indexes:
segments = df.groupby(['bins_x', 'bins_y']).apply(lambda x: x.index.tolist())
which takes ~10 seconds, however it is still comparatively quite slow compared to the binning and grouping functions. Since i'm simply trying to copy the data to a new array or list, and not actually performing any computation on it, i am expecting much better efficiency. I would expect speeds at least similar to the binning and grouping operations.
I am curious if there is a more efficient way of extracting the indexes (or any of the information) from a groupby object? Alternatively is there another method for segmenting/ grouping points which doesn't use pandas, such as a numpy or scipy alternative?
I am having a series vec which has been sampled at 2000Hz. What I would like to do is to sample this series down in 50Hz steps. My problem is that I do not quite understand how I can do this with pandas.
I do not quite understand how I can wrap my vec into a DataFrame and set the time stamps using pd.date_range accordingly.
The code I could show you is plain wrong, thus I cannot really show you what I did so far. But I can show you in pseudo Python what I'd like to do:
# Get a date range for vec
date_range = pd.date_range(len(vec), sampling_rate=2000, unit='Hz')
# Create a DataFrame for the 2000Hz series
df_2k = pd.DataFrame(vec, index=date_range)
# Sample down to 1950Hz, 1900Hz, ..
df_1950Hz = df_2k.resample(sampling_rate=1950, unit='Hz')
df_1900Hz = df_2k.resample(sampling_rate=1900, unit='Hz')
Any idea how I can do this?
I think what could also work is something like
df_1950Hz = df_2k.drop_every(nth_sample=int(2000/50))
First, construct a period from your frequency:
freq = 1950
period = '{}N'.format(int(1e9 / freq))
This gives you '512820N' which Pandas understands as nanoseconds. Then:
df2k.resample(period).mean() # you could use e.g. `last()` instead
As for your initial index, maybe you want this:
freq = 2000
period = '{}N'.format(int(1e9 / freq))
index = pd.date_range(start, periods=len(vec), freq=period)
Where start is arbitrary.
Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)
I have a couple of a set of data with timestamp, value and quality flag. The value and quality flag are missing for some of the timestamps, and needs to be filled with a dependence on the surrounding data. I.e.,
If the quality flags on the valid data bracketing the NaN data are different, then set the value and quality flag to the same as the bracketing row with the highest quality flag. In the example below, the first set of NaNs would be replaced with qf=3 and value=3.
If the quality flags are the same, then interpolate the value between the two valid values on either side. In the example, the second set of NaNs would be replaced by qf = 1 and v = 6 and 9.
Code:
import datetime
import pandas as pd
start = datetime.strptime("2004-01-01 00:00","%Y-%m-%d %H:%M")
end = datetime.strptime("2004-01-01 03:00","%Y-%m-%d %H:%M")
df = pd.DataFrame(\
data = {'v' : [1,2,'NaN','NaN','NaN',3,2,1,5,3,'NaN','NaN',12,43,23,12,32,12,12],\
'qf': [1,1,'NaN','NaN','NaN',3,1,5,1,1,'NaN','NaN',1,3,4,2,1,1,1]},\
index = pd.date_range(start, end,freq="10min"))
I have tried to solve this by finding the NA rows and looping through them, to fix the first criteron, then using interpolate to solve the second. However, this is really slow as I am working with a large set.
One approach would just be to do all the possible fills and then choose among them as appropriate. After doing df = df.astype(float) if necessary (your example uses the string "NaN"), something like this should work:
is_null = df.qf.isnull()
fill_down = df.ffill()
fill_up = df.bfill()
df.loc[is_null & (fill_down.qf > fill_up.qf)] = fill_down
df.loc[is_null & (fill_down.qf < fill_up.qf)] = fill_up
df = df.interpolate()
It does more work than is necessary, but it's easy to see what it's doing, and the work that it does do is vectorized and so happens pretty quickly. On a version of your dataset expanded to be ~10M rows (with the same density of nulls), it takes ~6s on my old notebook. Depending on your requirements that might suffice.