Lets say I have a pandas.DataFrame that has hourly data for 3 days:
import pandas as pd
import numpy as np
import datetime as dt
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(np.random.randn(3*24,2),index=dates,columns=list('AB'))
I would like to get every, let's say, 6 hours of data and independently fit a curve to that data. Since pandas' resample function has a how keyword that is supposed to be any numpy array function, I thought that I could maybe try to use resample to do that with polyfit, but apparently there is no way (right?).
So the only alternative way I thought of doing that is separating df into a sequence of DataFrames, so I am trying to create a function that would work such as
l=splitDF(df, '6H')
and it would return to me a list of dataframes, each one with 6 hours of data (except maybe the first and last ones). So far I got nothing that could work except something like the following manual method:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
cont=0
for date in data.index:
... check for date in res_index ...
... and start cutting at those points ...
But this method would be extremely slow and there is probably a faster way to do it. Is there a fast (maybe even pythonic) way of doing this?
Thank you!
EDIT
A better method (that needs some improvement but it's faster) would be the following:
def splitDF(data, rule):
res_index=data.resample(rule).index
out=[]
pdate=res_index[0]
for date in res_index:
out.append(data[pdate:date][:-1])
pdate=date
out.append(data[pdate:])
return out
But still seems to me that there should be a better method.
Ok, so this sounds like a textbook case for using groupby. Here's my thinking:
import pandas as pd
#let's define a function that'll group a datetime-indexed dataframe by hour-interval/date
def create_date_hour_groups(df, hr):
new_df = df.copy()
hr_int = int(hr)
new_df['hr_group'] = new_df.index.hour/hr_int
new_df['dt_group'] = new_df.index.date
return new_df
#now we define a wrapper for polyfit to pass to groupby.apply
def polyfit_x_y(df, x_col='A', y_col='B', poly_deg=3):
df_new = df.copy()
coef_array = pd.np.polyfit(df_new[x_col], df_new[y_col], poly_deg)
poly_func = pd.np.poly1d(coef_array)
df_new['poly_fit'] = poly_func(df[x_col])
return df_new
#to the actual stuff
dates = pd.date_range('20130101', periods=3*24, freq='H')
df = pd.DataFrame(pd.np.random.randn(3*24,2),index=dates,columns=list('AB'))
df = create_date_hour_groups(df, 6)
df_fit = df.groupby(['dt_group', 'hr_group'],
as_index=False).apply(polyfit_x_y)
How about?
np.array_split(df,len(df)/6)
Related
I basically have a dataframe (df1) with columns 7 columns. The values are always integers.
I have another dataframe (df2), which has 3 columns. One of these columns is a list of lists with a sequence of 7 integers. Example:
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
I now want to compare the sequence of the rows in df1 with the 'Sequence' column in df2 and get a percentage of overlap. In a primitive for loop this would look like this:
df2['Overlap'] = 0.
for i in range(len(df2)):
c = sum(el in list(df2.at[i, 'Sequence']) for el in df1.values.tolist())
df2.at[i, 'Overlap'] = c/len(df1)
Now the problem is that my df2 has 500000 rows and my df1 usually around 50-100. This means that the task easily gets very time consuming. I know that there must be a way to optimize this with numpy, but I cannot figure it out. Can someone please help me?
By default engine used in pandas cython, but you can also change engine to numba or use njit decorator to speed up. Look up enhancingperf.
Numba converts python code to optimized machine codee, pandas is highly integrated with numpy and hence numba also. You can experiment with parallel, nogil, cache, fastmath option for speedup. This method shines for huge inputs where speed is needed.
Numba you can do eager compilation or first time execution take little time for compilation and subsequent usage will be fast
import pandas as pd
df1 = pd.DataFrame(columns = ['A','B','C','D','E','F','G'],
data = np.random.randint(1,5,(100,7)))
df2 = pd.DataFrame(columns = ['Name','Location','Sequence'],
data = [['Alfred','Chicago',
np.random.randint(1,5,(100,7))],
['Nicola','New York',
np.random.randint(1,5,(100,7))]])
a = df1.values
# Also possible to add `parallel=True`
f = nb.njit(lambda x: (x == a).mean())
# This is just illustration, not correct logic. Change the logic according to needs
# nb.njit((nb.int64,))
# def f(x):
# sum = 0
# for i in nb.prange(x.shape[0]):
# for j in range(a.shape[0]):
# sum += (x[i] == a[j]).sum()
# return sum
# Experiment with engine
print(df2['Sequence'].apply(f))
You can use direct comparison of the arrays and sum the identical values. Use apply to perform the comparison per row in df2:
df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
output:
0 0.270000
1 0.298571
To save the output in your original dataframe:
df2['Overlap'] = df2['Sequence'].apply(lambda x: (x==df1.values).sum()/df1.size)
I am very new to to trying to parallelize my python code. I am trying to perform some analysis on an xarray, then fill in a pandas dataframe with the results. The columns of the dataframe are independent, so I think it should be trivial to parallelise using dask delayed, but can't work out how. My xarrays are quite big, so this loop takes a while, and is big in memory. It could also be chunked by time, instead, if that's easier (this might help with memory)!
Here is the un-parallelized version:
from time import sleep
import time
import pandas as pd
import dask.dataframe as dd
data1 = np.random.rand(4, 3,3)
data2=np.random.randint(4,size=(3,3))
locs1 = ["IA", "IL", "IN"]
locs2 = ['a', 'b', 'c']
times = pd.date_range("2000-01-01", periods=4)
xarray1 = xr.DataArray(data1, coords=[times, locs1, locs2], dims=["time", "space1", "space2"])
xarray2= xr.DataArray(data2, coords=[locs1, locs2], dims=[ "space1", "space2"])
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
final_df=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df[column]=delayed_where(xarray1,xarray2,column)
I would like to parallelize the for loop, but have tried:
final_df_delayed=pd.DataFrame(columns=range(4),index=times)
for column in final_df:
final_df_delayed[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df.compute()
Or maybe something with dask delayed?
final_df_dd=dd.from_pandas(final_df, npartitions=2)
for column in final_df:
final_df_dd[column]=delayed(delayed_where)(xarray1,xarray2,column)
final_df_dd.compute()
But none of these work. Can anyone help?
You're using delayed correctly, but it's not possible to construct a dask dataframe in the way you specified.
from dask import delayed
import dask
#delayed
def delayed_where(xarray1,xarray2,id):
sleep(1)
return xarray1.where(xarray2==id).mean(axis=(1,2)).to_dataframe(id)
#delayed
def form_df(list_col_results):
final_df=pd.DataFrame(columns=range(4),index=times)
for n, column in enumerate(final_df):
final_df[column]=list_col_results[n]
return final_df
delayed_cols = [delayed_where(xarray1,xarray2, col) for col in final_df.columns]
delayed_df = form_df(delayed_cols)
delayed_df.compute()
Note that the enumeration is a clumsy way to get correct order of the columns, but your actual problem might guide you to a better way of specifying this (e.g. by explicitly specifying each column as an individual argument).
I would like to apply a specific function (in this case a logit model) to a dataframe which can be grouped (by the variable "model"). I know the task can be performed through a loop, however I believe this to be inefficient at best. Example code below:
import pandas as pd
import numpy as np
import statsmodels.api as sm
df1=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df2=pd.DataFrame(np.random.randint(0,100,size=(100,10)),columns=list('abcdefghij'))
df1['model']=1
df1['target']=np.random.randint(2,size=100)
df2['model']=2
df2['target']=np.random.randint(2,size=100)
data=pd.concat([df1,df2])
### Clunky, but works...
for i in range(1,2+1):
lm=sm.Logit(data[data['model']==i]['target'],
sm.add_constant(data[data['model']==i].drop(['target'],axis=1))).fit(disp=0)
print(lm.summary2())
### Can this work?
def elegant(self):
lm=sm.Logit(data['target'],
sm.add_constant(data.drop(['target'],axis=1))).fit(disp=0)
better=data.groupby(['model']).apply(elegant)
If the above groupby can work, is this a more efficient way to perform than looping?
This could work:
def elegant(df):
lm = sm.Logit(df['target'],
sm.add_constant(df.drop(['target'],axis=1))).fit(disp=0)
return lm
better = data.groupby('model').apply(elegant)
Using .apply you passe the dataframe groups to the function elegant so elegant has to take a dataframe as the first argument here. Also your function needs to return the result of your calculation lm.
For more complexe functions the following structure can be used:
def some_fun(df, kw_param=1):
# some calculations to df using kw_param
return df
better = data.groupby('model').apply(lambda group: some_func(group, kw_param=99))
So I'm trying to apply a function to the index row by row but having some problems
startDate = '2015-05-01 00:00'
endDate = '2015-05-08 00:00'
idx = pd.date_range(startDate, endDate, freq="1min")
df = pd.DataFrame(columns=['F(t)'])
df = df.reindex(idx, fill_value=0)
def circadian_function(T):
return math.cos(math.pi*(T-delta)/12)
Everything is okay up to here but trying to apply the function I'm not sure what to do
df['F(t)'] = df.index.apply(lambda x: circadian_function x[index].hour, axis=1)
Should I be using a lambda? Or just an apply?
I don't have 50 rep so I can't comment on #Ted Petrou's answer ;-;
I just wanted to say a couple things that you should know.
If you are going to feed df.index.hour into your carcadian_function, make sure you use numpy instead of math. Otherwise the interpreter will throw a TypeError (I just found out about this).
Make sure to define delta.
Example:
import numpy as np
def circadian_function(T, delta):
return np.cos(np.pi * (T-delta) / 12)
What #Ted Petrou recommends you do in full:
df['F(x)'] = circadian_function(df.index.hour, 0.5) #I picked an arbitrary delta
Numpy will automatically vectorize the function for you. Props to Ted I learned something new as well :>
Use apply only as a last result. This can be easily vectorized. Make sure you define delta.
import numpy as np
df['F(t)'] = np.cos(np.pi*(idx.hour-delta)/12)
I have one set of values measured at regular times. Say:
import pandas as pd
import numpy as np
rng = pd.date_range('2013-01-01', periods=12, freq='H')
data = pd.Series(np.random.randn(len(rng)), index=rng)
And another set of more arbitrary times, for example, (in reality these times are not a regular sequence)
ts_rng = pd.date_range('2013-01-01 01:11:21', periods=7, freq='87Min')
ts = pd.Series(index=ts_rng)
I want to know the value of data interpolated at the times in ts.
I can do this in numpy:
x = np.asarray(ts_rng,dtype=np.float64)
xp = np.asarray(data.index,dtype=np.float64)
fp = np.asarray(data)
ts[:] = np.interp(x,xp,fp)
But I feel pandas has this functionality somewhere in resample, reindex etc. but I can't quite get it.
You can concatenate the two time series and sort by index. Since the values in the second series are NaN you can interpolate and the just select out the values that represent the points from the second series:
pd.concat([data, ts]).sort_index().interpolate().reindex(ts.index)
or
pd.concat([data, ts]).sort_index().interpolate()[ts.index]
Assume you would like to evaluate a time series ts on a different datetime_index. This index and the index of ts may overlap. I recommend to use the following groupby trick. This essentially gets rid of dubious double stamps. I then forward interpolate but feel free to apply more fancy methods
def interpolate(ts, datetime_index):
x = pd.concat([ts, pd.Series(index=datetime_index)])
return x.groupby(x.index).first().sort_index().fillna(method="ffill")[datetime_index]
Here's a clean one liner:
ts = np.interp( ts_rng.asi8 ,data.index.asi8, data[0] )