Iterating optimization on a dataframe

Iterating optimization on a dataframe - python

I'm trying to build an iterating interpolation of series x and dataframe y.
Df y is made by n rows and m columns. I would like to run the interpolation for every row of DataFrame y.
So far, I've been able to successfully build the iteration for one single row by using iloc[0:]
### SX5E
z=np.linspace(0.2,0.99,200)
z_pd_SX5E=pd.Series(z)
from scipy import interpolate
def f(z_pd_SX5E):
x_SX5E=x
y_SX5E=y.iloc[0,:]
tck_SX5E = interpolate.splrep(x_SX5E, y_SX5E)
return interpolate.splev(z_pd_SX5E, tck_SX5E)
Optimal_trigger_P_SX5E= z_pd_SX5E[f(z_pd_SX5E).argmax(axis=0)]
How can I run the function through every row of y?
Thanks
Many thanks

In general you can run any function for each row by using .apply. So something like:
y.apply(lambda val: interpolate.splrep(x,val))
This will then return a new series object.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Related

dataframe apply lambda function that requires value from row n+1

I have a dataframe and geopy to calculate distances between two geo coordinates as follows :
import geopy.distance
distCalcExample = geopy.distance.geodesic((49.18443, -0.36098), (49.184335, -0.361185)).m
r = {'poly':[(49.419453, 0.232884),(49.41956, 0.23269),(49.41956, 0.23261),(49.41953, 0.23255),(49.41946, 0.23247)]}
df=pd.DataFrame(r)
df['dist']=0
df
I need to calculate the distance between coordinates of rows n and n+1.
I was thinking of using geopy as in distCalcExample, along with apply and a lambda function.
But i have not managed to achieve it. What would be the simplest way to make it?

First create a column including the shifted values
df["shifted"] = df["poly"].shift()
Then use apply rowwise:
df[["poly","shifted"]].apply(lambda x: geopy.distance.geodesic(x[0],x[1]).m,axis=1)

How to run different functions in different parts of a dataframe in python?

I have a dataframe(df).
I need to find the standard deviation dataframe from this one.For the first row I want to use the traditional variance formula.
sum of the(x - x(mean))/n
and from second row(=i) I want to use the following formula
lamb*(variance of first row) + (1-lamb)* (first row of returns)^2
※by first row, I meant the previous row.
# Generate Sample Dataframe
import numpy as np
import pandas as pd
df=pd.Dataframe({'a':range(1,7),
'b':[x**2 for x in range(1,7)],
'c':[x**3 for x in range(1,7)]})
# Generate return Dataframe
returns=df.pct_change()
# Generate new Zero dataframe
d=pd.DataFrame(0,index=np.arange(len(returns)),columns=returns.columns)
#populate first row
lamb=0.94
d.iloc[0]=list(returns.var())
Now my question is how to populated the second row till the end using the second formula?
It should be something like
d[1:].agg(lambda x: lamb*x.shift(-1)+(1-lamb)*returns[:2]
but it obviously returned a long error.
Could you please help?

for i in range(1,len(d)):
d.iloc[i].apply(lambda x: lamb*d.iloc[i-1] + (1-lamb)*returns.iloc[i-1])
I'm not completely sure if this gives the right answer but it wont throw an error. But try using apply, for loop and .iloc for iterating over rows and this should do the job for you if you use the correct formula.

How do you filter rows in a dataframe based on the column numbers from a Python list?

I have a Pandas dataframe with two columns, x and y, that correspond to a large signal. It is about 3 million rows in size.
Wavelength from dataframe
I am trying to isolate the peaks from the signal. After using scipy, I got a 1D Python list corresponding to the indexes of the peaks. However, they are not the actual x-values of the signal, but just the index of their corresponding row:
from scipy.signal import find_peaks
peaks, _ = find_peaks(y, height=(None, peakline))
So, I decided I would just filter the original dataframe by setting all values in its y column to NaN unless they were on an index found in the peak list. I did this iteratively, however, since it is 3000000 rows, it is extremely slow:
peak_index = 0
for data_index in list(data.index):
if data_index != peaks[peak_index]:
data[data_index, 1] = float('NaN')
else:
peak_index += 1
Does anyone know what a faster method of filtering a Pandas dataframe might be?

Looping in most cases is extremely inefficient when it comes to pandas. Assuming you just need filtered DataFrame that contains the values of both x and y columns only when y is a peak, you may use the following piece of code:
df.iloc[peaks]
Alternatively, if you are hoping to retrieve an original DataFrame with y column retaining its peak values and having NaN otherwise, then please use:
df.y = df.y.where(df.y.iloc[peaks] == df.y.iloc[peaks])
Finally, since you seem to care about just the x values of the peaks, you might just rework the first piece in the following way:
df.iloc[peaks].x

How to get value from Python table that will be interpolate between columns and rows?

I have got table (DataFrame) created in Pandas. It is 2D table with integers as column index and integer as row index (it is position x and position y).
I know how to get value that is in "cell" of that table using indexes, but I would like to get value "from between" columns and rows that will be linearly interpolated.
Preferably, I would like to do this for large number of x,y that are kept in two tables Position_x(m x n), Position_y(m x n) and put results to table Results(m x n)
https://i.stack.imgur.com/utv03.png
Here is example of such procedure in Excel:
https://superuser.com/questions/625154/what-is-the-simplest-way-to-interpolate-and-lookup-in-an-x-y-table-in-excel
Thanks
Szymon

I've found something that works in 90%, however, it has two disadvantages:
1) index and columns need to be strictly increasing,
2) for a set of n input pairs it plots n x n result array instead of just n results (for example below for 3 pairs of input points I need only 3 resulting values, using that code I will get 9 values as all combination of input points).
Here is what I've found:
import scipy
import scipy.interpolate
import numpy as np
import pandas as pd
x=np.array([0,10,25,60,100]) #Index
y=np.array([1000,1200,1400,1600]) #Column
data=np.array([[60,54,33,0],
[50,46,10,0],
[42,32,5,0],
[30,30,2,0],
[10,10,0,0]])
Table_to_Interpolate=pd.DataFrame(data,index=x,columns=y)
sp=scipy.interpolate.RectBivariateSpline(x,y,data, kx=1, ky=1, s=0)
scipy.interpolate.RectBivariateSpline(x,y,data, kx=1, ky=1, s=0)
Input_Xs=12, 44, 69
Input_Ys=1150, 1326, 1416
Results=pd.DataFrame(sp(Input_Xs, Input_Ys), index=Input_Xs, columns=Input_Ys,)
It's not perfect, but it's the best I could find.

If I understood your question:
you can start by using pandas.melt to convert the multi-column-result table to a one-column-result table.
Then, you can use ben-t great answer to interpolate.
Hope I helped.

Rolling correlation pandas

I have a dataframe with time series.
I'd like to compute the rolling correlation (periods=20) between columns.
store_corr=[] #empty list to store the rolling correlation of each pairs
names=[] #empty list to store the column name
df=df.pct_change(periods=1).dropna(axis=0) #Prepate dataframe of time series
for i in range(0,len(df.columns)):
for j in range(i,len(df.columns)):
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
names.append('col '+str(i)+' -col '+str(j))
store_corr.append(corr)
df_corr=pd.DataFrame(np.transpose(np.array(store_corr)),columns=names)
This solution is working and gives me the rolling correlation.This solution is with the help of Austin Mackillop (comments).
Is there another faster way? (I.e. I want to avoid this double for loop.)

This line:
corr=df.rolling(20).corr(df[df.columns[i]],df[df.columns[j]])
will produce an error because the second argument of corr expects a Bool but you passed a DataFrame which has an ambiguous truth value. You can view the docs here.
Does applying the rolling method to the first DataFrame in the second line of code that you provided achieve what you are trying to do?
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Iterating optimization on a dataframe - python

In general you can run any function for each row by using .apply. So something like: y.apply(lambda val: interpolate.splrep(x,val)) This will then return a new series object. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

Related

dataframe apply lambda function that requires value from row n+1

How to run different functions in different parts of a dataframe in python?

How do you filter rows in a dataframe based on the column numbers from a Python list?

How to get value from Python table that will be interpolate between columns and rows?

Rolling correlation pandas

Categories

Resources