dataframe apply lambda function that requires value from row n+1 - python

I have a dataframe and geopy to calculate distances between two geo coordinates as follows :
import geopy.distance
distCalcExample = geopy.distance.geodesic((49.18443, -0.36098), (49.184335, -0.361185)).m
r = {'poly':[(49.419453, 0.232884),(49.41956, 0.23269),(49.41956, 0.23261),(49.41953, 0.23255),(49.41946, 0.23247)]}
df=pd.DataFrame(r)
df['dist']=0
df
I need to calculate the distance between coordinates of rows n and n+1.
I was thinking of using geopy as in distCalcExample, along with apply and a lambda function.
But i have not managed to achieve it. What would be the simplest way to make it?

First create a column including the shifted values
df["shifted"] = df["poly"].shift()
Then use apply rowwise:
df[["poly","shifted"]].apply(lambda x: geopy.distance.geodesic(x[0],x[1]).m,axis=1)

Related

How to utilize .apply on functions from geopy in pandas when creating new column from existing columns

So I am trying to find a more efficient way of doing a task I already made some code for. The purpose of the code is to use 4 columns (LATITUDE, LONGITUDE, YORK_LATITUDE, YORK_LONGITUDE) to create a new column which calculates the distance between two coordinates in kilometers in a panda dataframe. Where the first coordinate is (LATITUDE, LONGITUDE) and the second coordinate is (YORK_LATITUDE, YORK_LONGITUDE).
A link of what the table looks like
In order to complete the task right now I create a list using the following code (geopy and pandas iterrows), convert that into a column and concatenate that to the dataframe. This is cumbersome, I know that there is an easier way to utilize .apply and the geopy function, but I haven't been able to figure out the syntax.
from geopy.distance import geodesic as GD
list = []
for index, row in result.iterrows():
coordinate1 = (row['LATITUDE'], row['LONGITUDE'])
coordinate2 = (row['LATITUDE_YORK_UNIVERSITY'], row['LONGITUDE_YORK_UNIVERSITY'])
list.append(GD(coordinate1, coordinate2).km)
TL;DR
df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
Some explanation
Let's say we have a function, which requires two tuples as arguments. For example:
from math import dist
def distance(point1: tuple, point2: tuple) -> float:
# suppose that developer checks the type
# so we can pass only tuples as arguments
assert type(point1) is tuple
assert type(point2) is tuple
return dist(point1, point2)
Let's apply the function to this data:
df = pd.DataFrame(
data=np.arange(10*4).reshape(10, 4),
columns=['long', 'lat', 'Y long', 'Y lat']
)
We pass to apply two parameters: axis=1 to iterate over rows, and a wrapper over distance as a lambda-function. To split the row in tuples we can apply tuple(...) or `(*...,), note the comma at the end in the latter option:
df.apply(lambda x: distance((*x[:2],), (*x[2:],)), axis=1)
The thing is that geopy.distance doesn't require exactly tuples as an arguments, they can be any iterables with 2 to 3 elements (see the endpoint how an argument is transformed into the inner type Point while defining distance). So we can simplify this to:
df.apply(lambda x: distance(x[:2], x[2:]), axis=1)
To make it independent from the columns order we could write this (in your terms):
common_point = ['LATITUDE','LONGITUDE']
york_point = ['LATITUDE_YORK_UNIVERSITY','LONGITUDE_YORK_UNIVERSITY']
result.apply(lambda x: GD(x[common_point], x[york_point]).km, axis=1)

How to run different functions in different parts of a dataframe in python?

I have a dataframe(df).
I need to find the standard deviation dataframe from this one.For the first row I want to use the traditional variance formula.
sum of the(x - x(mean))/n
and from second row(=i) I want to use the following formula
lamb*(variance of first row) + (1-lamb)* (first row of returns)^2
※by first row, I meant the previous row.
# Generate Sample Dataframe
import numpy as np
import pandas as pd
df=pd.Dataframe({'a':range(1,7),
'b':[x**2 for x in range(1,7)],
'c':[x**3 for x in range(1,7)]})
# Generate return Dataframe
returns=df.pct_change()
# Generate new Zero dataframe
d=pd.DataFrame(0,index=np.arange(len(returns)),columns=returns.columns)
#populate first row
lamb=0.94
d.iloc[0]=list(returns.var())
Now my question is how to populated the second row till the end using the second formula?
It should be something like
d[1:].agg(lambda x: lamb*x.shift(-1)+(1-lamb)*returns[:2]
but it obviously returned a long error.
Could you please help?
for i in range(1,len(d)):
d.iloc[i].apply(lambda x: lamb*d.iloc[i-1] + (1-lamb)*returns.iloc[i-1])
I'm not completely sure if this gives the right answer but it wont throw an error. But try using apply, for loop and .iloc for iterating over rows and this should do the job for you if you use the correct formula.

Iterating optimization on a dataframe

I'm trying to build an iterating interpolation of series x and dataframe y.
Df y is made by n rows and m columns. I would like to run the interpolation for every row of DataFrame y.
So far, I've been able to successfully build the iteration for one single row by using iloc[0:]
### SX5E
z=np.linspace(0.2,0.99,200)
z_pd_SX5E=pd.Series(z)
from scipy import interpolate
def f(z_pd_SX5E):
x_SX5E=x
y_SX5E=y.iloc[0,:]
tck_SX5E = interpolate.splrep(x_SX5E, y_SX5E)
return interpolate.splev(z_pd_SX5E, tck_SX5E)
Optimal_trigger_P_SX5E= z_pd_SX5E[f(z_pd_SX5E).argmax(axis=0)]
How can I run the function through every row of y?
Thanks
Many thanks
In general you can run any function for each row by using .apply. So something like:
y.apply(lambda val: interpolate.splrep(x,val))
This will then return a new series object.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.apply.html

How to convert from degrees to radians in 2 out of 3 columns of a table?

I have a table (40000 x 3) of 3 columns and 40000 rows, one column is not related to the problem, and the other two columns contain numerical values in degrees. How can I covert all of the values in columns 2 and 3 from radians to degrees?
Note: DF4 is the name of the table. latitude and longitude are the names of the two columns.
I've tried creating a function (deg_to_rad()), then I've assigned a variable to column 2 and another variable to column 3. I've then attempted to call the function with the name of the variable as the argument. I did this twice for each column. It works in the sense that it does convert the values to radians but I can't put the two columns back together into a table with the third column.
Is there an easier way to achieve this goal?
Also, is it possible to use lambda instead?
This is the code I've written:
def deg_to_rad(dr):
return (dr*math.pi)/180
DEG_TO_RAD_ATTEMPT_LATITUDE = DF4['latitude']
DEG_TO_RAD_ATTEMPT_LONGITUDE = DF4['longitude']
deg_to_rad((DEG_TO_RAD_ATTEMPT_LATITUDE))
deg_to_rad((DEG_TO_RAD_ATTEMPT_LONGITUDE))
IIUC you can just do:
DF4['latitude'] = deg_to_rad(DF4['latitude'])
DF4['longitude'] = deg_to_rad(DF4['longitude'])
You're not assigning back the result of your function
you take a reference here:
DEG_TO_RAD_ATTEMPT_LATITUDE = DF4['latitude']
you then pass it to your function which will return the result but this isn't being assigned to anything, nor is it modifying the passed in column
Also you can use np.deg2rad to achieve the same:
import numpy as np
DF4['latitude'] = np.deg2rad(DF4['latitude'])
DF4['longitude'] = np.deg2rad(DF4['longitude'])
Assuming the table is like:
x, lat, long
. . .
. . .
Wrap the function in lambda and use apply().
df['lat'] = df['lat'].apply(lambda x: logic)
df['long'] = df['long'].apply(lambda x: logic)
However, if the main goal to do this to calculate the distance you can use Haversine Formula, to calculate without this conversion.
Read this
I presume that you are using Pandas.
To add a new column, you do
df['col_name'] = deg_to_rad(some_val)
NumPy provides vectorised functions for converting values from degrees and radians and vice versa, so you don't actually need to define your own.

Rolling correlation pandas

I have a dataframe with time series.
I'd like to compute the rolling correlation (periods=20) between columns.
store_corr=[] #empty list to store the rolling correlation of each pairs
names=[] #empty list to store the column name
df=df.pct_change(periods=1).dropna(axis=0) #Prepate dataframe of time series
for i in range(0,len(df.columns)):
for j in range(i,len(df.columns)):
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])
names.append('col '+str(i)+' -col '+str(j))
store_corr.append(corr)
df_corr=pd.DataFrame(np.transpose(np.array(store_corr)),columns=names)
This solution is working and gives me the rolling correlation.This solution is with the help of Austin Mackillop (comments).
Is there another faster way? (I.e. I want to avoid this double for loop.)
This line:
corr=df.rolling(20).corr(df[df.columns[i]],df[df.columns[j]])
will produce an error because the second argument of corr expects a Bool but you passed a DataFrame which has an ambiguous truth value. You can view the docs here.
Does applying the rolling method to the first DataFrame in the second line of code that you provided achieve what you are trying to do?
corr = df[df.columns[i]].rolling(20).corr(df[df.columns[j]])

Categories