I have a working script that converts Latitude and Longitude coordinates to Cartesian coordinates. However, I have to perform this for specific points at each point in time (row by row).
I want to do something similar on a larger df. I'm not sure if a loop that iterates over each row is the most efficient way to do this? Below is the script that converts a single XY point.
import math
import numpy as np
import pandas as pd
point1 = [-37.83028766, 144.9539561]
r = 6371000 #radians of earth meters
phi_0 = point1[1]
cos_phi_0 = math.cos(np.radians(phi_0))
def to_xy(point, r, cos_phi_0):
lam = point[0]
phi = point[1]
return (r * np.radians(lam) * cos_phi_0, r * np.radians(phi))
point1_xy = to_xy(point1, r, cos_phi_0)
This works fine if I want to convert between single points. The issue is if I have a large data frame or list (>100,000 rows) of coordinates. Would a loop that iterates through each row be inefficient. Is there a better way to perform the same function?
Below is an example of a fractionally bigger df.
d = ({
'Time' : [0,1,2,3,4,5,6,7,8],
'Lat' : [37.8300,37.8200,37.8200,37.8100,37.8000,37.8000,37.7900,37.7900,37.7800],
'Long' : [144.8500,144.8400,144.8600,144.8700,144.8800,144.8900,144.8800,144.8700,144.8500],
})
df = pd.DataFrame(data = d)
I will do this if I were you. (Btw: the tuple casting part can be optimized.
import numpy as np
import pandas as pd
point1 = [-37.83028766, 144.9539561]
def to_xy(point):
r = 6371000 #radians of earth meters
lam,phi = point
cos_phi_0 = np.cos(np.radians(phi))
return (r * np.radians(lam) * cos_phi_0,
r * np.radians(phi))
point1_xy = to_xy(point1)
print(point1_xy)
d = ({
'Lat' : [37.8300,37.8200,37.8200,37.8100,37.8000,37.8000,37.7900,37.7900,37.7800],
'Long' : [144.8500,144.8400,144.8600,144.8700,144.8800,144.8900,144.8800,144.8700,144.8500],
})
df = pd.DataFrame(d)
df['to_xy'] = df.apply(lambda x:
tuple(x.values),
axis=1).map(to_xy)
print(df)
Related
I am trying to convert the function create.bspline.basis(rangval,nbasis,norder=norder,breaks=breaks) from R to Python.
I have tried using the BSpline(t, c , degree) function from scipy.interpolate but cannot seem the get the same results as I got in R.
Here is my R code:
library('fda')
df <- read.csv('data.csv', header = T)
df <- df[,1] # convert data frame to vector. Vector has a length of 1941.
rangval <- c(1, length(df))
breaks = seq(1,length(df),length.out=length(df)/60)
norder = 6
nbasis = length(breaks) - 2 + norder
bbasis = create.bspline.basis(rangval,nbasis,norder=norder,breaks=breaks)
plot(bbasis)
Here is my Python code:
from scipy.interpolate import BSpline
import matplotlib.pyplot as plt
import numpy as np
import math
Load Data files as data frames:
df = pd.read_csv('RData\Data.csv')
Convert data frames to arrays:
df = df.to_numpy()
breaks = np.linspace(1, (len(df)), math.ceil(len(df)/60))
k = math.ceil(len(df)) - 2
degree = 5
order = degree + 1
n = order + k
t = np.zeros(math.ceil(len(df)/60) + (2*order) # create an array to store knots.
t[:order] = 0
t[-:order] = len(df)
t[order:-order] = (breaks)
xx = np.arange(len(df))
for i in range(0,n):
c = np.zeros(n)
c[i] = 1
spl = BSpline(t,c,degree)
plt.plot(xx, spl(xx))
plt.show()
With the python code above I get the plot:
For my python code, I will like to have all the BSplines in a single object and not just be able to plot each BSpline one at a time. My goal is to use the full set of BSplines and it to pass it into another function to perform smoothing.
Basically, I am trying to follow the steps below but using Python:
i'm trying to build a counter which would detect number of oscillations in a given data
i'm following a method where the slope of each point is calculated and based on negative and positive direction change
is there a preexisting function for this
i'm using the following code and i'm unable to leave out the cells with zero values after taking difference between each cell
import pandas as pd
import xlsxwriter
from asammdf import MDF
import numpy as np
dat = MDF("file_name.dat")
app = dat.get('variabe_name')
df = pd.DataFrame(app)
print(df)
data = df.loc[0, 0:]
#time step = T
T = 0.01
# Number of sample points
N = len(data)
# sample spacing
x = np.linspace(0.0, N*T, N, endpoint=False)
x1 = data.diff()
print(x1)
df1_1 = pd.DataFrame([x1])
df1_1 = df1_1.replace(0, np.nan)
df1_1 = df1_1.dropna(how='all', axis=0)
df1_1 = df1_1.dropna()
df1 = pd.DataFrame.transpose(df1_1)
df1.to_csv("output.csv")'
my data looks like this
I'm trying to adapt code from Matlab to Python. Specifically measuring the phase angle from a group of angles. So using the df below, I have 5 individual labels with an associated angle. I want to measure the phase angle between these points. In Matlab the angle for each Label is passed to the following:
exp(i*ang)
This equals the following:
A_ang = 0.9648 + 0.2632i
B_ang = 0.7452 + 0.6668i
C_ang = 0.9923 + 0.1241i
D_ang = 0.8615 + 0.5077i
E_ang = 0.9943 + 0.1066i
I then divide by the number of angles and pass abs
out = (ac + bc + cc + dc + ec)/5
total = abs(out)
The final output should be 0.9708
import pandas as pd
import numpy as np
df = pd.DataFrame({
'Label' : ['A','B','C','D','E'],
'Angle' : [0.266252,0.729900,0.124355,0.532504,0.106779],
})
Python supports the Complex data type out of the box. And cmath provides access to mathematical functions for complex numbers (Read More). Try the following:
import cmath
angs = [0.266252,0.729900,0.124355,0.532504,0.106779]
# Create the complex numbers associated with the angles (with r = 1)
nums = [cmath.cos(ang) + cmath.sin(ang) * 1j for ang in angs]
# Compute the total following what you described.
total = abs(sum(nums))/len(angs)
print (total) #0.9707616522067346
I have tables like these:
import pandas as pd
import numpy as np
df1 = pd.DataFrame([
['A', (37.55, 126.97)],
['B', (37.56, 126.97)],
['C', (37.57, 126.98)]
], columns=['STA_NM', 'COORD'])
df2 = pd.DataFrame([
['A-01', (37.57, 126.99)]
], columns=['ID', 'COORD'])
I'm trying to pick each coordinates from df2 and find the two closest stations(STA_NM) and their distances to each coordinates from df1, then add them to a new column of df2. I tried following codes:
from heapq import nsmallest
from math import cos, asin, sqrt
def dist(x, y):
p = 0.017453292519943295
a = 0.5 - cos((y[0] - x[0]) * p) / 2 + cos(x[0] * p) * cos(y[0] * p) * (1 - cos((y[1] - x[1]) * p)) / 2
return 12741 * asin(sqrt(a))
def shortest(df, v):
l_sta = []
# get a list of coords
l_coord = df['COORD'].tolist()
# get the two nearest coordinates
near_coord = nsmallest(2, l_coord, key=lambda p: dist(v, p))
# find station names
l_sta.append((df.loc[df['COORD'] == near_coord[0], 'STA_NM'].to_string(index=False), round(dist(near_coord[0], v) * 1000)))
l_sta.append((df.loc[df['COORD'] == near_coord[1], 'STA_NM'].to_string(index=False), round(dist(near_coord[1], v) * 1000)))
# e.g.: [('A', 700), ('B', 1000)]
return l_sta
df2['NEAR_STA'] = df2['COORD'].map(lambda x: shortest(df1, x))
In original data, df1 has about 700 rows, and df2 has about 55k rows. When I tried above codes, it took near two minutes. Is there any better way to make it faster?
You could convert the lat/lon coordinates to earth-centered, earth-fixed (ECEF) coordinates (lat and lon become x/y/z from the earth's core) before doing the distance calculation. That would make your dist function faster, since it would become a single euclidean distance calculation.
You could also ditch the dataframe/lambda approach and use cython or numba to speed this up significantly.
There is also opportunity to speed things up if you know what the spatial distribution of your stations looks like. For example, if they are on a regular grid, then you only have to look at the four neighboring stations. If you know that there are usually at least 2 stations within some distance from another, then you only need to search within that radius. If you have no such prior information then sorry no tricks.
I would like to come up with a faster way to create a distance matrix between all lat lon pairs. This QA addresses doing a vectorized way with standard Linear Algebra, but without Lat Lon coordinates.
In my case these lat longs are farms. Here is my Python code, which for the full data set (4000 (lat, lon)'s) takes at least five minutes. Any ideas?
> def slowdistancematrix(df, distance_calc=True, sparse=False, dlim=100):
"""
inputs: df
returns:
1.) distance between all farms in miles
2.) distance^2
"""
from scipy.spatial import distance_matrix
from geopy.distance import geodesic
unique_farms = pd.unique(df.pixel)
df_unique = df.set_index('pixel')
df_unique = df_unique[~df_unique.index.duplicated(keep='first')] # only keep unique index values
distance = np.zeros((unique_farms.size,unique_farms.size))
for i in range(unique_farms.size):
lat_lon_i = df_unique.Latitude.iloc[i],df_unique.Longitude.iloc[i]
for j in range(i):
lat_lon_j = df_unique.Latitude.iloc[j],df_unique.Longitude.iloc[j]
if distance_calc == True:
distance[i,j] = geodesic(lat_lon_i, lat_lon_j).miles
distance[j,i] = distance[i,j] # make use of symmetry
return distance, np.power(distance, 2)
My solution is a vectorized version of this implementation:
import numpy as np
def dist(v):
v = np.radians(v)
dlat = v[:, 0, np.newaxis] - v[:, 0]
dlon = v[:, 1, np.newaxis] - v[:, 1]
a = np.sin(dlat / 2.0) ** 2 + np.cos(v[:, 0]) * np.cos(v[:, 0]) * np.sin(dlon / 2.0) ** 2
c = 2 * np.arcsin(np.sqrt(a))
result = 3956 * c
return result
However you will need to convert your dataframe to a numpy array, using the attribute values. For example:
df = pd.read_csv('some_csv_file.csv')
distances = dist(df[['lat', 'lng']].values)
This is not a pure python solution, but instead it relies on having r installed with the geodist package and the rpy2 interface:
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects.conversion import localconverter
def pygeodist(pd_df):
"""
pd_df must have columns 'x' and 'y' such that 'x' is the lng coordinate
and 'y' is the lat coordinate
"""
geodist=importr('geodist')
with localconverter(ro.default_converter + pandas2ri.converter):
return geodist.geodist(pd_df, measure = "geodesic")