Distance betweeen coordinates Python vs R computation time - python

I am trying to calculate the distance between one point and many others on a WGS84 ellipsoid - not the haversine approximation as explained in other answers. I would like to do it in Python but the computation time is very long with respect to R. My Python script below takes almost 23 seconds while the equivalent one in R takes 0.13 seconds. Any suggestion for speeding up my python code?
Python script:
import numpy as np
import pandas as pd
import xarray as xr
from geopy.distance import geodesic
from timeit import default_timer as timer
df = pd.DataFrame()
city_coord_orig = (4.351749, 50.845701)
city_coord_orig_r = tuple(reversed(city_coord_orig))
N = 100000
np.random.normal()
df['or'] = [city_coord_orig_r] * N
df['new'] = df.apply(lambda x: (x['or'][0] + np.random.normal(), x['or'][1] + np.random.normal()), axis=1)
start = timer()
df['d2city2'] = df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)
end = timer()
print(end - start)
R script
# clean up
rm(list = ls())
# read libraries
library(geosphere)
city.coord.orig <- c(4.351749, 50.845701)
N<-100000
many <- data.frame(x=rep(city.coord.orig[1], N) + rnorm(N),
y=rep(city.coord.orig[2], N) + rnorm(N))
city.coord.orig <- c(4.351749, 50.845701)
start_time <- Sys.time()
many$d2city <- distGeo(city.coord.orig, many[,c("x","y")])
end_time <- Sys.time()
end_time - start_time

You are using .apply(), which uses a simple loop to run your function for each and every row. The distance calculation is done entirely in Python (geopy uses geographiclib which appears to be written in Python only). Non-vectorised distance calculations are slow, what you need is a vectorised solution using compiled code, just like when calculating the Haversine distance.
pyproj offers verctorised WSG84 distance calculations (the methods of the pyproj.Geod class accept numpy arrays) and wraps the PROJ4 library, meaning it runs these calculations in native machine code:
from pyproj import Geod
# split out coordinates into separate columns
df[['or_lat', 'or_lon']] = pd.DataFrame(df['or'].tolist(), index=df.index)
df[['new_lat', 'new_lon']] = pd.DataFrame(df['new'].tolist(), index=df.index)
wsg84 = Geod(ellps='WGS84')
# numpy matrix of the lon / lat columns, iterable in column order
or_and_new = df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T
df['d2city2'] = wsg84.inv(*or_and_new)[-1] / 1000 # as km
This clocks in at considerably better times:
>>> from timeit import Timer
>>> count, total = Timer(
... "wsg84.inv(*df[['or_lon', 'or_lat', 'new_lon', 'new_lat']].to_numpy().T)[-1] / 1000",
... 'from __main__ import wsg84, df'
... ).autorange()
>>> total / count * 10 ** 3 # milliseconds
66.09873340003105
66 milliseconds to calculate 100k distances, not bad!
To make the comparison objective, here is your geopy / df.apply() version on the same machine:
>>> count, total = Timer("df.apply(lambda x: geodesic(x['or'], x['new']).km, axis=1)", 'from __main__ import geodesic, df').autorange()
>>> total / count * 10 ** 3 # milliseconds
25844.119450000107
25.8 seconds, not even in the same ballpark.

Related

How to implement multiprocessing in Monte Carlo integration

I created a Python program that integrates a given function over a given interval using Monte Carlo simulation. It works well, except for the fact that it runs painfully slow when you want higher levels of accuracy (larger N value). I figured I'd give multiprocessing a try in order to speed it up, but then I realized I have no clue how to implement it. Here's what I have right now:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Process
import os
# GOAL: Approximate the integral of a function f(x) from lower bound a to upper bound b using Monte Carlo simulation
# bounds of integration
a = 0
b = np.pi
# function to integrate
def f(x):
return np.sin(x)
N = 10000
areas = []
def mcIntegrate():
for i in range(N):
# array filled with random numbers between limits
xrand = random.uniform(a, b, N)
# sum the return values of the function of each random number
integral = 0.0
for i in range(N):
integral += f(xrand[i])
# scale integral by difference of bounds divided by amount of random values
ans = integral * ((b - a) / float(N))
# add approximation to list of other approximations
areas.append(ans)
if __name__ == "__main__":
processes = []
numProcesses = os.cpu_count()
for i in range(numProcesses):
process = Process(target=mcIntegrate)
processes.append(process)
for process in processes:
process.start()
for process in processes:
process.start()
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
Can I get some help with this implementation?
Took advice from the comments and used multiprocessor.Pool, and also cut down on some operations by using NumPy instead. Went from taking about 5min to run to now about 6sec (for N = 10000). Here's my implementation:
import scipy
import numpy as np
import matplotlib.pyplot as plt
import multiprocessing
import os
# GOAL: Approximate the integral of function f from lower bound a to upper bound b using Monte Carlo simulation
a = 0 # lower bound of integration
b = np.pi # upper bound of integration
f = np.sin # function to integrate
N = 10000 # sample size
def mcIntegrate(p):
xrand = scipy.random.uniform(a, b, N) # create array filled with random numbers within bounds
integral = np.sum(f(xrand)) # sum return values of function of each random number
approx = integral * ((b - a) / float(N)) # scale integral by difference of bounds divided by sample size
return approx
if __name__ == "__main__":
# run simulation N times in parallel and store results in array
with multiprocessing.Pool(os.cpu_count()) as pool:
areas = pool.map(mcIntegrate, range(N))
# graph approximation distribution
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec='black')
plt.xlabel("Areas")
plt.show()
This turned out to be a more interesting problem than I thought it would when I got to optimising it. The basic method is very simple:
from multiprocessing import pool
def f(x):
return x
results = pool.map(f, range(100))
Here is your mcIntegerate adapted for multiprocessing:
from tqdm import tqdm
def mcIntegrate(steps):
tasks = []
print("Setting up simulations")
# linear
for _ in tqdm(range(steps)):
xrand = random.uniform(a, b, steps)
for i in range(steps):
tasks.append(xrand[i])
pool = Pool(cpu_count())
print("Simulating (no progress)")
results = pool.map(f, tasks)
pool.close()
print("summing")
areas = []
for chunk in tqdm(range(steps)):
vals = results[chunk * steps : (chunk + 1) * steps]
integral = sum(vals)
ans = integral * ((b - a) / float(steps))
areas.append(ans)
return areas
tqdm is just used to display a progress bar.
This is the basic workflow for multiprocessing: break the question up into tasks, solve all the tasks, then add them all back together again. And indeed the code as given works. (Note that I've changed your N for steps).
For completeness, the script now begins:
from scipy import random
import numpy as np
import matplotlib.pyplot as plt
from multiprocessing import Pool, cpu_count
from tqdm import tqdm
# function to integrate
def f(x):
return np.sin(x)
and ends
areas = mcIntegrate(3_000)
a = 0
b = np.pi
plt.title("Distribution of Approximated Integrals")
plt.hist(areas, bins=30, ec="black")
plt.xlabel("Areas")
plt.show()
Optimisation
I deliberately split the problem up at the smallest possible level. Was this a good idea? To answer that, consider: how might we optimise the linear process of generating the tasks? This does take a considerable while at the moment. We could parallelise it:
def _prepare(steps):
xrand = random.uniform(a, b, steps)
return [xrand[i] for i in range(steps)]
def mcIntegrate(steps):
...
tasks = []
for res in tqdm(pool.imap(_prepare, (steps for _ in range(steps))), total=steps):
tasks += res # slower except for very large steps
Here I've used pool.imap, which returns an iterator which we can iterate as soon as the results are available, allowing us to build a progress bar. If you do this and compare, you will see that it runs slower than the linear solution. Removing the progress bar (on my machine) and replace with:
import time
start = time.perf_counter()
results = pool.map(_prepare, (steps for _ in range(steps)))
tasks = []
for res in results:
tasks += res
print(time.perf_counter() - start)
Is only marginally faster: it's still slower than running linear. Serialising data to a process and then deserialising it has an overhead. If you try to get a progress bar on the whole thing, it becomes excruciatingly slow:
results = []
for result in tqdm(pool.imap(f, tasks), total=len(tasks)):
results.append(result)
So what about iterating at a higher level? Here's another adaption of your mcIterate:
a = 0
b = np.pi
def _mcIntegrate(steps):
xrand = random.uniform(a, b, steps)
integral = 0.0
for i in range(steps):
integral += f(xrand[i])
ans = integral * ((b - a) / float(steps))
return ans
def mcIntegrate(steps):
areas = []
p = Pool(cpu_count())
for ans in tqdm(p.imap(_mcIntegrate, ((steps) for _ in range(steps))), total=steps):
areas.append(ans)
return areas
This, on my machine, is considerably faster. It's also considerably simpler. I was expecting a difference, but not such a considerable difference.
Takeaways
Multiprocessing isn't free. Something as simple as np.sin() is too cheap to multprocess: we pay to serialise, deserialise, append, and so on, all for one sin() calculation. But if you do too many calculations, you will waste time as you lose granularity. Here the effect is more striking than I was expecting. The only way to know the right level of granularity for a particular problem... is to profile and try.
My experience is that multiprocessing is often not very efficient (a ton of overhead). The more you push your code into numpy the faster it will be, with one caveat; you can overload your memory if you're not careful (10k x 10k is getting large). Lastly, it looks like N is doing double duty, both defining sample size for each estimate, and also serving as the number of trial estimates.
Here is how I would do this (with minor style changes):
import numpy as np
f = np.sin
a = 0
b = np.pi
# number samples for each trial, trial count, and number calculated at once
N = 10000
TRIALS = 10000
BATCH_SIZE=1000
def mc_integrate(f, a, b, N, batch_size=BATCH_SIZE):
# compute everything carrying `batch_size` copies by extending the array dimension.
# samples.shape == (N, batch_size)
samples = np.random.uniform(a, b, size=(N, batch_size))
integrals = np.sum(f(samples), axis=0)
mc_estimates = integrals * ((b - a) / N)
return mc_estimates
# loop over batch values to get final result
n, r = divmod(TRIALS, BATCH_SIZE)
results = []
for j in [BATCH_SIZE]*n + [r]:
results.extend(mc_integrate(f, a, b, N, batch_size=j))
On my machine this takes a few seconds.

[pandas]Is there anyway to calculate cumulative travel distance faster or simpler

I am trying to speed up my code.
Here is my sample code. (The actual code is more complex
import pandas as pd
import time, math, random
length=10000
x = [random.randint(0,100) for _ in range(length)]
y = [random.randint(0,100) for _ in range(length)]
x_pd = pd.Series(data=x)
y_pd = pd.Series(data=y)
print(x)
print(y)
print(x_pd)
print(y_pd)
distance= 0
distance2= 0
t = time.time()
for k in range(1, len(x)):
distance += math.sqrt((x[k] - x[k-1])**2 + (y[k] - y[k-1])**2)
print("dist from list : %lf"% distance)
print("duration for compute moving distance = ", time.time()-t)
# compute by rolling
t = time.time()
for k in range(1, len(x_pd)):
distance2 += math.sqrt((x_pd[k] - x_pd[k-1])**2 + (y_pd[k] - y_pd[k-1])**2)
print("dist from pd.Series : %lf"% distance2)
print("duration for compute moving distance = ", time.time()-t)
As you see above, I have 2 list(or pandas series) and these are X, Y pose list. i want to calculate cumulative travel distance.
I think if length is larger, calculate using pandas like above is more slow due to for iteration.
Is there anyway to calculate faster or simpler than i thought?
thank you!
Try the vectorized Pandas functions:
((x_pd.diff()**2 + y_pd.diff()**2)**.5).sum()
you can use another package modin.pandas it's faster than pandas approximately x4 and it has the some functions.this package use parallel-processing.
import modin.pandas as pd

Vectorizing calculation of values using numpy which requires previously calculated value

I'm trying to calculate a particular formula for EMA from Investopedia which looks like
EmaToday = (ValueToday ∗ (Smoothing / 1+Days))
+ (EmaYesterday * (1 - (Smoothing / 1+Days)))
We can simplify this to:
Smoothing and Days are constants.
Let's call (Smoothing / 1 + Days) as 'M'
The simplified equation becomes:
EmaToday = ((ValueToday - EmaYesterday) * M) + EmaYesterday
We can do this in traditional python using loops as follows:
# Initialize an empty numpy array to hold calculated ema values
emaTodayArray = np.zeros((1, valueTodayArray.size - Days), dtype=np.float32)
ema = emaYesterday
# Calculate ema
for i, valueToday in enumerate(np.nditer(valueList)):
ema = ((valueToday - ema) * M) + ema
emaTodayArray[i] = ema
emaTodayArray holds all the computed EMA values.
I'm having a hard time trying to figure out how to vectorize this completely as the emaYesterday value is needed for every new calculation.
If a full vectorization using numpy is possible first of all, I'd really appreciate it if someone can show me the way.
​
Note: I had to fill in a few dummies to make your code run, pls check whether they are ok.
The loop can be vectorized by transforming ema[i] ~> ema'[i] = ema[i] x (1-M)^-i after which it becomes just a cumsum.
This is implemented below as ema_pp_naive.
The problem with this method is that for medium sized i (~10^3) the (1-M)^-i term may overflow rendering the result useless.
We can circumvent this problem by going to log space (using np.logaddexp for the summation). This ema_pp_safe is quite a bit more expensive than the naive method but still >10x faster than the original loop. In my quick and dirty testing this gave correct results for a million terms and beyond.
Code:
import numpy as np
K = 1000
Days = 0
emaYesterday = np.random.random()
valueTodayArray = np.random.random(K)
M = np.random.random()
valueList = valueTodayArray
import time
T = []
T.append(time.perf_counter())
# Initialize an empty numpy array to hold calculated ema values
emaTodayArray = np.zeros((valueTodayArray.size - Days), dtype=np.float32)
ema = emaYesterday
# Calculate ema
for i, valueToday in enumerate(np.nditer(valueList)):
ema = ((valueToday - ema) * M) + ema
emaTodayArray[i] = ema
T.append(time.perf_counter())
scaling = np.broadcast_to(1/(1-M), valueTodayArray.size+1).cumprod()
ema_pp_naive = ((np.concatenate([[emaYesterday], valueTodayArray * M]) * scaling).cumsum() / scaling)[1:]
T.append(time.perf_counter())
logscaling = np.log(1-M)*np.arange(valueTodayArray.size+1)
log_ema_pp = np.logaddexp.accumulate(np.log(np.concatenate([[emaYesterday], valueTodayArray * M])) - logscaling) + logscaling
ema_pp_safe = np.exp(log_ema_pp[1:])
T.append(time.perf_counter())
print(f'K = {K}')
print('naive method correct:', np.allclose(ema_pp_naive, emaTodayArray))
print('safe method correct:', np.allclose(ema_pp_safe, emaTodayArray))
print('OP {:.3f} ms naive {:.3f} ms safe {:.3f} ms'.format(*np.diff(T)*1000))
Sample runs:
K = 100
naive method correct: True
safe method correct: True
OP 0.236 ms naive 0.061 ms safe 0.053 ms
K = 1000
naive method correct: False
safe method correct: True
OP 2.397 ms naive 0.224 ms safe 0.183 ms
K = 1000000
naive method correct: False
safe method correct: True
OP 2145.956 ms naive 18.342 ms safe 108.528 ms

Optimization of a timeline builder function

I've got a squared signal with a frequency f, and I'm interested in the time at which the square starts.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
time = [t0] # /!\ time in ms
i = 1
while time[len(time)-1] < tf:
if t0 + (i/f)*1000 < tf:
time.append(t0 + (i/f)*1000)
else:
break
i += 1
return time
So this function loops between t0 and tf to create a list in which is the timing at which a square starts. I'm quite sure it's not the best way to do it, and I'd like to know how to improve it.
Thanks.
If I am interpreting this correct, you are looking for a list of the times of the waves, starting at t0 and ending at tf.
def time_builder(f, t0=0, tf=300):
"""
Function building the time line in ms between t0 and tf with a frequency f.
f: Hz
t0 and tf: ms
"""
T = 1000 / f # period [ms]
n = int( (tf - t0) / T + 0.5 ) # n integer number of wavefronts, +0.5 added for rounding consistency
return [t0 + i*T for i in range(n)]
Using standard library python for this might not be the best approach... particularly considering that you might want to do other things later on.
An alternative is to use numpy. This will let you to do the following
from numpy import np
from scipy import signal
t = np.linspace(0, 1, 500, endpoint=False)
s = signal.square(2 * np.pi * 5 * t) # we create a square signal usign scipy
d = np.diff(s) # obtaining the differences, this tell when there is a step.
# In this particular case, 2 means step up -2 step down.
starts = t[np.where(d == 2)] # take the times array t filtered by which
# elements in the differences array d equal to 2

Efficiently Running Newton Algorithm

This is related to another question I asked earlier. I want to run the newton method on a large dataset. Below is the code I created using a loop. I need to run it on ~50 million lines and the loop is quite unwieldy. Is there a more efficient way to run it using Pandas/Numpy/ect? Thanks in advance
In:
from pandas import *
from pylab import *
import pandas as pd
import pylab as plt
import numpy as np
from scipy import *
import scipy
df = DataFrame(list([100,2,34.1556,9,105,-100]))
df = DataFrame.transpose(df)
df = df.rename(columns={0:'Face',1:'Freq',2:'N',3:'C',4:'Mkt_Price',5:'Yield'})
df2= df
df = concat([df, df2])
df = df.reset_index(drop=True)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 -100
1 100 2 34.1556 9 105 -100
In:
def Px(Rate):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
for count, row in df.iterrows():
Face = row['Face']
Freq = row['Freq']
N = row['N']
C = row['C']
Mkt_Price = row['Mkt_Price']
row['Yield'] = scipy.optimize.newton(Px, .1, tol=.0001, maxiter=100)
df
Out:
Face Freq N C Mkt_Price Yield
0 100 2 34.1556 9 105 0.084419
1 100 2 34.1556 9 105 0.084419
One possibility that pops into my mind is that you might do it vectorized. However, you must then throw away all conditional code, and just run the required amount of iterations.
The basic step in Newton-Raphson is always the same, so you do not need to have any conditional code. Your function Px looks as if it could be vectorized without any extra effort.
The steps are roughly:
def Px(Rate, Mkt_Price, Face, Freq, N, C):
return Mkt_Price - (Face * ( 1 + Rate / Freq ) ** ( - N ) + ( C / Rate ) * ( 1 - (1 + ( Rate / Freq )) ** -N ) )
# initialize the iteration vector
y = 0.1 * np.zeros(num_rows)
# just a guess for the differentiation, might be smaller
h = 1e-6
# then iterate for a suitable number of iterations
for i in range(100):
f = Px(y, Mkt_Price, Face, Freq, N, C)
fp = Px(y+h, Mkt_Price, Face, Freq, N, C)
y -= h * f / (fp - f)
After this you have the iteration results in y. I have assumed Mkt_Price, Face, etc. are 50-million-row vectors.
There will be billions of calculations, so this will still take maybe a dozen seconds. Also, there is no error checking, so if something goes wildly oscillating, there is nothing to warn you about it.
One way to make this better is to calculate the first differential analytically, as it can be done. The practical improvement may be small, though. You will have to experiment to find the best number of iterations. If the function converges fast (as I suppose), 20 iterations will be plenty.
The code is completely untested, but it should illustrate the idea.

Categories