How to multiprocess finding closest geographic point in two pandas dataframes? - python

I have a function which I'm trying to apply in parallel and within that function I call another function that I think would benefit from being executed in parallel. The goal is to take in multiple years of crop yields for each field and combine all of them into one pandas dataframe. I have the function I use for finding the closest point in each dataframe, but it is quite intensive and takes some time. I'm looking to speed it up.
I've tried creating a pool and using map_async on the inner function. I've also tried doing the same with the loop for the outer function. The latter is the only thing I've gotten to work the way I intended it to. I can use this, but I know there has to be a way to make it faster. Check out the code below:
return_columns = []
return_columns_cb = lambda x: return_columns.append(x)
def getnearestpoint(gdA, gdB, retcol):
dist = lambda point1, point2: distance.great_circle(point1, point2).feet
def find_closest(point):
distances = gdB.apply(
lambda row: dist(point, (row["Longitude"], row["Latitude"])), axis=1
return (gdB.loc[distances.idxmin(), retcol], distances.min())
append_retcol = gdA.apply(
lambda row: find_closest((row["Longitude"], row["Latitude"])), axis=1
return append_retcol
def combine_yield(field):
#field is a list of the files for the field I'm working with
#lots of pre-processing
#dfs in this case is a list of the dataframes for the current field
#mdf is the dataframe with the most points which I poppped from this list
p = Pool()
for i in range(0, len(dfs)):
p.apply_async(getnearestpoint, args=(mdf, dfs[i], dfs[i].columns[-1]), callback=return_cols_cb)
for col in return_columns:
mdf = mdf.append(col)
'''I unzip my points back to longitude and latitude here in the final
dataframe so I can write to csv without tuples'''
mdf[["Longitude", "Latitude"]] = pd.DataFrame(
mdf["Point"].tolist(), index=mdf.index
return mdf
def multiprocess_combine_yield():
'''do stuff to get dictionary below with each field name as key and values
as all the files for that field'''
yield_by_field = {'C01': ('files...'), ...}
#The farm I'm working on has 30 fields and below is too slow
for k,v in yield_by_field.items():
I guess what I need help on is I envision something like using a pool to imap or apply_async on each tuple of files in the dictionary. Then within the combine_yield function when applied to that tuple of files, I want to to be able to parallel process the distance function. That function bogs the program down because it calculates the distance between every point in each of the dataframes for each year of yield. The files average around 1200 data points and then you multiply all of that by 30 fields and I need something better. Maybe the efficiency improvement lies in finding a better way to pull in the closest point. I still need something that gives me the value from gdB, and the distance though because of what I do later on when selecting which rows to use from the 'mdf' dataframe.

Thanks to #ALollz comment, I figured this out. I went back to my getnearestpoint function and instead of doing a bunch of Series.apply I am now using cKDTree from scipy.spatial to find the closest point, and then using a vectorized haversine distance to calculate the true distances on each of these matched points. Much much quicker. Here are the basics of the code below:
import numpy as np
import pandas as pd
from scipy.spatial import cKDTree
def getnearestpoint(gdA, gdB, retcol):
gdA_coordinates = np.array(
list(zip(gdA.loc[:, "Longitude"], gdA.loc[:, "Latitude"]))
gdB_coordinates = np.array(
list(zip(gdB.loc[:, "Longitude"], gdB.loc[:, "Latitude"]))
tree = cKDTree(data=gdB_coordinates)
distances, indices = tree.query(gdA_coordinates, k=1)
#These column names are done as so due to formatting of my 'retcols'
df = pd.DataFrame.from_dict(
f"Longitude_{retcol[:4]}": gdB.loc[indices, "Longitude"].values,
f"Latitude_{retcol[:4]}": gdB.loc[indices, "Latitude"].values,
retcol: gdB.loc[indices, retcol].values,
gdA = pd.merge(left=gdA, right=df, left_on=gdA.index, right_on=df.index)
gdA.drop(columns="key_0", inplace=True)
return gdA
def combine_yield(field):
#same preprocessing as before
for i in range(0, len(dfs)):
mdf = getnearestpoint(mdf, dfs[i], dfs[i].columns[-1])
main_coords = np.array(list(zip(mdf.Longitude, mdf.Latitude)))
lat_main = main_coords[:, 1]
longitude_main = main_coords[:, 0]
longitude_cols = [
c for c in mdf.columns for m in ["Longitude_B\d{4}", c)] if m
latitude_cols = [
c for c in mdf.columns for m in ["Latitude_B\d{4}", c)] if m
year_coords = list(zip_longest(longitude_cols, latitude_cols, fillvalue=np.nan))
for i in year_coords:
year ="\d{4}", i[0]).group(0)
year_coords = np.array(list(zip(mdf.loc[:, i[0]], mdf.loc[:, i[1]])))
year_coords = np.deg2rad(year_coords)
lat_year = year_coords[:, 1]
longitude_year = year_coords[:, 0]
diff_lat = lat_main - lat_year
diff_lng = longitude_main - longitude_year
d = (
np.sin(diff_lat / 2) ** 2
+ np.cos(lat_main) * np.cos(lat_year) * np.sin(diff_lng / 2) ** 2
mdf[f"{year} Distance"] = 2 * (2.0902 * 10 ** 7) * np.arcsin(np.sqrt(d))
return mdf
Then I'll just do, (v for k,v in yield_by_field.items()))
This has made a substantial difference. Hope it helps anyone else in a similar predicament.


Avoid for loop in Python DataFrame

Problem 1.
Suppose I have n years of annual returns r and my initial wealth is 100. Every year I have fixed expense of 6. I want to create yearly wealth. I can do it in for loop. But for my purpose it's time consuming. How do I do it in DataFrame?
wealth = pd.Series(index = range(n+1))
wealth[0] = 100
for i in range(n):
wealth.iloc[i+1] = wealth.iloc[i]*(1+r.iloc[i]) - 6
Initially I thought
wealth = ((1 + r - 0.06).cumprod()).multiply(other = 100)
to be the solution. But it is not. Expenses are not 6%. They are fixed. It is 6.
Problem 2.
I want to do the above N times. In each case I generate r by sampling n returns with replacement.
r = returnY.sample(n,replace=True).reset_index(drop=True)
Then for that return, create the wealth path I described above and create a n*N dateframe of wealth paths. I can do this in for loop, but for big N and n, it takes long time to run. Is there an efficient and elegant way to do this?
Problem 3.
Suppose allWealth is the DF with all wealth paths. Want to check %columns in each row less than 0. This is how I resolved it.
yy = allWealth.copy()
yy[yy>0] = 1
yy[yy<=0] = 0
yy.sum(axis = 1)/N
Any better, more elegant solution?
Problem 1: It looks like you want to apply the "reduce" pattern. You can use reduce function from functools.
import numpy as np
from functools import reduce
rs = np.random.random(50)*0.3 #sequence of annual returns
result = reduce(lambda w,r: w*(1+r)-6, rs, 100)
If you want to keep all the intermediate values, use itertools.accumulate() instead. For example, replace the last line with the following:
ts_iter= itertools.accumulate(rs, lambda w,r: w*(1+r)-6, initial=100)
ts = list(ts_iter) #itertools.accumulate returns an iterable
Problem 2: You can first generate a random matrix of nxN by sampling with replacement. Then you can use "apply_along_axis" method for each column.
import numpy as np
rm = np.random.random((n,N))
def sim(rs):
return reduce(lambda w,r: w * (1+r) - 6, rs, 100)
result = np.apply_along_axis(sim, 0, rm)
Problem 3: you don't need to assign ones and zeros to your original dataframe. A mask dataframe of True and False implicitly acts as a dataframe of ones and zeros in this case.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.random((50,30)))
mask = df < 0.5
I used #chi's solution with some small edit.
import numpy as np
import itertools
rm = np.random.random((n,N)) #sequence of annual returns
rm0 = np.insert(rm, 0, 100, axis=1)
def wealth(rs):
return list(itertools.accumulate(rs, lambda w,r: w*(1+r)-6))
result = np.apply_along_axis(wealth, 1, rm0)
itertools.accumulate does not recognize initial. Hence inserted initial wealth at the front of return array.

How do I add a matrix constraint `Ax=b` to a Pyomo model efficiently?

I want to add the constraints Ax=b to a Pyomo model with my numpy arrays A and b as efficient as possible. Unfortunately, the performance is very bad currently. For the following example
import time
import numpy as np
import pyomo.environ as pyo
start = time.time()
rows = 287
cols = 2765
A = np.random.rand(rows, cols)
b = np.random.rand(rows)
mdl = pyo.ConcreteModel()
mdl.rows = range(rows)
mdl.cols = range(cols)
mdl.A = A
mdl.b = b
mdl.x_var = pyo.Var(mdl.cols, bounds=(0.0, None))
mdl.constraints = pyo.ConstraintList()
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
mdl.obj = pyo.Objective(expr=sum(mdl.x_var[col] for col in mdl.cols), sense=pyo.minimize)
end = time.time()
print(end - start)
is takes almost 30 seconds because of the add statement and the huge amount of columns. Is it possible to pass A, x, and b directly and fast instead of adding it row by row?
The main thing that is slowing down your construction above is the fact that you are building the constraint list elements within a list comprehension, which is unnecessary and causes a lot of bloat.
This line:
[mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows]
Constructs a list of the captured results of each ConstraintList.add() expression, which is a "rich" return. That list is an unnecessary byproduct of the loop you desire to do over the add() function. Just change your generation scheme to either a loop or a generator (by using parens) to avoid that capture, as such:
(mdl.constraints.add(sum(mdl.A[row, col] * mdl.x_var[col] for col in mdl.cols) <= mdl.b[row]) for row in mdl.rows)
And the model construction time drops to about 0.02 seconds.

Parallelizing for loops in python

I know similar questions on this topic have been asked before, but I'm still struggling to make any headway with my problem.
Basically, I have three dataframes (of sizes 402 x 402, 402 x 3142, and 1 x 402) and I'm combining elements from them into a calculation. I then write the calculation to another dataframe - see code below using dummy data. Each calculation takes between 0.3-0.8 ms, but there are (402 x 3142)^2 total calculations, which obviously takes a long time!
Since none of the calculations is dependent on any other, this is ripe for parallelization, but I'm really having a hard time figuring out how to do this - sorry the code is probably pretty ugly, very new to python, and parallel computing.
One additional thing to note is that the non-vector matrices are sparse (0.4 and 0.3, respectively), so could be changed to coordinate or compressed row/column format so that not all of the possible combinations of calculations need to be made. This might reduce the time by half.
import pandas as pd
A = pd.DataFrame(np.random.choice([0, 1], size=(402,402), p=[0.6,0.4]))
B = pd.DataFrame(np.random.choice([0, 1], size=(402,3142), p=[0.7,0.3]))
x = A.sum(axis = 1)
col_names = ["R", "I", "S", "J","value"]
results = pd.DataFrame(columns = col_names)
row = 0
for r in B.columns:
for s in B.columns:
for i in A.index:
for j in A.columns:
results.loc[row,"R"] = r
results.loc[row,"I"] = i
results.loc[row,"S"] = s
results.loc[row,"J"] = j
results.loc[row, "value"] = A.loc[i,j]*B.loc[j,s]*B.loc[i,r]/x[i]
row = row + 1

How to find a root of a dynamic function based in columns of a dataset using Python

I'm beginner in python and I need to translate some code in R to Python.
I need to find one root per row in a dataset based in a dynamic function, the code in R is:
dataset = data.frame(A = c(10,20,30),B=c(20,10,40), FX = c("A+B-x","A-B+x","A*B-x"))
sol<- adply(dataset,1, summarize,
solution_0= uniroot.all(function(x)(eval(parse(text=as.character(FX),dataset))),lower = -10000, upper = 10000, tol = 0.00001))
This code return [30,-10,1200] as a solution for each row.
In python I read a documentation of optimize of sciPy package but i don't found a code that's work for me:
I tried a solutions like that below, but without sucess:
import pandas as pd
from scipy.optimize import fsolve as fs
data = {'A': [10,20,30],
'B': [20,10,40],
'FX': ["A+B-x","A-B+x","A*B-x"]}
df = pd.DataFrame(data)
def func(FX):
Someone have idea how to solve this?
Very Thanks.
SymPy is a symbolic math library for Python. Your question can be solved as:
import pandas as pd
from sympy import Symbol, solve
from sympy.parsing.sympy_parser import parse_expr
data = {'A': [10,20,30],
'B': [20,10,40],
'FX': ["A+B-x","A-B+x","A*B-x"]}
df = pd.DataFrame(data)
x = Symbol("x", real=True)
for index, row in df.iterrows():
F = parse_expr(row['FX'], local_dict={'A': row['A'], 'B': row['B'], 'x':x})
print (row['A'], row['B'], row['FX'], "-->", F, "-->", solve(F, x))
This outputs:
10 20 A+B-x --> 30 - x --> [30]
20 10 A-B+x --> x + 10 --> [-10]
30 40 A*B-x --> 1200 - x --> [1200]
Note that SymPy's solve returns a list of solutions. If you are sure there is always exactly one solution, just use solve(F, x)[0]. (Remember that unlike R, Python always starts indexing with 0.)
With list comprehension, you could write the solution as:
sol = [ solve(parse_expr(row['FX'], local_dict={'A': row['A'], 'B': row['B'], 'x':x}),
x)[0] for _, row in df.iterrows() ]
If you have many columns, you can also create the dictionary with a loop: dict({c:row[c] for c in df.columns}, **{'x':x}) ). The weird ** syntax is needed if you want to combine the dictionaries inside the list comprehension. See this post about the union of dictionaries.
cols = df.columns # change this if you won't need all columns
sol = [ solve(parse_expr(row['FX'],
local_dict=dict({c:row[c] for c in cols}, **{'x':x}) ),
x)[0].evalf() for _, row in df.iterrows() ]
PS: SymPy normally keeps the solutions in a symbolic form because it prefers exact expressions. When there are e.g. fractions or square roots, they are not evaluated immediately. To get the evaluated form, use evalf() as in solve(F, x)[0].evalf().

How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark? The RDD is approximately 700,000 elements and therefore too large to collect and find the median.
This question is similar to this question. However, the answer to the question is using Scala, which I do not know.
How can I calculate exact median with Apache Spark?
Using the thinking for the Scala answer, I am trying to write a similar answer in Python.
I know I first want to sort the RDD. I do not know how. I see the sortBy (Sorts this RDD by the given keyfunc) and sortByKey (Sorts this RDD, which is assumed to consist of (key, value) pairs.) methods. I think both use key value and my RDD only has integer elements.
First, I was thinking of doing myrdd.sortBy(lambda x: x)?
Next I will find the length of the rdd (rdd.count()).
Finally, I want to find the element or 2 elements at the center of the rdd. I need help with this method too.
I had an idea. Maybe I can index my RDD and then key = index and value = element. And then I can try to sort by value? I don't know if this is possible because there is only a sortByKey method.
Ongoing work
SPARK-30569 - Add DSL functions invoking percentile_approx
Spark 2.0+:
You can use approxQuantile method which implements Greenwald-Khanna algorithm:
df.approxQuantile("x", [0.5], 0.25)
df.stat.approxQuantile("x", Array(0.5), 0.25)
where the last parameter is a relative error. The lower the number the more accurate results and more expensive computation.
Since Spark 2.2 (SPARK-14352) it supports estimation on multiple columns:
df.approxQuantile(["x", "y", "z"], [0.5], 0.25)
df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)
Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:
> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
> SELECT approx_percentile(10.0, 0.5, 100);
Spark < 2.0
As I've mentioned in the comments it is most likely not worth all the fuss. If data is relatively small like in your case then simply collect and compute median locally:
import numpy as np
rdd = sc.parallelize(np.random.randint(1000000, size=700000))
%time np.median(rdd.collect())
It takes around 0.01 second on my few years old computer and around 5.5MB of memory.
If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally. But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):
from numpy import floor
import time
def quantile(rdd, p, sample=None, seed=None):
"""Compute a quantile of order p ∈ [0, 1]
:rdd a numeric rdd
:p quantile(between 0 and 1)
:sample fraction of and rdd to use. If not provided we use a whole dataset
:seed random number generator seed to be used with sample
assert 0 <= p <= 1
assert sample is None or 0 < sample <= 1
seed = seed if seed is not None else time.time()
rdd = rdd if sample is None else rdd.sample(False, sample, seed)
rddSortedWithIndex = (rdd.
sortBy(lambda x: x).
map(lambda (x, i): (i, x)).
n = rddSortedWithIndex.count()
h = (n - 1) * p
rddX, rddXPlusOne = (
for x in int(floor(h)) + np.array([0L, 1L]))
return rddX + (h - floor(h)) * (rddXPlusOne - rddX)
And some tests:
np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)
Finally lets define median:
from functools import partial
median = partial(quantile, p=0.5)
So far so good but it takes 4.66 s in a local mode without any network communication. There is probably way to improve this, but why even bother?
Language independent (Hive UDAF):
If you use HiveContext you can also use Hive UDAFs. With integral values: x: (float(x), )).toDF(["x"]).registerTempTable("df")
sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")
With continuous values:
sqlContext.sql("SELECT percentile(x, 0.5) FROM df")
In percentile_approx you can pass an additional argument which determines a number of records to use.
Here is the method I used using window functions (with pyspark 2.2.0).
from pyspark.sql import DataFrame
class median():
""" Create median class with over method to pass partition """
def __init__(self, df, col, name):
assert col
self.df = df = name
def over(self, window):
from pyspark.sql.functions import percent_rank, pow, first
first_window = window.orderBy(self.column) # first, order by column we want to compute the median for
df = self.df.withColumn("percent_rank", percent_rank().over(first_window)) # add percent_rank column, percent_rank = 0.5 coressponds to median
second_window = window.orderBy(pow(df.percent_rank-0.5, 2)) # order by (percent_rank - 0.5)^2 ascending
return df.withColumn(, first(self.column).over(second_window)) # the first row of the window corresponds to median
def addMedian(self, col, median_name):
""" Method to be added to spark native DataFrame class """
return median(self, col, median_name)
# Add method to DataFrame class
DataFrame.addMedian = addMedian
Then call the addMedian method to calculate the median of col2:
from pyspark.sql import Window
median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)
Finally you can group by if needed.
df.groupby("col1", "median")
Adding a solution if you want an RDD method only and dont want to move to DF.
This snippet can get you a percentile for an RDD of double.
If you input percentile as 50, you should obtain your required median.
Let me know if there are any corner cases not accounted for.
* Gets the nth percentile entry for an RDD of doubles
* #param inputScore : Input scores consisting of a RDD of doubles
* #param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
* It prefers the higher value when the desired quantile lies between two data points
* #return : The number best representing the percentile in the Rdd of double
def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
val numEntries = inputScore.count().toDouble
val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt
.sortBy { case (score) => score }
.filter { case (score, index) => index == retrievedEntry }
.map { case (score, index) => score }
There are two ways that can be used. One is using approxQuantile method and the other percentile_approx method. However, both the methods might not give accurate results when there are even number of records.
importpyspark.sql.functions.percentile_approx as F
#"COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records
I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :
from pyspark.sql import Window
import pyspark.sql.functions as F
def calculate_median(dataframe, part_col, order_col):
win = Window.partitionBy(*part_col).orderBy(order_col)
# count_row = dataframe.groupby(*part_col).distinct().count()
temp = dataframe.withColumn("rank", F.row_number().over(win))
temp = temp.withColumn(
temp = temp.withColumn(
F.col("count_row_part") %2 == 0,
temp = temp.withColumn(
(F.col("even_flag")==1) &
(F.col("rank") == F.col("mid_value"))|
((F.col("rank")-1) == F.col("mid_value")),
F.col("rank") == F.col("mid_value")+1,
return temp.filter(
F.col("avg_flag") == 1
part_col + ["avg_flag"]
For exact median computation you can use the following function and use it with PySpark DataFrame API:
def median_exact(col: Union[Column, str]) -> Column:
For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
since for large datasets, computing the median is computationally expensive.
This function manually computes the median and should only be used for small to mid sized datasets / groupings.
:param col: Column to compute the median for.
:return: A pyspark `Column` containing the median calculation expression
list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
sorted_list_expr = F.sort_array(list_expr)
size_expr = F.size(sorted_list_expr)
even_num_elements = (size_expr % 2) == 0
odd_num_elements = ~even_num_elements
return F.when(size_expr == 0, None).otherwise(
F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
sorted_list_expr[(size_expr / 2 - 1).cast("long")]
+ sorted_list_expr[(size_expr / 2).cast("long")]
/ 2
Apply it like this:
output_df = input_spark_df.groupby("group").agg(
We can calculate the median and quantiles in spark using the following code:
For example, finding the median in the following dataframe [1,2,3,4,5]:
The lesser the error, the more accurate the results.
From version 3.4+ (and also already in 3.3.1) the median function is directly available
import pyspark.sql.functions as f
I guess the respective documentation will be added if the version is finally released.
