I am working on a data frame that looks like this :
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
I'm trying to make a Haverisne distance matrix. Basically for each zone, I would like to calculate the distance between it and all the others in the dataframe. So there should be only 0s on the diagonal. Here is the Haversine function that I use but I can't make my matrix.
def haversine(x):
x.lon, x.lat, x.lon2, x.lat2 = map(radians, [x.lon, x.lat, x.lon2, x.lat2])
# formule de Haversine
dlon = x.lon2 - x.lon
dlat = x.lat2 - x.lat
a = sin(dlat / 2) ** 2 + cos(x.lat) * cos(x.lat2) * sin(dlon / 2) ** 2
c = 2 * atan2(sqrt(a), sqrt(1 - a))
km = 6367 * c
return km
You can use the solution to this answer Pandas - Creating Difference Matrix from Data Frame
Or in your specific case, where you have a DataFrame like this example:
lat lon
id_zone
0 40.0795 4.338600
1 45.9990 4.829600
2 45.2729 2.882000
3 45.7336 4.850478
4 45.6981 5.043200
And your function is defined as:
def haversine(first, second):
# convert decimal degrees to radians
lat, lon, lat2, lon2 = map(np.radians, [first[0], first[1], second[0], second[1]])
# haversine formula
dlon = lon2 - lon
dlat = lat2 - lat
a = np.sin(dlat/2)**2 + np.cos(lat) * np.cos(lat2) * np.sin(dlon/2)**2
c = 2 * np.arcsin(np.sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
Where you pass the lat and lon of the first location and the second location.
You can then create a distance matrix using Numpy and then replace the zeros with the distance results from the haversine function:
# create a matrix for the distances between each pair of zones
distances = np.zeros((len(df), len(df)))
for i in range(len(df)):
for j in range(len(df)):
distances[i, j] = haversine(df.iloc[i], df.iloc[j])
pd.DataFrame(distances, index=df.index, columns=df.index)
Your output should be similar to this:
id_zone 0 1 2 3 4
id_zone
0 0.000000 659.422944 589.599339 630.083979 627.383858
1 659.422944 0.000000 171.597296 29.555376 37.325316
2 589.599339 171.597296 0.000000 161.731366 174.983855
3 630.083979 29.555376 161.731366 0.000000 15.474533
4 627.383858 37.325316 174.983855 15.474533 0.000000
I have a pandas dataframe that represents the GPS trajectory of a vehicle
d1 = {'id': [1, 2, 3, 4, 5, 6, 7, 8, 9], 'longitude': [4.929783, 4.932333, 4.933950, 4.933900, 4.928467, 4.924583, 4.922133, 4.921400, 4.920967], 'latitude': [52.372250, 52.370884, 52.371101, 52.372234, 52.375282, 52.375950, 52.376301, 52.376232, 52.374481]}
df1 = pd.DataFrame(data=d1)
id longitude latitude
1 4.929783 52.372250
2 4.932333 52.370884
3 4.933950 52.371101
4 4.933900 52.372234
5 4.928467 52.375282
6 4.924583 52.375950
7 4.922133 52.376301
8 4.921400 52.376232
9 4.920967 52.374481
I already calculated the (haversine) distance in meters between consecutive GPS points as follows:
import numpy as np
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
df1['distance'] = haversine(df1['latitude'], df1['longitude'],
df1['latitude'].shift(), df1['longitude'].shift())
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
3 4.933950 52.371101 112.398101
4 4.933900 52.372234 126.029572
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
8 4.921400 52.376232 50.345227
9 4.920967 52.374481 196.908503
Now I would like to create a function that
removes the second, i.e. the following point if the distance between consecutive GPS points is less than 150 meters.
always keep the last (and the first) GPS point, regardless of the distance between the previous kept feature
Meaning this should be the output:
id longitude latitude distance
1 4.929783 52.372250 NaN
2 4.932333 52.370884 230.305288
5 4.928467 52.375282 500.896578
6 4.924583 52.375950 273.918990
7 4.922133 52.376301 170.828592
9 4.920967 52.374481 196.908503
What is the best way to achieve this in python?
NOTE: This doesn't account for maximum distance... that would require some look ahead and optimization.
I would iterate through and pass back just the index values of the rows you'd like to keep. Use those index values in a loc call.
Distance
Use whatever metric you want. I used OP's haversine distance.
def haversine(lat1, lon1, lat2, lon2, earth_radius=6371):
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
km = earth_radius * 2 * np.arcsin(np.sqrt(a))
m = km * 1000
return m
def dis(t0, t1):
lat0 = t0.latitude
lon0 = t0.longitude
lat1 = t1.latitude
lon1 = t1.longitude
return haversine(lat0, lon0, lat1, lon1)
The Loop
def f(d, threshold=50):
itups = d.itertuples()
last = next(itups)
indices = [last.Index]
distances = [0]
for tup in itups:
distance = dis(tup, last)
if distance > threshold:
indices.append(tup.Index)
distances.append(distance)
last = tup
return indices, distances
The Results
idx, distances = f(df1, 150)
df1.loc[idx].assign(distance=distances)
id longitude latitude distance
0 1 4.929783 52.372250 0.000000
1 2 4.932333 52.370884 230.305288
3 4 4.933900 52.372234 183.986479
4 5 4.928467 52.375282 500.896578
5 6 4.924583 52.375950 273.918990
6 7 4.922133 52.376301 170.828592
8 9 4.920967 52.374481 217.302775
I have the following two dataframes. Call this df1
City Latitude Longitude
0 NewYorkCity 40.7128 74.0060
1 Chicago 41.8781 87.6298
2 LA 34.0522 118.2437
3 Paris 48.8566 2.3522
and call this one df2
Place Latitude Longitude
0 75631 26.78436 -80.103
1 89210 26,75347 -80.0192
I want to know how I can calculate the distance between place and all cities listed. So it should look something like this.
Place Latitude Longitude NewYorkCity Chicago Paris
0 75631 26.78436 -80.103 some number ..... ....
1 89210 26,75347 -80.0192 some number .... ....
I'm reading through this particular post and attempting to adapt:Pandas Latitude-Longitude to distance between successive rows
def haversine(lat1, lon1, lat2, lon2, to_radians=True, earth_radius=6371):
if to_radians:
lat1, lon1, lat2, lon2 = np.radians([lat1, lon1, lat2, lon2])
a = np.sin((lat2-lat1)/2.0)**2 + \
np.cos(lat1) * np.cos(lat2) * np.sin((lon2-lon1)/2.0)**2
return earth_radius * 2 * np.arcsin(np.sqrt(a))
df['dist'] = haversine(df1.Latitude, df.Longitude, df2.Latitude, df2.Longitude)
I know this looks wrong. Am I needing a for loop to go through each of the ones in df1?
a=df.iloc[:,1::].values#Array of the Lat/Long
b=df2.iloc[:,1::].values##Array of the Lat/Long
df.join(pd.DataFrame(distance.cdist(a, b, 'euclidean')).rename(columns={0:75631,1:89210}))
City Latitude Longitude 75631 89210
0 NewYorkCity 40.7128 74.0060 154.737149 154.656475
1 Chicago 41.8781 87.6298 168.410550 168.329860
2 LA 34.0522 118.2437 198.479810 198.397200
3 Paris 48.8566 2.3522 85.358326 85.285379
Alternatively and which is a long way
df2.rename(columns={'Latitude':'Lat','Longitude':'Long'}, inplace=True)#rename Lat/long in df2
g=pd.concat([df,df2.iloc[:1]], axis=1).fillna(method='ffill')#Append 1st Place on df
h=h=pd.concat([df,df2.iloc[1:]], axis=1).ffill().bfill()#append 2nd place on df
l=g.append(h)#new dataframe
#Compute diatnce
u=l.Latitude.sub(l.Lat)
v=l.Longitude.sub(l.Long)
l['dist'] = np.sqrt(u**2+v**2)
print(l)
City Latitude Longitude Place Lat Long dist
0 NewYorkCity 40.7128 74.0060 75631.0 26.78436 -80.1030 154.737149
1 Chicago 41.8781 87.6298 75631.0 26.78436 -80.1030 168.410550
2 LA 34.0522 118.2437 75631.0 26.78436 -80.1030 198.479810
3 Paris 48.8566 2.3522 75631.0 26.78436 -80.1030 85.358326
0 NewYorkCity 40.7128 74.0060 89210.0 26.75347 -80.0192 154.656475
1 Chicago 41.8781 87.6298 89210.0 26.75347 -80.0192 168.329860
2 LA 34.0522 118.2437 89210.0 26.75347 -80.0192 198.397200
3 Paris 48.8566 2.3522 89210.0 26.75347 -80.0192 85.285379
The following code worked for me:
a=list(range(19))
for i in a:
Lat1=df1[i,2] #works down 3rd column
Lon1=df1[i,3] #works down 4th column
Lat2=df2['Latitude']
Lon2= df2['Longitude']
#the i in the below piece works down the 1st column to grab names
#the code then places them into column names
df2[df1iloc[i,0]] = 3958.756*np.arccos(np.cos(math.radians(90-Lat1)) *np.cos(np.radians(90-Lat2)) +np.sin(math.radians(90-Lat1)) *np.sin(np.radians(90-Lat2)) *np.cos(np.radians(Lon1-Lon2)))
Note that this calculates the miles between each location as direct shots there. Doesn't factor in twists and turns.
I need to produce 5000 kgs of steel by mixing 7 alloys parts .
I need to reduce the cost, so i need to pick up the best parts.
The result must respect the main steel caracteristics, for example, the carbon level must be between 2% and 3 %, no more, no less .
The Excel linear solver program already exists ,and is originated from a professional book.
I'm trying to translate it to a PULP code, now .
My problem is : How to create the copper, carbone, and manganèse constraints ? There are 2 arrays, so I don't know how to do.
It is all in percents, and I don't know how to do . My result is actually wrong, I left the bad constraints I've done for information . It seems that I need to divide by 5000 at one moment, but how should I do ?
Let me try to explain to you what I can not understand :
I need 5000 kgs of steel to have 0.60 % of copper in it, but my Copper alloy parts contains 90 % and 96% of copper.
Do you see what I mean, and why it is so difficult to describe my constraints ?
"" "
Mining and metals
We make steel with raw materials, we want to reduce the cost of producing this steel
to make more money, but still respecting the minimum characteristics of quality steel
"" "
# Minimize the cost of metal alloys.
# Characteristics of the steel to be made
"" "Element %Minimum %Max %Real ( it is a var)
Carbon 2 3 2.26
Copper 0.4 0.6 0.60
Manganese 1.2 1.65 1.20
"" "
# Characteristics, stocks and purchase price of alloys
"" "
Alloy C% Cu% Mn% Stocks kg Price € / kg
Iron alloy 2.50 0.00 1.30 4000 1.20
Iron alloy 3.00 0.00 0.80 3000 1.50
Iron alloy 0.00 0.30 0.00 6000 0.90
Copper alloy 0.00 90.00 0.00 5000 1.30
Copper alloy 0.00 96.00 4.00 2000 1.45
Aluminum alloy 0.00 0.40 1.20 3000 1.20
Aluminum alloy 0.00 0.60 0.00 2,500 1.00
"" "
# Import the PuLP lib
from pulp import *
# Create the problem variable
prob = LpProblem ("MinimiserLpAlliage", LpMinimize)
# The 7 vars have a zero limit
x1 = LpVariable ("Iron alloy 1", 0)
x2 = LpVariable ("Iron alloy 2", 0)
x3 = LpVariable ("Iron alloy 3", 0)
x4 = LpVariable ("Copper alloy 1", 0)
x5 = LpVariable ("Copper alloy 2", 0)
x6 = LpVariable ("Aluminum alloy 1", 0)
x7 = LpVariable ("Aluminum alloy 2", 0)
# The objective function is to minimize the total cost of the alloys in EUROS for a given quantity in KGS
prob + = 1.20 * x1 + 1.50 * x2 + 0.90 * x3 + 1.30 * x4 + 1.45 * x5 + 1.20 * x6 + 1.00 * x7, "AlliageCost"
# Quantity constraint in KGS.
prob + = x1 + x2 + x3 + x4 + x5 + x6 + x7 == 5000, "RequestedQuantity"
# MIN constraints of% carbon, by alloy // ITS NOT WHAT I NEED
prob + = x1> = 2.5, "MinCarboneRequirement1"
prob + = x2> = 3, "MinCarboneRequirement2"
prob + = x3> = 0, "MinCarboneRequirement3"
prob + = x4> = 0, "MinCarboneRequirement4"
prob + = x5> = 0, "MinCarboneRequirement5"
prob + = x6> = 0, "MinCarboneRequirement6"
prob + = x7> = 0, "MinCarboneRequirement7"
# MIN constraints of% copper, by alloy // ITS WRONG ITS NOT WHAT I NEED
prob + = x1> = 0, "MinCuivreRequirement1"
prob + = x2> = 0, "MinCuivreRequirement2"
prob + = x3> = 0.3, "MinCuivreRequirement3"
prob + = x4> = 90, "MinCuivreRequirement4"
prob + = x5> = 96, "MinCuivreRequirement5"
prob + = x6> = 0.4, "MinCuivreRequirement6"
prob + = x7> = 0.6, "MinCuivreRequirement7"
# MIN constraints of% of Manganese, by alloy // ITS WRONG ITS NOT WHAT I NEED
prob + = x1> = 1.3, "MinManganeseRequirement1"
prob + = x2> = 0.8, "MinManganeseRequirement2"
prob + = x3> = 0, "MinManganeseRequirement3"
prob + = x4> = 0, "MinManganeseRequirement4"
prob + = x5> = 4, "MinManganeseRequirement5"
prob + = x6> = 1.2, "MinManganeseRequirement6"
prob + = x7> = 0, "MinManganeseRequirement7"
# MAX constraints of% of Manganese, by alloy // ITS WRONG ITS NOT WHAT I NEED
prob + = x1 <= 1.3, "MaxManganeseRequirement1"
prob + = x2 <= 0.8, "MaxManganeseRequirement2"
prob + = x3 <= 0, "MaxManganeseRequirement3"
prob + = x4 <= 0, "MaxManganeseRequirement4"
prob + = x5 <= 4, "MaxManganeseRequirement5"
prob + = x6 <= 1.2, "MaxManganeseRequirement6"
prob + = x7 <= 0, "MaxManganeseRequirement7"
# 5. MAX constraints from available stock, by alloy // I THINK IT IS OK
prob + = x1 <= 4000, "MaxStock"
prob + = x2 <= 3000, "MaxStock1"
prob + = x3 <= 6000, "MaxStock2"
prob + = x4 <= 5000, "MaxStock3"
prob + = x5 <= 2000, "MaxStock4"
prob + = x6 <= 3000, "MaxStock5"
prob + = x7 <= 2500, "MaxStock6"
# The problem data is written to an .lp file
prob.writeLP ( "WhiskasModel.lp")
# We use the solver
prob.solve ()
# The status of the solution
print ("Status:", LpStatus [prob.status])
# We magnify and display the optimums of each var
for v in prob.variables ():
print (v.name, "=", v.varValue)
# The result of the objective function is here
print ("Total", value (prob.objective))
This is the answer, but of course, it is wrong, cause I dont know how to do the constraints :
Status: Optimal
Aluminum_alloy_1 = 1.2
Aluminum_alloy_2 = 0.6
Copper_alloy_1 = 90.0
Alloy_of_copper_2 = 96.0
Alloy_of_fer_1 = 2.5
Alloy_of_fer_2 = 3.0
Iron_alloy_3 = 4806.7
Total 4,591.76,999,999,999,995
EDIT Hello !
This is the improved version 2 of my code, sorry, it is in french, but i bet you can see what i mean , it still doesn't work , thought... but closer to what I need :
Mining and metals
In the manufacture of steel with permeable materials, sur wants to reduce the cost of producing this steel
to earn more money but still respecting the important characteristics of quality steel
# Characteristics of the steel to be made
""" Elément % minimal % Max
Carbone 2 3
Cuivre 0.4 0.6
Manganèse 1.2 1.65
"""
# Characteristics, stocks and purchase price of alloys at KILO
"""
Alliage C % Cu % Mn % Stocks kg Prix €/kg
Alliage de fer 1 2,50 0,00 1,30 4000 1,20
Alliage de fer 2 3,00 0,00 0,80 3000 1,50
Alliage de fer 3 0,00 0,30 0,00 6000 0,90
Alliage de cuivre 1 0,00 90,00 0,00 5000 1,30
Alliage de cuivre 2 0,00 96,00 4,00 2000 1,45
Alliage d'alu 1 0,00 0,40 1,20 3000 1,20
Alliage d'alu 2 0,00 0,60 0,00 2500 1,00
"""
# Importer la lib PuLP
from pulp import *
#Créer la variable du problème
prob = LpProblem("MinimiserLpAlliage",LpMinimize)
# The 7 vars have a zero limit, these decision variables are expressed in KILOS
x1 = LpVariable("Alliage de fer 1",0)
x2 = LpVariable("Alliage de fer 2",0)
x3 = LpVariable("Alliage de fer 3",0)
x4 = LpVariable("Alliage de cuivre 1",0)
x5 = LpVariable("Alliage de cuivre 2",0)
x6 = LpVariable("Alliage d'alu 1",0)
x7 = LpVariable("Alliage d'alu 2",0)
# The objective function is to minimize the total cost of the alloys in EUROS
prob += 1.20 * x1 + 1.50 * x2 + 0.90 * x3 + 1.30 * x4 + 1.45 * x5 + 1.20 * x6 + 1.00 * x7, "CoutAlliages"
# Quantity constraint in KGS.
prob += x1 + x2 + x3 + x4 + x5 + x6 + x7 == 5000, "QuantitéDemandée"
# Carbon stress.
prob += (2.50 * x1 + 3.00 * x2 + x3 + x4 + x5 + x6 + x7 ) / 5000 <= 3,"carBmax"
prob += (2.50 * x1 + 3.00 * x2 + x3 + x4 + x5 + x6 + x7 ) / 5000 >= 2,"carBmin"
# Constraint cu .
prob += (x1 + x2 + 0.30 * x3 + 90 * x4 + 96 * x5 + 0.40 * x6 + 0.60 * x7) / 5000 <= 0.6,"cuBmax"
prob += (x1 + x2 + 0.30 * x3 + 90 * x4 + 96 * x5 + 0.40 * x6 + 0.60 * x7) / 5000 >= 0.4,"cuBmin"
# Constraint Manganèse.
prob += (1.30 * x1 + 0.80 * x2 + x3 + x4 + 4 * x5 + 1.20 * x6 + x7 ) / 5000 <= 1.65,"mgBmax"
prob += (1.30 * x1 + 0.80 * x2 + x3 + x4 + 4 * x5 + 1.20 * x6 + x7 ) / 5000 >= 1.2,"mgBmin"
# 5. MAX constraints from available stock, by alloy
prob += x1 <= 4000 , "MaxStock"
prob += x2 <= 3000 , "MaxStock1"
prob += x3 <= 6000 , "MaxStock2"
prob += x4 <= 5000 , "MaxStock3"
prob += x5 <= 2000 , "MaxStock4"
prob += x6 <= 3000 , "MaxStock5"
prob += x7 <= 2500 , "MaxStock6"
# The problem data is written to an .lp file
prob.writeLP("acier.lp")
# On utilise le solveur
prob.solve()
# The status of the solution
print ("Status:", LpStatus[prob.status])
# We magnify and display the optimums of each var
for v in prob.variables():
print (v.name, "=", v.varValue)
# The result of the objective function is here
print ("Total payable in euros", value(prob.objective))
""" Status: Infeasible
Alliage_d'alu_1 = 0.0
Alliage_d'alu_2 = 0.0
Alliage_de_cuivre_1 = 0.0
Alliage_de_cuivre_2 = 0.0
Alliage_de_fer_1 = 0.0
Alliage_de_fer_2 = 0.0
Alliage_de_fer_3 = 10000.0
Total à payer en euros 9000.0 """
The book says the result with the excel solver is :
iron_1 : 4000 kgs
iron_2 : 0 kgs
iron_3 : 397.76kgs
cu_1 : 0 kgs
cu_2 : 27.61kgs
al_1 : 574.62kgs
al_2 : 0kgs
Cost in euros 5887.57
Steel contains 2% carb, 0.6 % cu, 1.2 %
manganese
Excel tab :
Solver pic :
Part of your problem is how you are understanding/applying percentages. My recommendation would be to convert percentages [0-100] to fractional numbers [0-1.0] as early as possible.
In excel when a cell says 50% the numeric value of the cell is actually 0.5. Working with percentages in this way means you don't have to keep dividing out by 100, and can multiply one percentage with another and it all just works.
The code below does what you want:
"""
Mining and metals
We make steel with raw materials, we want to reduce the cost of producing this steel
to make more money, but still respecting the minimum characteristics of quality steel
"""
# Minimize the cost of metal alloys.
# Characteristics of the steel to be made
"""Element %Minimum %Max %Real (it is a var)
Carbon 2 3 2.26
Copper 0.4 0.6 0.60
Manganese 1.2 1.65 1.20
"""
# Characteristics, stocks and purchase price of alloys
"""
Alloy C% Cu% Mn% Stocks kg Price € / kg
Iron alloy 2.50 0.00 1.30 4000 1.20
Iron alloy 3.00 0.00 0.80 3000 1.50
Iron alloy 0.00 0.30 0.00 6000 0.90
Copper alloy 0.00 90.00 0.00 5000 1.30
Copper alloy 0.00 96.00 4.00 2000 1.45
Aluminum alloy 0.00 0.40 1.20 3000 1.20
Aluminum alloy 0.00 0.60 0.00 2500 1.00
"""
# Import the PuLP lib
from pulp import *
# Create the problem variable
prob = LpProblem ("MinimiserLpAlliage", LpMinimize)
# Problem Data
input_mats = ["iron_1", "iron_2", "iron_3",
"cu_1", "cu_2",
"al_1", "al_2"]
input_costs = {"iron_1": 1.20, "iron_2": 1.50, "iron_3": 0.90,
"cu_1": 1.30, "cu_2": 1.45,
"al_1": 1.20, "al_2": 1.00}
# C% Cu% Mn%
input_composition = {"iron_1": [0.025, 0.000, 0.013],
"iron_2": [0.030, 0.000, 0.008],
"iron_3": [0.000, 0.003, 0.000],
"cu_1": [0.000, 0.900, 0.000],
"cu_2": [0.000, 0.960, 0.040],
"al_1": [0.000, 0.004, 0.012],
"al_2": [0.000, 0.006, 0.000]}
input_stock = {"iron_1": 4000, "iron_2": 3000, "iron_3": 6000,
"cu_1": 5000, "cu_2": 2000,
"al_1": 3000, "al_2": 2500}
request_quantity = 5000
Carbon_min = 0.02
Carbon_max = 0.03
Cu_min = 0.004
Cu_max = 0.006
Mn_min = 0.012
Mn_max = 0.0165
# Problem variables - amount in kg of each input
x = LpVariable.dicts("input_mat", input_mats, 0)
# The objective function is to minimize the total cost of the alloys in EUROS for a given quantity in KGS
prob += lpSum([input_costs[i]*x[i] for i in input_mats]), "AlliageCost"
# Quantity constraint in KGS.
prob += lpSum([x[i] for i in input_mats]) == request_quantity, "RequestedQuantity"
# MIN/MAX constraint of carbon in resultant steel
prob += lpSum([x[i]*input_composition[i][0] for i in input_mats]) >= Carbon_min*request_quantity, "MinCarbon"
prob += lpSum([x[i]*input_composition[i][0] for i in input_mats]) <= Carbon_max*request_quantity, "MaxCarbon"
# MIN/MAX constraints of copper in resultant steel
prob += lpSum([x[i]*input_composition[i][1] for i in input_mats]) >= Cu_min*request_quantity, "MinCu"
prob += lpSum([x[i]*input_composition[i][1] for i in input_mats]) <= Cu_max*request_quantity, "MaxCu"
# MIN/MAX constraints of manganese in resultant steel
prob += lpSum([x[i]*input_composition[i][2] for i in input_mats]) >= Mn_min*request_quantity, "MinMn"
prob += lpSum([x[i]*input_composition[i][2] for i in input_mats]) <= Mn_max*request_quantity, "MaxMn"
# MAX constraints of available stock
for i in input_mats:
prob += x[i] <= input_stock[i], ("MaxStock_" + i)
# Solve the problem
prob.solve()
# The status of the solution
print ("Status:", LpStatus [prob.status])
# Dislay the optimums of each var
for v in prob.variables ():
print (v.name, "=", v.varValue)
# Display mat'l compositions
Carbon_value = sum([x[i].varValue*input_composition[i][0] for i in input_mats])/request_quantity
Cu_value = sum([x[i].varValue*input_composition[i][1] for i in input_mats])/request_quantity
Mn_value = sum([x[i].varValue*input_composition[i][2] for i in input_mats])/request_quantity
print ("Carbon content: " + str(Carbon_value))
print ("Copper content: " + str(Cu_value))
print ("Manganese content: " + str(Mn_value))
# The result of the objective function is here
print ("Total", value (prob.objective))
From which I get:
Status: Optimal
input_mat_al_1 = 574.62426
input_mat_al_2 = 0.0
input_mat_cu_1 = 0.0
input_mat_cu_2 = 27.612723
input_mat_iron_1 = 4000.0
input_mat_iron_2 = 0.0
input_mat_iron_3 = 397.76302
Carbon content: 0.02
Copper content: 0.006000000036
Manganese content: 0.012000000008
Total 5887.57427835
I have a dataframe with >2.7MM coordinates, and a separate list of ~2,000 coordinates. I'm trying to return the minimum distance between the coordinates in each individual row compared to every coordinate in the list. The following code works on a small scale (dataframe with 200 rows), but when calculating over 2.7MM rows, it seemingly runs forever.
from haversine import haversine
df
Latitude Longitude
39.989 -89.980
39.923 -89.901
39.990 -89.987
39.884 -89.943
39.030 -89.931
end_coords_list = [(41.342,-90.423),(40.349,-91.394),(38.928,-89.323)]
for row in df.itertuples():
def min_distance(row):
beg_coord = (row.Latitude, row.Longitude)
return min(haversine(beg_coord, end_coord) for end_coord in end_coords_list)
df['Min_Distance'] = df.apply(min_distance, axis=1)
I know the issue lies in the sheer number of calculations that are happening (5.7MM * 2,000 = ~11.4BN), and the fact that running this many loops is incredibly inefficient.
Based on my research, it seems like a vectorized NumPy function might be a better approach, but I'm new to Python and NumPy so I'm not quite sure how to implement this in this particular situation.
Ideal Output:
df
Latitude Longitude Min_Distance
39.989 -89.980 3.7
39.923 -89.901 4.1
39.990 -89.987 4.2
39.884 -89.943 5.9
39.030 -89.931 3.1
Thanks in advance!
The haversine func in essence is :
# convert all latitudes/longitudes from decimal degrees to radians
lat1, lng1, lat2, lng2 = map(radians, (lat1, lng1, lat2, lng2))
# calculate haversine
lat = lat2 - lat1
lng = lng2 - lng1
d = sin(lat * 0.5) ** 2 + cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2
h = 2 * AVG_EARTH_RADIUS * asin(sqrt(d))
Here's a vectorized method leveraging the powerful NumPy broadcasting and NumPy ufuncs to replace those math-module funcs so that we would operate on entire arrays in one go -
# Get array data; convert to radians to simulate 'map(radians,...)' part
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
# Get the differentiations
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
# Compute the "cos(lat1) * cos(lat2) * sin(lng * 0.5) ** 2" part.
# Add into "sin(lat * 0.5) ** 2" part.
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
# Get h and assign into dataframe
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance'] = h.min(1)
For further performance boost, we can make use of numexpr module to replace the transcendental funcs.
Runtime test and verification
Approaches -
def loopy_app(df, coords_list):
for row in df.itertuples():
df['Min_Distance1'] = df.apply(min_distance, axis=1)
def vectorized_app(df, coords_list):
coords_arr = np.deg2rad(coords_list)
a = np.deg2rad(df.values)
lat = coords_arr[:,0] - a[:,0,None]
lng = coords_arr[:,1] - a[:,1,None]
add0 = np.cos(a[:,0,None])*np.cos(coords_arr[:,0])* np.sin(lng * 0.5) ** 2
d = np.sin(lat * 0.5) ** 2 + add0
h = 2 * AVG_EARTH_RADIUS * np.arcsin(np.sqrt(d))
df['Min_Distance2'] = h.min(1)
Verification -
In [158]: df
Out[158]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [159]: loopy_app(df, coords_list)
In [160]: vectorized_app(df, coords_list)
In [161]: df
Out[161]:
Latitude Longitude Min_Distance1 Min_Distance2
0 39.989 -89.980 126.637607 126.637607
1 39.923 -89.901 121.266241 121.266241
2 39.990 -89.987 126.037388 126.037388
3 39.884 -89.943 118.901195 118.901195
4 39.030 -89.931 53.765506 53.765506
Timings -
In [163]: df
Out[163]:
Latitude Longitude
0 39.989 -89.980
1 39.923 -89.901
2 39.990 -89.987
3 39.884 -89.943
4 39.030 -89.931
In [164]: %timeit loopy_app(df, coords_list)
100 loops, best of 3: 2.41 ms per loop
In [165]: %timeit vectorized_app(df, coords_list)
10000 loops, best of 3: 96.8 µs per loop