How do I solve a Travelling Salesman problem in python? I did not find any library, there should be a way using scipy functions for optimization or other libraries.
My hacky-extremelly-lazy-pythonic bruteforcing solution is:
tsp_solution = min( (sum( Dist[i] for i in izip(per, per[1:])), n, per) for n, per in enumerate(i for i in permutations(xrange(Dist.shape[0]), Dist.shape[0])) )[2]
where Dist (numpy.array) is the distance matrix.
If Dist is too big this will take forever.
Suggestions?
The scipy.optimize functions are not constructed to allow straightforward adaptation to the traveling salesman problem (TSP). For a simple solution, I recommend the 2-opt algorithm, which is a well-accepted algorithm for solving the TSP and relatively straightforward to implement. Here is my implementation of the algorithm:
import numpy as np
# Calculate the euclidian distance in n-space of the route r traversing cities c, ending at the path start.
path_distance = lambda r,c: np.sum([np.linalg.norm(c[r[p]]-c[r[p-1]]) for p in range(len(r))])
# Reverse the order of all elements from element i to element k in array r.
two_opt_swap = lambda r,i,k: np.concatenate((r[0:i],r[k:-len(r)+i-1:-1],r[k+1:len(r)]))
def two_opt(cities,improvement_threshold): # 2-opt Algorithm adapted from https://en.wikipedia.org/wiki/2-opt
route = np.arange(cities.shape[0]) # Make an array of row numbers corresponding to cities.
improvement_factor = 1 # Initialize the improvement factor.
best_distance = path_distance(route,cities) # Calculate the distance of the initial path.
while improvement_factor > improvement_threshold: # If the route is still improving, keep going!
distance_to_beat = best_distance # Record the distance at the beginning of the loop.
for swap_first in range(1,len(route)-2): # From each city except the first and last,
for swap_last in range(swap_first+1,len(route)): # to each of the cities following,
new_route = two_opt_swap(route,swap_first,swap_last) # try reversing the order of these cities
new_distance = path_distance(new_route,cities) # and check the total distance with this modification.
if new_distance < best_distance: # If the path distance is an improvement,
route = new_route # make this the accepted best route
best_distance = new_distance # and update the distance corresponding to this route.
improvement_factor = 1 - best_distance/distance_to_beat # Calculate how much the route has improved.
return route # When the route is no longer improving substantially, stop searching and return the route.
Here is an example of the function being used:
# Create a matrix of cities, with each row being a location in 2-space (function works in n-dimensions).
cities = np.random.RandomState(42).rand(70,2)
# Find a good route with 2-opt ("route" gives the order in which to travel to each city by row number.)
route = two_opt(cities,0.001)
And here is the approximated solution path shown on a plot:
import matplotlib.pyplot as plt
# Reorder the cities matrix by route order in a new matrix for plotting.
new_cities_order = np.concatenate((np.array([cities[route[i]] for i in range(len(route))]),np.array([cities[0]])))
# Plot the cities.
plt.scatter(cities[:,0],cities[:,1])
# Plot the path.
plt.plot(new_cities_order[:,0],new_cities_order[:,1])
plt.show()
# Print the route as row numbers and the total distance travelled by the path.
print("Route: " + str(route) + "\n\nDistance: " + str(path_distance(route,cities)))
If the speed of algorithm is important to you, I recommend pre-calculating the distances and storing them in a matrix. This dramatically decreases the convergence time.
Edit: Custom Start and End Points
For a non-circular path (one which ends at a location different from where it starts), edit the path distance formula to
path_distance = lambda r,c: np.sum([np.linalg.norm(c[r[p+1]]-c[r[p]]) for p in range(len(r)-1)])
and then reorder the cities for plotting using
new_cities_order = np.array([cities[route[i]] for i in range(len(route))])
With the code as it is, the starting city is fixed as the first city in cities, and the ending city is variable.
To make the ending city the last city in cities, restrict the range of swappable cities by changing the range of swap_first and swap_last in two_opt() with the code
for swap_first in range(1,len(route)-3):
for swap_last in range(swap_first+1,len(route)-1):
To make both the starting and ending cities variable, instead expand the range of swap_first and swap_last with
for swap_first in range(0,len(route)-2):
for swap_last in range(swap_first+1,len(route)):
I recently found out this option to use linear optimization for the TSP problem
https://gist.github.com/mirrornerror/a684b4d439edbd7117db66a56f2483e0
Nonetheless I agree with some of the other comments, just a remainder that there are ways to use linear optimization for this problem.
Some academic publications include the following
http://www.opl.ufc.br/post/tsp/
https://phabi.ch/2021/09/19/tsp-subtour-elimination-by-miller-tucker-zemlin-constraint/
Related
So I have two different files containing multiple trajectories in a squared map (512x512 pixels). Each file contains information about the spatial position of each particle within a track/trajectory (X and Y coordinates) and to which track/trajectory that spot belongs to (TRACK_ID).
My goal was to find a way to cluster similar trajectories between both files. I found a nice way to do this (distance clustering comparison), but the code it's too slow. I was just wondering if someone has some suggestions to make it faster.
My files look something like this:
The approach that I implemented finds similar trajectories based on something called Fréchet Distance (maybe not to relevant here). Below you can find the function that I wrote, but briefly this is the rationale:
group all the spots by track using pandas.groupby function for file1 (growth_xml) and file2 (shrinkage_xml)
for each trajectories in growth_xml (loop) I compare with each trajectory in growth_xml
if they pass the Fréchet Distance criteria that I defined (an if statement) I save both tracks in a new table. you can see an additional filter condition that I called delay, but I guess that is not important to explain here.
so really simple:
def distance_clustering(growth_xml,shrinkage_xml):
coords_g = pd.DataFrame() # empty dataframes to save filtered tracks
coords_s = pd.DataFrame()
counter = 0 #initalize counter to count number of filtered tracks
for track_g, param_g in growth_xml.groupby('TRACK_ID'):
# define growing track as multi-point line object
traj1 = [(x,y) for x,y in zip(param_g.POSITION_X.values, param_g.POSITION_Y.values)]
for track_s, param_s in shrinkage_xml.groupby('TRACK_ID'):
# define shrinking track as a second multi-point line object
traj2 = [(x,y) for x,y in zip(param_s.POSITION_X.values, param_s.POSITION_Y.values)]
# compute delay between shrinkage and growing ends to use as an extra filter
delay = (param_s.FRAME.iloc[0] - param_g.FRAME.iloc[0])
# keep track only if the frechet Distance is lower than 0.2 microns
if frechetDist(traj1, traj2) < 0.2 and delay > 0:
counter += 1
param_g = param_g.assign(NEW_ID = np.ones(param_g.shape[0]) * counter)
coords_g = pd.concat([coords_g, param_g])
param_s = param_s.assign(NEW_ID = np.ones(param_s.shape[0]) * counter)
coords_s = pd.concat([coords_s, param_s])
coords_g.reset_index(drop = True, inplace = True)
coords_s.reset_index(drop = True, inplace = True)
return coords_g, coords_s
The main problem is that most of the times I have more than 2 thousand tracks (!!) and this pairwise combination takes forever. I'm wondering if there's a simple and more efficient way to do this. Perhaps by doing the pairwise combination in multiple small areas instead of the whole map? not sure...
Have you tried to make a matrix (DeltaX,DeltaY) lookUpTable for the pairwise combination distance. It will take some long time to calc the LUT once, or you can write it in a file and load it when the algo starts.
Then you'll only have to look on correct case to have the result instead of calc each time.
You can too make a polynomial regression for the distance calc, it will be less precise but definitely faster
Maybe not an outright answer, but it's been a while. Could you not segment the lines and use minimum bounding box around each segment to assess similarities? I might be thinking of your problem the wrong way around. I'm not sure. Right now I'm trying to work with polygons from two different data sets and want to optimize the processing by first identifying the polygons in both geometries that overlap.
In your case, I think segments would you leave you with some edge artifacts. Maybe look at this paper: https://drops.dagstuhl.de/opus/volltexte/2021/14879/pdf/OASIcs-ATMOS-2021-10.pdf or this paper (with python code): https://www.austriaca.at/0xc1aa5576_0x003aba2b.pdf
I am trying to find the distance between a point and other 40,000 points.
Each point is a 300 dimension vector.
I am able to find the closes point. How do I find the 10 nearest points in decreasing order?
Function for closest point:
from scipy.spatial import distance
def closest_node(node,df):
closest_index = distance.cdist([node],df.feature.tolist()).argmin()
return pd.Series([df.title.tolist([closest_index],df.id.tolist()[closest_index]])
This command returns the closest title and id:
df3[["closest_title","closest_id"]]=df3.feature.apply(lambda row: closest_node(row,df2))
df2- pandas dataframe of 40,000 points (each 300 dimension)
How do I return the title and index for the 10 closest points
Thanks
Just slice the sorted distance matrix for the top 10 nodes.
Something like this:
from scipy.spatial import distance
# Find the query node
query_node = df.iloc[10] ## Not sure what you're looking for
# Find the distance between this node and everyone else
euclidean_distances = df.apply(lambda row: distance.euclidean(row, query_node), axis=1)
# Create a new dataframe with distances.
distance_frame = pandas.DataFrame(data={"dist": euclidean_distances, "idx": euclidean_distances.index})
distance_frame.sort("dist", inplace=True)
# nodes
smallest_dist_ixs = distance_frame.iloc[1:10]["idx"]
most_similar_nodes = df.iloc[int(smallest_dist_ixs)]
My assumption based on the word 'title' you have used here, and the choice of 300 dimensional vectors, is that these are word or phrase vectors.
Gensim actually has a manner to get the top N number of similar words based on this idea, which is reasonably fast.
https://tedboy.github.io/nlps/generated/generated/gensim.models.Word2Vec.most_similar.html
>>> trained_model.most_similar(positive=['woman', 'king'], negative=['man'])
[('queen', 0.50882536), ...]
For something slightly different, this is also slightly similar to the traveling salesman problem (TSP) if you want to get shortest paths between all points, and then simply slice out the first 10 'cities'.
Google has a pretty simple and quick python implementation with OR-Tools here: https://developers.google.com/optimization/routing/tsp.
As I don't know your complete code of have a sample of data, here would be my suggestion:
Instead of using ".argmin()" just sort your list by distance and then return the first ten elements of the sorted list. Then find their indices like you're already doing it.
If I have two known locations and a known speed, how can I calculate the current position at distance d (in km)?
For example, given:
Two gps locations in ES4236:
37.783333, -122.416667 # San Francisco
32.715, -117.1625 # San Diego
Traveling at 1km/min in a straight line (ignoring altitude)
How can I find the gps coordinate at a certain distance? A similar SO question uses VincentyDistance in geopy to calculate the next point based on bearing and distance.
I guess, more specifically:
How can I calculate the bearing between two gps points using geopy?
Using VincentyDistance to get the next gps point by bearing and distance, how do I know if I have arrived at my destination, or if I should keep going? It doesn't need to be exactly on the destination to be considered being arrived. Maybe any point with a radius of .5 km of the destination is considered 'arrived'.
ie,
import geopy
POS1 = (37.783333, -122.416667) # origin
POS2 = (32.715, -117.1625) # dest
def get_current_position(d):
# use geopy to calculate bearing between POS1 and POS2
# then use VincentyDistance to get next coord
return gps_coord_at_distance_d
# If current position is within .5 km of destination, consider it 'arrived'
def has_arrived(curr_pos):
return True/False
d = 50 # 50 km
print get_current_position(d)
print has_arrived(get_current_position(d))
Ok, figured I'd come back to this question and give it my best shot given that it hasn't seen any other solutions. Unfortunately I can't test code right now, but I believe there is a solution to your problem using both geopy and geographiclib. Here goes.
From the terminal (possibly with sudo)
pip install geographiclib
pip install geopy
Now with Python
Get Current Position
import geographiclib
from geopy import geopy.distance
# Get the first azimuth, which should be the bearing
bearing = geographiclib.WGS84.Inverse(37.783333, -122.416667, 32.715, -117.1625)[2]
# Now we use geopy to calculate the distance over time
dist = geopy.distance.VincentyDistance(kilometers = 1)
san_fran = geopy.Point(37.783333, -122.416667)
print dist.destination(point=san_fran, bearing=bearing)
Has Arrived
def has_arrived(d):
return geopy.distance.vincenty(curr_pos, (32.715, -117.1625)).kilometers < .5
Like I said, I unfortunately can't test this, but I believe this is correct. It's possible there will be some unit differences with the bearing calculation: it calculates bearing off of North as seen here. Sorry if this isn't exactly correct, but like I said since this hasn't received a response since I figured I may as well throw in what I know.
I'm using a version of Dijkstra's algorithm written in Python which I found online, and it works great. But because this is for bus routes, changing 10 times might be the shortest route, but probably not the quickest and definitely not the easiest. I need to modify it somehow to return the path with the least number of changes, regardless of distance to be honest (obviously if 2 paths have equal number of changes, choose the shortest one). My current code is as follows:
from priodict import priorityDictionary
def Dijkstra(stops,start,end=None):
D = {} # dictionary of final distances
P = {} # dictionary of predecessors
Q = priorityDictionary() # est.dist. of non-final vert.
Q[start] = 0
for v in Q:
D[v] = Q[v]
print v
if v == end: break
for w in stops[v]:
vwLength = D[v] + stops[v][w]
if w in D:
if vwLength < D[w]:
raise ValueError, "Dijkstra: found better path to already-final vertex"
elif w not in Q or vwLength < Q[w]:
Q[w] = vwLength
P[w] = v
return (D,P)
def shortestPath(stops,start,end):
D,P = Dijkstra(stops,start,end)
Path = []
while 1:
Path.append(end)
if end == start: break
end = P[end]
Path.reverse()
return Path
stops = MASSIVE DICTIONARY WITH VALUES (7800 lines)
print shortestPath(stops,'Airport-2001','Comrie-106')
I must be honest - I aint no mathematician so I don't quite understand the algorithm fully, despite all my research on it.
I have tried changing a few things but I don't get even close.
Any help? Thanks!
Here is a possible solution:
1)Run breadth first search from the start vertex. It will find the path with the least number of changes, but not the shortest among them. Let's assume that after running breadth first search dist[i] is the distance between the start and the i vertex.
2)Now one can run Djikstra algorithm on modified graph(add only those edges from the initial graph which satisfy this condition: dist[from] + 1 == dist[to]). The shortest path in this graph is the one you are looking for.
P.S If you don't want to use breadth first search, you can use Djikstra algorithm after making all edges' weights equal to 1.
What i would do is to add an offset to the actual costs if you have to change the line. For example if your edge weights represent the time needed between 2 stations, i would add the average waiting time between Line1 Line2 at station X (e.g. 0.5*maxWaitingTime) during the search process. Of course this is a heuristic solution for the problem. If your timetables are known, you can calculate a "exact" solution or at least a solution that satisfies the model because in reality you can't assume that every bus is always on time.
The solution is simple: instead of using the distances as weights, use a wright of 1 for each stop. Dijkstra's algorithm will minimize the number of changes as you requested (the total path weight is the number of rides, which is the number of changes +1). If you want to use the distance to break ties, use something like
vwLength = D[v] + 1+ alpha*stops[v][w]
where alpha<<1, e.g. alpha=0.0001
Practically, I think you're approach is exaggerated. You don't want to fly from Boston to Toronto through Paris even if two of flights are the minimum. I would play with alpha to get an approximation of total traveling time, which is what probably matters.
I'm trying to solve a problem related to graphs in Python. Since its a comeptitive programming problem, I'm not using any other 3rd party packages.
The problem presents a graph in the form of a 5 X 5 square grid.
A bot is assumed to be at a user supplied position on the grid. The grid is indexed at (0,0) on the top left and (4,4) on the bottom right. Each cell in the grid is represented by any of the following 3 characters. ‘b’ (ascii value 98) indicates the bot’s current position, ‘d’ (ascii value 100) indicates a dirty cell and ‘-‘ (ascii value 45) indicates a clean cell in the grid.
For example below is a sample grid where the bot is at 0 0:
b---d
-d--d
--dd-
--d--
----d
The goal is to clean all the cells in the grid, in minimum number of steps.
A step is defined as a task, where either
i) The bot changes it position
ii) The bot changes the state of the cell (from d to -)
Assume that initially the position marked as b need not be cleaned. The bot is allowed to move UP, DOWN, LEFT and RIGHT.
My approach
I've read a couple of tutorials on graphs,and decided to model the graph as an adjacency matrix of 25 X 25 with 0 representing no paths, and 1 representing paths in the matrix (since we can move only in 4 directions). Next, I decided to apply Floyd Warshell's all pairs shortest path algorithm to it, and then sum up the values of the paths.
But I have a feeling that it won't work.
I'm in a delimma that the problem is either one of the following:
i) A Minimal Spanning Tree (which I'm unable to do, as I'm not able to model and store the grid as a graph).
ii) A* Search (Again a wild guess, but the same problem here, I'm not able to model the grid as a graph properly).
I'd be thankful if you could suggest a good approach at problems like these. Also, some hint and psuedocode about various forms of graph based problems (or links to those) would be helpful. Thank
I think you're asking two questions here.
1. How do I represent this problem as a graph in Python?
As the robot moves around, he'll be moving from one dirty square to another, sometimes passing through some clean spaces along the way. Your job is to figure out the order in which to visit the dirty squares.
# Code is untested and may contain typos. :-)
# A list of the (x, y) coordinates of all of the dirty squares.
dirty_squares = [(0, 4), (1, 1), etc.]
n = len(dirty_squares)
# Everywhere after here, refer to dirty squares by their index
# into dirty_squares.
def compute_distance(i, j):
return (abs(dirty_squares[i][0] - dirty_squares[j][0])
+ abs(dirty_squares[i][1] - dirty_squares[j][1]))
# distances[i][j] is the cost to move from dirty square i to
# dirty square j.
distances = []
for i in range(n):
distances.append([compute_distance(i, j) for j in range(n)])
# The x, y coordinates of where the robot starts.
start_node = (0, 0)
# first_move_distances[i] is the cost to move from the robot's
# start location to dirty square i.
first_move_distances = [
abs(start_node[0] - dirty_squares[i][0])
+ abs(start_node[1] - dirty_squares[i][1]))
for i in range(n)]
# order is a list of the dirty squares.
def cost(order):
if not order:
return 0 # Cleaning 0 dirty squares is free.
return (first_move_distances[order[0]]
+ sum(distances[order[i]][order[i+1]]
for i in range(len(order)-1)))
Your goal is to find a way to reorder list(range(n)) that minimizes the cost.
2. How do I find the minimum number of moves to solve this problem?
As others have pointed out, the generalized form of this problem is intractable (NP-Hard). You have two pieces of information that help constrain the problem to make it tractable:
The graph is a grid.
There are at most 24 dirty squares.
I like your instinct to use A* here. It's often good for solving find-the-minimum-number-of-moves problems. However, A* requires a fair amount of code. I think you'd be better of going with a Branch-and-Bound approach (sometimes called Branch-and-Prune), which should be almost as efficient but is much easier to implement.
The idea is to start enumerating all possible solutions using a depth-first-search, like so:
# Each list represents a sequence of dirty nodes.
[]
[1]
[1, 2]
[1, 2, 3]
[1, 3]
[1, 3, 2]
[2]
[2, 1]
[2, 1, 3]
Every time you're about to recurse into a branch, check to see if that branch is more expensive than the cheapest solution found so far. If so, you can skip the whole branch.
If that's not efficient enough, add a function to calculate a lower bound on the remaining cost. Then if cost([2]) + lower_bound(set([1, 3])) is more expensive than the cheapest solution found so far, you can skip the whole branch. The tighter lower_bound() is, the more branches you can skip.
Let's say V={v|v=b or v=d}, and get a full connected graph G(V,E). You could calculate the cost of each edge in E with a time complexity of O(n^2). Afterwards the problem becomes exactly the same as: Start at a specified vertex, and find a shortest path of G which covers V.
We call this Traveling Salesman Problem(TSP) since 1832.
The problem can certainly be stored as a graph. The cost between nodes (dirty cells) is their Manhattan distance. Ignore the cost of cleaning cells, because that total cost will be the same no matter what path taken.
This problem looks to me like the Minimum Rectilinear Steiner Tree problem. Unfortunately, the problem is NP hard, so you'll need to come up with an approximation (a Minimum Spanning Tree based on Manhattan distance), if I am correct.