Python: all simple paths in a directed graph - python

I am working with a (number of) directed graphs with no cycles in them, and I have the need to find all simple paths between any two nodes. In general I wouldn't worry about the execution time, but I have to do this for very many nodes during very many timesteps - I am dealing with a time-based simulation.
I had tried in the past the facilities offered by NetworkX but in general I found them slower than my approach. Not sure if anything has changed lately.
I have implemented this recursive function:
import timeit
def all_simple_paths(adjlist, start, end, path):
path = path + [start]
if start == end:
return [path]
paths = []
for child in adjlist[start]:
if child not in path:
child_paths = all_simple_paths(adjlist, child, end, path)
paths.extend(child_paths)
return paths
fid = open('digraph.txt', 'rt')
adjlist = eval(fid.read().strip())
number = 1000
stmnt = 'all_simple_paths(adjlist, 166, 180, [])'
setup = 'from __main__ import all_simple_paths, adjlist'
elapsed = timeit.timeit(stmnt, setup=setup, number=number)/number
print 'Elapsed: %0.2f ms'%(1000*elapsed)
On my computer, I get an average of 1.5 ms per iteration. I know this is a small number, but I have to do this operation very many times.
In case you're interested, I have uploaded a small file containing the adjacency list here:
adjlist
I am using adjacency lists as inputs, coming from a NetworkX DiGraph representation.
Any suggestion for improvements of the algorithm (i.e., does it have to be recursive?) or other approaches I may try are more than welcome.
Thank you.
Andrea.

You can save time without change the algorithm logic by caching result of shared sub-problems here.
For example, calling all_simple_paths(adjlist, 'A', 'D', []) in following graph will compute all_simple_paths(adjlist, 'D', 'E', []) multiple times:
Python has a built-in decorator lru_cache for this task. It uses hash to memorize the parameters so you will need to change adjList and path to tuple since list is not hashable.
import timeit
import functools
#functools.lru_cache()
def all_simple_paths(adjlist, start, end, path):
path = path + (start,)
if start == end:
return [path]
paths = []
for child in adjlist[start]:
if child not in path:
child_paths = all_simple_paths(tuple(adjlist), child, end, path)
paths.extend(child_paths)
return paths
fid = open('digraph.txt', 'rt')
adjlist = eval(fid.read().strip())
# you can also change your data format in txt
adjlist = tuple(tuple(pair)for pair in adjlist)
number = 1000
stmnt = 'all_simple_paths(adjlist, 166, 180, ())'
setup = 'from __main__ import all_simple_paths, adjlist'
elapsed = timeit.timeit(stmnt, setup=setup, number=number)/number
print('Elapsed: %0.2f ms'%(1000*elapsed))
Running time on my machine:
- original: 0.86ms
- with cache: 0.01ms
And this method should only work when there's a lot shared sub-problems.

Related

Is there a better way to do this? Counting Files, and directories via for loop vs map

Folks,
I'm trying to optimize this to help speed up the process...
What I am doing is creating a dictionary of scandir entries...
e.g.
fs_data = {}
for item in Path(fqpn).iterdir():
# snipped out a bunch of normalization code
fs_data[item.name.title().strip()] = item
{'file1': <file1 scandisk data>, etc}
and then later using a function to gather the count of files, and directories in the data.
Now I suspect that the new code, using map could be optimized to be faster than the old code. I suspect that having to run the list comprehension twice, once for files, and once for directories.
But I can't think of a way to optimize it to only have to run once.
Can anyone suggest a way to sum the files, and directories at the same time in the new version? (I could fall back to the old code, if necessary)
But I might be over optimizing at this point?
Any feedback would be welcome.
def new_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
def counter(fs_entry):
return (fs_entry.is_file(), not fs_entry.is_file())
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
return (files, dirs)
vs
def old_fs_counts(fs_entries) -> (int, int):
"""
Quickly count the files vs directories in a list of scandir entries
Used primary by sync_database_disk to count a path's files & directories
Parameters
----------
fs_entries (list) - list of scandir entries
Returns
-------
tuple - (# of files, # of dirs)
"""
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
return (files, dirs)
map is fast here if you map the is_file function directly:
files = sum(map(os.DirEntry.is_file, fs_entries.values()))
dirs = len(fs_entries) - files
(Something with filter might be even faster, at least if most entries aren't files. Or filter with is_dir if that works for you and most entries aren't directories. Or itertools.filterfalse with is_file. Or using itertools.compress. Also, counting True with list.count or operator.countOf instead of summing bools might be faster. But all of these ideas take more code (and some also memory). I'd prefer my above way.)
Okay, map is definitely not the right answer here.
This morning I got up and created a test using timeit...
and it was a bit of a splash of reality to the face.
Without optimizations, new vs old, the new map code was roughly 2x the time.
New : 0.023185124970041215
old : 0.011841499945148826
I really ended up falling for a bit of click bait, and thought that rewriting with MAP would gain some better efficiency.
For the sake of completeness.
from timeit import timeit
import os
new = '''
def counter(fs_entry):
files = fs_entry.is_file()
return (files, not files)
mapdata = list(map(counter, fs_entries.values()))
files = sum(files for files, _ in mapdata)
dirs = sum(dirs for _, dirs in mapdata)
#dirs = len(fs_entries)-files
'''
#dirs = sum(dirs for _, dirs in mapdata)
old = '''
files = 0
dirs = 0
for fs_item in fs_entries:
is_file = fs_entries[fs_item].is_file()
files += is_file
dirs += not is_file
'''
fs_location = '/Volumes/4TB_Drive/gallery/albums/collection1'
fs_data = {}
for item in os.scandir(fs_location):
fs_data[item.name] = item
print("New : ", timeit(stmt=new, number=1000, globals={'fs_entries':fs_data}))
print("old : ", timeit(stmt=old, number=1000, globals={'fs_entries':fs_data}))
And while I was able close the gap with some optimizations.. (Thank you Lee for your suggestion)
New : 0.10864979098550975
old : 0.08246175001841038
It is clear that the for loop solution is easier to read, faster, and just simpler.
The speed difference between new and old, doesn't seem to be map specifically.
The duplicate sum statement added .021, and The biggest slow down was from the second fs_entry.is_file, it added .06x to the timings...

Python fastest way to find if a tuple or its reverse is in a list?

I have a nested list of tuples that acts as a cache, for a script to decide which Queue to put packet data into, for its respective multiprocessing.Process() to process.
For example:
queueCache = [[('100.0.1.19035291111.0.0.255321TCP', '111.0.0.255321100.0.1.19035291TCP'), ('100.0.0.41842111.0.0.280TCP', '111.0.0.280100.0.0.41842TCP')], [('100.0.1.18722506111.0.0.345968TCP', '111.0.0.345968100.0.1.18722506TCP'), ('100.0.0.11710499111.0.0.328881TCP', '111.0.0.328881100.0.0.11710499TCP')], [('100.0.0.14950710111.0.0.339767TCP', '111.0.0.339767100.0.0.14950710TCP'), ('100.0.0.8663063111.0.0.280TCP', '111.0.0.280100.0.0.8663063TCP')]]
If the tuple ('100.0.1.19035291111.0.0.255321TCP', '111.0.0.255321100.0.1.19035291TCP') or ('111.0.0.255321100.0.1.19035291TCP', '100.0.1.19035291111.0.0.255321TCP') (the reverse) is given, 0 should be returned, indicating index 0, and that its respective data should be put into the "first" queue. If it is not in the cache, it will be sent to the queue with the shortest queue record. (This is irrelevant to the question but relevant to my example code below).
This is what I have so far:
for i in range(len(queueCache)):
if any(x in queueCache[i] for x in ((fwd_tup, bwd_tup), (bwd_tup, fwd_tup))):
index = i
break
else:
index = queueCache.index(min(queueCache, key=len))
queueCache[index].append((fwd_tup, bwd_tup))
Running some benchmarks with much greater sample data, if the tuple is at the end of the cache (which should be common since newest tuples are appended), 100,000 runs takes about 9.5 seconds, which is probably more than it should.
# using timeit:
print("Normal search, target at start: %f" % timeit(f"search('{fwd_tup1}', '{bwd_tup1}')", "from __main__ import search", number=100000))
print("Normal search, target at middle: %f" % timeit(f"search('{fwd_tup2}', '{bwd_tup2}')", "from __main__ import search", number=100000))
print("Normal search, target at end: %f" % timeit(f"search('{fwd_tup3}', '{bwd_tup3}')", "from __main__ import search", number=100000))
Normal search, target at start: 0.155531
Normal search, target in the middle: 4.507907
Normal search, target at end: 9.470369
Perhaps there is a way to start the search from the end?
Note: The script deletes records when possible, but for specific reasons most records need to stay in the cache, thus it ends up being significantly sized.
Edit: Using a dictionary yields much greater results, thank you to #SilvioMayolo and #MisterMiyagi for the help.
Transforming the sample data to a dictionary (just for this specific testing purpose):
d = {}
for i, list in enumerate(queueCache):
for tup in list:
d[tup] = i
Then simply:
def dictSearch(fwd_tup, bwd_tup):
global d
d[(fwd_tup, bwd_tup)]
Results:
Dictionary search, target at start: 0.044860
Dictionary search, target at middle: 0.042514
Dictionary search, target at end: 0.044760
Amazing. Faster than the list search where the target is at the start!

How to Quickly Remove Similar Strings?

Right now, I have a dataset of around 70,000 tweets and I'm trying to remove tweets that are overly similar. I have decided to use a Levenshtein ratio of greater than 0.9 as a threshold for tweets that are too similar. However, my current code is essentially comparing every tweet to every other tweet, giving me O(N^2), which is painfully slow. It takes over 12 hours to run, which is just too much. I tried parallelizing it, which is what I'm running right now, but I don't know if that will speed it up to an acceptable degree. Is there some way I can accomplish this in O(N)?
import json
import multiprocessing as mp
from pathlib import Path
import Levenshtein
def check_ratio(similar_removed, try_tweet):
for tweet in similar_removed:
if Levenshtein.ratio(try_tweet, tweet[1]) > 0.9:
return 1
return 0
def main():
path = Path('C:/Data/Python/JobLoss')
orig_data = []
with open(path / 'Processed.json') as f:
data = json.load(f)
for tweet in data:
orig_data.append(tweet[2])
pool = mp.Pool(mp.cpu_count())
similar_removed = []
similar_removed.append((0, orig_data[0]))
for i in range(1, len(orig_data)):
print('%s / %s' % (i, len(orig_data)))
too_similar = 0
try_tweet = orig_data[i]
too_similar = pool.apply(check_ratio, args=(similar_removed, try_tweet))
if not too_similar:
similar_removed.append((i, try_tweet))
pool.close()
final = [data[ind[0]] for ind in similar_removed]
with open(path / 'ProcessedSimilarRemoved.json', 'w') as f:
json.dump(final, f)
if __name__ == '__main__':
main()
I ended up using a method described in the top answer to this question, which uses LSH for sublinear time queries, giving an overall complexity of sub quadratic; fast enough for my needs. Essentially you turn each tweet into a set with k-shingling, then use min hash lsh for a quick way to bucket similar sets together. Then, instead of comparing each tweet to every other tweet, you only need to look in its bucket for matches, making this method considerably faster than using the Levenshtein ratio on all pairs.

Generate all paths in an efficient way using networkx in python

I am trying to generate all paths with at most 6 nodes from every origin to every destination in a fairly large network (20,000+ arcs). I am using networkx and python 2.7. For small networks, it works well but I need to run this for the whole network. I was wondering if there is a more efficient way to do this in python. My code contains a recursive function (see below). I am thinking about keeping some of the paths in memory so that I don't create them again for other paths but I am not sure how I can accomplish it fast. right now it can't finish even within a few days. 3-4 hours should be fine for my project.
Here is a sample that I created. Feel free to ignore print functions as I added them for illustration purposes. Also here is the sample input file. input
import networkx as nx
import pandas as pd
import copy
import os
class ODPath(object):
def __init__(self,pathID='',transittime=0,path=[],vol=0,OD='',air=False,sortstrat=[],arctransit=[]):
self.pathID = pathID
self.transittime = transittime
self.path = path
self.vol = vol
self.OD = OD
self.air = air
self.sortstrat = sortstrat # a list of sort strategies
self.arctransit = arctransit # keep the transit time of each arc as a list
def setpath(self,newpath):
self.path = newpath
def setarctransitpath(self,newarctransit):
self.arctransit = newarctransit
def settranstime(self,newtranstime):
self.transittime = newtranstime
def setpathID(self,newID):
self.pathID = newID
def setvol(self,newvol):
self.vol = newvol
def setOD(self,newOD):
self.OD = newOD
def setAIR(self,newairTF):
self.air = newairTF
def setsortstrat(self,newsortstrat):
self.sortstrat = newsortstrat
def find_allpaths(graph, start, end, pathx=ODPath(None,0,[],0,None,False)):
path = copy.deepcopy(pathx) #to make sure the original has not been revised
newpath = path.path +[start]
path.setpath(newpath)
if len(path.path) >6:
return []
if start == end:
return [path]
if (start) not in graph: #check if node:start exists in the graph
return []
paths = []
for node in graph[start]: #loop over all outbound nodes of starting point
if node not in path.path: #makes sure there are no cycles
newpaths = find_allpaths(graph,node,end,path)
for newpath in newpaths:
if len(newpath.path) < 7: #generate only the paths that are with at most 6 hops
paths.append(newpath)
return paths
def printallpaths(path_temp):
map(printapath,path_temp)
def printapath(path):
print path.path
filename='transit_sample1.csv'
df_t= pd.read_csv(filename,delimiter=',')
df_t = df_t.reset_index()
G=nx.from_pandas_dataframe(df_t, 'node1', 'node2', ['Transit Time'],nx.DiGraph())
allpaths=find_allpaths(G,'A','S')
printallpaths(allpaths)
I would really appreciate any help.
I actually asked this question previously about optimizing an algorithm I wrote previously using networkx. Essentially what you'll want to do is move away from a recursive function, and towards a solution that uses memoization like I did.
From here you can do further optimizations like using multiple cores, or picking the next node to traverse based on criteria such as degree.

Python: Using multiprocessing module as possible solution to increase the speed of my function

I wrote a function in Python 2.7 (on Window OS 64bit) in order to calculate the mean value of of the intersection area from a reference polygon (Ref) and one or more segmented (Seg) polygon(s) in ESRI shapefile format. The code is quite slow because i have more that 2000 reference polygon (s) and for each Ref_polygon the function run for every time for all Seg polygons(s) (more than 7000). I am sorry but the function is a prototype.
I wish to know if multiprocessing can help me to increase the speed of my loop or there are more performance solutions. if multiprocessing can be a possible solution i wish to know the best way to optimize my following function
import numpy as np
import ogr
import osr,gdal
from shapely.geometry import Polygon
from shapely.geometry import Point
import osgeo.gdal
import osgeo.gdal as gdal
def AreaInter(reference,segmented,outFile):
# open shapefile
ref = osgeo.ogr.Open(reference)
if ref is None:
raise SystemExit('Unable to open %s' % reference)
seg = osgeo.ogr.Open(segmented)
if seg is None:
raise SystemExit('Unable to open %s' % segmented)
ref_layer = ref.GetLayer()
seg_layer = seg.GetLayer()
# create outfile
if not os.path.split(outFile)[0]:
file_path, file_name_ext = os.path.split(os.path.abspath(reference))
outFile_filename = os.path.splitext(os.path.basename(outFile))[0]
file_out = open(os.path.abspath("{0}\\{1}.txt".format(file_path, outFile_filename)), "w")
else:
file_path_name, file_ext = os.path.splitext(outFile)
file_out = open(os.path.abspath("{0}.txt".format(file_path_name)), "w")
# For each reference objects-i
for index in xrange(ref_layer.GetFeatureCount()):
ref_feature = ref_layer.GetFeature(index)
# get FID (=Feature ID)
FID = str(ref_feature.GetFID())
ref_geometry = ref_feature.GetGeometryRef()
pts = ref_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
# convert in a shapely polygon
ref_polygon = Polygon(points)
# get the area
ref_Area = ref_polygon.area
# create an empty list
Area_seg, Area_intersect = ([] for _ in range(2))
# For each segmented objects-j
for segment in xrange(seg_layer.GetFeatureCount()):
seg_feature = seg_layer.GetFeature(segment)
seg_geometry = seg_feature.GetGeometryRef()
pts = seg_geometry.GetGeometryRef(0)
points = []
for p in xrange(pts.GetPointCount()):
points.append((pts.GetX(p), pts.GetY(p)))
seg_polygon = Polygon(points)
seg_Area.append = seg_polygon.area
# intersection (overlap) of reference object with the segmented object
intersect_polygon = ref_polygon.intersection(seg_polygon)
# area of intersection (= 0, No intersection)
intersect_Area.append = intersect_polygon.area
# Avarage for all segmented objects (because 1 or more segmented polygons can intersect with reference polygon)
seg_Area_average = numpy.average(seg_Area)
intersect_Area_average = numpy.average(intersect_Area)
file_out.write(" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
file_out.close()
You can use the multiprocessing package, and especially the Pool class. First create a function that does all the stuff you want to do within the for loop, and that takes as an argument only the index:
def process_reference_object(index):
ref_feature = ref_layer.GetFeature(index)
# all your code goes here
return (" ".join(["%s" %i for i in [FID, ref_Area,seg_Area_average,intersect_Area_average]])+ "\n")
Note that this doesn't write to a file itself- that would be messy because you'd have multiple processes writing to the same file at the same time. Instead, it returns the string that needs to be written. Also note that there are objects in this function like ref_layer or ref_geometry that will need to reach it somehow- that's up to you how to do it (you could put process_reference_object as the method in a class initialized with them, or it could be as ugly as just defining them globally).
Then, you create a pool of process resources, and run all of your indices using Pool.imap_unordered (which will itself allocate each index to a different process as necessary):
from multiprocessing import Pool
p = Pool() # run multiple processes
for l in p.imap_unordered(process_reference_object, range(ref_layer.GetFeatureCount())):
file_out.write(l)
This will parallelize the independent processing of your reference objects across multiple processes, and write them to the file (in an arbitrary order, note).
Threading can help to a degree, but first you should make sure you can't simplify the algorithm. If you're checking each of 2000 reference polygons against 7000 segmented polygons (perhaps I misunderstood), then you should start there. Stuff that runs at O(n2) is going to be slow, so maybe you can prune away things that will definitely not intersect or find some other way to speed things up. Otherwise, running multiple processes or threads will only improve things linearly when your data grows geometrically.

Categories