I'm looking for a more efficient way to reprioritize items in a priority queue. I have a (quite naive) priority queue implementation based on heapq. The relevant parts are like:
from heapq import heapify, heappop
class pq(object):
def __init__(self, init= None):
self.inner, self.item_f= [], {}
if not None is init:
self.inner= [[priority, item] for item, priority in enumerate(init)]
heapify(self.inner)
self.item_f= {pi[1]: pi for pi in self.inner}
def top_one(self):
if not len(self.inner): return None
priority, item= heappop(self.inner)
del self.item_f[item]
return item, priority
def re_prioritize(self, items, prioritizer= lambda x: x+ 1):
for item in items:
if not item in self.item_f: continue
entry= self.item_f[item]
entry[0]= prioritizer(entry[0])
heapify(self.inner)
And here is a simple co-routine to just demonstrate the reprioritize characteristics in my real application.
def fecther(priorities, prioritizer= lambda x: x+ 1):
q= pq(priorities)
for k in xrange(len(priorities)+ 1):
items= (yield k, q.top_one())
if not None is items:
q.re_prioritize(items, prioritizer)
With testing
if __name__ == '__main__':
def gen_tst(n= 3):
priorities= range(n)
priorities.reverse()
priorities= priorities+ range(n)
def tst():
result, f= range(2* n), fecther(priorities)
k, item_t= f.next()
while not None is item_t:
result[k]= item_t[0]
k, item_t= f.send(range(item_t[0]))
return result
return tst
producing:
In []: gen_tst()()
Out[]: [2, 3, 4, 5, 1, 0]
In []: t= gen_tst(123)
In []: %timeit t()
10 loops, best of 3: 26 ms per loop
Now, my question is, does there exist any data-structure which would avoid calls to heapify(.), when repriorizating the priority queue? I'm here willing to trade memory for speed, but it should be possible to implement it in pure Python (obviously with much more better timings than my naive implementation).
Update:
In order to let you to understand more on the specific case, lets assume that no items are added to the queue after initial (batch) pushes and then every fetch (pop) from the queue will generate number of repriorizations roughly like this scheme:
0* n, very seldom
0.05* n, typically
n, very seldom
where n is the current number of itemsin queue. Thus, in any round, there are more or less only relative few items to repriorizate. So I'm hoping that there could exist a data-structure that would be able to exploit this pattern and therefore outperforming the cost of doing mandatory heapify(.) in every round (in order to satisfy the heap invariant).
Update 2:
So far it seems that the heapify(.) approach is quite efficient (relatively speaking) indeed. All the alternatives I have been able to figure out, needs to utilize heappush(.) and it seems to be more expensive what I originally anticipated. (Anyway, if the state of issue remains like this, I'm forced to find a better solution out of the python realm).
Since the new prioritization function may have no relationship to the previous one, you have to pay the cost to get the new ordering (and it's at minimum O(n) just to find the minimum element in the new ordering). If you have a small, fixed number of prioritization functions and switch frequently between them, then you could benefit from keeping a separate heap going for each function (although not with heapq, because it doesn't support cheaply locating and removing and object from the middle of a heap).
Related
I have a nested list of tuples that acts as a cache, for a script to decide which Queue to put packet data into, for its respective multiprocessing.Process() to process.
For example:
queueCache = [[('100.0.1.19035291111.0.0.255321TCP', '111.0.0.255321100.0.1.19035291TCP'), ('100.0.0.41842111.0.0.280TCP', '111.0.0.280100.0.0.41842TCP')], [('100.0.1.18722506111.0.0.345968TCP', '111.0.0.345968100.0.1.18722506TCP'), ('100.0.0.11710499111.0.0.328881TCP', '111.0.0.328881100.0.0.11710499TCP')], [('100.0.0.14950710111.0.0.339767TCP', '111.0.0.339767100.0.0.14950710TCP'), ('100.0.0.8663063111.0.0.280TCP', '111.0.0.280100.0.0.8663063TCP')]]
If the tuple ('100.0.1.19035291111.0.0.255321TCP', '111.0.0.255321100.0.1.19035291TCP') or ('111.0.0.255321100.0.1.19035291TCP', '100.0.1.19035291111.0.0.255321TCP') (the reverse) is given, 0 should be returned, indicating index 0, and that its respective data should be put into the "first" queue. If it is not in the cache, it will be sent to the queue with the shortest queue record. (This is irrelevant to the question but relevant to my example code below).
This is what I have so far:
for i in range(len(queueCache)):
if any(x in queueCache[i] for x in ((fwd_tup, bwd_tup), (bwd_tup, fwd_tup))):
index = i
break
else:
index = queueCache.index(min(queueCache, key=len))
queueCache[index].append((fwd_tup, bwd_tup))
Running some benchmarks with much greater sample data, if the tuple is at the end of the cache (which should be common since newest tuples are appended), 100,000 runs takes about 9.5 seconds, which is probably more than it should.
# using timeit:
print("Normal search, target at start: %f" % timeit(f"search('{fwd_tup1}', '{bwd_tup1}')", "from __main__ import search", number=100000))
print("Normal search, target at middle: %f" % timeit(f"search('{fwd_tup2}', '{bwd_tup2}')", "from __main__ import search", number=100000))
print("Normal search, target at end: %f" % timeit(f"search('{fwd_tup3}', '{bwd_tup3}')", "from __main__ import search", number=100000))
Normal search, target at start: 0.155531
Normal search, target in the middle: 4.507907
Normal search, target at end: 9.470369
Perhaps there is a way to start the search from the end?
Note: The script deletes records when possible, but for specific reasons most records need to stay in the cache, thus it ends up being significantly sized.
Edit: Using a dictionary yields much greater results, thank you to #SilvioMayolo and #MisterMiyagi for the help.
Transforming the sample data to a dictionary (just for this specific testing purpose):
d = {}
for i, list in enumerate(queueCache):
for tup in list:
d[tup] = i
Then simply:
def dictSearch(fwd_tup, bwd_tup):
global d
d[(fwd_tup, bwd_tup)]
Results:
Dictionary search, target at start: 0.044860
Dictionary search, target at middle: 0.042514
Dictionary search, target at end: 0.044760
Amazing. Faster than the list search where the target is at the start!
I found a solution to this topological sorting question, but it's not a solution I came across in any of my research, leading me to believe it's not the most optimal. The question (from algoexpert) reads along the lines of: "return one of the possible graph traversals given a graph where each node represents a job and each edge represents that job's prereq. First param is a list of numbers representing the jobs, second param is a list of arrays (size 2) where the first number in the array represents the prereq to the job being the second number. For example, inputs([1,2,3], [[1,3],[2,1],[2,3]]) => [2, 1, 3]. Note: the graph may not be acyclical, which the algorithm should then return an empty array. Example, inputs([1,2], [[1,2],[2,1]]) => [].
A popular optimal solution is a bit confusing to me, as I've tried implementing it, but keep getting situations where my algorithm detects a cycle and short-circuit returns an empty array. This algorithm "works backwards" in a depth-first manner, with "in-progress" and "visited" nodes kept in memory while searching the graph.
My algorithm initially finds graph nodes with no prereqs (as these can be immediately added to the return array), and removes this node from all other nodes prereqs. While this removal happens, if this node now has 0 prereqs, add it to the stack. When the stack size reaches 0, return the return array if its size matches the size of the first param (jobs list), otherwise return an empty array, which in this case means that a cycle was present in the graph. Here's the code for my algorithm:
def topologicalSort(jobs, relations):
rtn = []
jobsHash = {}
stackSet = set()
for job in jobs:
stackSet.add(job)
for relation in relations:
if relation[1] in stackSet:
stackSet.remove(relation[1])
if relation[0] not in jobsHash:
jobsHash[relation[0]] = {"prereqs": set(), "depends": set()}
jobsHash[relation[0]]["depends"].add(relation[1])
if relation[1] not in jobsHash:
jobsHash[relation[1]] = {"prereqs": set(), "depends": set()}
jobsHash[relation[1]]["prereqs"].add(relation[0])
if jobsHash[relation[0]]["prereqs"].__contains__(relation[1]): # 2 node cycle shortcut
return []
stack = []
for job in stackSet:
stack.append(job)
while len(stack):
job = stack.pop()
rtn.append(job)
clearDepends(jobsHash, job, stack)
if len(rtn) == len(jobs):
return rtn
else:
return []
def clearDepends(jobsHash, job, stack):
if job in jobsHash:
for dependJob in jobsHash[job]["depends"]:
jobsHash[dependJob]["prereqs"].remove(job)
if not len(jobsHash[dependJob]["prereqs"]):
stack.append(dependJob)
jobsHash[job]["depends"] = set()
print(topologicalSort([1,2,3,4],[[1,2],[1,3],[3,2],[4,2],[4,3]]))
I found this algorithm to have a time complexity of O(j + d) and space complexity of O(j + d), which is on par to the popular algorithms attributes as well. My question is, did I find the correct complexities, and is this an optimal solution to this problem. Thanks!
I have a big list of items, and some auxiliary data. For each item in the list and element in data, I compute some thing, and add all the things into an output set (there may be many duplicates). In code:
def process_list(myList, data):
ret = set()
for item in myList:
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
what_I_Want = process_list(myList, data)
Because myList is big and compute(item, foo) is costly, I need to use multiprocessing. For now this is what I have:
from multiprocessing import Pool
initialize_worker(bar):
global data
data = bar
def process_item(item):
ret = set()
for foo in data:
thing = compute(item, foo)
ret.add(thing)
return ret
if __name__ == "__main__":
data = create_data()
myList = create_list()
p = Pool(nb_proc, initializer = initialize_worker, initiargs = (data))
ret = p.map(process_item, myList)
what_I_Want = set().union(*ret)
What I do not like about that is that ret can be big. I am thinking about 3 options:
1) Chop myList into chunks a pass them to the workers, who will use process_list on each chunk (hence some duplicates will be removed at that step), and then union all the sets obtained to remove the last duplicates.
question: Is there an elegant way of doing that? Can we specify to Pool.map that it should pass the chunks to the workers instead of each item in the chunks? I know I could chop the list by myself, but this is damn ugly.
2) Have a shared set between all processes.
question: Why multiprocessing.manager does not feature set()? (I know it has dict(), but still..) If I use a manager.dict(), won't the communications between the processes and the manager slow down considerably the thing?
3) Have a shared multiprocessing.Queue(). Each worker puts the things it computes into the queue. Another worker does the unioning until some stopItem is found (which we put in the queue after the p.map)
question: Is this a stupid idea? Are communications between processes and a multiprocessing.Queue faster than those with a, say, manager.dict()? Also, how could I get back the set computed by the worker doing the unioning?
A minor thing: initiargs takes a tuple.
If you want to avoid creating all the results before reducing them into a set, you can use Pool.imap_unordered() with some chunk size. That will produce chunk size results from each workers as they become available.
If you want to change process_item to accept chunks directly, you have to do it manually. toolz.partition_all could be used to partition the initial dataset.
Finally, the managed data structures are bound to have much higher synchronization overhead. I'd avoid them as much as possible.
Go with imap_unordered and see if that's good enough; if not, then partition; if you cannot help having more than a couple duplicates total, use a managed dict.
I created a demo problem when testing out auto-scalling Dask Distributed implementation on Kubernetes and AWS and I I'm not sure I'm tackling the problem correctly.
My scenario is given a md5 hash of a string (representing a password) find the original string. I hit three main problems.
A) the parameter space is massive and trying to create a dask bag with 2.8211099e+12 members caused memory issues (hence the 'explode' function you'll see in the sample code below).
B) Clean exit on early find. I think using take(1, npartitions=-1) will achieve this but I wasn't sure. Originally I raised an exception raise Exception("%s is your answer' % test_str) which worked but felt "dirty"
C) Given this is long running and sometimes workers or AWS boxes die, how would it be best to store progress?
Example code:
import distributed
import math
import dask.bag as db
import hashlib
import dask
import os
if os.environ.get('SCHED_URL', False):
sched_url = os.environ['SCHED_URL']
client = distributed.Client(sched_url)
versions = client.get_versions(True)
dask.set_options(get=client.get)
difficulty = 'easy'
settings = {
'hard': (hashlib.md5('welcome1'.encode('utf-8')).hexdigest(),'abcdefghijklmnopqrstuvwxyz1234567890', 8),
'mid-hard': (hashlib.md5('032abgh'.encode('utf-8')).hexdigest(),'abcdefghijklmnop1234567890', 7),
'mid': (hashlib.md5('b08acd'.encode('utf-8')).hexdigest(),'0123456789abcdef', 6),
'easy': (hashlib.md5('0812'.encode('utf-8')).hexdigest(),'0123456789', 4)
}
hashed_pw, keyspace, max_guess_length = settings[difficulty]
def is_pw(guess):
return hashlib.md5(guess.encode('utf-8')).hexdigest() == hashed_pw
def guess(n):
guess = ''
size = len(keyspace)
while n>0 :
n -= 1
guess += keyspace[n % size];
n = math.floor(n / size);
return guess
def make_exploder(num_partitions, max_val):
"""Creates a function that maps a int to a range based on the number maximum value aimed for
and the number of partitions that are expected.
Used in this code used with map and flattent to take a short list
i.e 1->1e6 to a large one 1->1e20 in dask rather than on the host machine."""
steps = math.ceil(max_val / num_partitions)
def explode(partition):
return range(partition * steps, partition * steps + steps)
return explode
max_val = len(keyspace) ** max_guess_length # How many possiable password permutation
partitions = math.floor(max_val / 100)
partitions = partitions if partitions < 100000 else 100000 # split in to a maximum of 10000 partitions. Too many partitions caused issues, memory I think.
exploder = make_exploder(partitions, max_val) # Sort of the opposite of a reduce. make_exploder(10, 100)(3) => [30, 31, ..., 39]. Expands the problem back in to the full problem space.
print("max val: %s, partitions:%s" % (max_val, partitions))
search = db.from_sequence(range(partitions), npartitions=partitions).map(exploder).flatten().filter(lambda i: i <= max_val).map(guess).filter(is_pw)
search.take(1,npartitions=-1)
I find 'easy' works well locally, 'mid-hard' works well on our 6 to 8 * m4.2xlarge AWS cluster. But so far haven't got hard to work.
A) the parameter space is massive and trying to create a dask bag with 2.8211099e+12 members caused memory issues (hence the 'explode' function you'll see in the sample code below).
This depends strongly on how you arrange your elements into a bag. If each element is in its own partition then yes, this will certainly kill everything. 1e12 partitions is very expensive. I recommend keeping the number of partitions in the thousands or tens of thousands.
B) Clean exit on early find. I think using take(1, npartitions=-1) will achieve this but I wasn't sure. Originally I raised an exception raise Exception("%s is your answer' % test_str) which worked but felt "dirty"
If you want this then I recommend not using dask.bag, but instead using the concurrent.futures interface and in particular the as_completed iterator.
C) Given this is long running and sometimes workers or AWS boxes die, how would it be best to store progress?
Dask should be resilient to this as long as you can guarantee that the scheduler survives. If you use the concurrent futures interface rather than dask bag then you can also track intermediate results on the client process.
I've created a Python script that generates a list of words by permutation of characters. I'm using itertools.product to generate my permutations. My char list is composed by letters and numbers 01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ. Here is my code:
#!/usr/bin/python
import itertools, hashlib, math
class Words:
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
def __init__(self, size):
self.make(size)
def getLenght(self, size):
res = []
for i in range(1, size+1):
res.append(math.pow(len(self.chars), i))
return sum(res)
def getMD5(self, text):
m = hashlib.md5()
m.update(text.encode('utf-8'))
return m.hexdigest()
def make(self, size):
file = open('res.txt', 'w+')
res = []
i = 1
for i in range(1, size+1):
prod = list(itertools.product(self.chars, repeat=i))
res = res + prod
j = 1
for r in res:
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % (j/float(self.getLenght(size))*100))
file.write(res+'\n')
j = j + 1
file.close()
Words(3)
This script works fine for list of words with max 4 characters. If I try 5 or 6 characters, my computer consumes 100% of CPU, 100% of RAM and freezes.
Is there a way to restrict the use of those resources or optimize this heavy processing?
Does this do what you want?
I've made all the changes in the make method:
def make(self, size):
with open('res.txt', 'w+') as file_: # file is a builtin function in python 2
# also, use with statements for files used on only a small block, it handles file closure even if an error is raised.
for i in range(1, size+1):
prod = itertools.product(self.chars, repeat=i)
for j, r in enumerate(prod):
text = ''.join(r)
md5 = self.getMD5(text)
res = text+'\t'+md5
print(res + ' %.3f%%' % ((j+1)/float(self.get_length(size))*100))
file_.write(res+'\n')
Be warned this will still chew up gigabytes of memory, but not virtual memory.
EDIT: As noted by Padraic, there is no file keyword in Python 3, and as it is a "bad builtin", it's not too worrying to override it. Still, I'll name it file_ here.
EDIT2:
To explain why this works so much faster and better than the previous, original version, you need to know how lazy evaluation works.
Say we have a simple expression as follows (for Python 3) (use xrange for Python 2):
a = [i for i in range(1e12)]
This immediately evaluates 1 trillion elements into memory, overflowing your memory.
So we can use a generator to solve this:
a = (i for i in range(1e12))
Here, none of the values have been evaluated, just given the interpreter instructions on how to evaluate it. We can then iterate through each item one by one and do work on each separately, so almost nothing is in memory at a given time (only 1 integer at a time). This makes the seemingly impossible task very manageable.
The same is true with itertools: it allows you to do memory-efficient, fast operations by using iterators rather than lists or arrays to do operations.
In your example, you have 62 characters and want to do the cartesian product with 5 repeats, or 62**5 (nearly a billion elements, or over 30 gigabytes of ram). This is prohibitively large."
In order to solve this, we can use iterators.
chars = '01234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVXYZ'
for i in itertools.product(chars, repeat=5):
print(i)
Here, only a single item from the cartesian product is in memory at a given time, meaning it is very memory efficient.
However, if you evaluate the full iterator using list(), it then exhausts the iterator and adds it to a list, meaning the nearly one billion combinations are suddenly in memory again. We don't need all the elements in memory at once: just 1. Which is the power of iterators.
Here are links to the itertools module and another explanation on iterators in Python 2 (mostly true for 3).