Python sort using key : possible to parallelize? - python

Let's say I have a list of lists my_list.
I want to sort my_list using some operation on each element of my_list which is also a list (let us call them inner_list).
def fun(inner_list):
# do some calculation on inner list
time.sleep(1) # simulate expensive operation
return cost
sorted_list = sorted(my_list, key=lambda inner_list: fun(inner_list))
Since fun is expensive in terms of time, is there any way to calculate the cost using fun in parallel? If yes, would threading be a good choice? AFAIK due to GIL threading can't actually do stuff in parallel. However, if the expensive nature is due to iterating over items of a long list, can threading help?
What other ways can I speed this up?
Note: I am limited to python 2.7.
Edit : Added details
Here is the exact code. fun is an abstraction for get_delay. graph is a networkx graph with each link (u, v) having some value for delay. Basically, I am iterating over all edges on the path and calculating the cumulative delay. The path is a list of nodes. For example if path = [1, 2, 3, 4] then the (u, v) links would be [ (1, 2) , (2, 3), (3, 4)].
def get_delay(path):
return sum([float(graph.get_edge_data(u, v)['delay']) for u, v in zip(path[:-1], path[1:])])

To do that you need to use a sorting algorithm that allows you to parallelize the problem. A nice straightforward algorithm is merge sort. That would allow you to split the data, sort the parts, and then merge them in a final, single-threaded sort. However, that could lead to calculating the keys twice unless you cache the values.

Related

Is it possible to make "get values that is not duplicated ​at all" algorithm, not using sort? (Time complexity => O(n)

i want to know "get values that is not duplicated ​​at all" algorithm without sort is possible.
let me explain about "get values that is not duplicated ​​at all"
if list value is [1,3,3,2]
i want to get 1, 2
I tried lots of algorithms.
but i can't find an algorithm that has stable time complexity is O(n).
i tried using dict, queue.
However, always time complexity is not stable O(n)
using dict
li = [1, 2, 3, 3, 5]
li_dict = dict()
for i in range(len(li)):
try:
a = li_dict[li[i]]
li_dict[li[i]] = False
except KeyError:
li_dict[li[i]] = True
for k in li_dict.keys():
if li_dict[k]:
print(k, end="")
is this possible?
if this is possible, please explain how to make it possible.
(without sort)
I've got a hardcoded solution for you. You can make use of the Set type:
li = [1,3,3,2]
result = set()
seen = set()
for i in li:
if i in seen:
result.discard(i)
else:
result.add(i)
seen.add(i)
print(result)
Because python Sets use a hash table technique under the hood lookups are very fast, just as in dictionaries. Sets are a bit better suited for your problem however, because they behave more like lists.
In my code example i iterate over the list only once and as Set memberships tests are near instant (O(1)) I believe this is O(n) complexity.
I use two sets to seperate the items I've seen already from the items I want to keep.
There are more pythonic solutions most likely. This is my first attempt at reasoning the complexity of a piece of code, but i believe my conclusions are valid. If not, someone else may correct me!

Multiprocessing iteration without repeating duplicates in Python

In python, I have a large list of numbers (~0.5 billion items) with some duplicates in it, for example:
[1,1,2,2,3,4,5,6,6]
I want to apply a function over these numbers to create a dictionary, keyed by the number, giving the function result. So if the function is simply (say) lambda x: x*10, I want a dictionary like:
{1:10, 2:20, 3:30, 4:40, 5:50, 6:60}
The thing is, I want to use the Python multiprocessing module to do this (I don't care in what order the functions are run), and I don't really want to make the list of numbers unique beforehand: I'd prefer to check when iterating over the numbers whether there's a duplicate, and if so, not add the calculation to the multiprocessing pool or queue.
So should I use something like multiprocessing.Pool.imap_unordered for this, and check for previously visited iterators, e.g.:
import multiprocessing
import itertools
import time
def f(x):
print(x)
time.sleep(0.1)
return x, x*10.0
input = [1, 1, 2, 2, 3, 4, 5, 6, 6]
result = {}
def unique_everseen(iterable):
for element in iterable:
if element not in result:
result[element] = None # Mark this result as being processed
yield element
with multiprocessing.Pool(processes=2) as pool:
for k, v in pool.imap_unordered(f, unique_everseen(input)):
result[k]=v
I ask, as it seems a little hacky using the result dictionary to also check whether we have visited this value before (I've done this to save having to create a separate set of half a billion items just to check if they are dups). Is there a more pythonic way to do this, perhaps adding the items to a Queue or something? I'm not used multiprocessing much before, so perhaps I'm doing this wrong, and e.g. opening myself up to race conditions?

Unpacking values from a python 3 `product` iterator built from `zip` iterators

I am converting some code from Python 2 to Python 3. I ran into an issue where some Python 2 builtins used to generate lists, but now generate iterators in Python 3. I wanted to rewrite this code but have not found a way make the program work without converting the iterators to lists, which does not seem very efficient. The question is whether there is a way to make this code work using iterators instead of converting the iterators to lists?
Here is a simplified version of the Python 3 program.
from itertools import product
points = {(1, 1), (0.5, 0.5), (0.75, 0.75)}
labels = range(len(points))
zips = zip(points, labels)
for pair in product(zips,zips):
if pair[0][1] != pair[1][1]:
print("good")
Note, I simplified the for loop to just print "good." The real program calculates Euclidean distance, but "good" preserves the basic idea.
Further note that in Python 3, the zip and range builtins generate and iterators. Then the product function from itertools creates an iterator based upon zip and range iterators.
When I run this code in Python 3, The loop just ends without ever printing "good," which is not accurate. The code seems to just bypass the loop since I don't think it finds any values in pair.
The solution that I found was to just convert the variable zips to a list. So the code looks like.
from itertools import product
points = {(1, 1), (0.5, 0.5), (0.75, 0.75)}
labels = range(len(points))
zips = list(zip(points, labels))
for pair in product(zips,zips):
if pair[0][1] != pair[1][1]:
print("good")
In the modified code things work, but I have lost the nice iterator structure.
So again, the question is whether I can rewrite this code without having to convert the iterators to lists?
Instead of passing zips to product twice, use the repeat argument:
for pair in product(zip(points, labels), repeat=2):
...
Note that product will materialize the iterator under the hood anyway. It has to, since it needs to iterate over the elements more than once.

Lists in python using itertools (moving only 2 items, permutations)

I come from C++ as my 1st programming lenguage, Im just getting into python and im looking for a way to switch numbers from a list, in C++ this would be done using pointers moving them around with loops, however this time I need to generate all the permutations of a list A to a list B in Python
List A (the starter list) and list B (the result list)
A= 1234
B= 4231
The program has to show all the possible combinations in order, by moving only 2 numbers at he same time until the A list becomes the B list (the following example is simplified to 4 numbers and might not show all the combinations)
[1,2,3,4]
[1,2,4,3]
[1,4,2,3]
[4,1,2,3]
[4,2,1,3]
[4,2,3,1]
In order to acomplish this, I have found the itertools module, that contains a lot of functions, but havent been able to implement to many functions so far, the following code kind of does what its needed however it does not move the numbers in pairs nor in order
import itertools
from itertools import product, permutations
A = ([1,2,3,4])
B = ([4,2,3,1])
print "\n"
print (list(permutations(sorted(B),4)))
Im thinking about adding a while ( A != B ) then stop the permutations, i already tried this but im not familiar with pythons syntax, any help about how can i accomplish this would be appreciated
I am assuming sorting of the input list is not really required ,
from itertools import permutations
A = ([4, 3, 2, 1])
B = ([1,2,4, 3])
def print_combinations(start, target):
# use list(permutations(sorted(start), len(start))) if sorting of start is really required
all_perms = list(permutations(start, len(start)))
if tuple(target) not in all_perms:
# return empty list if target is not found in all permutations
return []
# return all combinations till target(inclusive)
# using list slicing
temp = all_perms[: all_perms.index(tuple(target)) + 1]
return temp
print print_combinations(A, B)
On the assumption that you are asking the best way to solve this permutation question - here is a different answer:
Think of all the permutations as a set. itertools.permutations generates all those permutations in some order. That is just what you want if you want to find all or some of those permutations. But that's not what you are looking for. You are trying to find paths through those permutations. itertools.permutations generates all the permutations in an order but not necessarily the order you want. And certainly not all the orders: it only generates them once.
So, you could generate all the permutations and consider them as the nodes of a network. Then you could link nodes whenever they are connected by a single swap, to get a graph. This is called the permutohedron. Then you could do a search on that graph to find all the loop-free paths from a to b that you are interested in. This is certainly possible but it's not really optimal. Building the whole graph in advance is an unnecessary step since it can easily be generated on demand.
Here is some Python code that does just that: it generates a depth first search over the permutohedron by generating the neighbours for a node when needed. It doesn't use itertools though.
a = (1,2,3,4)
b = (4,2,3,1)
def children(current):
for i in range(len(a)-1):
yield (current[:i] + (current[i+1],current[i]) +
current[i+2:])
def dfs(current,path,path_as_set):
path.append(current)
path_as_set.add(current)
if current == b:
yield path
else:
for next_perm in children(current):
if next_perm in path_as_set:
continue
for path in dfs(next_perm,path,path_as_set):
yield path
path.pop()
path_as_set.remove(current)
for path in dfs(a,[],set()):
print(path)
If you are really interested in using itertools.permutations, then the object you are trying to study is actually:
itertools.permutations(itertools.permutations(a))
This generates all possible paths through the set of permutations. You could work through this rejecting any that don't start at a and that contain steps that are not a single swap. But that is a very bad approach: this list is very long.
It's not completely clear what you are asking. I think you are asking for a pythonic way to swap two elements in a list. In Python it is usual to separate data structures into immutable and mutable. In this case you could be talking about tuples or lists.
Let's assume you want to swap elements i and j, with j larger.
For immutable tuples the pythonic approach will be to generate a new tuple via slicing like this:
next = (current[:i] + current[j:j+1] + current[i+1:j]
+ current[i:i+1] + current[j+1:])
For mutable lists it would be pythonic to do much the same as C++, though it's prettier in Python:
list[i],list[j] = list[j],list[i]
Alternatively you could be asking about how to solve your permutation question, in which case the answer is that itertools does not really provide much help. I would advise depth first search.
I guess following is a simpler way, I had nearly same issue (wanted swapped number) in a list (append a copy of list to itself list = list + list and then run :
from itertools import combinations_with_replacement
mylist = ['a', 'b']
list(set(combinations_with_replacement(mylist + mylist, r=2)))
results:
[('a', 'b'), ('b', 'a'), ('b', 'b'), ('a', 'a')]

Most efficient way to cycle over Python sublists while making them grow (insert method)?

My problem is about managing insert/append methods within loops.
I have two lists of length N: the first one (let's call it s) indicates a subset to which, while the second one represents a quantity x that I want to evaluate. For sake of simplicity, let's say that every subset presents T elements.
cont = 0;
for i in range(NSUBSETS):
for j in range(T):
subcont = 0;
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T)+cont, s[(i*T)+j+cont]);
x.insert(((i+1)*T)+cont, x[(i*T)+j+cont]);
subcont += 1;
cont += subcont;
While cycling over all the elements of the two lists, I'd like that, when a certain condition is fulfilled (e.g. x[i] < 100), a copy of that element is put at the end of the subset, and then going on with the loop till completing the analysis of all the original members of the subset. It would be important to maintain the "order", i.e. inserting the elements next to the last element of the subset it comes from.
I thought a way could have been to store within 2 counter variables the number of copies made within the subset and globally, respectively (see code): this way, I could shift the index of the element I was looking at according to that. I wonder whether there exists some simpler way to do that, maybe using some Python magic.
If the idea is to interpolate your extra copies into the lists without making a complete copy of the whole list, you can try this with a generator expression. As you loop through your lists, collect the matches you want to append. Yield each item as you process it, then yield each collected item too.
This is a simplified example with only one list, but hopefully it illustrates the idea. You would only get a copy if you do like i've done and expand the generator with a comprehension. If you just wanted to store or further analyze the processed list (eg, to write it to disk) you could never have it in memory at all.
def append_matches(input_list, start, end, predicate):
# where predicate is a filter function or lambda
for item in input_list[start:end]:
yield item
for item in filter(predicate, input_list[start:end]):
yield item
example = lambda p: p < 100
data = [1,2,3,101,102,103,4,5,6,104,105,106]
print [k for k in append_matches (data, 0, 6, example)]
print [k for k in append_matches (data, 5, 11, example)]
[1, 2, 3, 101, 102, 103, 1, 2, 3]
[103, 4, 5, 6, 104, 105, 4, 5, 6]
I'm guessing that your desire not to copy the lists is based on your C background - an assumption that it would be more expensive that way. In Python lists are not actually lists, inserts have O(n) time as they are more like vectors and so those insert operations are each copying the list.
Building a new copy with the extra elements would be more efficient than trying to update in-place. If you really want to go that way you would need to write a LinkedList class that held prev/next references so that your Python code really was a copy of the C approach.
The most Pythonic approach would not try to do an in-place update, as it is simpler to express what you want using values rather than references:
def expand(origLs) :
subsets = [ origLs[i*T:(i+1)*T] for i in range(NSUBSETS) ]
result = []
for s in subsets :
copies = [ e for e in s if e<100 ]
result += s + copies
return result
The main thing to keep in mind is that the underlying cost model for an interpreted garbage-collected language is very different to C. Not all copy operations actually cause data movement, and there are no guarantees that trying to reuse the same memory will be successful or more efficient. The only real answer is to try both techniques on your real problem and profile the results.
I'd be inclined to make a copy of your lists and then, while looping across the originals, as you come across a criteria to insert you insert into the copy at the place you need it to be at. You can then output the copied and updated lists.
I think to have found a simple solution.
I cycle from the last subset backwards, putting the copies at the end of each subset. This way, I avoid encountering the "new" elements and get rid of counters and similia.
for i in range(NSUBSETS-1, -1, -1):
for j in range(T-1, -1, -1):
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T), s[(i*T)+j])
x.insert(((i+1)*T), x[(i*T)+j])
One possibility would be using numpy's advanced indexing to provide the illusion of copying elements to the ends of the subsets by building a list of "copy" indices for the original list, and adding that to an index/slice list that represents each subset. Then you'd combine all the index/slice lists at the end, and use the final index list to access all your items (I believe there's support for doing so generator-style, too, which you may find useful as advanced indexing/slicing returns a copy rather than a view). Depending on how many elements meet the criteria to be copied, this should be decently efficient as each subset will have its indices as a slice object, reducing the number of indices needed to keep track of.

Categories