Fast way of getting all sublists of a list - python

I recently bombed a coding interview because I wasn't able to generate all possible sublists of a list fast enough. More specifically: (using python)
We're given a list of string numbers ["1", "3", "2", ...]
How many sublists of this list of size 6, concateanted, are dividable by 16?
Note that though the elements in the original list may not be unique, you should treat them as unique when constructing your sublists. E.g. for [1, 1, 1] a sublist of the first two 1's and the last two 1's are different sublists.
Using itertools.combinations I was able to generate all my sublists fast enough, but then looping through all those sublists to determine which one's were "dividable" by 16 was too slow.
So is there a way to create the sublists at the same speed (or faster) than itertools.combations, checking while each sublist as I'm creating it to see if they're divisable by 16?
Any insight would be very appreciated!

Sort the list.
find the smallest list (by length)having sum at least 16 and divisible by it(say s).
then check all size list from s to 6.
That should reduce the number of size exponentially since bigger the length of sublist the lesser the number of sublists

Related

If there's a way to solve element uniqueness problem in O(n)

there any help in such problem??
Do you mean this?
def check_elements(arr):
return len(arr) == len(set(arr))
UPD
I think I get the point. Given a list with constant length (say 50). And we need to add such circumstances to the problem that solving this problem will take O(n) time. And I suppose not like O(n) dummy operations but kinda reasonable O(n).
Well... the only way a see where we can get O(n) are elements themselves. Say we have something like this:
[
1.1111111111111111..<O(n) digits>..1,
1.1111111111111111..<O(n) digits>..2,
1.1111111111111111..<O(n) digits>..3,
1.1111111111111111..<O(n) digits>..1,
]
Basically we can treat elements as very long string. And to check whether constant number of n-character strings are unique or not we have to at least read them all. And it's at least O(n) time.
You can just use a counting sort, in your case it will be in O(n). Create an array from 0 to N (N is your maximum value), and then if you have value v in the original array, add one to the value-th entry of the resulting array. This will takes you O(n) (juste review all values from the original array), and then juste search in the resulting array if there is an entry greater than 1...

Generate permutations of list by swapping specific elements

I'm trying to write a function that generates all possible configurations of a list by swapping certain allowable pairs of elements.
For example, if we have the list:
lst = [1, 2, 3, 4, 5]
And we only allow the swapping of the following pairs of elements:
pairs = [[0, 2], [4, 1]]
i.e., we can only swap the 0th element if the list with the 2nd, and the 4th element with the 1st (there can be any number of allowed pairs of swaps).
I would like the function to return the number of distinct configurations of the list given the allowable swaps.
Since I'm planning on running this for large lists and many allowable swaps, it would be preferable for the function to be as efficient as possible.
I've found examples that generate permutations by swapping all the elements, two at a time, but I can't find a way to specify certain pairs of allowable swaps.
You've been lured off other productive paths by the common term "swap". Switch your attack. Instead, note that you need the product of [a[0], a[2]] and [a[1], a[4]] to get all the possible permutations. You take each of these products (four of them) and distribute the elements in your result sets in the proper sequence. It will look vaguely like this ... I'm using Python as pseudo-code, to some extent.
seq = itertools.product([a[0], a[2]], [a[1], a[4]])
for soln in seq:
# each solution "soln" is a list of 4 elements to be distributed.
# Construct a permutation "b" by putting each in its proper place.
# Map the first two soln values to b[0] and b[2];
# and the last two values to b[1] and b[4]
b = [soln[0], soln[2], soln[1], a[3], soln[4]]
Can you take it from there? That's the idea; I'll leave you to generalize the algorithm.

How can a list's lists be modified efficiently to have equal length to the list's longest list?

I have a 2-D list of shape (300,000, X), where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
That means that I need to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
for i in range(0, length-len(seq)):
seq.append(-1)
However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?
Individual append calls can be rather slow, so use list multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:
for seq in new_format:
seq += [-1] * (length-len(seq))
seq.extend([-1] * (length-len(seq))) would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).
In theory, seq.extend(itertools.repeat(-1, length-len(seq))) would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__/list.extend forces the creation of a temporary list anyway (to handle the case where the generator is defined in terms of the list being extended), so it wouldn't actually avoid the temporary.

Choosing python data structures to speed up algorithm implementation

So I'm given a large collection (roughly 200k) of lists. Each contains a subset of the numbers 0 through 27. I want to return two of the lists where the product of their lengths is greater than the product of the lengths of any other pair of lists. There's another condition, namely that the lists have no numbers in common.
There's an algorithm I found for this (can't remember the source, apologies for non-specificity of props) which exploits the fact that there are fewer total subsets of the numbers 0 through 27 than there are words in the dictionary.
The first thing I've done is looped through all the lists, found the unique subset of integers that comprise it and indexed it as a number between 0 and 1<<28. As follows:
def index_lists(lists):
index_hash = {}
for raw_list in lists:
length = len(raw_list)
if length > index_hash.get(index,{}).get("length"):
index = find_index(raw_list)
index_hash[index] = {"list": raw_list, "length": length}
return index_hash
This gives me the longest list and the length of the that list for each subset that's actually contained in the collection of lists given. Naturally, not all subsets from 0 to (1<<28)-1 are necessarily included, since there's not guarantee the supplied collection has a list containing each unique subset.
What I then want, for each subset 0 through 1<<28 (all of them this time) is the longest list that contains at most that subset. This is the part that is killing me. At a high level, it should, for each subset, first check to see if that subset is contained in the index_hash. It should then compare the length of that entry in the hash (if it exists there) to the lengths stored previously in the current hash for the current subset minus one number (this is an inner loop 27 strong). The greatest of these is stored in this new hash for the current subset of the outer loop. The code right now looks like this:
def at_most_hash(index_hash):
most_hash = {}
for i in xrange(1<<28): # pretty sure this is a bad idea
max_entry = index_hash.get(i)
if max_entry:
max_length = max_entry["length"]
max_word = max_entry["list"]
else:
max_length = 0
max_word = []
for j in xrange(28): # again, probably not great
subset_index = i & ~(1<<j) # gets us a pre-computed subset
at_most_entry = most_hash.get(subset_index, {})
at_most_length = at_most_entry.get("length",0)
if at_most_length > max_length:
max_length = at_most_length
max_list = at_most_entry["list"]
most_hash[i] = {"length": max_length, "list": max_list}
return most_hash
This loop obviously takes several forevers to complete. I feel that I'm new enough to python that my choice of how to iterate and what data structures to use may have been completely disastrous. Not to mention the prospective memory problems from attempting to fill the dictionary. Is there perhaps a better structure or package to use as data structures? Or a better way to set up the iteration? Or maybe I can do this more sparsely?
The next part of the algorithm just cycles through all the lists we were given and takes the product of the subset's max_length and complementary subset's max length by looking them up in at_most_hash, taking the max of those.
Any suggestions here? I appreciate the patience for wading through my long-winded question and less than decent attempt at coding this up.
In theory, this is still a better approach than working with the collection of lists alone since that approach is roughly o(200k^2) and this one is roughly o(28*2^28 + 200k), yet my implementation is holding me back.
Given that your indexes are just ints, you could save some time and space by using lists instead of dicts. I'd go further and bring in NumPy arrays. They offer compact storage representation and efficient operations that let you implicitly perform repetitive work in C, bypassing a ton of interpreter overhead.
Instead of index_hash, we start by building a NumPy array where index_array[i] is the length of the longest list whose set of elements is represented by i, or 0 if there is no such list:
import numpy
index_array = numpy.zeros(1<<28, dtype=int) # We could probably get away with dtype=int8.
for raw_list in lists:
i = find_index(raw_list)
index_array[i] = max(index_array[i], len(raw_list))
We then use NumPy operations to bubble up the lengths in C instead of interpreted Python. Things might get confusing from here:
for bit_index in xrange(28):
index_array = index_array.reshape([1<<(28-bit_index), 1<<bit_index])
numpy.maximum(index_array[::2], index_array[1::2], out=index_array[1::2])
index_array = index_array.reshape([1<<28])
Each reshape call takes a new view of the array where data in even-numbered rows corresponds to sets with the bit at bit_index clear, and data in odd-numbered rows corresponds to sets with the bit at bit_index set. The numpy.maximum call then performs the bubble-up operation for that bit. At the end, each cell index_array[i] of index_array represents the length of the longest list whose elements are a subset of set i.
We then compute the products of lengths at complementary indices:
products = index_array * index_array[::-1] # We'd probably have to adjust this part
# if we picked dtype=int8 earlier.
find where the best product is:
best_product_index = products.argmax()
and the longest lists whose elements are subsets of the set represented by best_product_index and its complement are the lists we want.
This is a bit too long for a comment so I will post it as an answer. One more direct way to index your subsets as integers is to use "bitsets" with each bit in the binary representation corresponding to one of the numbers.
For example, the set {0,2,3} would be represented by 20 + 22 + 23 = 13 and {4,5} would be represented by 24 + 25 = 48
This would allow you to use simple lists instead of dictionaries and Python's generic hashing function.

Split a list of n floating point numbers into roughly equal lists

I have a sorted list of n floating point numbers , and i want to split this list into m multiple lists in such a way that the sum of numbers in each list is roughly equal. Any thoughts?
I don't care how many numbers each list gets but the sum of numbers in each list should be fairly equal. One way that i have tried is to take subsections of the list equally (based on m) from the start and from the end and form a list , then shrink the existing list and repeat that process again.
I am fairly successful in that but the last list is either too heavier than others or too small than others , is there any better solution?

Categories