Getting the indexes of an element in a subset - python

I have an list and a subset of it, and want to find the index of each element in the subset. I have currently tried this code:
def convert_toindex(listof_elements, listof_indices):
for i in range(len(listof_elements)):
listof_elements[:] = [listof_indices.index(x) for x in listof_elements]
return listof_elements
list1 = ['lol', 'please', 'help']
list2 = ['help', 'lol', 'please', 'extra']
What I want to happen when I do convert_toindex(list1, list2) is the output to be [2, 0, 1]
However, when I do this I get a ValueError: '0' is not in list.
0, however, appears nowhere in either list so I am not sure why this is happening.
Secondly, if I have a list of lists, and I want to do this process all the nested lists inside the big list, would I do something like this?
for smalllist in biglist:
smalllist[:] = [dict_of_indices[x] for x in smalllist]
Where dict_of_indices is the dictionary of indices created following the top answer.

The problem is that, instead of doing this one times, you're doing it over and over, N times:
for i in range(len(listof_elements)):
listof_elements[:] = [listof_indices.index(x) for x in listof_elements]
The first time through, you replace every value in listof_elements with its index in listof_indices. So far, so good. In fact, you should be done there.
But then you do it a second time. You look up each of those indices, as if they were values, in listof_indices. And some of them aren't there. So you get an error.
You can solve this just by removing the outer loop. You're already done after the first time.
You may be confused because this problem seems to inherently require two loops—but you already do have two loops. The first is the obvious one in the list comprehension, and the second one is the one hidden inside listof_indices.index.
While we're at it: while this problem does require two loops, it doesn't require them to be nested.
Instead of looping over listof_indices to find each x, you can loop over it in advance to build a dictionary:
dict_of_indices = {value: index for index, value in enumerate(listof_indices)}
And then just do a direct lookup in that dictionary:
listof_elements[:] = [dict_of_indices[x] for x in listof_elements]
Besides being a whole lot faster (O(N+M) time rather than O(N*M)), I think this might also easier be to understand, and to debug. The first line may be a bit tricky, but you can easily print out the dict and verify that it's correct. And then the second line is about as trivial as you can get.

Related

Reduce a list in a specific way

I have a list of strings which looks like this:
['(num1, num2):1', '(num3, num4):1', '(num5, num6):1', '(num7, num8):1']
What I try to achieve is to reduce this list and combine every two elements and I want to do this until there is only one big string element left.
So the intermediate list would look like this:
['((num1, num2):1,(num3, num4):1)', '((num5, num6):1,(num7, num8):1)']
The complicated thing is (as you can see in the intermediate list), that two strings need to be wrapped in paranthesis. So for the above mentioned starting point the final result should look like this:
(((num_1,num_2):1,(num_3,num_4):1),((num_5,num_6):1,(num_7,num_8):1))
Of course this should work in a generic way also for 8, 16 or more string elements in the starting list. Or to be more precise it should work for an=2(n+1).
Just to be very specific how the result should look with 8 elements:
'((((num_1,num_2):1,(num_3,num_4):1),((num_5,num_6):1,(num_7,num_8):1)),(((num_9,num_10):1,(num_11,num_12):1),((num_13,num_14):1,(num_15,num_16):1)))'
I already solved the problem using nested for loops but I thought there should be a more functional or short-cut solution.
I also found this solution on stackoverflow:
import itertools as it
l = [map( ",".join ,list(it.combinations(my_list, l))) for l in range(1,len(my_list)+1)]
Although, the join isn't bad, I still need the paranthesis. I tried to use:
"{},{}".format
instead of .join but this seems to be to easy to work :).
I also thought to use reduce but obviously this is not the right function. Maybe one can implement an own reduce function or so?
I hope some advanced pythonics can help me.
Sounds like a job for the zip clustering idiom: zip(*[iter(x)]*n) where you want to break iterable x into size n chunks. This will discard "leftover" elements that don't make up a full chunk. For x=[1, 2, 3], n=2 this would yield (1, 2)
def reducer(l):
while len(l) > 1:
l = ['({},{})'.format(x, y) for x, y in zip(*[iter(l)]*2)]
return l
reducer(['(num1, num2):1', '(num3, num4):1', '(num5, num6):1', '(num7, num8):1'])
# ['(((num1, num2):1,(num3, num4):1),((num5, num6):1,(num7, num8):1))']
This is an explanation of what is happening in zip(*[iter(l)]*2)
[iter(l)*2] This creates an list of length 2 with two times the same iterable element or to be more precise with two references to the same iter-object.
zip(*...) does the extracting. It pulls:
Loop
the first element from the first reference of the iter-object
the second element from the second reference of the iter-object
Loop
the third element from the first reference of the iter-object
the fourth element from the second reference of the iter object
Loop
the fifth element from the first reference of the iter-object
the sixth element from the second reference of the iter-object
and so on...
Therefore we have the extracted elements available in the for-loop and can use them as x and y for further processing.
This is really handy.
I also want to point to this thread since it helped me to understand the concept.

Altering a list using append during a list comprehension

Caveat: this is a straight up question for code-golfing, so I know what I'm asking is bad practise in production
I'm trying to alter an array during a list comprehension, but for some reason it is hanging and I don't know why or how to fix this.
I'm dealing with a list of lists of indeterminite depth and need to condense them to a flat list - for those curious its this question. But at this stage, lets just say I need a flat list of all the elements in the list, and 0 if it is a list.
The normal method is to iterate through the list and if its a list add it to the end, like so:
for o in x:
if type(o)==type([]):x+=o
else:i+=o
print i
I'm trying to shorten this using list comprehension, like so.
print sum([
[o,x.append(o) or 0][type(o)==type([])]
for o in x
]))
Now, I know List.append returns None, so to ensure that I get a numeric value, lazy evaluation says I can do x.append(o) or 0, and since None is "falsy" it will evaulate the second part and the value is 0.
But it doesn't. If I put x.append() into the list comprehension over x, it doesn't break or error, or return an iteration error, it just freezes. Why does append freeze during the list comprehension, but the for loop above works fine?
edit: To keep this question from being deleted, I'm not looking for golfing tips (they are very educational though), I was looking for an answer as to why the code wasn't working as I had written it.
or may be lazy, but list definitions aren't. For each o in x, when the [o,x.append(o) or 0][type(o)==type([])] monstrosity is evaluated, Python has to evaluate [o,x.append(o) or 0], which means evaluating x.append(o) or 0, which means that o will be appended to x regardless of whether it's a list. Thus, you end up with every element of x appended to x, and then they get appended again and again and again and OutOfMemoryError
What about:
y = [element for element in x if type(element) != list or x.extend(element)]
(note that extend will flatten, while append will only add the nested list back to the end, unflattened).

Most efficient way to cycle over Python sublists while making them grow (insert method)?

My problem is about managing insert/append methods within loops.
I have two lists of length N: the first one (let's call it s) indicates a subset to which, while the second one represents a quantity x that I want to evaluate. For sake of simplicity, let's say that every subset presents T elements.
cont = 0;
for i in range(NSUBSETS):
for j in range(T):
subcont = 0;
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T)+cont, s[(i*T)+j+cont]);
x.insert(((i+1)*T)+cont, x[(i*T)+j+cont]);
subcont += 1;
cont += subcont;
While cycling over all the elements of the two lists, I'd like that, when a certain condition is fulfilled (e.g. x[i] < 100), a copy of that element is put at the end of the subset, and then going on with the loop till completing the analysis of all the original members of the subset. It would be important to maintain the "order", i.e. inserting the elements next to the last element of the subset it comes from.
I thought a way could have been to store within 2 counter variables the number of copies made within the subset and globally, respectively (see code): this way, I could shift the index of the element I was looking at according to that. I wonder whether there exists some simpler way to do that, maybe using some Python magic.
If the idea is to interpolate your extra copies into the lists without making a complete copy of the whole list, you can try this with a generator expression. As you loop through your lists, collect the matches you want to append. Yield each item as you process it, then yield each collected item too.
This is a simplified example with only one list, but hopefully it illustrates the idea. You would only get a copy if you do like i've done and expand the generator with a comprehension. If you just wanted to store or further analyze the processed list (eg, to write it to disk) you could never have it in memory at all.
def append_matches(input_list, start, end, predicate):
# where predicate is a filter function or lambda
for item in input_list[start:end]:
yield item
for item in filter(predicate, input_list[start:end]):
yield item
example = lambda p: p < 100
data = [1,2,3,101,102,103,4,5,6,104,105,106]
print [k for k in append_matches (data, 0, 6, example)]
print [k for k in append_matches (data, 5, 11, example)]
[1, 2, 3, 101, 102, 103, 1, 2, 3]
[103, 4, 5, 6, 104, 105, 4, 5, 6]
I'm guessing that your desire not to copy the lists is based on your C background - an assumption that it would be more expensive that way. In Python lists are not actually lists, inserts have O(n) time as they are more like vectors and so those insert operations are each copying the list.
Building a new copy with the extra elements would be more efficient than trying to update in-place. If you really want to go that way you would need to write a LinkedList class that held prev/next references so that your Python code really was a copy of the C approach.
The most Pythonic approach would not try to do an in-place update, as it is simpler to express what you want using values rather than references:
def expand(origLs) :
subsets = [ origLs[i*T:(i+1)*T] for i in range(NSUBSETS) ]
result = []
for s in subsets :
copies = [ e for e in s if e<100 ]
result += s + copies
return result
The main thing to keep in mind is that the underlying cost model for an interpreted garbage-collected language is very different to C. Not all copy operations actually cause data movement, and there are no guarantees that trying to reuse the same memory will be successful or more efficient. The only real answer is to try both techniques on your real problem and profile the results.
I'd be inclined to make a copy of your lists and then, while looping across the originals, as you come across a criteria to insert you insert into the copy at the place you need it to be at. You can then output the copied and updated lists.
I think to have found a simple solution.
I cycle from the last subset backwards, putting the copies at the end of each subset. This way, I avoid encountering the "new" elements and get rid of counters and similia.
for i in range(NSUBSETS-1, -1, -1):
for j in range(T-1, -1, -1):
if (x[(i*T)+j] < 100):
s.insert(((i+1)*T), s[(i*T)+j])
x.insert(((i+1)*T), x[(i*T)+j])
One possibility would be using numpy's advanced indexing to provide the illusion of copying elements to the ends of the subsets by building a list of "copy" indices for the original list, and adding that to an index/slice list that represents each subset. Then you'd combine all the index/slice lists at the end, and use the final index list to access all your items (I believe there's support for doing so generator-style, too, which you may find useful as advanced indexing/slicing returns a copy rather than a view). Depending on how many elements meet the criteria to be copied, this should be decently efficient as each subset will have its indices as a slice object, reducing the number of indices needed to keep track of.

Python: Listing the duplicates in a list

I am fairly new to Python and I am interested in listing duplicates within a list. I know how to remove the duplicates ( set() ) within a list and how to list the duplicates within a list by using collections.Counter; however, for the project that I am working on this wouldn't be the most efficient method to use since the run time would be n(n-1)/2 --> O(n^2) and n is anywhere from 5k-50k+ string values.
So, my idea is that since python lists are linked data structures and are assigned to the memory when created that I begin counting duplicates from the very beginning of the creation of the lists.
List is created and the first index value is the word 'dog'
Second index value is the word 'cat'
Now, it would check if the second index is equal to the first index, if it is then append to another list called Duplicates.
Third index value is assigned 'dog', and the third index would check if it is equal to 'cat' then 'dog'; since it matches the first index, it is appended to Duplicates.
Fourth index is assigned 'dog', but it would check the third index only, and not the second and first, because now you can assume that since the third and second are not duplicates that the fourth does not need to check before, and since the third/first are equal, the search stops at the third index.
My project gives me these values and append it to a list, so I would want to implement that above algorithm because I don't care how many duplicates there are, I just want to know if there are duplicates.
I can't think of how to write the code, but I figured the basic structure of it, but I might be completely off (using random numgen for easier use):
for x in xrange(0,10):
list1.append(x)
for rev, y in enumerate(reversed(list1)):
while x is not list1(y):
cond()
if ???
I really don't think you'll get better than a collections.Counter for this:
c = Counter(mylist)
duplicates = [ x for x,y in c.items() if y > 1 ]
building the Counter should be O(n) (unless you're using keys which are particularly bad for hashing -- But in my experience, you need to try pretty hard to make that happen) and then getting the duplicates list is also O(n) giving you a total complexity of O(2n) == O(n) (for typical uses).

python function slowing down for no apparent reason

I have a python function defined as follows which i use to delete from list1 the items which are already in list2. I am using python 2.6.2 on windows XP
def compareLists(list1, list2):
curIndex = 0
while curIndex < len(list1):
if list1[curIndex] in list2:
list1.pop(curIndex)
else:
curIndex += 1
Here, list1 and list2 are a list of lists
list1 = [ ['a', 11221, '2232'], ['b', 1321, '22342'] .. ]
# list2 has a similar format.
I tried this function with list1 with 38,000 elements and list2 with 150,000 elements. If i put in a print statement to print the current iteration, I find that the function slows down with each iterations. At first, it processes around 1000 or more items in a second and then after a while it reduces to around 20-50 a second. Why can that be happening?
EDIT: In the case with my data, the curIndex remains 0 or very close to 0 so the pop operation on list1 is almost always on the first item.
If possible, can someone also suggest a better way of doing the same thing in a different way?
Try a more pythonic approach to the filtering, something like
[x for x in list1 if x not in set(list2)]
Converting both lists to sets is unnessescary, and will be very slow and memory hungry on large amounts of data.
Since your data is a list of lists, you need to do something in order to hash it.
Try out
list2_set = set([tuple(x) for x in list2])
diff = [x for x in list1 if tuple(x) not in list2_set]
I tested out your original function, and my approach, using the following test data:
list1 = [[x+1, x*2] for x in range(38000)]
list2 = [[x+1, x*2] for x in range(10000, 160000)]
Timings - not scientific, but still:
#Original function
real 2m16.780s
user 2m16.744s
sys 0m0.017s
#My function
real 0m0.433s
user 0m0.423s
sys 0m0.007s
There are 2 issues that cause your algorithm to scale poorly:
x in list is an O(n) operation.
pop(n) where n is in the middle of the array is an O(n) operation.
Both situations cause it to scale poorly O(n^2) for large amounts of data. gnud's implementation would probably be the best solution since it solves both problems without changing the order of elements or removing potential duplicates.
If we rule the data structure itself out, look at your memory usage next. If you end up asking the OS to swap in for you (i.e., the list takes up more memory than you have), Python's going to sit in iowait waiting on the OS to get the pages from disk, which makes sense given your description.
Is Python sitting in a jacuzzi of iowait when this slowdown happens? Anything else going on in the environment?
(If you're not sure, update with your platform and one of us will tell you how to tell.)
The only reason why the code can become slower is that you have big elements in both lists which share a lot of common elements (so the list1[curIndex] in list2 takes more time).
Here are a couple of ways to fix this:
If you don't care about the order, convert both lists into sets and use set1.difference(set2)
If the order in list1 is important, then at least convert list2 into a set because in is much faster with a set.
Lastly, try a filter: filter(list1, lambda x: x not in set2)
[EDIT] Since set() doesn't work on recursive lists (didn't expect that), try:
result = filter(list1, lambda x: x not in list2)
It should still be much faster than your version. If it isn't, then your last option is to make sure that there can't be duplicate elements in either list. That would allow you to remove items from both lists (and therefore making the compare ever cheaper as you find elements from list2).
EDIT: I've updated my answer to account for lists being unhashable, as well as some other feedback. This one is even tested.
It probably relates to the cost of poping an item out of a middle of a list.
Alternatively have you tried using sets to handle this?
def difference(list1, list2):
return [x for x in list1 if tuple(x) in set(tuple(y) for y in list2)]
You can then set list one to the resulting list if that is your intention by doing
list1 = difference(list1, list2)
The often suggested set wont work here, because the two lists contain lists, which are unhashable. You need to change your data structure first.
You can
convert the sublists into tuples or class instances to make them hashable, then use sets.
Keep both lists sorted, then you just have to compare the lists' heads.

Categories