Indexing a list with an unique index - python

I have a list say l = [10,10,20,15,10,20]. I want to assign each unique value a certain "index" to get [1,1,2,3,1,2].
This is my code:
a = list(set(l))
res = [a.index(x) for x in l]
Which turns out to be very slow.
l has 1M elements, and 100K unique elements. I have also tried map with lambda and sorting, which did not help. What is the ideal way to do this?

You can do this in O(N) time using a defaultdict and a list comprehension:
>>> from itertools import count
>>> from collections import defaultdict
>>> lst = [10, 10, 20, 15, 10, 20]
>>> d = defaultdict(count(1).next)
>>> [d[k] for k in lst]
[1, 1, 2, 3, 1, 2]
In Python 3 use __next__ instead of next.
If you're wondering how it works?
The default_factory(i.e count(1).next in this case) passed to defaultdict is called only when Python encounters a missing key, so for 10 the value is going to be 1, then for the next ten it is not a missing key anymore hence the previously calculated 1 is used, now 20 is again a missing key and Python will call the default_factory again to get its value and so on.
d at the end will look like this:
>>> d
defaultdict(<method-wrapper 'next' of itertools.count object at 0x1057c83b0>,
{10: 1, 20: 2, 15: 3})

The slowness of your code arises because a.index(x) performs a linear search and you perform that linear search for each of the elements in l. So for each of the 1M items you perform (up to) 100K comparisons.
The fastest way to transform one value to another is looking it up in a map. You'll need to create the map and fill in the relationship between the original values and the values you want. Then retrieve the value from the map when you encounter another of the same value in your list.
Here is an example that makes a single pass through l. There may be room for further optimization to eliminate the need to repeatedly reallocate res when appending to it.
res = []
conversion = {}
i = 0
for x in l:
if x not in conversion:
value = conversion[x] = i
i += 1
else:
value = conversion[x]
res.append(value)

Well I guess it depends on if you want it to return the indexes in that specific order or not. If you want the example to return:
[1,1,2,3,1,2]
then you can look at the other answers submitted. However if you only care about getting a unique index for each unique number then I have a fast solution for you
import numpy as np
l = [10,10,20,15,10,20]
a = np.array(l)
x,y = np.unique(a,return_inverse = True)
and for this example the output of y is:
y = [0,0,2,1,0,2]
I tested this for 1,000,000 entries and it was done essentially immediately.

Your solution is slow because its complexity is O(nm) with m being the number of unique elements in l: a.index() is O(m) and you call it for every element in l.
To make it O(n), get rid of index() and store indexes in a dictionary:
>>> idx, indexes = 1, {}
>>> for x in l:
... if x not in indexes:
... indexes[x] = idx
... idx += 1
...
>>> [indexes[x] for x in l]
[1, 1, 2, 3, 1, 2]
If l contains only integers in a known range, you could also store indexes in a list instead of a dictionary for faster lookups.

You can use collections.OrderedDict() in order to preserve the unique items in order and, loop over the enumerate of this ordered unique items in order to get a dict of items and those indices (based on their order) then pass this dictionary with the main list to operator.itemgetter() to get the corresponding index for each item:
>>> from collections import OrderedDict
>>> from operator import itemgetter
>>> itemgetter(*lst)({j:i for i,j in enumerate(OrderedDict.fromkeys(lst),1)})
(1, 1, 2, 3, 1, 2)

For completness, you can also do it eagerly:
from itertools import count
wordid = dict(zip(set(list_), count(1)))
This uses a set to obtain the unique words in list_, pairs
each of those unique words with the next value from count() (which
counts upwards), and constructs a dictionary from the results.
Original answer, written by nneonneo.

Related

Finding first time value occurs in an array when you don't know what it is

I have a very long array (over 2 million values) with repeating value. It looks something like this:
array = [1,1,1,1,......,2,2,2.....3,3,3.....]
With a bunch of different values. I want to create individual arrays for each group of points. IE: an array for the ones, an array for the twos, and so forth. So something that would look like:
array1 = [1,1,1,1...]
array2 = [2,2,2,2.....]
array3 = [3,3,3,3....]
.
.
.
.
None of the values occur an equal amount of time however, and I don't know how many times each value occurs. Any advice?
Assuming that repeated values are grouped together (otherwise you simply need to sort the list), you can create a nested list (rather than a new list for every different value) using itertools.groupby:
from itertools import groupby
array = [1,1,1,1,2,2,2,3,3]
[list(v) for k,v in groupby(array)]
[[1, 1, 1, 1], [2, 2, 2], [3, 3]]
Note that this will be more convenient than creating n new lists created dinamically as shown for instance in this post, as you have no idea of how many lists will be created, and you will have to refer to each list by its name rather by simply indexing a nested list
You can use bisect.bisect_left to find the indices of the first occurence of each element. This works only if the list is sorted:
from bisect import bisect_left
def count_values(l, values=None):
if values is None:
values = range(1, l[-1]+1) # Default assume list is [1..n]
counts = {}
consumed = 0
val_iter = iter(values)
curr_value = next(val_iter)
next_value = next(val_iter)
while True:
ind = bisect_left(l, next_value, consumed)
counts[curr_value] = ind - consumed
consumed = ind
try:
curr_value, next_value = next_value, next(val_iter)
except StopIteration:
break
counts[next_value] = len(l) - consumed
return counts
l = [1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,3,3,3]
print(count_values(l))
# {1: 9, 2: 8, 3: 7}
This avoids scanning the entire list, trading that for a binary search for each value. Expect this to be more performant where there are very many of each element, and less performant where there are few of each element.
Well, it seems to be wasteful and redundant to create all those arrays, each of which just stores repeating values.
You might want to just create a dictionary of unique values and their respective counts.
From this dictionary, you can always selectively create any of the individual arrays easily, whenever you want, and whichever particular one you want.
To create such a dictionary, you can use:
from collections import Counter
my_counts_dict = Counter(my_array)
Once you have this dict, you can get the number of 23's, for example, with my_counts_dict[23].
And if this returns 200, you can create your list of 200 23's with:
my_list23 = [23]*200
****Use this code ****
<?php
$arrayName = array(2,2,5,1,1,1,2,3,3,3,4,5,4,5,4,6,6,6,7,8,9,7,8,9,7,8,9);
$arr = array();
foreach ($arrayName as $value) {
$arr[$value][] = $value;
}
sort($arr);
print_r($arr);
?>
Solution with no helper functions:
array = [1,1,2,2,2,3,4]
result = [[array[0]]]
for i in array[1:]:
if i == result[-1][-1]:
result[-1].append(i)
else:
result.append([i])
print(result)
# [[1, 1], [2, 2, 2], [3], [4]]

Python3: Integer not repeated

Out of a random list of integers, with integers being repeated in the list, what is the way to print that integer out of the list which is not repeated at all?
I have tried to solve the question by making the following program:
K = int(input())
room_list = list(input().split())
room_set = set(room_list)
for i in room_set:
count = room_list.count(i)
if count == 1:
i = int(i)
print(i)
break
K being the number of the elements in the list.
When I try to run the above program, it works well in the case of less elements however, when it is tested with a list having (say, 825) elements, the program times out.
Please help me in optimizing the above code.
Elements whose repetition count in the list is one will be your answer.
from collections import Counter
a = [1,1,1,2,2,3,4,5,5]
c = Counter(a) # O(n)
x = [key for key, val in c.items() if val == 1]
print(x)
Output:
[3,4]
Counter class creates a dictionary of elements and repetitions by iterating through the list once that takes time O(n) and each element's access takes O(1) time.
The count function of the list iterates every time you call it on a list. In your case taking O(n^2) time.
This will print the number that occured least often:
data = [3,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,93,4,5,6,7,8,9,0,1,2,3,4,5,6,7,8,9]
from collections import Counter
# least often
print (Counter(data).most_common()[-1][0])
# all non-repeated onces
# non_repeat = [k[0] for k in Counter(data).most_common() if k[1] == 1]
Output:
93
It uses a specialized dictionary: collection.Counter thats built for counting things in iterable you give it.
The method .most_common() returns a sorted list of tuples of (key, count) - by printing its last member you get the one thats least often.
The built-up dict looks like this:
Counter({4: 4, 5: 4, 6: 4, 7: 4, 8: 4, 3: 3, 9: 3, 0: 2, 1: 2, 2: 2, 93: 1})
A similar approach is to use a collections.defaultdict and count them yourself, then get the one with the minimal value:
from collections import defaultdict
k = defaultdict(int)
for elem in data:
k[elem] += 1
print( min(k.items(), key=lambda x:x[1]) )
The last solutions is similar in approach without the specialized Counter - the advantage of both of them is that you iterate once over the whole list and increment a value instead of iterating n times over the whole list and count each distinct elements occurences once.
Using count() on a list of pure distinct elements would lead to n counting-runs through n elements = n^2 actions needed.
The dictionary approach only needs one pass though the list so only n actions needed.

Comparing a 3-tuple to a list of 3-tuples using only the first two parts of the tuple

I have a list of 3-tuples in a Python program that I'm building while looking through a file (so one at a time), with the following setup:
(feature,combination,durationOfTheCombination),
such that if a unique combination of feature and combination is found, it will be added to the list. The list itself holds a similar setup, but the durationOfTheCombination is the sum of all duration that share the unique combination of (feature,combination). Therefore, when deciding if it should be added to the list, I need to only compare the first two parts of the tuple, and if a match is found, the duration is added to the corresponding list item.
Here's an example for clarity. If the input is
(ABC,123,10);(ABC,123,10);(DEF,123,5);(ABC,123,30);(EFG,456,30)
The output will be (ABC,123,50);(DEF,123,5);(EFG,456,30).
Is there any way to do this comparison?
You can do this with Counter,
In [42]: from collections import Counter
In [43]: lst = [('ABC',123,10),('ABC',123,10),('DEF',123,5)]
In [44]: [(i[0],i[1],i[2]*j) for i,j in Counter(lst).items()]
Out[44]: [('DEF', 123, 5), ('ABC', 123, 20)]
As per the OP suggestion if it's have different values, use groupby
In [26]: lst = [('ABC',123,10),('ABC',123,10),('ABC',123,25),('DEF',123,5)]
In [27]: [tuple(list(n)+[sum([i[2] for i in g])]) for n,g in groupby(sorted(lst,key = lambda x:x[:2]), key = lambda x:x[:2])]
Out[27]: [('ABC', 123, 45), ('DEF', 123, 5)]
If you don't want to use Counter, you can use a dict instead.
setOf3Tuples = dict()
def add3TupleToSet(a):
key = a[0:2]
if key in setOf3Tuples:
setOf3Tuples[a[0:2]] += a[2]
else:
setOf3Tuples[a[0:2]] = a[2]
def getRaw3Tuple():
for k in setOf3Tuples:
yield k + (setOf3Tuples[k],)
if __name__ == "__main__":
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("ABC",123,10))
add3TupleToSet(("DEF",123,5))
print([i for i in getRaw3Tuple()])
It seems a dict is more suited than a list here, with the first 2 fields as key. And to avoid checking each time if the key is already here you can use a defaultdict.
from collections import defaultdict
d = defaultdict(int)
for t in your_list:
d[t[:2]] += t[-1]
Assuming your input is collected in a list as below, you can use pandas groupby to accomplish this quickly:
import pandas as pd
input = [('ABC',123,10),('ABC',123,10),('DEF',123,5),('ABC',123,30),('EFG',456,30)]
output = [tuple(x) for x in pd.DataFrame(input).groupby([0,1])[2].sum().reset_index().values]

Optimize search to find next matching value in a list

I have a program that goes through a list and for each objects finds the next instance that has a matching value. When it does it prints out the location of each objects. The program runs perfectly fine but the trouble I am running into is when I run it with a large volume of data (~6,000,000 objects in the list) it will take much too long. If anyone could provide insight into how I can make the process more efficient, I would greatly appreciate it.
def search(list):
original = list
matchedvalues = []
count = 0
for x in original:
targetValue = x.getValue()
count = count + 1
copy = original[count:]
for y in copy:
if (targetValue == y.getValue):
print (str(x.getLocation) + (,) + str(y.getLocation))
break
Perhaps you can make a dictionary that contains a list of indexes that correspond to each item, something like this:
values = [1,2,3,1,2,3,4]
from collections import defaultdict
def get_matches(x):
my_dict = defaultdict(list)
for ind, ele in enumerate(x):
my_dict[ele].append(ind)
return my_dict
Result:
>>> get_matches(values)
defaultdict(<type 'list'>, {1: [0, 3], 2: [1, 4], 3: [2, 5], 4: [6]})
Edit:
I added this part, in case it helps:
values = [1,1,1,1,2,2,3,4,5,3]
def get_next_item_ind(x, ind):
my_dict = get_matches(x)
indexes = my_dict[x[ind]]
temp_ind = indexes.index(ind)
if len(indexes) > temp_ind + 1:
return(indexes)[temp_ind + 1]
return None
Result:
>>> get_next_item_ind(values, 0)
1
>>> get_next_item_ind(values, 1)
2
>>> get_next_item_ind(values, 2)
3
>>> get_next_item_ind(values, 3)
>>> get_next_item_ind(values, 4)
5
>>> get_next_item_ind(values, 5)
>>> get_next_item_ind(values, 6)
9
>>> get_next_item_ind(values, 7)
>>> get_next_item_ind(values, 8)
There are a few ways you could increase the efficiency of this search by minimising additional memory use (particularly when your data is BIG).
you can operate directly on the list you are passing in, and don't need to make copies of it, in this way you won't need: original = list, or copy = original[count:]
you can use slices of the original list to test against, and enumerate(p) to iterate through these slices. You won't need the extra variable count and, enumerate(p) is efficient in Python
Re-implemented, this would become:
def search(p):
# iterate over p
for i, value in enumerate(p):
# if value occurs more than once, print locations
# do not re-test values that have already been tested (if value not in p[:i])
if value not in p[:i] and value in p[(i + 1):]:
print(e, ':', i, p[(i + 1):].index(e))
v = [1,2,3,1,2,3,4]
search(v)
1 : 0 2
2 : 1 2
3 : 2 2
Implementing it this way will only print out the values / locations where a value is repeated (which I think is what you intended in your original implementation).
Other considerations:
More than 2 occurrences of value: If the value repeats many times in the list, then you might want to implement a function to walk recursively through the list. As it is, the question doesn't address this - and it may be that it doesn't need to in your situation.
using a dictionary: I completely agree with Akavall above, dictionary's are a great way of looking up values in Python - especially if you need to lookup values again later in the program. This will work best if you construct a dictionary instead of a list when you originally create the list. But if you are only doing this once, it is going to cost you more time to construct the dictionary and query over it than simply iterating over the list as described above.
Hope this helps!

Get list based on occurrences in unknown number of sublists

I'm looking for a way to make a list containing list (a below) into a single list (b below) with 2 conditions:
The order of the new list (b) is based on the number of times the value has occurred in some of the lists in a.
A value can only appear once
Basically turn a into b:
a = [[1,2,3,4], [2,3,4], [4,5,6]]
# value 4 occurs 3 times in list a and gets first position
# value 2 occurs 2 times in list a and get second position and so on...
b = [4,2,3,1,5,6]
I figure one could do this with set and some list magic. But can't get my head around it when a can contain any number of list. The a list is created based on user input (I guess that it can contain between 1 - 20 list with up 200-300 items in each list).
My trying something along the line with [set(l) for l in a] but don't know how to perform set(l) & set(l).... to get all matched items.
Is possible without have a for loop iterating sublist count * items in sublist times?
I think this is probably the closest you're going to get:
from collections import defaultdict
d = defaultdict(int)
for sub in outer:
for val in sub:
d[val] += 1
print sorted(d.keys(), key=lambda k: d[k], reverse = True)
# Output: [4, 2, 3, 1, 5, 6]
There is an off chance that the order of elements that appear an identical number of times may be indeterminate - the output of d.keys() is not ordered.
import itertools
all_items = set(itertools.chain(*a))
b = sorted(all_items, key = lambda y: -sum(x.count(y) for x in a))
Try this -
a = [[1,2,3,4], [2,3,4], [4,5,6]]
s = set()
for l in a:
s.update(l)
print s
#set([1, 2, 3, 4, 5, 6])
b = list(s)
This will add each list to the set, which will give you a unique set of all elements in all the lists. If that is what you are after.
Edit. To preserve the order of elements in the original list, you can't use sets.
a = [[1,2,3,4], [2,3,4], [4,5,6]]
b = []
for l in a:
for i in l:
if not i in b:
b.append(i)
print b
#[1,2,3,4,5,6] - The same order as the set in this case, since thats the order they appear in the list
import itertools
from collections import defaultdict
def list_by_count(lists):
data_stream = itertools.chain.from_iterable(lists)
counts = defaultdict(int)
for item in data_stream:
counts[item] += 1
return [item for (item, count) in
sorted(counts.items(), key=lambda x: (-x[1], x[0]))]
Having the x[0] in the sort key ensures that items with the same count are in some kind of sequence as well.

Categories