Rapid compression of multiple lists with value addition - python

I am looking for a pythonic way to iterate through a large number of lists and use the index of repeated values from one list to calculate a total value from the values with the same index in another list.
For example, say I have two lists
a = [ 1, 2, 3, 1, 2, 3, 1, 2, 3]
b = [ 1, 2, 3, 4, 5, 6, 7, 8, 9]
What I want to do is find the unique values in a, and then add together the corresponding values from b with the same index. My attempt, which is quite slow, is as follows:
a1=list(set(a))
b1=[0 for y in range(len(a1))]
for m in range(len(a)):
for k in range(len(a1)):
if a1[k]==a[m]:
b1[k]+=b[m]
and I get
a1=[1, 2, 3]
b1=[12, 15, 18]
Please let me know if there is a faster, more pythonic way to do this.
Thanks

Use the zip() function and a defaultdict dictionary to collect values per unique value:
from collections import defaultdict
try:
# Python 2 compatibility
from future_builtins import zip
except ImportError:
# Python 3, already there
pass
values = defaultdict(int)
for key, value in zip(a, b):
values[key] += value
a1, b1 = zip(*sorted(values.items()))
zip() pairs up the values from your two input lists, now all you have to do is sum up each value from b per unique value of a.
The last line pulls out the keys and values from the resulting dictionary, sorts these, and puts just the keys and just the values into a1 and b1, respectively.
Demo:
>>> from collections import defaultdict
>>> a = [ 1, 2, 3, 1, 2, 3, 1, 2, 3]
>>> b = [ 1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> values = defaultdict(int)
>>> for key, value in zip(a, b):
... values[key] += value
...
>>> zip(*sorted(values.items()))
[(1, 2, 3), (12, 15, 18)]
If you don't care about output order, you can drop the sorted() call altogether.

Related

Sort a list based on the the order of occurence of that value in another list

How to sort values of A based on the order of occurrence in B where values in A may be repetitive and values in B are unique
A=[1, 2, 2, 2, 3, 4, 4, 5]
B=[8, 5, 6, 2, 10, 3, 1, 9, 4]
The expected list is C which should contain
C = [5, 2, 2, 2, 3, 1, 4, 4]
Solution:
Try using sorted:
C = sorted(A, key=B.index)
And now:
print(C)
Output:
[5, 2, 2, 2, 3, 1, 4, 4]
Documentation reference:
As mentioned in the documentation of sorted:
Return a new sorted list from the items in iterable.
Has two optional arguments which must be specified as keyword
arguments.
key specifies a function of one argument that is used to extract a
comparison key from each element in iterable (for example,
key=str.lower). The default value is None (compare the elements
directly).
reverse is a boolean value. If set to True, then the list elements are
sorted as if each comparison were reversed.
you can use the key in sorted function
A=[1, 2, 2, 2, 3, 4, 4, 5]
B=[8, 5, 6, 2, 10, 3, 1, 9, 4]
C = ((i, B.index(i)) for i in A) # <generator object <genexpr> at 0x000001CE8FFBE0A0>
output = [i[0] for i in sorted(C, key=lambda x: x[1])] #[5, 2, 2, 2, 3, 1, 4, 4]
You can sort it without actually using a sort. The Counter class (from collection) is a special dictionary that maintains counts for a set of keys. In this case, your B list contains all keys that are possible. So you can use it to initialize a Counter object with zero occurrences of each key (this will preserve the order) and then add the A list to that. Finally, get the repeated elements out of the resulting Counter object.
from collections import Counter
A=[1, 2, 2, 2, 3, 4, 4, 5]
B=[8, 5, 6, 2, 10, 3, 1, 9, 4]
C = Counter(dict.fromkeys(B,0)) # initialize order
C.update(A) # 'sort' A
C = list(C.elements()) # get sorted elements
print(C)
[5, 2, 2, 2, 3, 1, 4, 4]
You could also write it in a single line:
C = list((Counter(dict.fromkeys(B,0))+Counter(A)).elements())
While using sorted(A,key=B.index) is simpler to write, this solution has lower complexity O(K+N) than a sort on an index lookup O(N x K x logN).

Sorted a list by frequency

I would like to sort a list by its frequency in descending order. If the frequency for two values is the same, then I also want the descending order for these two values.
For example,
mylist = [1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 4]
I would like my result to be
[4,4,4,4,4,4,3,3,3,3,3,5,5,5,2,2,2,1,1].
If I use
sorted(mylist,key = mylist.count,reverse = True)
I would get
[4,4,4,4,4,4,3,3,3,3,3,2,2,2,5,5,5,1,1];
I tried
sorted(mylist,key = lambda x:(mylist.count,-x),reverse = True)
But I think something is wrong, and it only give me the result:
[1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5].
So my questions are how can I get the result I want and why the result will be
[1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5]
if I use
sorted(mylist,key = lambda x:(mylist.count,-x),reverse = True)
Use a Counter to get the frequencies, then sort by the frequencies it gives:
from collections import Counter
def sorted_by_frequency(arr):
counts = Counter(arr)
# secondarily sort by value
arr2 = sorted(arr, reverse=True)
# primarily sort by frequency
return sorted(arr2, key=counts.get, reverse=True)
# Usage:
>>> sorted_by_frequency([1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 5, 5, 5, 4, 4, 4, 4, 4, 4])
[4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 5, 5, 5, 2, 2, 2, 1, 1]
You can try :
from collections import Counter
counts = Counter(mylist)
new_list = sorted(mylist, key=lambda x: (counts[x], x), reverse=True)
Why does
sorted(mylist, key=lambda x: (mylist.count, -x), reverse=True)
go wrong?
It compares the keys, so for example the two values 3 and 1 become the pairs (mylist.count, -3) and (mylist.count, -1) and the comparison would be (mylist.count, -3) < (mylist.count, -1).
So the obvious mistake is that the pairs don't have the frequencies of the numbers as intended. Instead they have that function. And the function is not less than itself.
But I find it interesting to note what exactly happens then. How does that pair comparison work? You might think that (a, b) < (c, d) is equivalent to (a < c) or (a == c and b < d). That is not the case. Because that would evaluate mylist.count < mylist.count, and then you'd crash with a TypeError. The actual way tuples compare with each other is by first finding a difference, and that's done by checking equality. And mylist.count == mylist.count not only doesn't crash but returns True. So the tuple comparison then goes to the next index, where it will find the -3 and -1.
So essentially you're really only doing
sorted(mylist, key=lambda x: -x, reverse=True)
and the negation and the reverse=True cancel each other out, so you get the same as
sorted(mylist, key=lambda x: x)
or just
sorted(mylist)
Now how to get it right? One way is to call the function (and to remove the negation):
result = sorted(mylist, key=lambda x: (mylist.count(x), x), reverse=True)
Or negate both frequency and value, instead of reverse=True:
result = sorted(mylist, key=lambda x: (-mylist.count(x), -x))
Another would be to take advantage of the sort's stability and use two simpler sorts (which might even be faster than the one more elaborate sort):
result = sorted(mylist, reverse=True)
result.sort(key=mylist.count, reverse=True)
Note that here we don't have to call mylist.count ourselves, because as it is the key it will be called for us. Just like your "lambda function" does get called (just not the function inside its result). Also note that I use sorted followed by in-place sort - no point creating yet another list and incur the costs associated with that.
Though in all cases, for long lists it would be more efficient to use a collections.Counter instead of mylist.count, as the latter makes the solution take O(n2) instead of O(n log n).

Take elements from multiple lists

Given multiple lists like the ones shown:
a = [1, 2, 3]
b = [5, 6, 7, 8]
c = [9, 0, 1]
d = [2, 3, 4, 5, 6, 7]
...
I want to be able to combine them to take as many elements from the first list as I can before starting to take elements from the second list, so the result would be:
result = [1, 2, 3, 8, 6, 7]
Is there a particularly nice way to write this? I can't think of a really simple one without a for loop. Maybe a list comprehension with a clever zip.
Simple slicing and concatenation:
a + b[len(a):]
Or with more lists:
res = []
for lst in (a, b, c, d):
res += lst[len(res):]
# [1, 2, 3, 8, 6, 7]
With itertools.zip_longest() for Python 3, works on any number of input lists:
>>> from itertools import zip_longest
>>> [next(x for x in t if x is not None) for t in zip_longest(a,b,c,d)]
[1, 2, 3, 8, 6, 7]
The default fill value is None so take the first none None element in each tuple created with the zip_longest call (you can change the defaults and criteria if None is a valid data value)
With functools.reduce:
from functools import reduce
print(list(reduce(lambda a, b: a + b[len(a):], [a, b, c, d])))
This outputs:
[1, 2, 3, 8, 6, 7]

How to replace numbers with order in (python) list

I have a list containing integers and want to replace them so that the element which previously contained the highest number now contains a 1, the second highest number set to 2, etc etc.
Example:
[5, 6, 34, 1, 9, 3] should yield [4, 3, 1, 6, 2, 5].
I personally only care about the first 9 highest numbers by I thought there might be a simple algorithm or possibly even a python function to do take care of this task?
Edit: I don't care how duplicates are handled.
A fast way to do this is to first generate a list of tuples of the element and its position:
sort_data = [(x,i) for i,x in enumerate(data)]
next we sort these elements in reverse:
sort_data = sorted(sort_data,reverse=True)
which generates (for your sample input):
>>> sort_data
[(34, 2), (9, 4), (6, 1), (5, 0), (3, 5), (1, 3)]
and nest we need to fill in these elements like:
result = [0]*len(data)
for i,(_,idx) in enumerate(sort_data,1):
result[idx] = i
Or putting it together:
def obtain_rank(data):
sort_data = [(x,i) for i,x in enumerate(data)]
sort_data = sorted(sort_data,reverse=True)
result = [0]*len(data)
for i,(_,idx) in enumerate(sort_data,1):
result[idx] = i
return result
this approach works in O(n log n) with n the number of elements in data.
A more compact algorithm (in the sense that no tuples are constructed for the sorting) is:
def obtain_rank(data):
sort_data = sorted(range(len(data)),key=lambda i:data[i],reverse=True)
result = [0]*len(data)
for i,idx in enumerate(sort_data,1):
result[idx] = i
return result
Another option, you can use rankdata function from scipy, and it provides options to handle duplicates:
from scipy.stats import rankdata
lst = [5, 6, 34, 1, 9, 3]
rankdata(list(map(lambda x: -x, lst)), method='ordinal')
# array([4, 3, 1, 6, 2, 5])
Assuimg you do not have any duplicates, the following list comprehension will do:
lst = [5, 6, 34, 1, 9, 3]
tmp_sorted = sorted(lst, reverse=True) # kudos to #Wondercricket
res = [tmp_sorted.index(x) + 1 for x in lst] # [4, 3, 1, 6, 2, 5]
To understand how it works, you can break it up into pieces like so:
lst = [5, 6, 34, 1, 9, 3]
# let's see what the sorted returns
print(sorted(lst, reverse=True)) # [34, 9, 6, 5, 3, 1]
# biggest to smallest. that is handy.
# Since it returns a list, i can index it. Let's try with 6
print(sorted(lst, reverse=True).index(6)) # 2
# oh, python is 0-index, let's add 1
print(sorted(lst, reverse=True).index(6) + 1) # 3
# that's more like it. now the same for all elements of original list
for x in lst:
print(sorted(lst, reverse=True).index(x) + 1) # 4, 3, 1, 6, 2, 5
# too verbose and not a list yet..
res = [sorted(lst, reverse=True).index(x) + 1 for x in lst]
# but now we are sorting in every iteration... let's store the sorted one instead
tmp_sorted = sorted(lst, reverse=True)
res = [tmp_sorted.index(x) + 1 for x in lst]
Using numpy.argsort:
numpy.argsort returns the indices that would sort an array.
>>> xs = [5, 6, 34, 1, 9, 3]
>>> import numpy as np
>>> np.argsort(np.argsort(-np.array(xs))) + 1
array([4, 3, 1, 6, 2, 5])
A short, log-linear solution using pure Python, and no look-up tables.
The idea: store the positions in a list of pairs, then sort the list to reorder the positions.
enum1 = lambda seq: enumerate(seq, start=1) # We want 1-based positions
def replaceWithRank(xs):
# pos = position in the original list, rank = position in the top-down sorted list.
vp = sorted([(value, pos) for (pos, value) in enum1(xs)], reverse=True)
pr = sorted([(pos, rank) for (rank, (_, pos)) in enum1(vp)])
return [rank for (_, rank) in pr]
assert replaceWithRank([5, 6, 34, 1, 9, 3]) == [4, 3, 1, 6, 2, 5]

Get index from a list where the key changes, groupby

I have a list that looks like this:
myList = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
What I want to do is record the index where the items in the list changes value. So for my list above it would be 3, 6.
I know that using groupby like this:
[len(list(group)) for key, group in groupby(myList)]
will result in:
[4, 3, 3]
but what I want is the index where a group starts/ends rather than just then number of items in the groups. I know I could start summing each sucessive group count-1 to get the index but thought there may be a cleaner way of doing so.
Thoughts appreciated.
Just use enumerate to generate indexes along with the list.
from operator import itemgetter
from itertools import groupby
myList = [1, 1, 1, 1, 2, 2, 2, 3, 3, 3]
[next(group) for key, group in groupby(enumerate(myList), key=itemgetter(1))]
# [(0, 1), (4, 2), (7, 3)]
This gives pairs of (start_index, value) for each group.
If you really just want [3, 6], you can use
[tuple(group)[-1][0] for key, group in
groupby(enumerate(myList), key=itemgetter(1))][:-1]
or
indexes = (next(group)[0] - 1 for key, group in
groupby(enumerate(myList), key=itemgetter(1)))
next(indexes)
indexes = list(indexes)
[i for i in range(len(myList)-1) if myList[i] != myList[i+1]]
In Python 2, replace range with xrange.
>>> x0 = myList[0]
... for i, x in enumerate(myList):
... if x != x0:
... print i - 1
... x0 = x
3
6

Categories