Lazy sorted function, timsort - python

Having a list with N (large number) elements:
from random import randint
eles = [randint(0, 10) for i in range(3000000)]
I'm trying to implement the best way (performance/resources spent) this function below:
def mosty(lst):
sort = sorted((v, k) for k, v in enumerate(lst))
count, maxi, last_ele, idxs = 0, 0, None, []
for ele, idx in sort:
if(last_ele != ele):
count = 1
idxs = []
idxs.append(idx)
if(last_ele == ele):
count += 1
if(maxi < count):
results = (ele, count, idxs)
maxi = count
last_ele = ele
return results
This function returns the most common element, number of occurrences, and the indexes where it was found.
Here is the benchmark with 300000 eles.
But I think I could improve, one of the reasons being python3 sorted function (timsort), if it returned a generator I didn't have to loop through the list twice right?
My questions are:
Is there any way for this code to be optimized? How?
With a lazy sorting I sure it would be, am I right? How can I implement lazy timsort

did not do any benchmarks, but that should not perform that badly (even though it iterates twice over the list):
from collections import Counter
from random import randint
eles = [randint(0, 10) for i in range(30)]
counter = Counter(eles)
most_common_element, number_of_occurrences = counter.most_common(1)[0]
indices = [i for i, x in enumerate(eles) if x == most_common_element]
print(most_common_element, number_of_occurrences, indices)
and the indices (the second iteration) can be found lazily in a generator expression:
indices = (i for i, x in enumerate(eles) if x == most_common_element)
if you need to care about multiple elements being the most common, this might work for you:
from collections import Counter
from itertools import groupby
from operator import itemgetter
eles = [1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5]
counter = Counter(eles)
_key, group = next(groupby(counter.most_common(), key=itemgetter(1)))
most_common = dict(group)
indices = {key: [] for key in most_common}
for i, x in enumerate(eles):
if x in indices:
indices[x].append(i)
print(most_common)
print(indices)
you could of course still make the indices lazy the same way as above.

If you are willing to use numpy, then you can do something like this:
arr = np.array(eles)
values, counts = np.unique(arr, return_counts=True)
ind = np.argmax(counts)
most_common_elem, its_count = values[ind], counts[ind]
indices = np.where(arr == most_common_elem)
HTH.

Related

how to make integer as Key and List as Value in Map in Python?

I have this input
arr = [1,2,1,3,4,2,4,2]
What I'm trying to achieve is make elements of the given list as keys of a map and its indices as values in the form of a list.
Like this :
mp = {1:[0,2],2:[1,5,7],3:[3],4:[4,6]}
And what will be the time-complexity for this?
from collections import defaultdict
result = defaultdict(list)
for pos, num in enumerate(array):
result[num].append(pos)
Time complexity is O(n), there is only one loop and all things in the loop (looking up thing in dictionary, appending an item) are constant wrt. the number of items.
arr = [1, 2, 1, 3, 4, 2, 4, 2]
mp = {
num: [i for i, k in enumerate(arr) if i == num]
for num in set(arr)
}
or normally:
arr = [1, 2, 1, 3, 4, 2, 4, 2]
mp = {}
for num in set(arr):
mp[num] = []
for i, k in enumerate(arr):
if k == num:
mp[num] += [i]
Sorry, i do not know much about the time complexities stuff, but naturally two loops, so O(n^2).
try
arr = [1,2,1,3,4,2,4,2]
d = {}
for n in set(arr):
d[n] = [i for i, x in enumerate(arr) if x==n]
or
import numpy as np
arr = [1,2,1,3,4,2,4,2]
arr = np.array(arr)
d = {n:np.where(arr==n)[0].tolist() for n in set(arr)}

Find the minimum sum obtained from summing each integers from a list

I'm learning python and I'm training myself on different exercises. For one of them I have to find the minimum sums of the product of two numbers from a list. For this I had the idea of making a list and sort it in order and to create a second list where I'll sort it in reverse order then I will make the product of the first element of each list and sum them together.
Here's a little example:
list[5, 4, 2, 3]
list_sorted = [2, 3]
list_sorted_rev = [5, 4]
expectation = 22
calcul made: 5 * 2 + 3 * 4
But I have a problem, when I do that in a loop, my loop iterate through the second list with the first value of my first list and then it goes to the second value of my first list.
Here's the code I've done.
def min_sum(arr):
my_list = []
my_rev = []
srt = sorted(arr)
rev = sorted(arr, reverse=True)
rng = len(arr) / 2
res = 0
for i in range(0, int(rng)):
my_list.append(srt[i])
my_rev.append(rev[i])
for i in my_rev:
for j in my_list:
res += i * j
print(res)
Instead of using the nested for loops:
for i in my_rev:
for j in my_list:
res += i * j
You must iterate over both list simultaneously. This can be done using zip(*iterables):
for i, j in zip(my_rev, my_list):
res += i * j
This is because you're using a nested loop which makes it iterate over the first list multiple times. You can use this code instead:
def min_sum(arr):
my_list = []
my_rev = []
srt = sorted(arr)
rev = sorted(arr, reverse=True)
rng = len(arr) / 2
res = 0
for i in range(0, int(rng)):
my_list.append(srt[i])
my_rev.append(rev[i])
for ind in range(len(my_rev)):
res += my_rev[ind] * my_list[ind]
print(res)
min_sum([5, 4, 2, 3])

How do I find the Index of the smallest number in an array in python if I have multiple smallest numbers and want both indexes?

I have an array in which I want to find the index of the smallest elements. I have tried the following method:
distance = [2,3,2,5,4,7,6]
a = distance.index(min(distance))
This returns 0, which is the index of the first smallest distance. However, I want to find all such instances, 0 and 2. How can I do this in Python?
Use np.where to get all the indexes that match a given value:
import numpy as np
distance = np.array([2,3,2,5,4,7,6])
np.where(distance == np.min(distance))[0]
Out[1]: array([0, 2])
Numpy outperforms other methods as the size of the array grows:
Results of TimeIt comparison test, adapted from Yannic Hamann's code below
Length of Array x 7
Method 1 10 20 50 100 1000
Sorted Enumerate 2.47 16.291 33.643
List Comprehension 1.058 4.745 8.843 24.792
Numpy 5.212 5.562 5.931 6.22 6.441 6.055
Defaultdict 2.376 9.061 16.116 39.299
You may enumerate array elements and extract their indexes if the condition holds:
min_value = min(distance)
[i for i,n in enumerate(distance) if n==min_value]
#[0,2]
Surprisingly the numpy answer seems to be the slowest.
Update: Depends on the size of the input list.
import numpy as np
import timeit
from collections import defaultdict
def weird_function_so_bad_to_read(distance):
se = sorted(enumerate(distance), key=lambda x: x[1])
smallest_numb = se[0][1] # careful exceptions when list is empty
return [x for x in se if smallest_numb == x[1]]
# t1 = 1.8322973089525476
def pythonic_way(distance):
min_value = min(distance)
return [i for i, n in enumerate(distance) if n == min_value]
# t2 = 0.8458914929069579
def fastest_dont_even_have_to_measure(np_distance):
# np_distance = np.array([2, 3, 2, 5, 4, 7, 6])
min_v = np.min(np_distance)
return np.where(np_distance == min_v)[0]
# t3 = 4.247801031917334
def dd_answer_was_my_first_guess_too(distance):
d = defaultdict(list) # a dictionary where every value is a list by default
for idx, num in enumerate(distance):
d[num].append(idx) # for each number append the value of the index
return d.get(min(distance))
# t4 = 1.8876687170704827
def wrapper(func, *args, **kwargs):
def wrapped():
return func(*args, **kwargs)
return wrapped
distance = [2, 3, 2, 5, 4, 7, 6]
t1 = wrapper(weird_function_so_bad_to_read, distance)
t2 = wrapper(pythonic_way, distance)
t3 = wrapper(fastest_dont_even_have_to_measure, np.array(distance))
t4 = wrapper(dd_answer_was_my_first_guess_too, distance)
print(timeit.timeit(t1))
print(timeit.timeit(t2))
print(timeit.timeit(t3))
print(timeit.timeit(t4))
We can use an interim dict to store indices of the list and then just fetch the minimum value of distance from it. We will also use a simple for-loop here so that you can understand what is happening step by step.
from collections import defaultdict
d = defaultdict(list) # a dictionary where every value is a list by default
for idx, num in enumerate(distance):
d[num].append(idx) # for each number append the value of the index
d.get(min(distance)) # fetch the indices of the min number from our dict
[0, 2]
You can also do the following list comprehension
distance = [2,3,2,5,4,7,6]
min_distance = min(distance)
[index for index, val in enumerate(distance) if val == min_distance]
>>> [0, 2]

How to sort a NumPy array by frequency?

I am attempting to sort a NumPy array by frequency of elements. So for example, if there's an array [3,4,5,1,2,4,1,1,2,4], the output would be another NumPy sorted from most common to least common elements (no duplicates). So the solution would be [4,1,2,3,5]. If two elements have the same number of occurrences, the element that appears first is placed first in the output. I have tried doing this, but I can't seem to get a functional answer. Here is my code so far:
temp1 = problems[j]
indexes = np.unique(temp1, return_index = True)[1]
temp2 = temp1[np.sort(indexes)]
temp3 = np.unique(temp1, return_counts = True)[1]
temp4 = np.argsort(temp3)[::-1] + 1
where problems[j] is a NumPy array like [3,4,5,1,2,4,1,1,2,4]. temp4 returns [4,1,2,5,3] so far but it is not correct because it can't handle when two elements have the same number of occurrences.
You can use argsort on the frequency of each element to find the sorted positions and apply the indexes to the unique element array
unique_elements, frequency = np.unique(array, return_counts=True)
sorted_indexes = np.argsort(frequency)[::-1]
sorted_by_freq = unique_elements[sorted_indexes]
A non-NumPy solution, which does still work with NumPy arrays, is to use an OrderedCounter followed by sorted with a custom function:
from collections import OrderedDict, Counter
class OrderedCounter(Counter, OrderedDict):
pass
L = [3,4,5,1,2,4,1,1,2,4]
c = OrderedCounter(L)
keys = list(c)
res = sorted(c, key=lambda x: (-c[x], keys.index(x)))
print(res)
[4, 1, 2, 3, 5]
If the values are integer and small, or you only care about bins of size 1:
def sort_by_frequency(arr):
return np.flip(np.argsort(np.bincount(arr))[-(np.unique(arr).size):])
v = [1,1,1,1,1,2,2,9,3,3,3,3,7,8,8]
sort_by_frequency(v)
this should yield
array([1, 3, 8, 2, 9, 7]
Use zip and itemgetter should help
from operator import itemgetter
import numpy as np
temp1 = problems[j]
temp, idx, cnt = np.unique(temp1, return_index = True, return_counts=True)
cnt = 1 / cnt
k = sorted(zip(temp, cnt, idx), key=itemgetter(1, 2))
print(next(zip(*k)))
You can count up the number of each element in the array, and then use it as a key to the build-in sorted function
def sortbyfreq(arr):
s = set(arr)
keys = {n: (-arr.count(n), arr.index(n)) for n in s}
return sorted(list(s), key=lambda n: keys[n])

Permutations with repetition without two consecutive equal elements

I need a function that generates all the permutation with repetition of an iterable with the clause that two consecutive elements must be different; for example
f([0,1],3).sort()==[(0,1,0),(1,0,1)]
#or
f([0,1],3).sort()==[[0,1,0],[1,0,1]]
#I don't need the elements in the list to be sorted.
#the elements of the return can be tuples or lists, it doesn't change anything
Unfortunatly itertools.permutation doesn't work for what I need (each element in the iterable is present once or no times in the return)
I've tried a bunch of definitions; first, filterting elements from itertools.product(iterable,repeat=r) input, but is too slow for what I need.
from itertools import product
def crp0(iterable,r):
l=[]
for f in product(iterable,repeat=r):
#print(f)
b=True
last=None #supposing no element of the iterable is None, which is fine for me
for element in f:
if element==last:
b=False
break
last=element
if b: l.append(f)
return l
Second, I tried to build r for cycle, one inside the other (where r is the class of the permutation, represented as k in math).
def crp2(iterable,r):
a=list(range(0,r))
s="\n"
tab=" " #4 spaces
l=[]
for i in a:
s+=(2*i*tab+"for a["+str(i)+"] in iterable:\n"+
(2*i+1)*tab+"if "+str(i)+"==0 or a["+str(i)+"]!=a["+str(i-1)+"]:\n")
s+=(2*i+2)*tab+"l.append(a.copy())"
exec(s)
return l
I know, there's no need you remember me: exec is ugly, exec can be dangerous, exec isn't easy-readable... I know.
To understand better the function I suggest you to replace exec(s) with print(s).
I give you an example of what string is inside the exec for crp([0,1],2):
for a[0] in iterable:
if 0==0 or a[0]!=a[-1]:
for a[1] in iterable:
if 1==0 or a[1]!=a[0]:
l.append(a.copy())
But, apart from using exec, I need a better functions because crp2 is still too slow (even if faster than crp0); there's any way to recreate the code with r for without using exec? There's any other way to do what I need?
You could prepare the sequences in two halves, then preprocess the second halves to find the compatible choices.
def crp2(I,r):
r0=r//2
r1=r-r0
A=crp0(I,r0) # Prepare first half sequences
B=crp0(I,r1) # Prepare second half sequences
D = {} # Dictionary showing compatible second half sequences for each token
for i in I:
D[i] = [b for b in B if b[0]!=i]
return [a+b for a in A for b in D[a[-1]]]
In a test with iterable=[0,1,2] and r=15, I found this method to be over a hundred times faster than just using crp0.
You could try to return a generator instead of a list. With large values of r, your method will take a very long time to process product(iterable,repeat=r) and will return a huge list.
With this variant, you should get the first element very fast:
from itertools import product
def crp0(iterable, r):
for f in product(iterable, repeat=r):
last = f[0]
b = True
for element in f[1:]:
if element == last:
b = False
break
last = element
if b:
yield f
for no_repetition in crp0([0, 1, 2], 12):
print(no_repetition)
# (0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1)
# (1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0)
Instead of filtering the elements, you could generate a list directly with only the correct elements. This method uses recursion to create the cartesian product:
def product_no_repetition(iterable, r, last_element=None):
if r == 0:
return [[]]
else:
return [p + [x] for x in iterable
for p in product_no_repetition(iterable, r - 1, x)
if x != last_element]
for no_repetition in product_no_repetition([0, 1], 12):
print(no_repetition)
I agree with #EricDuminil's comment that you do not want "Permutations with repetition." You want a significant subset of the product of the iterable with itself multiple times. I don't know what name is best: I'll just call them products.
Here is an approach that builds each product line without building all the products then filtering out the ones you want. My approach is to work primarily with the indices of the iterable rather than the iterable itself--and not all the indices, but ignoring the last one. So instead of working directly with [2, 3, 5, 7] I work with [0, 1, 2]. Then I work with the products of those indices. I can transform a product such as [1, 2, 2] where r=3 by comparing each index with the previous one. If an index is greater than or equal to the previous one I increment the current index by one. This prevents two indices from being equal, and this also gets be back to using all the indices. So [1, 2, 2] is transformed to [1, 2, 3] where the final 2 was changed to a 3. I now use those indices to select the appropriate items from the iterable, so the iterable [2, 3, 5, 7] with r=3 gets the line [3, 5, 7]. The first index is treated differently, since it has no previous index. My code is:
from itertools import product
def crp3(iterable, r):
L = []
for k in range(len(iterable)):
for f in product(range(len(iterable)-1), repeat=r-1):
ndx = k
a = [iterable[ndx]]
for j in range(r-1):
ndx = f[j] if f[j] < ndx else f[j] + 1
a.append(iterable[ndx])
L.append(a)
return L
Using %timeit in my Spyder/IPython configuration on crp3([0,1], 3) shows 8.54 µs per loop while your crp2([0,1], 3) shows 133 µs per loop. That shows a sizeable speed improvement! My routine works best where iterable is short and r is large--your routine finds len ** r lines (where len is the length of the iterable) and filters them while mine finds len * (len-1) ** (r-1) lines without filtering.
By the way, your crp2() does do filtering, as shown by the if lines in your code that is execed. The sole if in my code does not filter a line, it modifies an item in the line. My code does return surprising results if the items in the iterable are not unique: if that is a problem, just change the iterable to a set to remove the duplicates. Note that I replaced your l name with L: I think l is too easy to confuse with 1 or I and should be avoided. My code could easily be changed to a generator: replace L.append(a) with yield a and remove the lines L = [] and return L.
How about:
from itertools import product
result = [ x for x in product(iterable,repeat=r) if all(x[i-1] != x[i] for i in range(1,len(x))) ]
Elaborating on #peter-de-rivaz's idea (divide and conquer). When you divide the sequence to create into two subsequences, those subsequences are the same or very close. If r = 2*k is even, store the result of crp(k) in a list and merge it with itself. If r=2*k+1, store the result of crp(k) in a list and merge it with itself and with L.
def large(L, r):
if r <= 4: # do not end the divide: too slow
return small(L, r)
n = r//2
M = large(L, r//2)
if r%2 == 0:
return [x + y for x in M for y in M if x[-1] != y[0]]
else:
return [x + y + (e,) for x in M for y in M for e in L if x[-1] != y[0] and y[-1] != e]
small is an adaptation from #eric-duminil's answer using the famous for...else loop of Python:
from itertools import product
def small(iterable, r):
for seq in product(iterable, repeat=r):
prev, *tail = seq
for e in tail:
if e == prev:
break
prev = e
else:
yield seq
A small benchmark:
print(timeit.timeit(lambda: crp2( [0, 1, 2], 10), number=1000))
#0.16290732200013736
print(timeit.timeit(lambda: crp2( [0, 1, 2, 3], 15), number=10))
#24.798989593000442
print(timeit.timeit(lambda: large( [0, 1, 2], 10), number=1000))
#0.0071403849997295765
print(timeit.timeit(lambda: large( [0, 1, 2, 3], 15), number=10))
#0.03471425700081454

Categories