Get random items from range of list - python

Let's say I have a unsorted set of items:
input = set([45, 235, 3, 77, 55, 80, 154])
I need to get random values from this input but in a specific range. E.g. when I have
ran = [50, 100]
I want it to return either 77 or 55 or 80. What's the fastest way to get this for large sets in python?

Using a set for this isn't the right way because elements aren't sorted. This would lead to a O(N) solution to test each element against the boundaries.
I'd suggest to turn the data into a sorted list, then you can use bisect to find start & end indexes for your boundary values, then apply random.choice on the sliced list:
import bisect,random
data = sorted([45, 235, 3, 77, 55, 80, 154])
def rand(start,stop):
start_index = bisect.bisect_left(data,start)
end_index = bisect.bisect_right(data,stop)
return data[random.randrange(start_index,end_index)]
print(rand(30,100))
bisect has O(log(N)) complexity on sorted lists. Then pick an index with random.randrange.
bisect uses compiled code on mainstream platforms, so it's very efficient besides its low complexity.
Boundaries are validated by performing a limit test:
print(rand(235,235))
which prints 235 as expected (always difficult to make sure that the arrays aren't out of bounds when using random)
(if you want to update your data while running, you can also use bisect to insert elements, it's slower than with set because of the O(log N) complexity + insertion in list, of course but you cannot have everything)

You didn't clarify whether you could or could not use numpy but also asked for "the fastest" so I'll include the numpy method for completeness. In this case, the "python_method" approach is the answer given by Jean-François Fabre here
import numpy as np
import bisect,random
data = np.random.randint(0, 60, 10000)
high = 25
low = 20
def python_method(data, low, high):
data = sorted(data)
start_index = bisect.bisect_left(data,low)
end_index = bisect.bisect_right(data,high)
return data[random.randrange(start_index,end_index)]
def numpy_method(data, low, high):
return np.random.choice(data[(data >=low) & (data <= high)])
Timings:
%timeit python_method(data, low, high)
2.34 ms ± 11.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit numpy_method(data, low, high)
33.2 µs ± 72.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Obviously, though, you'd only sort the list once if you were using that function several times, so that will cut down the Python runtime quite to the same level.
%timeit new_data = sorted(data)
2.33 ms ± 39.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
numpy would pull ahead again in cases where you needed multiple results from within a single range as you could get them in a single call.
EDIT:
In the case that the input array is already sorted, and you're sure you can exploit that (taking sorted() out of timeit), the pure python method wins in the case of picking single values:
%timeit python_method(data, low, high)
5.06 µs ± 16.3 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The un-modified numpy method gives:
%timeit numpy_method(data, low, high)
20.5 µs ± 668 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So, as far as I can tell, in cases where the list is already sorted, and you only want one result, the pure-python method wins. If you wanted multiple results from within that range it might be different but I'm benchmarking against randrange.

from random import randint
input = set([45, 235, 3, 77, 55, 80, 154])
ran = [50, 100]
valid_values = []
for i in input:
if ran[0] <= i <= ran[1]:
valid_values.append(i)
random_index = randint(0, len(valid_values)-1)
print(valid_values[random_index])

Here is my suggestion that I find readable, easy to understand and quite short:
import random
inputSet = set([45, 235, 3, 77, 55, 80, 154])
ran = [50,100]
# Get list of elements inside the range
a = [x for x in inputSet if x in range(ran[0],ran[1])]
# Print a random element
print(random.choice(a)) # randomly 55, 77 or 80
Note that I have not used the name input for the defined set because it is a reserved built-in symbol.

Related

why these algorithms differ in their execution time?

for the problem on leetcode 'Top K Frequent Elements' https://leetcode.com/problems/top-k-frequent-elements/submissions/
there is a solution that completes the task in just 88 ms, mine completes the tasks in 124 ms, I see it as a large difference.
I tried to understand why buy docs don't provide the way the function I use is implemented which is most_common(), if I want to dig a lot in details like that, such that I can write algorithms that run so fast in the future what should I read(specific books? or any other resource?).
my code (124 ms)
def topKFrequent(self, nums, k):
if k ==len(nums):
return nums
c=Counter(nums)
return [ t[0] for t in c.most_common(k) ]
other (88 ms) (better in time)
def topKFrequent(self, nums, k):
if k == len(nums):
return nums
count = Counter(nums)
return heapq.nlargest(k, count.keys(), key=count.get)
both are nearly taking same amount of memory, so no difference here.
The implementation of most_common
also uses heapq.nlargest, but it calls it with count.items() instead of count.keys(). This will make it a tiny bit slower, and also requires the overhead of creating a new list, in order to extract the [0] value from each element in the list returned by most_common().
The heapq.nlargest version just avoids this extra overhead, and passes count.keys() as second argument, and therefore it doesn't need to iterate that result again to extract pieces into a new list.
#trincot seems to have answered the question but if anyone is looking for a faster way to do this then use Numpy, provided nums can be stored as a np.array:
def topKFrequent_numpy(nums, k):
unique, counts = np.unique(nums, return_counts=True)
return unique[np.argsort(-counts)[:k]]
One speed test
nums_array = np.random.randint(1000, size=1000000)
nums_list = list(nums_array)
%timeit topKFrequent_Counter(nums_list, 500)
# 116 ms ± 4.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_heapq(nums_list, 500)
# 117 ms ± 3.35 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit topKFrequent_numpy(nums_array, 500)
# 39.2 ms ± 185 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
(Speeds may be dramatically different for other input values)

How to remove nested for loop?

I have the following nested loop:
sum_tot = 0.0
for i in range(len(N)-1):
for j in range(len(N)-1):
sum_tot = sum_tot + N[i]**2*N[j]**2*W[i]*W[j]*x_i[j][-1] / (N[j]**2 - x0**2) *(z_i[i][j] - z_j[i][j])*x_j[i][-1] / (N[i]**2 - x0**2)
It's basically a mathematical function that has a double summation. Each sum goes up to the length of N. I've been trying to figure out if there was a way to write this without using a nested for-loop in order to reduce computational time. I tried using list comprehension, but the computational time is similar if not the same. Is there a way to write this expression as matrices to avoid the loops?
Note that range will stop at N-2 given your current loop: range goes up to but not including its argument. You probably mean to write for i in range(len(N)).
It's also difficult to reduce summation: the actual time it takes is based on the number of terms computed, so if you write it a different way which still involves the same number of terms, it will take just as long. However, O(n^2) isn't exactly bad: it looks like the best you can do in this situation unless you find a mathematical simplification of the problem.
You might consider checking this post to gather ways to write out the summation in a neater fashion.
#Kraigolas makes valid points. But let's try a few benchmarks on a dummy, double nested operation, either way. (Hint: Numba might help you speed things up)
Note, I would avoid numpy arrays specifically because all of the cross-product between the range is going to be in memory at once. If this is a massive range, you may run out of memory.
Nested for loops
n = 5000
s1 = 0
for i in range(n):
for j in range(n):
s1 += (i/2) + (j/3)
print(s1)
#2.26 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
List comprehension
n = 5000
s2 = 0
s2 = sum([i/2+j/3 for i in range(n) for j in range(n)])
print(s2)
#3.2 s ± 307 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Itertools product
from itertools import product
n = 5000
s3 = 0
for i,j in product(range(n),repeat=2):
s3 += (i/2) + (j/3)
print(s3)
#2.35 s ± 186 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Note: When using Numba, you would want to run the code at least once before, because the first time it compiles the code and therefore the speed is slow. The real speedup comes second run onwards.
Numba njit (SIMD)
from numba import njit
n=5000
#njit
def f(n):
s = 0
for i in range(n):
for j in range(n):
s += (i/2) + (j/3)
return s
s4 = f(n)
#29.4 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Numba njit parallel with prange
An excellent suggestion by #Tim, added to benchmarks
#njit(parallel=True)
def f(n):
s = 0
for i in prange(n):
for j in prange(n):
s += (i/2) + (j/3)
return s
s5 = f(n)
#21.8 ms ± 4.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Significant boost up with Numba as expected. Maybe try that?
To convert this to matrix calculations, I would suggest combine some terms first.
If these objects are not numpy arrays, it's better to convert them to numpy arrays, as they support element-wise operations.
To convert, simply do
import numpy
N = numpy.array(N)
w = numpy.array(w)
x_i = numpy.array(x_i)
x_j = numpy.array(x_j)
z_i = numpy.array(z_i)
z_j = numpy.array(z_j)
Then,
common_terms = N**2*w/(N**2-x0**2)
i_terms = common_terms*x_j[:,-1]
j_terms = common_terms*x_i[:,-1]
i_j_matrix = z_i - z_j
sum_output = (i_terms.reshape((1,-1)) # i_j_matrix # j_terms.reshape((-1,1)))[0,0]

Generating lists filled with unique integer values in Python

First of all, I'd like to say that I thought the exact same question had been answered before, but after a brief research I couldn't find any threads leading me to the answer that i wanted in here which means i didn't dig enough or missed some keywords. Sorry for that if that question is out there.
Anyways, I've started to learn python and was going through some exercises. I needed to create a list which has 10 randomly generated integers in it and each integer must have a different value.
So I've tried to compare the first element of the list with the next one and if they're the same, I tried to generate a new number with the if statement.
import random
listA = []
for i in range(0,10):
x=random.randint(1,100)
if listA[i] == listA[i-1]:
x=random.randint(1,100)
else:
listA.append(x)
listA.sort()
print(listA)
But I've got an error saying; "list index is out of range"
I expected the if statement to start with the "0" index and count till the 9th index of the "listA" and as it goes, compare them, if the same, generate another random number. But obviously, my indexation was wrong.
Also, any other comments on the code would be appreciated.
Thanks for your time in advance.
In Python, a set can only contain unique values, so in the following code, duplicate random numbers won't increase the length of the set:
import random
s = set()
while len(s) < 10:
s.add(random.randint(1,100))
print(sorted(s))
Output:
[18, 20, 26, 48, 51, 72, 75, 92, 94, 99]
Try the following.
import random
listA = []
while(len(listA) < 10):
x = random.randint(1,100)
if x not in listA:
listA.append(x)
listA.sort()
print(listA)
Explanation:
You should use a while loop so that you keep generating numbers until your desired list is actually 10 numbers. When using the for loop, if you happen to generate [2, 2, 30, 40, 2, 10, 20, 83, 92, 29] at random, your list will only be 8 numbers long because the duplicate 2's will not be added, although you have already looped through your for loop 10 times.
While is the key here as you will never be able to fool proof predict how many times it will take to randomly have 10 different numbers, therefore you want to keep going while you haven't reached the desired length.
Also, the keyword in is a simple way to check if something already exists inside something else.
This can be thought of as sampling without replacement. In this case, you are "sampling" 10 items at random from range(1, 101) and each item that is sampled can only be sampled once (i.e. it is not "replaced" - imagine drawing numbered balls at random from a bag to understand the concept).
Sampling without replacement can be handled in one line:
import random
listA = random.sample(range(1, 101), 10)
Another way of thinking about it is to shuffle list(range(1, 101)) and take the first 10 elements:
import random
listA = list(range(1, 101))
random.shuffle(listA)
listA[:10]
Timing the different approaches
Using the %timeit magic in iPython we can compare the different approaches suggested in the answers:
def random_sample():
import random
return sorted(random.sample(range(1, 101), 10))
def random_shuffle():
import random
listA = list(range(1, 101))
random.shuffle(listA)
return sorted(listA[:10])
def while_loop():
import random
listA = []
while(len(listA) < 10):
x = random.randint(1, 100)
if x not in listA:
listA.append(x)
return sorted(listA)
def random_set():
import random
s = set()
while len(s) < 10:
s.add(random.randint(1, 100))
return sorted(s)
%timeit for i in range(100): random_sample()
# 1.38 ms ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit for i in range(100): random_shuffle()
# 6.81 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit for i in range(100): while_loop()
# 1.61 ms ± 18.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit for i in range(100): set_approach()
# 1.48 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Python, efficient paralell operation using a dict

First sorry for my not perfect English.
My problem is simple to explain, I think.
result={}
list_tuple=[(float,float,float),(float,float,float),(float,float,float)...]#200k tuples
threshold=[float,float,float...] #max 1k values
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
if value in result:
result[value].append(tuple)
else:
result[value]=[]
result[value].append(tuple)
list_tuple contains arround 200k tuples, i have to do this operation very fast(2/3 seconds max on a normal pc).
My first attemp was to do this in cython with prange() (so i could have benefits from the cython optimization and from the paralell execution), but the problem is (as always), GIL: in prange() i can manage lists and tuples using cython memviews, but i can't insert my result in a dict.
In cython i also tried using unordered_map of the c++ std, but now the problem is that i can't make a vector of array in c++ (that would the value of my dict).
The second problem is similar:
list_tuple=[((float,float),(float,float)),((float,float),(float,float))...]#200k tuples of tuples
result={list_tuple[0][0]:[]}
for tuple in list_tuple:
if tuple[0] in result:
result[tuple[0]].append(tuple)
else:
result[tuple[0]]=[]
Here i have also another problem,if a want to use prange() i have to use a custom hash function to use an array as key of a c++ unordered_map
As you can see my snippets are very simple to run in paralell.
I thought to try with numba, but probably will be the same because of GIL, and i prefer to use cython because i need a binary code (this library could be a part of a commercial software so only binary libraries are allowed).
In general i would like avoid c/c++ function, what i hope to find is a way to manage something like dicts/lists in parallel,with the cython performance, remaining as much as possible in the Python domain; but i'm open to every advice.
Thanks
Several performance improvements can be achieved, also by using numpy's vectorization features:
The min and max values are currently computed anew for each threshold. Instead they can be precomputed and then reused for each threshold.
The loop over data samples (list_tuple) is performed in pure Python. This loop can be vectorized using numpy.
In the following tests I used data.shape == (200000, 3); thresh.shape == (1000,) as indicated in the OP. I also omitted modifications to the result dict since depending on the data this can quickly overflow memory.
Applying 1.
v_min = [min(t) for t in data]
v_max = [max(t) for t in data]
for mi, ma in zip(v_min, v_max):
for value in thresh:
if ma > value and mi < value:
pass
This yields a performance increase of ~ 5 compared to the OP's code.
Applying 1. & 2.
v_min = data.min(axis=1)
v_max = data.max(axis=1)
mask = np.empty(shape=(data.shape[0],), dtype=bool)
for t in thresh:
mask[:] = (v_min < t) & (v_max > t)
samples = data[mask]
if samples.size > 0:
pass
This yields a performance increase of ~ 30 compared to the OP's code. This approach has the additional benefit that it doesn't contain incremental appends to the lists which can slow down the program since memory reallocation might be required. Instead it creates each list (per threshold) in a single attempt.
#a_guest's code:
def foo1(data, thresh):
data = np.asarray(data)
thresh = np.asarray(thresh)
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}
return result
This code creates a dictionary entry once for each item in thresh.
The OP code, simplified a bit with default_dict (from collections):
def foo3(list_tuple, threeshold):
result = defaultdict(list)
for tuple in list_tuple:
for value in threeshold:
if max(tuple)>value and min(tuple)<value:
result[value].append(tuple)
return result
This one updates a dictionary entry once for each item that meets the criteria.
And with his sample data:
In [27]: foo1(data,thresh)
Out[27]: {0: [], 1: [[0, 1, 2]], 2: [], 3: [], 4: [[3, 4, 5]]}
In [28]: foo3(data.tolist(), thresh.tolist())
Out[28]: defaultdict(list, {1: [[0, 1, 2]], 4: [[3, 4, 5]]})
time tests:
In [29]: timeit foo1(data,thresh)
66.1 µs ± 197 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# In [30]: timeit foo3(data,thresh)
# 161 µs ± 242 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [31]: timeit foo3(data.tolist(),thresh.tolist())
30.8 µs ± 56.4 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Iteration on arrays is slower than with lists. Time for tolist() is minimal; np.asarray for lists is longer.
With a larger data sample, the array version is faster:
In [42]: data = np.random.randint(0,50,(3000,3))
...: thresh = np.arange(50)
In [43]:
In [43]: timeit foo1(data,thresh)
16 ms ± 391 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [44]: %%timeit x,y = data.tolist(), thresh.tolist()
...: foo3(x,y)
...:
83.6 ms ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Edit
Since this approach basically performs an outer product between data samples and threshold values it increases the required memory significantly which might be undesired. An improved approach can be found here. I keep this answer nevertheless for future reference since it was referred to in this answer.
I found the performance increase as compared to the OP's code to be a factor of ~ 20.
This is an example using numpy. The data is vectorized and so are the operations. Note that the resulting dict contains empty lists, as opposed to the OP's example, and hence might require an additional cleaning step, if appropriate.
import numpy as np
# Data setup
data = np.random.uniform(size=(200000, 3))
thresh = np.random.uniform(size=1000)
# Compute tuples for thresholds.
condition = (
(data.min(axis=1)[:, None] < thresh)
& (data.max(axis=1)[:, None] > thresh)
)
result = {v: data[c].tolist() for c, v in zip(condition.T, thresh)}

How to get random.sample() from deque in Python 3?

I have a collections.deque() of tuples from which I want to draw random samples.
In Python 2.7, I can use batch = random.sample(my_deque, batch_size).
But in Python 3.4 this raises TypeError: Population must be a sequence or set. For dicts, use list(d).
What's the best workaround, or recommended way to sample efficiently from a deque in Python 3?
The obvious way – convert to a list.
batch = random.sample(list(my_deque), batch_size))
But you can avoid creating an entire list.
idx_batch = set(sample(range(len(my_deque)), batch_size))
batch = [val for i, val in enumerate(my_deque) if i in idx_batch]
P.S. (Edited)
Actually, random.sample should work fine with deques in Python >= 3.5. because the class has been updated to match the Sequence interface.
In [3]: deq = collections.deque(range(100))
In [4]: random.sample(deq, 10)
Out[4]: [12, 64, 84, 77, 99, 69, 1, 93, 82, 35]
Note! as Geoffrey Irving has correctly stated in the comment bellow, you'd better convert the queue into a list, because queues are implemented as linked lists, making each index-access O(n) in the size of the queue, therefore sampling m random values will take O(m*n) time.
sample() on a deque works fine in Python ≥3.5, and it's pretty fast.
In Python 3.4, you could use this instead, which runs about as fast:
sample_indices = sample(range(len(deq)), 50)
[deq[index] for index in sample_indices]
On my MacBook using Python 3.6.8, this solution is over 44 times faster than Eli Korvigo's solution. :)
I used a deque with 1 million items, and I sampled 50 items:
from random import sample
from collections import deque
deq = deque(maxlen=1000000)
for i in range(1000000):
deq.append(i)
sample_indices = set(sample(range(len(deq)), 50))
%timeit [deq[i] for i in sample_indices]
1.68 ms ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit sample(deq, 50)
1.94 ms ± 60.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit sample(range(len(deq)), 50)
44.9 µs ± 549 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit [val for index, val in enumerate(deq) if index in sample_indices]
75.1 ms ± 410 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
That said, as others have pointed out, a deque is not well suited for random access. If you want to implement a replay memory, you could instead use a rotating list like this:
class ReplayMemory:
def __init__(self, max_size):
self.buffer = [None] * max_size
self.max_size = max_size
self.index = 0
self.size = 0
def append(self, obj):
self.buffer[self.index] = obj
self.size = min(self.size + 1, self.max_size)
self.index = (self.index + 1) % self.max_size
def sample(self, batch_size):
indices = sample(range(self.size), batch_size)
return [self.buffer[index] for index in indices]
With a million items, sampling 50 items is blazingly fast:
%timeit mem.sample(50)
#58 µs ± 691 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Categories