Python very slow random sampling over big list

Python very slow random sampling over big list - python

I'm expecting very slow performance with the algorithm below.
I've a very large (1.000.000+) list containing large strings.
ie: id_list = ['MYSUPERLARGEID:1123:123123', 'MYSUPERLARGEID:1123:134534389', 'MYSUPERLARGEID:1123:12763']...
num_reads is the max number of elements to random choose from this list.
The idea is to randomly choose one of the string ids in id_list until num_reads is reached and to add (I say add, and not append because I don't care on random_id_list order) them into random_id_list which is empty at the beginning.
I can't repeat same id so I remove it from the original list after being randonly chosen. I suspect this is what is doing the script to go real slow.. maybe I'm wrong and it's another part of this loop the responsible of slow behavior.
for x in xrange(0, num_reads):
id_index, id_string = random.choice(list(enumerate(id_list)))
random_id_list.append(id_string)
del read_id_list[id_index]

Use random.sample() to produce a sample of N elements with no repeats:
random_id_list = random.sample(read_id_list, num_reads)
Removing elements from the middle of a large list is indeed slow, as everything beyond that index has to be moved up a step.
This does not, of course, remove elements from the original list anymore, so repeated random.sample() calls can still give you samples with elements that have been picked before. If you need to produce samples repeatedly until your list is exhausted, then shuffle once and from there on out take consecutive slices of k elements from the shuffled list:
def random_samples(k):
random.shuffle(id_list)
for i in range(0, len(id_list), k):
yield id_list[i : i + k]
then use this to produce your samples; either in a loop or with next():
sample_gen = random_samples(num_reads)
random_id_list = next(sample_gen)
# some point later
another_random_id_list = next(sample_gen)
Because the list is shuffled entirely randomly, the slices produced this way are also all valid random samples.

The "hard" way, instead of just shuffling the list, is to evaluate each element of your list in order, and selecting the item with a probability that relies on both the number of items you still need to choose and the number of items left to choose from. This is useful if you don't have the entire list presented to you at once (a so-called on-line algorithm).
Let's say you need to select k of N items. That means each item has a k/N probability of being chosen, if you can consider all items at once. However, if you accept the first item, then you only need to select k-1 items from N-1 remaining items. If you reject it, you still need k items from N-1 remaining items. So the algorithm would look like
N = len(id_list)
k = 10 # For example
choices = []
for i in id_list:
if random.randint(1,N) <= k:
choices.append(i)
k -= 1
N -= 1
Initially, the first item is chosen with the expected probability of k/N. As you go through your list, N steadily decreases, while k decreases as you actually accept items. Note that each item, overall, still has a p = k/N chance of being chosen. As an example, consider the second item in the list. Let pi be the probability that you choose the ith element in the list. p1 is obviously k/N, given the starting values of k and N. Consider p2 for example.
p2 = p1 * (k-1) / (N-1) + (1-p1) * k / (N-1)
= (p1*k - p1 + k - k*p1) / (N-1)
= (k - p1)/(N-1)
= (k - k/N)/(N-1)
= k/(N-1) - k/(N*(N-1)
= (k*N - k)/(N*(N-1))
= k/N
Similar (but longer) analysis holds for p3, p4, etc.

Related

Utilizing Mergesort for an array of unsorted words

I'm trying to utilize mergesort + divide & conquer by taking a user's input of random words. I'm taking the users input and dividing the words given into two arrays:
import math
inputted_sentence = input("Enter your sentence here: \n")
separated_inputs = [word.lower() for word in inputted_sentence.split()]
inputs_length = int(len(separated_inputs))
#separate the arrays
array_one = (separated_inputs[0:math.floor(int(inputs_length/2))])
array_two = (separated_inputs[math.floor(int(inputs_length/2)):inputs_length])
#grab length of the array
length_array_one = len(array_one)
length_array_two = len(array_two)
after that, I'm sorting them (alphabetical order)
#first array being sorted and stored
for a in range(length_array_one-1):
for b in range(length_array_one-a-1):
if array_one[b] > array_one[b+1]:
array_one[b], array_one[b+1] = array_one[b+1], array_one[b]
sorted_array_one = []
for words in array_one:
sorted_array_one.append(words)
#second array being sorted and stored
for a in range(length_array_two-1):
for b in range(length_array_two-a-1):
if array_two[b] > array_two[b+1]:
array_two[b], array_two[b+1] = array_two[b+1], array_two[b]
sorted_array_two = []
for words in array_two:
sorted_array_two.append(words)
This image shows the two arrays: https://i.stack.imgur.com/9ZrHb.png
Now I need to compare blue to aaple, see it's less, compare blue to apple, see its less, blue to cat, see it is greater and it takes index[2] in the final array.
after that, rabbit compares with cat, its less, dog, it's less, takes the array spot after dog.
Edit: my version one (below) does this but this doesn't utilize the sorted arrays as it just sorts the words all over again.
unsorted_final = sorted_array_one + sorted_array_two
length_unsorted_final = len(unsorted_final)
sorted_array_final = []
#Final array sorted and stored
for a in range(length_unsorted_final-1):
for b in range(length_unsorted_final-a-1):
if unsorted_final[b] > unsorted_final[b+1]:
unsorted_final[b], unsorted_final[b+1] = unsorted_final[b+1], unsorted_final[b]
for words in unsorted_final:
sorted_array_final.append(words)
print(sorted_array_final)

Merge sort uses a helper algorithm called "merge", which works by taking 2 sorted arrays, M and N, and combining them into a new array S that is also sorted. It does this by taking advantage of a simple invariant: given these 2 sorted input arrays, the smallest element in their union will always be either m = M[0] or n = N[0], and this remains true even after repeatedly removing the smaller of either m or n. If m <= n then we can do S.append(M.pop(0)), removing m from M UNION N and adding it to the end of S, and the invariant will still be true. We can keep doing this until both input lists are empty, leaving S = M UNION N and S is sorted.
you can implement that in Python like this:
def merge(left, right):
result = []
while left or right:
if left and right:
if left[0] <= right[0]:
result.append(left.pop(0))
else:
result.append(right.pop(0))
elif left:
result.extend(left)
break
else:
result.extend(right)
break
return result
Note that this is not quite MergeSort. This is actually just the merge step. To turn this into a fully working MergeSort, you need to implement the divide and conquer part. The most straightforward way of doing that is by taking the single input list and repeatedly splitting it in half, calling merge_sort on each half, then merging them:
def mergesort(A, depth=0):
if (count := len(A)) > 1:
left = mergesort(A[:count // 2], depth + 1)
right = mergesort(A[count // 2:], depth + 1)
print("{D}A: {A}\n{D}left: {L}\n{D}right: {R}".format(A=A, L=left, R=right, D=" " * depth))
result = merge(left, right)
print("{D}result: {S}".format(S=result, D=" " * depth))
return result
return A
(print statements and depth parameter added for demo purposes. This is not the most efficient way of implementing this, but I think it is the most illustrative.)
Hopefully you see how this works: since merge requires 2 sorted lists, we need to make sure to provide that. We do that by breaking the input down into smaller and smaller lists, until they only contain 1 element, which is trivially sorted. Then we proceed to build up longer and longer lists of sorted elements, until the whole input is done. I think the takeaway here is this: MergeSort is divide and conquer. Or, perhaps more accurately, MergeSort is the "divide" part, and the merge procedure is the "conquer".

Generate combinations such that the total is always 100 and uses a defined jump value

I am looking to generate a list of combinations such that the total is always 100. The combinations have to be generated based on a jump value (similar to how we use it in range or loop).
The number of elements in each combination is based on the length of the parent_list. If the parent list of 10 elements, we need each list in the output to be of 10 elements.
parent_list=['a','b','c', 'd']
jump=15
sample of expected output is
[[15,25,25,35],[30,50,10,10],[20,15,20,45]]
I used the solution given in this question, but it doesn't give the option to add the Jump parameter. Fractions are allowed too.

This program finds all combinations of n positive integers whose sum is total such that at least one of them is a multiple of jump. It works in a recursive way, setting jump to 1 if the current sequence already contains an element that's a multiple of the original jump.
def find_sum(n, total, jump, seq=()):
if n == 0:
if total == 0 and jump == 1: yield seq
return
for i in range(1, total+1):
yield from find_sum(n - 1, total - i, jump if i % jump else 1, seq + (i,))
for seq in find_sum(4, 100, 15):
print(seq)
There's still a lot of solutions.

Pythonic way of checking if indefinite # of consec elements in list sum to given value

Having trouble figuring out a nice way to get this task done.
Say i have a list of triangular numbers up to 1000 -> [0,1,3,6,10,15,..]etc
Given a number, I want to return the consecutive elements in that list that sum to that number.
i.e.
64 --> [15,21,28]
225 --> [105,120]
371 --> [36, 45, 55, 66, 78, 91]
if there's no consecutive numbers that add up to it, return an empty list.
882 --> [ ]
Note that the length of consecutive elements can be any number - 3,2,6 in the examples above.
The brute force way would iteratively check every possible consecutive pairing possibility for each element. (start at 0, look at the sum of [0,1], look at the sum of [0,1,3], etc until the sum is greater than the target number). But that's probably O(n*2) or maybe worse. Any way to do it better?
UPDATE:
Ok, so a friend of mine figured out a solution that works at O(n) (I think) and is pretty intuitively easy to follow. This might be similar (or the same) to Gabriel's answer, but it was just difficult for me to follow and I like that this solution is understandable even from a basic perspective. this is an interesting question, so I'll share her answer:
def findConsec(input1 = 7735):
list1 = range(1, 1001)
newlist = [reduce(lambda x,y: x+y,list1[0:i]) for i in list1]
curr = 0
end = 2
num = sum(newlist[curr:end])
while num != input1:
if num < input1:
num += newlist[end]
end += 1
elif num > input1:
num -= newlist[curr]
curr += 1
if curr == end:
return []
if num == input1:
return newlist[curr:end]

A 3-iteration max solution
Another solution would be to start from close where your number would be and walk forward from one position behind. For any number in the triangular list vec, their value can be defined by their index as:
vec[i] = sum(range(0,i+1))
The division between the looking-for sum value and the length of the group is the average of the group and, hence, lies within it, but may as well not exist in it.
Therefore, you can set the starting point for finding a group of n numbers whose sum matches a value val as the integer part of the division between them. As it may not be in the list, the position would be that which minimizes their difference.
# vec as np.ndarray -> the triangular or whatever-type series
# val as int -> sum of n elements you are looking after
# n as int -> number of elements to be summed
import numpy as np
def seq_index(vec,n,val):
index0 = np.argmin(abs(vec-(val/n)))-n/2-1 # covers odd and even n values
intsum = 0 # sum of which to keep track
count = 0 # counter
seq = [] # indices of vec that sum up to val
while count<=2: # walking forward from the initial guess of where the group begins or prior to it
intsum = sum(vec[(index0+count):(index0+count+n)])
if intsum == val:
seq.append(range(index0+count,index0+count+n))
count += 1
return seq
# Example
vec = []
for i in range(0,100):
vec.append(sum(range(0,i))) # build your triangular series from i = 0 (0) to i = 99 (whose sum equals 4950)
vec = np.array(vec) # convert to numpy to make it easier to query ranges
# looking for a value that belong to the interval 0-4590
indices = seq_index(vec,3,4)
# print indices
print indices[0]
print vec[indices]
print sum(vec[indices])
Returns
print indices[0] -> [1, 2, 3]
print vec[indices] -> [0 1 3]
print sum(vec[indices]) -> 4 (which we were looking for)

This seems like an algorithm question rather than a question on how to do it in python.
Thinking backwards I would copy the list and use it in a similar way to the Sieve of Eratosthenes. I would not consider the numbers that are greater than x. Then start from the greatest number and sum backwards. Then if I get greater than x, subtract the greatest number (exclude it from the solution) and continue to sum backward.
This seems the most efficient way to me and actually is O(n) - you never go back (or forward in this backward algorithm), except when you subtract or remove the biggest element, which doesn't need accessing the list again - just a temp var.
To answer Dunes question:
Yes, there is a reason - to subtracts the next largest in case of no-solution that sums larger. Going from the first element, hit a no-solution would require access to the list again or to the temporary solution list to subtract a set of elements that sum greater than the next element to sum. You risk to increase the complexity by accessing more elements.
To improve efficiency in the cases where an eventual solution is at the beginning of the sequence you can search for the smaller and larger pair using binary search. Once a pair of 2 elements, smaller than x is found then you can sum the pair and if it sums larger than x you go left, otherwise you go right. This search has logarithmic complexity in theory. In practice complexity is not what it is in theory and you can do whatever you like :)

You should pick the first three elements, sum them and do and then you keep subtracting the first of the three and add the next element in the list and see if the sum add up to whatever number you want. That would be O(n).
# vec as np.ndarray
import numpy as np
itsum = sum(list[0:2]) # the sum you want to iterate and check its value
sequence = [[] if itsum == whatever else [range(0,3)]] # indices of the list that add up to whatever (creation)
for i in range(3,len(vec)):
itsum -= vec[i-3]
itsum += vec[i]
if itsum == whatever:
sequence.append(range(i-2,i+1)) # list of sequences that add up to whatever

The solution you provide in the question isn't truly O(n) time complexity -- the way you compute your triangle numbers makes the computation O(n2). The list comprehension throws away the previous work that want into calculating the last triangle number. That is: tni = tni-1 + i (where tn is a triangle number). Since you also, store the triangle numbers in a list, your space complexity is not constant, but related to the size of the number you are looking for. Below is an identical algorithm, but is O(n) time complexity and O(1) space complexity (written for python 3).
# for python 2, replace things like `highest = next(high)` with `highest = high.next()`
from itertools import count, takewhile, accumulate
def find(to_find):
# next(low) == lowest number in total
# next(high) == highest number not in total
low = accumulate(count(1)) # generator of triangle numbers
high = accumulate(count(1))
total = highest = next(high)
# highest = highest number in the sequence that sums to total
# definitely can't find solution if the highest number in the sum is greater than to_find
while highest <= to_find:
# found a solution
if total == to_find:
# keep taking numbers from the low iterator until we find the highest number in the sum
return list(takewhile(lambda x: x <= highest, low))
elif total < to_find:
# add the next highest triangle number not in the sum
highest = next(high)
total += highest
else: # if total > to_find
# subtract the lowest triangle number in the sum
total -= next(low)
return []

What is fastest way to determine numbers are within specific range of each other in Python?

I have list of numbers as follows -
L = [ 1430185458, 1430185456, 1430185245, 1430185246, 1430185001 ]
I am trying to determine which numbers are within range of "2" from each other. List will be in unsorted when I receive it.
If there are numbers within range of 2 from each other I have to return "1" at exact same position number was received in.
I was able to achieve desired result , however code is running very slow. My approach involves sorting list, iterating it twice taking two pointers and comparing it successively. I will have millions of records coming as seperate lists.
Just trying to see what is best possible approach to address this problem.
Edit - Apology as I was away for a while. List can have any number of elements in it ranging from 1 to n. Idea is to return either 0 or 1 in exact same position number was received. I can not post actual code I implemented but here is pseudo code.
a. create new list as list of list with second part as 0 for each element. We assume that there are no numbers within range of 2 of each other.
[[1430185458,0], [1430185456,0], [1430185245,0], [1430185246,0], [1430185001,0]]
b. sort original list
c. compare first element to second, second to third and so on until end of list is reached and whenever difference is less than or equal to 2 update corresponding second elements in step a to 1.
[[1430185458,1], [1430185456,1], [1430185245,1], [1430185246,1], [1430185001,0]]

The goal is to be fast, so that presumably means an O(N) algorithm. Building an NxN difference matrix is O(N^2), so that's not good at all. Sorting is O(N*log(N)), so that's out, too. Assuming average case O(1) behavior for dictionary insert and lookup, the following is an O(N) algorithm. It rips through a list of a million random integers in a couple of seconds.
def in_range (numbers) :
result = [0] * len(numbers)
index = {}
for idx, number in enumerate(numbers) :
for offset in range(-2,3) :
match_idx = index.get(number+offset)
if match_idx is not None :
result[match_idx] = result[idx] = 1
index[number] = idx
return result
Update
I have to return "1" at exact same position number was received in.
The update to the question asks for a list of the form [[1,1],[2,1],[5,0]] given an input of [1,2,5]. I didn't do that. Instead, my code returns [1,1,0] given [1,2,5]. It's about 15% faster to produce that simple 0/1 list compared to the [[value,in_range],...] list. The desired list can easily be created using zip:
zip(numbers,in_range(numbers)) # Generator
list(zip(numbers,in_range(numbers))) # List of (value,in_range) tuples

I think this does what you need (process() modifies the list L). Very likely it's still optimizable, though:
def process(L):
s = [(v,k) for k,v in enumerate(L)]
s.sort()
j = 0
for i,v_k in enumerate(s):
v = v_k[0]
while j < i and v-s[j][0]>2:
j += 1
while j < i:
L[s[j][1]] = 1
L[s[i][1]] = 1
j += 1

Create a long list of random values, no duplicates

I want to create a list given two inputs, and under the condition that there cannot be any duplicates. The list should contain a random sequence of numbers. Then numbers in the list are positive integers.
Input 1: the length of the list (var samples)
Input 2: the highest number of the list (var end)
I know how to do this, but I want the list to contain a vast number of numbers, 1 million numbers, or more.
I have created 2 methods to solve this problem myself, both have their issues, on of them is slow the other produces a MemoryError.
Method 1, MemoryError:
import random
def create_lst_rand_int(end, samples):
if samples > end:
print('You cannot create this list')
else:
lst = []
lst_possible_values = range(0, end)
for item in range(0, samples):
random_choice = random.choice(lst_possible_values)
lst_possible_values.remove(random_choice)
lst.append(random_choice)
return lst
print create_lst_rand_int(1000000000000, 100000000001)
Method 2, slow:
import random
def lst_rand_int(end, samples):
lst = []
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
while len(lst) < samples:
random_int = random.randint(0, end)
if not random_int in lst:
lst.append(random_int)
return lst
print lst_rand_int(1000000000000, 100000000001)
Since neither of my methods work well (method 1 does work better than method 2) I would like to know how I can create a list that meets my requirements better.

Try the solution given in the docs:
http://docs.python.org/2/library/random.html#random.sample
To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).
Or, in your case, random.sample(xrange(0,1000000000000), 100000000001)
This is still a giant data structure that may or may not fit in your memory. On my system:
>>> sys.getsizeof(1)
24
So 100000000001 samples will require 2400000000024 bytes, or roughly two terabytes. I suggest you find a way to work with smaller numbers of samples.

Try:
temp = xrange(end+1)
random.sample(temp, samples)
random.sample() does not pick any duplicates.

Since sample always returns a list, you're out of luck with such a large size. Try using a generator instead:
def rrange(min, max):
seen = set()
while len(seen) <= max - min:
n = random.randint(min, max)
if n not in seen:
seen.add(n)
yield n
This still requires memory to store seen elements, but at least not everything at once.

You could use a set instead of a list, and avoid checking for duplicates.
def lr2(end, samples):
lst = set()
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
for _ in range(samples):
random_int = random.randint(0, end)
lst.add(random_int)
return lst

Since your sample size is such a large percentage of the items being sampled, a much faster approach is to shuffle the list of items and then just remove the first or last n items.
import random
def lst_rand_int(end, samples):
lst = range(0, end)
random.shuffle(lst)
return lst[0:samples]
If samples > end it will just return the whole list
If the list is too large for memory, you can break it into parts and store the parts on disc. In that case a random choice should be made to choose a section, then an item in the section and remove it for each sample required.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python very slow random sampling over big list - python

Related

Utilizing Mergesort for an array of unsorted words

Generate combinations such that the total is always 100 and uses a defined jump value

Pythonic way of checking if indefinite # of consec elements in list sum to given value

What is fastest way to determine numbers are within specific range of each other in Python?

Create a long list of random values, no duplicates

Categories

Resources