Create a long list of random values, no duplicates

Create a long list of random values, no duplicates - python

I want to create a list given two inputs, and under the condition that there cannot be any duplicates. The list should contain a random sequence of numbers. Then numbers in the list are positive integers.
Input 1: the length of the list (var samples)
Input 2: the highest number of the list (var end)
I know how to do this, but I want the list to contain a vast number of numbers, 1 million numbers, or more.
I have created 2 methods to solve this problem myself, both have their issues, on of them is slow the other produces a MemoryError.
Method 1, MemoryError:
import random
def create_lst_rand_int(end, samples):
if samples > end:
print('You cannot create this list')
else:
lst = []
lst_possible_values = range(0, end)
for item in range(0, samples):
random_choice = random.choice(lst_possible_values)
lst_possible_values.remove(random_choice)
lst.append(random_choice)
return lst
print create_lst_rand_int(1000000000000, 100000000001)
Method 2, slow:
import random
def lst_rand_int(end, samples):
lst = []
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
while len(lst) < samples:
random_int = random.randint(0, end)
if not random_int in lst:
lst.append(random_int)
return lst
print lst_rand_int(1000000000000, 100000000001)
Since neither of my methods work well (method 1 does work better than method 2) I would like to know how I can create a list that meets my requirements better.

Try the solution given in the docs:
http://docs.python.org/2/library/random.html#random.sample
To choose a sample from a range of integers, use an xrange() object as an argument. This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).
Or, in your case, random.sample(xrange(0,1000000000000), 100000000001)
This is still a giant data structure that may or may not fit in your memory. On my system:
>>> sys.getsizeof(1)
24
So 100000000001 samples will require 2400000000024 bytes, or roughly two terabytes. I suggest you find a way to work with smaller numbers of samples.

Try:
temp = xrange(end+1)
random.sample(temp, samples)
random.sample() does not pick any duplicates.

Since sample always returns a list, you're out of luck with such a large size. Try using a generator instead:
def rrange(min, max):
seen = set()
while len(seen) <= max - min:
n = random.randint(min, max)
if n not in seen:
seen.add(n)
yield n
This still requires memory to store seen elements, but at least not everything at once.

You could use a set instead of a list, and avoid checking for duplicates.
def lr2(end, samples):
lst = set()
# lst cannot exist under these conditions
if samples > end:
print('List must be longer or equal to the highest value')
else:
for _ in range(samples):
random_int = random.randint(0, end)
lst.add(random_int)
return lst

Since your sample size is such a large percentage of the items being sampled, a much faster approach is to shuffle the list of items and then just remove the first or last n items.
import random
def lst_rand_int(end, samples):
lst = range(0, end)
random.shuffle(lst)
return lst[0:samples]
If samples > end it will just return the whole list
If the list is too large for memory, you can break it into parts and store the parts on disc. In that case a random choice should be made to choose a section, then an item in the section and remove it for each sample required.

Related

most efficient way to iterate over a large array looking for a missing element in Python

I was trying an online test. the test asked to write a function that given a list of up to 100000 integers whose range is 1 to 100000, would find the first missing integer.
for example, if the list is [1,4,5,2] the output should be 3.
I iterated over the list as follow
def find_missing(num)
for i in range(1, 100001):
if i not in num:
return i
the feedback I receives is the code is not efficient in handling big lists.
I am quite new and I couldnot find an answer, how can I iterate more efficiently?

The first improvement would be to make yours linear by using a set for the repeated membership test:
def find_missing(nums)
s = set(nums)
for i in range(1, 100001):
if i not in s:
return i
Given how C-optimized python sorting is, you could also do sth like:
def find_missing(nums)
s = sorted(set(nums))
return next(i for i, n in enumerate(s, 1) if i != n)
But both of these are fairly space inefficient as they create a new collection. You can avoid that with an in-place sort:
from itertools import groupby
def find_missing(nums):
nums.sort() # in-place
return next(i for i, (k, _) in enumerate(groupby(nums), 1) if i != k)

For any range of numbers, the sum is given by Gauss's formula:
# sum of all numbers up to and including nums[-1] minus
# sum of all numbers up to but not including nums[-1]
expected = nums[-1] * (nums[-1] + 1) // 2 - nums[0] * (nums[0] - 1) // 2
If a number is missing, the actual sum will be
actual = sum(nums)
The difference is the missing number:
result = expected - actual
This compulation is O(n), which is as efficient as you can get. expected is an O(1) computation, while actual has to actually add up the elements.
A somewhat slower but similar complexity approach would be to step along the sequence in lockstep with either a range or itertools.count:
for a, e in zip(nums, range(nums[0], len(nums) + nums[0])):
if a != e:
return e # or break if not in a function
Notice the difference between a single comparison a != e, vs a linear containment check like e in nums, which has to iterate on average through half of nums to get the answer.

You can use Counter to count every occurrence of your list. The minimum number with occurrence 0 will be your output. For example:
from collections import Counter
def find_missing():
count = Counter(your_list)
keys = count.keys() #list of every element in increasing order
main_list = list(range(1:100000)) #the list of values from 1 to 100k
missing_numbers = list(set(main_list) - set(keys))
your_output = min(missing_numbers)
return your_output

Creating data in loop subject to moving condition

I am trying to create a list of data in a for loop then store this list in a list if it satisfies some condition. My code is
R = 10
lam = 1
proc_length = 100
L = 1
#Empty list to store lists
exponential_procs_lists = []
for procs in range(0,R):
#Draw exponential random variables
z_exponential = np.random.exponential(lam,proc_length)
#Sort values to increase
z_exponential.sort()
#Insert 0 at start of list
z_dat_r = np.insert(z_exponential,0,0)
sum = np.sum(np.diff(z_dat_r))
if sum < 5*L:
exponential_procs_lists.append(z_dat_r)
which will store some of the R lists that satisfies the sum < 5L condition. My question is, what is the best way to store R lists where the sum of each list is less than 5L? The lists can be different length but they must satisfy the condition that the sum of the increments is less than 5*L. Any help much appreciated.

Okay so based on your comment, I take that you want to generate an exponential_procs_list, inside which every sublist has a sum < 5*L.
Well, I modified your code to chop the sublists as soon as the sum exceeds 5*L.
Edit : See answer history to see my last answer for the approach above.
Well looking closer, notice you don't actually need the discrete difference array. You're finding the difference array, summing it up and checking whether the sum's < 5L and if it is, you append the original array.
But notice this:
if your array is like so: [0, 0.00760541, 0.22281415, 0.60476231], it's difference array would be [0.00760541 0.21520874 0.38194816].
If you add the first x terms of the difference array, you get the x+1th element of the original array. So you really just need to keep elements which are lesser than 5L:
import numpy as np
R = 10
lam = 1
proc_length = 5
L = 1
exponential_procs_lists = []
def chop(nums, target):
good_list = []
for num in nums:
if num >= target:
break
good_list.append(num)
return good_list
for procs in range(0,R):
z_exponential = np.random.exponential(lam,proc_length)
z_exponential.sort()
z_dat_r = np.insert(z_exponential,0,0)
good_list = chop(z_dat_r, 5*L)
exponential_procs_lists.append(good_list)
You could probably also just do a binary search(for better time complexity) or use a filter lambda, that's up to you.

How can I optimize my code to print amicable numbers?

I have tried this following code and it takes a lot of time when I set lower = 0 and upper = 10000
def sumPdivisors(n):
'''This function returns the sum of proper divisors of a number'''
lst = []
for i in range(1,n//2+1):
if n%i == 0:
lst.append(i)
return(sum(lst))
lower = int(input("Enter the lower value of range: "))
upper = int(input("Enter the upper value of range: "))
lst = []
for i in range(lower, upper+1):
if i == 0:
continue
else:
for j in range(i, upper):
if i!=j and sumPdivisors(i) == j and sumPdivisors(j) == i:
lst.append((i,j))
break
print(lst)

There are two things that you could do here.
Memoization
There's already a great explanation of what memoization is elsewhere on this site [link], but here's how it's relevant to your problem:
sumPdivisors is called very frequently in the for-loop at the bottom of your code snippet. For really large inputs n, it will take a long time to run.
sumPdivisors is called with the same input n multiple times.
You can speed things up by saving the result of calling sumPdivisors on different inputs somehow, like in a dictionary that maps integers to the resulting output when you call sumPdivisors with that corresponding integer. This is kind of what memoization is. You're precomputing the possible outputs of sumPdivisors and storing them for later. Read the link for a more in-depth explanation.
Don't add the numbers in sumPdivisors to a list
You can just add these numbers as you iterate instead of appending them to a list, then summing them. This change won't have as great of an impact as adding memoization to your code.

Large list generation memory managment

import itertools
Num = 11
base = list(range(1,Num+1))
Permutations = list(itertools.permutations(base))
I'm getting a memory error trying to run this. In reality I only need to generate the 1st (Num-1)! permutations but I'm not sure how to (so if Num = 7 i would need to generate the first 6! = 720 permutations). But I would ideally like to be able to generate permutations for significantly higher values of Num so any suggestions would be great

range() and permutation() both return generators that generate items on demand. You don't need to call list() and turn them into lists. Just iterate over them directly and access the items one by one.
num = 11
base = range(1, num+1)
permutations = itertools.permutations(base)
for permutation in permutations:
# Do something with `permutation`.
(Note that a generator can only be used once. If you want to iterate over the permutations more than once you'll need to call itertools.permutations() multiple times.)
To stop after n items use itertools.islice():
for permutation in itertools.islice(permutations, n):
# Do something with `permutation`.
You can skip items at the beginning, too. This would skip the first five permutations:
for permutation in itertools.islice(permutations, 5, n):
# Do something with `permutation`.
If you want to count the permutations you can add enumerate(), which attaches an index to every entry:
for i, permutation in enumerate(itertools.islice(permutations, n)):
# Skip the fifth permutation.
if i == 4:
continue
# Do something with `permutation`.
By the way, please use lowercase for variable names. Only class names should be capitalized.

Python very slow random sampling over big list

I'm expecting very slow performance with the algorithm below.
I've a very large (1.000.000+) list containing large strings.
ie: id_list = ['MYSUPERLARGEID:1123:123123', 'MYSUPERLARGEID:1123:134534389', 'MYSUPERLARGEID:1123:12763']...
num_reads is the max number of elements to random choose from this list.
The idea is to randomly choose one of the string ids in id_list until num_reads is reached and to add (I say add, and not append because I don't care on random_id_list order) them into random_id_list which is empty at the beginning.
I can't repeat same id so I remove it from the original list after being randonly chosen. I suspect this is what is doing the script to go real slow.. maybe I'm wrong and it's another part of this loop the responsible of slow behavior.
for x in xrange(0, num_reads):
id_index, id_string = random.choice(list(enumerate(id_list)))
random_id_list.append(id_string)
del read_id_list[id_index]

Use random.sample() to produce a sample of N elements with no repeats:
random_id_list = random.sample(read_id_list, num_reads)
Removing elements from the middle of a large list is indeed slow, as everything beyond that index has to be moved up a step.
This does not, of course, remove elements from the original list anymore, so repeated random.sample() calls can still give you samples with elements that have been picked before. If you need to produce samples repeatedly until your list is exhausted, then shuffle once and from there on out take consecutive slices of k elements from the shuffled list:
def random_samples(k):
random.shuffle(id_list)
for i in range(0, len(id_list), k):
yield id_list[i : i + k]
then use this to produce your samples; either in a loop or with next():
sample_gen = random_samples(num_reads)
random_id_list = next(sample_gen)
# some point later
another_random_id_list = next(sample_gen)
Because the list is shuffled entirely randomly, the slices produced this way are also all valid random samples.

The "hard" way, instead of just shuffling the list, is to evaluate each element of your list in order, and selecting the item with a probability that relies on both the number of items you still need to choose and the number of items left to choose from. This is useful if you don't have the entire list presented to you at once (a so-called on-line algorithm).
Let's say you need to select k of N items. That means each item has a k/N probability of being chosen, if you can consider all items at once. However, if you accept the first item, then you only need to select k-1 items from N-1 remaining items. If you reject it, you still need k items from N-1 remaining items. So the algorithm would look like
N = len(id_list)
k = 10 # For example
choices = []
for i in id_list:
if random.randint(1,N) <= k:
choices.append(i)
k -= 1
N -= 1
Initially, the first item is chosen with the expected probability of k/N. As you go through your list, N steadily decreases, while k decreases as you actually accept items. Note that each item, overall, still has a p = k/N chance of being chosen. As an example, consider the second item in the list. Let pi be the probability that you choose the ith element in the list. p1 is obviously k/N, given the starting values of k and N. Consider p2 for example.
p2 = p1 * (k-1) / (N-1) + (1-p1) * k / (N-1)
= (p1*k - p1 + k - k*p1) / (N-1)
= (k - p1)/(N-1)
= (k - k/N)/(N-1)
= k/(N-1) - k/(N*(N-1)
= (k*N - k)/(N*(N-1))
= k/N
Similar (but longer) analysis holds for p3, p4, etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Create a long list of random values, no duplicates - python

Try: temp = xrange(end+1) random.sample(temp, samples) random.sample() does not pick any duplicates.

Related

most efficient way to iterate over a large array looking for a missing element in Python

Creating data in loop subject to moving condition

How can I optimize my code to print amicable numbers?

Large list generation memory managment

Python very slow random sampling over big list

Categories

Resources