Python: Number ranges that are extremely large?

Python: Number ranges that are extremely large? - python

val = long(raw_input("Please enter the maximum value of the range:")) + 1
start_time = time.time()
numbers = range(0, val)
shuffle(numbers)
I cannot find a simple way to make this work with extremely large inputs - can anyone help?
I saw a question like this - but I could not implement the range function they described in a way that works with shuffle. Thanks.

To get a random permutation of the range [0, n) in a memory efficient manner; you could use numpy.random.permutation():
import numpy as np
numbers = np.random.permutation(n)
If you need only small fraction of values from the range e.g., to get k random values from [0, n) range:
import random
from functools import partial
def sample(n, k):
# assume n is much larger than k
randbelow = partial(random.randrange, n)
# from random.py
result = [None] * k
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow()
while j in selected:
j = randbelow()
selected_add(j)
result[i] = j
return result
print(sample(10**100, 10))

If you don't need the full list of numbers (and if you are getting billions, its hard to imagine why you would need them all), you might be better off taking a random.sample of your number range, rather than shuffling them all. In Python 3, random.sample can work on a range object too, so your memory use can be quite modest.
For example, here's code that will sample ten thousand random numbers from a range up to whatever maximum value you specify. It should require only a relatively small amount of memory beyond the 10000 result values, even if your maximum is 100 billion (or whatever enormous number you want):
import random
def get10kRandomNumbers(maximum):
pop = range(1, maximum+1) # this is memory efficient in Python 3
sample = random.sample(pop, 10000)
return sample
Alas, this doesn't work as nicely in Python 2, since xrange objects don't allow maximum values greater than the system's integer type can hold.

An important point to note is that it will be impossible for a computer to have the list of numbers in memory if it is larger than a few billion elements: its memory footprint becomes larger than the typical RAM size (as it takes about 4 GB for 1 billion 32-bit numbers).
In the question, val is a long integer, which seems to indicate that you are indeed using more than a billion integer, so this cannot be done conveniently in memory (i.e., shuffling will be slow, as the operating system will swap).
That said, if the number of elements is small enough (let's say smaller than 0.5 billion), then a list of elements can fit in memory thanks to the compact representation offered by the array module, and be shuffled. This can be done with the standard module array:
import array, random
numbers = array.array('I', xrange(10**8)) # or 'L', if the number of bytes per item (numbers.itemsize) is too small with 'I'
random.shuffle(numbers)

Related

Justification of constants used in random.sample

I'm looking into the source code for the function sample in random.py (python standard library).
The idea is simple:
If a small sample (k) is needed from a large population (n): Just pick k random indices, since it is unlikely you'll pick the same number twice as the population is so large. And if you do, just pick again.
If a relatively large sample (k) is needed, compared to the total population (n): It is better to keep track of what you have picked.
My Question
There are a few constants involved, setsize = 21 and setsize += 4 ** _log(3*k,4). The critical ratio is roughly k : 21+3k. The comment says # size of a small set minus size of an empty list and # table size for big sets.
Where have these specific numbers come from? What is there justification?
The comments shed some light, however I find they bring as many questions as they answer.
I would kind of understand, size of a small set but find the "minus size of an empty list" confusing. Can someone shed any light on this?
what is meant specifically by "table" size, as apposed to say "set size".
Looking on the github repository, it looks like a very old version simply used the ratio k : 6*k, as the critical ratio, but I find that equally mysterious.
The code
def sample(self, population, k):
"""Chooses k unique random elements from a population sequence or set.
Returns a new list containing elements from the population while
leaving the original population unchanged. The resulting list is
in selection order so that all sub-slices will also be valid random
samples. This allows raffle winners (the sample) to be partitioned
into grand prize and second place winners (the subslices).
Members of the population need not be hashable or unique. If the
population contains repeats, then each occurrence is a possible
selection in the sample.
To choose a sample in a range of integers, use range as an argument.
This is especially fast and space efficient for sampling from a
large population: sample(range(10000000), 60)
"""
# Sampling without replacement entails tracking either potential
# selections (the pool) in a list or previous selections in a set.
# When the number of selections is small compared to the
# population, then tracking selections is efficient, requiring
# only a small set and an occasional reselection. For
# a larger number of selections, the pool tracking method is
# preferred since the list takes less space than the
# set and it doesn't suffer from frequent reselections.
if isinstance(population, _Set):
population = tuple(population)
if not isinstance(population, _Sequence):
raise TypeError("Population must be a sequence or set. For dicts, use list(d).")
randbelow = self._randbelow
n = len(population)
if not 0 <= k <= n:
raise ValueError("Sample larger than population or is negative")
result = [None] * k
setsize = 21 # size of a small set minus size of an empty list
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
if n <= setsize:
# An n-length list is smaller than a k-length set
pool = list(population)
for i in range(k): # invariant: non-selected at [0,n-i)
j = randbelow(n-i)
result[i] = pool[j]
pool[j] = pool[n-i-1] # move non-selected item into vacancy
else:
selected = set()
selected_add = selected.add
for i in range(k):
j = randbelow(n)
while j in selected:
j = randbelow(n)
selected_add(j)
result[i] = population[j]
return result
(I apologise is this question would be better placed in math.stackexchange. I couldn't think of any probability/statistics-y reasons for this particular ratio, and the comments sounded as though, it was maybe something to do with the amount of space that sets and lists use - but could't find any details anywhere).

This code is attempting to determine whether using a list or a set would take more space (instead of trying to estimate the time cost, for some reason).
It looks like 21 was the difference between the size of an empty list and a small set on the Python build this constant was determined on, expressed in multiples of the size of a pointer. I don't have a build of that version of Python, but testing on my 64-bit CPython 3.6.3 gives a difference of 20 pointer sizes:
>>> sys.getsizeof(set()) - sys.getsizeof([])
160
and comparing the 3.6.3 list and set struct definitions to the list and set definitions from the change that introduced this code, 21 seems plausible.
I said "the difference between the size of an empty list and a small set" because both now and at the time, small sets used a hash table contained inside the set struct itself instead of externally allocated:
setentry smalltable[PySet_MINSIZE];
The
if k > 5:
setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
check adds the size of the external table allocated for sets larger than 5 elements, with size again expressed in number of pointers. This computation assumes the set never shrinks, since the sampling algorithm never removes elements. I am not currently sure whether this computation is exact.
Finally,
if n <= setsize:
compares the base overhead of a set plus any space used by an external hash table to the n pointers required by a list of the input elements. (It doesn't seem to account for the overallocation performed by list(population), so it may be underestimating the cost of the list.)

Is there any efficient way to increment the corresponding set positions of an integer in an integer array?

Any solution consuming less than O(Bit Length) time is welcome. I need to process around 100 million large integers.
answer = [0 for i in xrange(100)]
def pluginBits(val):
global answer
for j in xrange(len(answer)):
if val <= 0:
break
answer[j] += (val & 1)
val >>= 1

A speedier way to do this would be to use '{:b}'.format(someval) to convert from integer to a string of '1's and '0's. Python still needs to do similar work to perform this conversion, but doing it at the C layer in the interpreter internals involves significantly less overhead for larger values.
For conversion to actual list of integer 1s and 0s, you could do something like:
# Done once at top level to make translation table:
import string
bitstr_to_intval = string.maketrans(b'01', b'\x00\x01')
# Done for each value to convert:
bits = bytearray('{:b}'.format(origint).translate(bitstr_to_intval))
Since bytearray is a mutable sequence of values in range(256) that iterates the actual int values, you don't need to convert to list; it should be usable in 99% of the places the list would be used, using less memory and running faster.
This does generate the values in the reverse of the order your code produces (that is, bits[-1] here is the same as your answer[0], bits[-2] is your answer[1], etc.), and it's unpadded, but since you're summing bits, the padding isn't needed, and reversing the result is a trivial reversing slice (add [::-1] to the end). Summing the bits from each input can be made much faster by making answer a numpy array (that allows a bulk element-wise addition at the C layer), and putting it all together gets:
import string
bitstr_to_intval = string.maketrans(b'01', b'\x00\x01')
answer = numpy.zeros(100, numpy.uint64)
def pluginBits(val):
bits = bytearray('{:b}'.format(val).translate(bitstr_to_intval))[::-1]
answer[:len(bits)] += bits
In local tests, this definition of pluginBits takes a little under one-seventh the time to sum the bits at each position for 10,000 random input integers of 100 bits each, and gets the same results.

How to Generate N random numbers in Python 3 between 0 to infinity

How do I generate n random numbers in python 3? n is a to be determined variable. preferably natural numbers (integers > 0),
All answers I've found take random integers from a range, however I don't want to generate numbers from a range. (unless the range is 0 to infinity)

To paraphrase Wittgenstein, the limits of your machine is the limits of your language. i.e. There is no such thing as infinity in computers/computation world. However, regarding the maximum supported size of data structures you can use sys.maxsize (sys.maxint in python 2) to get that limit which for example could be used as the maximum list index or string length etc. You could also pass it to random.randint function in order to get an arbitrary very large random integer but still you may be able to increase that threshold based on your machine's processing power.
>>> import sys
>>> sys.maxsize
9223372036854775807
>>> random.randint(0,sys.maxsize)
7512061515276834201
And for generating multiple random numbers you can use a list comprehension like following:
>>> N = 10
>>> [random.randint(0,sys.maxsize) for _ in range(N)]
[3275729488497352533, 7487884953907275260, 36555221619119354, 1813061054215861082, 619640952660975257, 9041692448390670491, 5863449945569266108, 8061742194513906273, 1435436865777681895, 8761466112930659544]
For more info about the difference of sys.maxint and sys.maxsize in python 2.X and 3.X:
The sys.maxint constant was removed, since there is no longer a
limit to the value of integers. However, sys.maxsize can be used as an
integer larger than any practical list or string index. It conforms to
the implementation’s “natural” integer size and is typically the same
as sys.maxint in previous releases on the same platform (assuming the
same build options).

I think you probably need to rethink what it is you're trying to do with the random number you want. In particular, what distribution are you sampling the number from?
If you want your random numbers uniformly distributed (equal probability of each number being chosen), you can't: you'd need an infinite amount of memory (or time, or both).
Of course, if you allow for non-uniform distributions, here are some random numbers between 1 and (roughly) the largest float my system allows, but there are gaps due to the way that such numbers are represented. And you may feel that the probability of "large" numbers being selected falls away rather quicker than you'd like...
In [254]: [int(1./random.random()) for i in range(10)]
Out[254]: [1, 1, 2, 1, 1, 117, 1, 3, 2, 6]

Here we have a memory limitation, so we can get the random numbers to the maximum a system can reach. Just place the n-digit numbers you want in the condition, and you can get the desired result. As an example, I tried for 6-digit random numbers. One can try as per the requirements. Hope this solves your question to an extent.
import sys
from random import *
for i in range(sys.maxsize):
print(randint(0,9),randint(0,9),randint(0,9),randint(0,9),randint(0,9),randint(0,9),sep='')

Sum of primes below 2,000,000 in python

I am attempting problem 10 of Project Euler, which is the summation of all primes below 2,000,000. I have tried implementing the Sieve of Erasthotenes using Python, and the code I wrote works perfectly for numbers below 10,000.
However, when I attempt to find the summation of primes for bigger numbers, the code takes too long to run (finding the sum of primes up to 100,000 took 315 seconds). The algorithm clearly needs optimization.
Yes, I have looked at other posts on this website, like Fastest way to list all primes below N, but the solutions there had very little explanation as to how the code worked (I am still a beginner programmer) so I was not able to actually learn from them.
Can someone please help me optimize my code, and clearly explain how it works along the way?
Here is my code:
primes_below_number = 2000000 # number to find summation of all primes below number
numbers = (range(1, primes_below_number + 1, 2)) # creates a list excluding even numbers
pos = 0 # index position
sum_of_primes = 0 # total sum
number = numbers[pos]
while number < primes_below_number and pos < len(numbers) - 1:
pos += 1
number = numbers[pos] # moves to next prime in list numbers
sum_of_primes += number # adds prime to total sum
num = number
while num < primes_below_number:
num += number
if num in numbers[:]:
numbers.remove(num) # removes multiples of prime found
print sum_of_primes + 2
As I said before, I am new to programming, therefore a thorough explanation of any complicated concepts would be deeply appreciated. Thank you.

As you've seen, there are various ways to implement the Sieve of Erasthotenes in Python that are more efficient than your code. I don't want to confuse you with fancy code, but I can show how to speed up your code a fair bit.
Firstly, searching a list isn't fast, and removing elements from a list is even slower. However, Python provides a set type which is quite efficient at performing both of those operations (although it does chew up a bit more RAM than a simple list). Happily, it's easy to modify your code to use a set instead of a list.
Another optimization is that we don't have to check for prime factors all the way up to primes_below_number, which I've renamed to hi in the code below. It's sufficient to just go to the square root of hi, since if a number is composite it must have a factor less than or equal to its square root.
We don't need to keep a running total of the sum of the primes. It's better to do that at the end using Python's built-in sum() function, which operates at C speed, so it's much faster than doing the additions one by one at Python speed.
# number to find summation of all primes below number
hi = 2000000
# create a set excluding even numbers
numbers = set(xrange(3, hi + 1, 2))
for number in xrange(3, int(hi ** 0.5) + 1):
if number not in numbers:
#number must have been removed because it has a prime factor
continue
num = number
while num < hi:
num += number
if num in numbers:
# Remove multiples of prime found
numbers.remove(num)
print 2 + sum(numbers)
You should find that this code runs in a a few seconds; it takes around 5 seconds on my 2GHz single-core machine.
You'll notice that I've moved the comments so that they're above the line they're commenting on. That's the preferred style in Python since we prefer short lines, and also inline comments tend to make the code look cluttered.
There's another small optimization that can be made to the inner while loop, but I let you figure that out for yourself. :)

First, removing numbers from the list will be very slow. Instead of this, make a list
primes = primes_below_number * True
primes[0] = False
primes[1] = False
Now in your loop, when you find a prime p, change primes[k*p] to False for all suitable k. (You wouldn't actually do multiply, you'd continually add p, of course.)
At the end,
primes = [n for n i range(primes_below_number) if primes[n]]
This should be a great deal faster.
Second, you can stop looking once your find a prime greater than the square root of primes_below_number, since a composite number must have a prime factor that doesn't exceed its square root.

Try using numpy, should make it faster. Replace range by xrange, it may help you.

Here's an optimization for your code:
import itertools
primes_below_number = 2000000
numbers = list(range(3, primes_below_number, 2))
pos = 0
while pos < len(numbers) - 1:
number = numbers[pos]
numbers = list(
itertools.chain(
itertools.islice(numbers, 0, pos + 1),
itertools.ifilter(
lambda n: n % number != 0,
itertools.islice(numbers, pos + 1, len(numbers))
)
)
)
pos += 1
sum_of_primes = sum(numbers) + 2
print sum_of_primes
The optimization here is because:
Removed the sum to outside the loop.
Instead of removing elements from a list we can just create another one, memory is not an issue here (I hope).
When creating the new list we create it by chaining two parts, the first part is everything before the current number (we already checked those), and the second part is everything after the current number but only if they are not divisible by the current number.
Using itertools can make things faster since we'd be using iterators instead of looping through the whole list more than once.
Another solution would be to not remove parts of the list but disable them like #saulspatz said.
And here's the fastest way I was able to find: http://www.wolframalpha.com/input/?i=sum+of+all+primes+below+2+million 😁
Update
Here is the boolean method:
import itertools
primes_below_number = 2000000
numbers = [v % 2 != 0 for v in xrange(primes_below_number)]
numbers[0] = False
numbers[1] = False
numbers[2] = True
number = 3
while number < primes_below_number:
n = number * 3 # We already excluded even numbers
while n < primes_below_number:
numbers[n] = False
n += number
number += 1
while number < primes_below_number and not numbers[number]:
number += 1
sum_of_numbers = sum(itertools.imap(lambda index_n: index_n[1] and index_n[0] or 0, enumerate(numbers)))
print(sum_of_numbers)
This executes in seconds (took 3 seconds on my 2.4GHz machine).

Instead of storing a list of numbers, you can instead store an array of boolean values. This use of a bitmap can be thought of as a way to implement a set, which works well for dense sets (there aren't big gaps between the values of members).
An answer on a recent python sieve question uses this implementation python-style. It turns out a lot of people have implemented a sieve, or something they thought was a sieve, and then come on SO to ask why it was slow. :P Look at the related-questions sidebar from some of them if you want more reading material.
Finding the element that holds the boolean that says whether a number is in the set or not is easy and extremely fast. array[i] is a boolean value that's true if i is in the set, false if not. The memory address can be computed directly from i with a single addition.
(I'm glossing over the fact that an array of boolean might be stored with a whole byte for each element, rather than the more efficient implementation of using every single bit for a different element. Any decent sieve will use a bitmap.)
Removing a number from the set is as simple as setting array[i] = false, regardless of the previous value. No searching, not comparison, no tracking of what happened, just one memory operation. (Well, two for a bitmap: load the old byte, clear the correct bit, store it. Memory is byte-addressable, but not bit-addressable.)
An easy optimization of the bitmap-based sieve is to not even store the even-numbered bytes, because there is only one even prime, and we can special-case it to double our memory density. Then the membership-status of i is held in array[i/2]. (Dividing by powers of two is easy for computers. Other values are much slower.)
An SO question:
Why is Sieve of Eratosthenes more efficient than the simple "dumb" algorithm? has many links to good stuff about the sieve. This one in particular has some good discussion about it, in words rather than just code. (Nevermind the fact that it's talking about a common Haskell implementation that looks like a sieve, but actually isn't. They call this the "unfaithful" sieve in their graphs, and so on.)
discussion on that question brought up the point that trial division may be fast than big sieves, for some uses, because clearing the bits for all multiples of every prime touches a lot of memory in a cache-unfriendly pattern. CPUs are much faster than memory these days.

Get the the number of zeros and ones of a binary number in Python

I am trying to solve a binary puzzle, my strategy is to transform a grid in zeros and ones, and what I want to make sure is that every row has the same amount of 0 and 1.
Is there a any way to count how many 1s and 0s a number has without iterating through the number?
What I am currently doing is:
def binary(num, length=4):
return format(num, '#0{}b'.format(length + 2)).replace('0b', '')
n = binary(112, 8)
// '01110000'
and then
n.count('0')
n.count('1')
Is there any more efficient computational (or maths way) of doing that?

What you're looking for is the Hamming weight of a number. In a lower-level language, you'd probably use a nifty SIMD within a register trick or a library function to compute this. In Python, the shortest and most efficient way is to just turn it into a binary string and count the '1's:
def ones(num):
# Note that bin is a built-in
return bin(num).count('1')
You can get the number of zeros by subtracting ones(num) from the total number of digits.
def zeros(num, length):
return length - ones(num)
Demonstration:
>>> bin(17)
'0b10001'
>>> # leading 0b doesn't affect the number of 1s
>>> ones(17)
2
>>> zeros(17, length=6)
4

If the length is moderate (say less than 20), you can use a list as a lookup table.
It's only worth generating the list if you're doing a lot of lookups, but it seems you might in this case.
eg. For a 16 bit table of the 0 count, use this
zeros = [format(n, '016b').count('0') for n in range(1<<16)]
ones = [format(n, '016b').count('1') for n in range(1<<16)]
20 bits still takes under a second to generate on this computer
Edit: this seems slightly faster:
zeros = [20 - bin(n).count('1') for n in range(1<<20)]
ones = [bin(n).count('1') for n in range(1<<20)]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.