Massive Python set of integers using too much memory - python

Setup
Python 2.6
Ubuntu x64
I have a set of unique integers with values between 1 and 50 million. New integers are added at random e.g. numberset.add(random.randint(1, 50000000)). I need to be able to quickly add new integers and quickly check if an integer is already present.
Problem
After a while, the set grows too large for my low memory system and I experience MemoryErrors.
Question
How can I achieve this while using less memory? What's the fastest way to do this using the disk without reconfiguring the system e.g. swapfiles? Should I use a database file like sqlite? Is there a library that will compress the integers in memory?

You can avoid dependencies on 3rd-party bit-array modules by writing your own -- the functionality required is rather minimal:
import array
BITS_PER_ITEM = array.array('I').itemsize * 8
def make_bit_array(num_bits, initially=0):
num_items = (num_bits + BITS_PER_ITEM - 1) // BITS_PER_ITEM
return array.array('I', [initially]) * num_items
def set_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
bit_array[item_index] |= 1 << bit_index
def clear_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
bit_array[item_index] &= ~(1 << bit_index)
def get_bit(bit_array, offset):
item_index = offset // BITS_PER_ITEM
bit_index = offset % BITS_PER_ITEM
return (bit_array[item_index] >> bit_index) & 1

Use a bit-array.This will reduce the need for huge space requirement.
Realted SO Question:
Python equivalent to Java's BitSet

Use an array of bits as flags for each integer - the memory needed will be only 50 million bits (about 6 MB). There are a few modules that can help. This example uses bitstring, another option is bitarray:
from bitstring import BitArray
i = BitArray(50000000) # initialise 50 million zero bits
for x in xrange(100):
v = random.randint(1, 50000000)
if not i[v]: # Test if it's already present
i.set(1, v) # Set a single bit
Setting and checking bits is very fast and it uses very little memory.

Try to use array module.

Depending on your requirements, you might also consider a bloom filter. It is a memory-efficient data structure for testing if an element is in a set. The catch is that it it can give false-positives, though it will never give false-negatives.

If integers are unique then use bits. Example: binary 01011111 means that there are: 1, 3, 4, 5, 6 and 7. This way every bit is used to check if its integer index is used (value 1) or not (value 0).
It was described in one chapter of "Programming Pearls" by Jon Bentley (look for "The file contains at most ten million records; each record is a seven-digit integer.")
It seems that there is bitarray module mentioned by Emil that works this way.

Related

Python efficient assigning of even and odd bits to a bitstring/bit_array

I receive serialized DDR data from a setup with 8 chips. Due to the way the readout is implemented in hardware, the data that is received by the computer has the following structure:
bits 0 and 1 of Chip A, bits 0 and 1 of Chip B, ... bits 2 and 3 of Chip A, bits 2 and 3 of Chip B, ...
In order to make sense of the individual reply of each chip, the data needs to be split:
import bitstring
data = bitstring.BitArray(1024) # contains serialized DDR data of 8 chips (here just 0 values for demo purposes)
even_bits_of_chip_A = data[0::16] # starting at position 0, every 16th bit is an even bit of chip A
odd_bits_of_chip_A = data[1::16] # starting at position 1, every 16th bit is an odd bit of chip A
data_of_chip_A = bitstring.BitArray(len(even_bits_of_chip_A) + len(odd_bits_of_chip_A))
data_of_chip_A[0::2] = even_bits_of_chip_A
data_of_chip_A[1::2] = odd_bits_of_chip_A
This code works fine and does what I want it to do. However, it is not that fast (considering I have to do this for all 8 chips, and generally with a lot more than 1024 bits). Is there a way to speed it up?
Of course the code can be rewritten like this:
import bitstring
data = bitstring.BitArray(1024)
data_of_chip_A = bitstring.BitArray(int(len(data) / 8))
data_of_chip_A[0::2] = data[0::16]
data_of_chip_A[1::2] = data[1::16]
This avoids the creation of the even_bits, odd_bits variables and increases performance. But the final step of assigning values to every second bit of data_of_chip_A still takes quite some time.
Is there a way to for example join two bit_arrays in "alternating" fashion?
The issue might be that the stepping assignments are going to be quite slow as the BitArray is having to be manipulated a lot to remain as a dense object representing the data. It could be faster to instead go via a string as an intermediary:
import bitstring
data = bitstring.BitArray(1024)
data_bin = data.bin # Convert to ordinary str of '0' and '1's
# Create a list to contain single '0' or '1' characters
data_of_chip_A_bin = [''] * (len(data) // 8)
data_of_chip_A_bin[0::2] = data_bin[0::16]
data_of_chip_A_bin[1::2] = data_bin[1::16]
# Convert back to dense binary object
data_of_chip_A = bitstring.BitArray(bin=''.join(data_of_chip_A_bin))
...or that might be slower. Hard to tell without more work. In general the bitarray module mentioned in the comments is going to be faster doing the bit manipulation work in the original answer as it's a compiled C module as opposed to working in pure Python.
As suggested in one of the comments, I switched to bitarray. It works fairly similar to bitstring and is way faster. Together with some simple multiprocessing the script runs 24x faster now. Thanks for the suggestions!
from bitarray import bitarray
data = bitarray(1024) # empty/random array with 1024 bits
even_bits_of_chip_A = data[0::16] # starting at position 0, every 16th bit is an even bit of chip A
odd_bits_of_chip_A = data[1::16] # starting at position 1, every 16th bit is an odd bit of chip A
data_of_chip_A = bitarray(len(even_bits_of_chip_A) + len(odd_bits_of_chip_A))
data_of_chip_A[0::2] = even_bits_of_chip_A
data_of_chip_A[1::2] = odd_bits_of_chip_A

Python high precision integer to array of numpy integer

I understand that numpy can't handle non-native integers, but how can I store python high precision integers as an array of native integers (in either endian)? e.g.
a = 105951305240504794066266398962584786593081186897777398483830058739006966285013
can't be stored as a native integer because it's 256 bit. But it can be stored as
A = array([18196013122530525909, 15462736877728584896,
12869567647602165677, 16879016735257358861], dtype=uint64)
using little-endian (i.e. a == A[0] + A[1]<<64 + A[2]<<128 + A[3]<<192) or A[::-1] as big-endian. How can I convert from a to A here?
I want to convert this "python-side" number to "numpy-side" so that I can run highly efficient algorithms on it (e.g. fast multiplication using Fourier transform).
I believe Python internally should already be using similar structure. All I need to do is to "expose" it to numpy, but I'm not sure about the exact structure or how can I "expose" it. The most straight forward way is of course using a while loop:
A = np.zeros(4, 'uint64')
i = 0
while a > 0:
A[i] = a & (2**64-1)
a >>= 64
i += 1
But I'm wondering are there more "native" or "efficient" ways?
Thanks for your help!

Save list of numbers to (binary) file with defined bits per number

I have a list/array of numbers, which I want to save to a binary file.
The crucial part is, that each number should not be saved as a pre-defined data type.
The bits per value are constant for all values in the list but do not correspond to the typical data types (e.g. byte or int).
import numpy as np
# create 10 random numbers in range 0-63
values = np.int32(np.round(np.random.random(10)*63));
# each value requires exactly 6 bits
# how to save this to a file?
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: str(bin(x)[2:]).zfill(6), values));
print(bitstring)
In the real project, there are more than a million values I want to store with a given bit dephts.
I already tried the module bitstring, but appending each value to the BitArray costs a lot of time...
The may be some numpy-specific way that make things easier, but here's a pure Python (2.x) way to do it. It first converts the list of values into a single integer since Python supports int values of any length. Next it converts that int value into a string of bytes and writes it to the file.
Note: If you're sure all the values will fit within the bit-width specified, the array_to_int() function could be sped up slightly by changing the (value & mask) it's using to just value.
import random
def array_to_int(values, bitwidth):
mask = 2**bitwidth - 1
shift = bitwidth * (len(values)-1)
integer = 0
for value in values:
integer |= (value & mask) << shift
shift -= bitwidth
return integer
# In Python 2.7 int and long don't have the "to_bytes" method found in Python 3.x,
# so here's one way to do the same thing.
def to_bytes(n, length):
return ('%%0%dx' % (length << 1) % n).decode('hex')[-length:]
BITWIDTH = 6
#values = [random.randint(0, 2**BITWIDTH - 1) for _ in range(10)]
values = [0b000001 for _ in range(10)] # create fixed pattern for debugging
values[9] = 0b011111 # make last one different so it can be spotted
# just for debug/information: bit string representation
bitstring = "".join(map(lambda x: bin(x)[2:].zfill(BITWIDTH), values));
print(bitstring)
bigint = array_to_int(values, BITWIDTH)
width = BITWIDTH * len(values)
print('{:0{width}b}'.format(bigint, width=width)) # show integer's value in binary
num_bytes = (width+8 - (width % 8)) // 8 # round to whole number of 8-bit bytes
with open('data.bin', 'wb') as file:
file.write(to_bytes(bigint, num_bytes))
Since you give an example with a string, I'll assume that's how you get the results. This means performance is probably never going to be great. If you can, try creating bytes directly instead of via a string.
Side note: I'm using Python 3 which might require you to make some changes for Python 2. I think this code should work directly in Python 2, but there are some changes around bytearrays and strings between 2 and 3, so make sure to check.
byt = bytearray(len(bitstring)//8 + 1)
for i, b in enumerate(bitstring):
byt[i//8] += (b=='1') << i%8
and for getting the bits back:
bitret = ''
for b in byt:
for i in range(8):
bitret += str((b >> i) & 1)
For millions of bits/bytes you'll want to convert this to a streaming method instead, as you'd need a lot of memory otherwise.

Pushing Radix Sort (and python) to its limits

I've been immensely frustrated with many of the implementations of python radix sort out there on the web.
They consistently use a radix of 10 and get the digits of the numbers they iterate over by dividing by a power of 10 or taking the log10 of the number. This is incredibly inefficient, as log10 is not a particularly quick operation compared to bit shifting, which is nearly 100 times faster!
A much more efficient implementation uses a radix of 256 and sorts the number byte by byte. This allows for all of the 'byte getting' to be done using the ridiculously quick bit operators. Unfortunately, it seems that absolutely nobody out there has implemented a radix sort in python that uses bit operators instead of logarithms.
So, I took matters into my own hands and came up with this beast, which runs at about half the speed of sorted on small arrays and runs nearly as quickly on larger ones (e.g. len around 10,000,000):
import itertools
def radix_sort(unsorted):
"Fast implementation of radix sort for any size num."
maximum, minimum = max(unsorted), min(unsorted)
max_bits = maximum.bit_length()
highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1
min_bits = minimum.bit_length()
lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1
sorted_list = unsorted
for offset in xrange(lowest_byte, highest_byte):
sorted_list = radix_sort_offset(sorted_list, offset)
return sorted_list
def radix_sort_offset(unsorted, offset):
"Helper function for radix sort, sorts each offset."
byte_check = (0xFF << offset*8)
buckets = [[] for _ in xrange(256)]
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
return list(itertools.chain.from_iterable(buckets))
This version of radix sort works by finding which bytes it has to sort by (if you pass it only integers below 256, it'll sort just one byte, etc.) then sorting each byte from LSB up by dumping them into buckets in order then just chaining the buckets together. Repeat this for each byte that needs to be sorted and you have your nice sorted array in O(n) time.
However, it's not as fast as it could be, and I'd like to make it faster before I write about it as a better radix sort than all the other radix sorts out there.
Running cProfile on this tells me that a lot of time is being spent on the append method for lists, which makes me think that this block:
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
in radix_sort_offset is eating a lot of time. This is also the block that, if you really look at it, does 90% of the work for the whole sort. This code looks like it could be numpy-ized, which I think would result in quite a performance boost. Unfortunately, I'm not very good with numpy's more complex features so haven't been able to figure that out. Help would be very appreciated.
I'm currently using itertools.chain.from_iterable to flatten the buckets, but if anyone has a faster suggestion I'm sure it would help as well.
Originally, I had a get_byte function that returned the nth byte of a number, but inlining the code gave me a huge speed boost so I did it.
Any other comments on the implementation or ways to squeeze out more performance are also appreciated. I want to hear anything and everything you've got.
You already realized that
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
is where most of the time goes - good ;-)
There are two standard tricks for speeding that kind of thing, both having to do with moving invariants out of loops:
Compute "offset*8" outside the loop. Store it in a local variable. Save a multiplication per iteration.
Add bucketappender = [bucket.append for bucket in buckets] outside the loop. Saves a method lookup per iteration.
Combine them, and the loop looks like:
for num in unsorted:
bucketappender[(num & byte_check) >> ofs8](num)
Collapsing it to one statement also saves a pair of local vrbl store/fetch opcodes per iteration.
But, at a higher level, the standard way to speed radix sort is to use a larger radix. What's magical about 256? Nothing, apart from that it's convenient for bit-shifting. But so are 512, 1024, 2048 ... it's a classical time/space tradeoff.
PS: for very long numbers,
(num >> offset*8) & 0xff
will run faster. That's because your num & byte_check takes time proportional to log(num) - it generally has to create an integer about as big as num.
This is an old thread, but I came across this when looking to radix sort an array of positive integers. I was trying to see if I can do any better than the already wickedly fast timsort (hats off to you again, Tim Peters) which implements python's builtin sorted and sort! Either I don't understand certain aspects of the above code, or if I do, the code as presented above has some problems IMHO.
It only sorts bytes starting with the highest byte of the smallest item and ending with the highest byte of the biggest item. This may be okay in some cases of special data. But in general the approach fails to differentiate items which differ on account of the lower bits. For example:
arr=[65535,65534]
radix_sort(arr)
produces the wrong output:
[65535, 65534]
The range used to loop over the helper function is not correct. What I mean is that if lowest_byte and highest_byte are the same, execution of the helper function is altogether skipped. BTW I had to change xrange to range in 2 places.
With modifications to address the above 2 points, I got it to work. But it is taking 10-20 times the time of python's builtin sorted or sort! I know timsort is very efficient and takes advantage of already sorted runs in the data. But I was trying to see if I can use the prior knowledge that my data is all positive integers to some advantage in my sorting. Why is the radix sort doing so badly compared to timsort? The array sizes I was using are in the order of 80K items. Is it because the timsort implementation in addition to its algorithmic efficiency has also other efficiencies stemming from possible use of low level libraries? Or am I missing something entirely? The modified code I used is below:
import itertools
def radix_sort(unsorted):
"Fast implementation of radix sort for any size num."
maximum, minimum = max(unsorted), min(unsorted)
max_bits = maximum.bit_length()
highest_byte = max_bits // 8 if max_bits % 8 == 0 else (max_bits // 8) + 1
# min_bits = minimum.bit_length()
# lowest_byte = min_bits // 8 if min_bits % 8 == 0 else (min_bits // 8) + 1
sorted_list = unsorted
# xrange changed to range, lowest_byte deleted from the arguments
for offset in range(highest_byte):
sorted_list = radix_sort_offset(sorted_list, offset)
return sorted_list
def radix_sort_offset(unsorted, offset):
"Helper function for radix sort, sorts each offset."
byte_check = (0xFF << offset*8)
# xrange changed to range
buckets = [[] for _ in range(256)]
for num in unsorted:
byte_at_offset = (num & byte_check) >> offset*8
buckets[byte_at_offset].append(num)
return list(itertools.chain.from_iterable(buckets))
You could simply use one of the existing C or C++ implementations, such
as example, integer_sort from Boost.Sort or u4_sort from usort. It is surprisingly easy to call native C or C++ code from Python, see How to sort an array of integers faster than quicksort?
I totally get your frustration. Although it's been more than 2 years, numpy still does not have radix sort. I will let the NumPy developers know that they could simply grab one of the existing implementations; licensing should not be an issue.

Process RGBA data efficiently using python?

I'm trying to process an RGBA buffer (list of chars), and run "unpremultiply" on each pixel. The algorithm is color_out=color*255/alpha.
This is what I came up with:
def rgba_unpremultiply(data):
for i in range(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
data[i] = chr(255*ord(data[i])/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
It works but causes a major drawback in performance.
I'm wondering besides writing a C module, what are my options to optimize this particular function?
This is exactly the kind of code NumPy is great for.
import numpy
def rgba_unpremultiply(data):
a = numpy.fromstring(data, 'B') # Treat the string as an array of bytes
a = a.astype('I') # Cast array of bytes to array of uints, since temporary values needs to be larger than byte
alpha = a[3::4] # Every 4th element starting from index 3
alpha = numpy.where(alpha == 0, 255, alpha) # Don't modify colors where alpha is 0
a[0::4] = a[0::4] * 255 // alpha # Operates on entire slices of the array instead of looping over each element
a[1::4] = a[1::4] * 255 // alpha
a[2::4] = a[2::4] * 255 // alpha
return a.astype('B').tostring() # Cast back to bytes
How big is data? Assuming this is on python2.X Try using xrange instead of range so that you don't have to constantly allocate and reallocate a large list.
You could convert all the data to integers for working with them so you're not constantly converting to and from characters.
Look into using numpy to vectorize this: Link I suspect that simply storing the data as integers and using a numpy array will greatly improve the performance.
And another relatively simple thing you could do is write a little Cython:
http://wiki.cython.org/examples/mandelbrot
Basically Cython will compile your above function into C code with just a few lines of type hints. It greatly reduces the barrier to writing a C extension.
I don't have a concrete answer, but some useful pointers might be:
Python's array module
numpy
OpenCV if you have actual image data
There are some minor things you can do, but I do not think you can improve a lot.
Anyway, here's some hint:
def rgba_unpremultiply(data):
# xrange() is more performant then range, it does not precalculate the whole array
for i in xrange(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
# Not sure about this, but maybe (c << 8) - c is faster than c*255
# So maybe you can arrange this to do that
# Check for performance improvement
data[i] = chr(((ord(data[i]) << 8) - ord(data[i]))/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
I've just make some dummy benchmark on << vs *, and it seems not to be markable differences, but I guess you can do better evaluation on your project.
Anyway, a c module may be a good thing, even if it does not seem to be "language related" the problem.

Categories