Python: Size of (large) dict 10 times smaller when pickled - python

I'm trying to understand what's going in internally with python in the following.
Situation (Python3 on debian):
A (large) dict that has integers as keys (running from zero) and tuples as values.
The elements of the tuple are ALL integers (randomly from zero to the number of the largest key).
All tuples have exactly 30 elements.
Problem:
The pickled dict is significantly (approx. 10 times!) smaller on my harddisk than the size of the single elements should represent in memory.
Details:
The size of an integer is 28 bytes (except < 0 > which is just 24 bytes).
The size of a tuple is dependent on the number of elements it contains; assuming 30 elements it is 288 bytes.
The size of a dictionary is dependent on the number of elements it contains; assuming 1000 elements it is 49248 bytes.
Given the situation above, 1000 elements in the dict and assuming the number < 0 > appears 29 times in the tuples I get:
size of the integers in the tuples: 28 x 30 x 1000 - 4 x 29 = 839,884 bytes
size of the tuples: 288 x 1000 = 288,000 bytes
size of the keys: 28 x 1000 - 4 (the first key is zero) = 27,996 bytes
size of the dict with 1000 elements: 49,248 bytes
Sum of this all = 1,205,128 bytes
Now I pickle this dict to harddisk as a binary file and I actually get 91,207 bytes as the size of the file.
So my question is now: what is going on here?
Is the pickling "compressing" the integers to just what the bits (or something like that) are? The number < 1000 > for example can be represented with just 10 bits and would fit into 2 bytes (instead of 28).
Code that might be useful:
import os
import sys
import random
max_key = 1000
zeros = 0
theoretical_size = 0
the_dict = {}
for i in range(max_key):
the_tuple = tuple()
ii = 0
while ii < 30:
number = random.randint(0, (max_key - 1))
if number not in the_tuple:
the_tuple += (number, )
theoretical_size += sys.getsizeof(number)
ii += 1
if not number:
zeros += 1
theoretical_size += sys.getsizeof(the_tuple)
theoretical_size += sys.getsizeof(i)
the_dict[i] = the_tuple
theoretical_size += sys.getsizeof(the_dict)
outfile = '/path/to/outfile/outfilename'
with open(outfile, 'wb') as f:
pickle.dump(the_dict, f)
print(" zeros:", zeros)
print("theoretical size:", theoretical_size)
print(" Calculated:", 28*30*max_key - 4*zeros + 288*max_key + 28*max_key - 4 + sys.getsizeof(the_dict))
print(" On disk:", os.path.getsize(outfile))

Related

How to measure the true size of an object in Python (including its elements/children) [duplicate]

I am trying to create a list of size 1 MB. while the following code works:
dummy = ['a' for i in xrange(0, 1024)]
sys.getsizeof(dummy)
Out[1]: 9032
The following code does not work.
import os
import sys
dummy = []
dummy.append((os.urandom(1024))
sys.getsizeof(dummy)
Out[1]: 104
Can someone explain why?
If you're wondering why I am not using the first code snippet, I am writing a program to benchmark my memory by writing a for loop that writes blocks (of size 1 B, 1 KB and 1 MB) into memory.
start = time.time()
for i in xrange(1, (1024 * 10)):
dummy.append(os.urandom(1024)) #loop to write 1 MB blocks into memory
end = time.time()
If you check the size of a list, it will be provide the size of the list data structure, including the pointers to its constituent elements. It won't consider the size of elements.
str1_size = sys.getsizeof(['a' for i in xrange(0, 1024)])
str2_size = sys.getsizeof(['abc' for i in xrange(0, 1024)])
int_size = sys.getsizeof([123 for i in xrange(0, 1024)])
none_size = sys.getsizeof([None for i in xrange(0, 1024)])
str1_size == str2_size == int_size == none_size
The size of empty list: sys.getsizeof([]) == 72
Add an element: sys.getsizeof([1]) == 80
Add another element: sys.getsizeof([1, 1]) == 88
So each element adds 4 bytes.
To get 1024 bytes, we need (1024 - 72) / 8 = 119 elements.
The size of the list with 119 elements: sys.getsizeof([None for i in xrange(0, 119)]) == 1080.
This is because a list maintains an extra buffer for inserting more items, so that it doesn't have to resize every time. (The size comes out to be same as 1080 for number of elements between 107 and 126).
So what we need is an immutable data structure, which doesn't need to keep this buffer - tuple.
empty_tuple_size = sys.getsizeof(()) # 56
single_element_size = sys.getsizeof((1,)) # 64
pointer_size = single_element_size - empty_tuple_size # 8
n_1mb = (1024 - empty_tuple_size) / pointer_size # (1024 - 56) / 8 = 121
tuple_1mb = (1,) * n_1mb
sys.getsizeof(tuple_1mb) == 1024
So this is your answer to get a 1MB data structure: (1,)*121
But note that this is only the size of tuple and the constituent pointers. For the total size, you actually need to add up the size of individual elements.
Alternate:
sys.getsizeof('') == 37
sys.getsizeof('1') == 38 # each character adds 1 byte
For 1 MB, we need 987 characters:
sys.getsizeof('1'*987) == 1024
And this is the actual size, not just the size of pointers.

list.insert(position, value) grows list indefinitely

If I run the following code:
data = list()
length = 10
for i in range(1000):
point = i % length
data.insert(point, i)
len(data)
The output is: 1000
I was expecting the length to be 10 as I am restricting point to be in range 0-9.
What I am doing wrong?
Insert adds elements in a new position, to overwrite old ones try this instead:
length = 10
data = [None] * length
for i in range(1000):
point = i % length
data[point] = i
len(data)
=> 10
Although it's not clear why you want to loop 1000 times when only the last 10 values are needed... Wouldn't it better to use range(990, 1000)?

Generate values from a frequency distribution

I'm currently analyzing a 16 bit binary string - something like 0010001010110100. I have approximately 30 of these strings. I have written a simple program in Matlab that counts the numbers of 1's in each bit for all 30 strings.
So, for example:
1 30
2 15
3 1
4 10
etc
I want to generate more strings (100s) that roughly follow the frequency distribution above. Is there a Matlab (or Python or R) command that does that?
What I'm looking for is something like this: http://www.prenhall.com/weiss_dswin/html/simulate.htm
In MATLAB: just use < (or lt, less than) on rand:
len = 16; % string length
% counts of 1s for each bit (just random integer here)
counts = randi([0 30],[1 len]);
% probability for 1 in each bit
prob = counts./30;
% generate 100 random strings
n = 100;
moreStrings = rand(100,len);
% for each bit check if number is less than the probability of the bit
moreStrings = bsxfun(#lt, moreStrings, prob); % lt(x,y) := x < y
In Python:
import numpy as np
len = 16 # string length
# counts of 1's for each bit (just random integer here)
counts = np.random.randint(0, 30, (1,16)).astype(float)
# probability for 1 in each bit
prob = counts/30
# generate 100 random strings
n = 100
moreStrings = np.random.rand(100,len)
# for each bit check if number is less than the probability of the bit
moreStrings = moreStrings < prob

Python - reading 10 bit integers from a binary file

I have a binary file containing a stream of 10-bit integers. I want to read it and store the values in a list.
It is working with the following code, which reads my_file and fills pixels with integer values:
file = open("my_file", "rb")
pixels = []
new10bitsByte = ""
try:
byte = file.read(1)
while byte:
bits = bin(ord(byte))[2:].rjust(8, '0')
for bit in reversed(bits):
new10bitsByte += bit
if len(new10bitsByte) == 10:
pixels.append(int(new10bitsByte[::-1], 2))
new10bitsByte = ""
byte = file.read(1)
finally:
file.close()
It doesn't seem very elegant to read the bytes into bits, and read it back into "10-bit" bytes. Is there a better way to do it?
With 8 or 16 bit integers I could just use file.read(size) and convert the result to an int directly. But here, as each value is stored in 1.25 bytes, I would need something like file.read(1.25)...
Here's a generator that does the bit operations without using text string conversions. Hopefully, it's a little more efficient. :)
To test it, I write all the numbers in range(1024) to a BytesIO stream, which behaves like a binary file.
from io import BytesIO
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if len(b) == 0:
break
n = int.from_bytes(b, 'big')
#Split n into 4 10 bit integers
t = []
for i in range(4):
t.append(n & 0x3ff)
n >>= 10
yield from reversed(t)
# Make some test data: all the integers in range(1024),
# and save it to a byte stream
buff = BytesIO()
maxi = 1024
n = 0
for i in range(maxi):
n = (n << 10) | i
#Convert the 40 bit integer to 5 bytes & write them
if i % 4 == 3:
buff.write(n.to_bytes(5, 'big'))
n = 0
# Rewind the stream so we can read from it
buff.seek(0)
# Read the data in 10 bit chunks
a = list(tenbitread(buff))
# Check it
print(a == list(range(maxi)))
output
True
Doing list(tenbitread(buff)) is the simplest way to turn the generator output into a list, but you can easily iterate over the values instead, eg
for v in tenbitread(buff):
or
for i, v in enumerate(tenbitread(buff)):
if you want indices as well as the data values.
Here's a little-endian version of the generator which gives the same results as your code.
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if not len(b):
break
n = int.from_bytes(b, 'little')
#Split n into 4 10 bit integers
for i in range(4):
yield n & 0x3ff
n >>= 10
We can improve this version slightly by "un-rolling" that for loop, which lets us get rid of the final masking and shifting operations.
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if not len(b):
break
n = int.from_bytes(b, 'little')
#Split n into 4 10 bit integers
yield n & 0x3ff
n >>= 10
yield n & 0x3ff
n >>= 10
yield n & 0x3ff
n >>= 10
yield n
This should give a little more speed...
As there is no direct way to read a file x-bit by x-bit in Python, we have to read it byte by byte. Following MisterMiyagi and PM 2Ring's suggestions I modified my code to read the file by 5 byte chunks (i.e. 40 bits) and then split the resulting string into 4 10-bit numbers, instead of looping over the bits individually. It turned out to be twice as fast as my previous code.
file = open("my_file", "rb")
pixels = []
exit_loop = False
try:
while not exit_loop:
# Read 5 consecutive bytes into fiveBytesString
fiveBytesString = ""
for i in range(5):
byte = file.read(1)
if not byte:
exit_loop = True
break
byteString = format(ord(byte), '08b')
fiveBytesString += byteString[::-1]
# Split fiveBytesString into 4 10-bit numbers, and add them to pixels
pixels.extend([int(fiveBytesString[i:i+10][::-1], 2) for i in range(0, 40, 10) if len(fiveBytesString[i:i+10]) > 0])
finally:
file.close()
Adding a Numpy based solution suitable for unpacking large 10-bit packed byte buffers like the ones you might receive from AVT and FLIR cameras.
This is a 10-bit version of #cyrilgaudefroy's answer to a similar question; there you can also find a Numba alternative capable of yielding an additional speed increase.
import numpy as np
def read_uint10(byte_buf):
data = np.frombuffer(byte_buf, dtype=np.uint8)
# 5 bytes contain 4 10-bit pixels (5x8 == 4x10)
b1, b2, b3, b4, b5 = np.reshape(data, (data.shape[0]//5, 5)).astype(np.uint16).T
o1 = (b1 << 2) + (b2 >> 6)
o2 = ((b2 % 64) << 4) + (b3 >> 4)
o3 = ((b3 % 16) << 6) + (b4 >> 2)
o4 = ((b4 % 4) << 8) + b5
unpacked = np.reshape(np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1), 4*o1.shape[0])
return unpacked
Reshape can be omitted if returning a buffer instead of a Numpy array:
unpacked = np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1).tobytes()
Or if image dimensions are known it can be reshaped directly, e.g.:
unpacked = np.reshape(np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1), (1024, 1024))
If the use of the modulus operator appears confusing, try playing around with:
np.unpackbits(np.array([255%64], dtype=np.uint8))
Edit: It turns out that the Allied Vision Mako-U cameras employ a different ordering than the one I originally suggested above:
o1 = ((b2 % 4) << 8) + b1
o2 = ((b3 % 16) << 6) + (b2 >> 2)
o3 = ((b4 % 64) << 4) + (b3 >> 4)
o4 = (b5 << 2) + (b4 >> 6)
So you might have to test different orders if images come out looking wonky initially for your specific setup.

Query Board challenge on Python, need some pointers

So, I have this challenge on CodeEval, but I seem don't know where to start, so I need some pointers (and answers if you can) to help me figure out this challenge.
DESCRIPTION:
There is a board (matrix). Every cell of the board contains one integer, which is 0 initially.
The next operations can be applied to the Query Board:
SetRow i x: it means that all values in the cells on row "i" have been change value to "x" after this operation.
SetCol j x: it means that all values in the cells on column "j" have been changed to value "x" after this operation.
QueryRow i: it means that you should output the sum of values on row "i".
QueryCol j: it means that you should output the sum of values on column "j".
The board's dimensions are 256x256
i and j are integers from 0 to 255
x is an integer from 0 to 31
INPUT SAMPLE:
Your program should accept as its first argument a path to a filename. Each line in this file contains an operation of a query. E.g.
SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2
OUTPUT SAMPLE:
For each query, output the answer of the query. E.g.
5118
34
1792
3571
I'm not that great on Python, but this challenge is pretty interesting, although I didn't have any clues on how to solve it. So, I need some help from you guys.
Thanks!
You could use a sparse matrix for this; addressed by (col, row) tuples as keys in a dictionary, to save memory. 64k cells is a big list otherwise (2MB+ on a 64-bit system):
matrix = {}
This is way more efficient, as the challenge is unlikely to set values for all rows and columns on the board.
Setting a column or row is then:
def set_col(col, x):
for i in range(256):
matrix[i, col] = x
def set_row(row, x):
for i in range(256):
matrix[row, i] = x
and summing a row or column is then:
def get_col(col):
return sum(matrix.get((i, col), 0) for i in range(256))
def get_row(row):
return sum(matrix.get((row, i), 0) for i in range(256))
WIDTH, HEIGHT = 256, 256
board = [[0] * WIDTH for i in range(HEIGHT)]
def set_row(i, x):
global board
board[i] = [x]*WIDTH
... implement each function, then parse each line of input to decide which function to call,
for line in inf:
dat = line.split()
if dat[0] == "SetRow":
set_row(int(dat[1]), int(dat[2]))
elif ...
Edit: Per Martijn's comments:
total memory usage for board is about 2.1MB. By comparison, after 100 random row/column writes, matrix is 3.1MB (although it tops out there and doesn't get any bigger).
yes, global is unnecessary when modifying a global object (just don't try to assign to it).
while dispatching from a dict is good and efficient, I did not want to inflict it on someone who is "not that great on Python", especially for just four entries.
For sake of comparison, how about
time = 0
WIDTH, HEIGHT = 256, 256
INIT = 0
rows = [(time, INIT) for _ in range(WIDTH)]
cols = [(time, INIT) for _ in range(HEIGHT)]
def set_row(i, x):
global time
time += 1
rows[int(i)] = (time, int(x))
def set_col(i, x):
global time
time += 1
cols[int(i)] = (time, int(x))
def query_row(i):
rt, rv = rows[int(i)]
total = rv * WIDTH + sum(cv - rv for ct, cv in cols if ct > rt)
print(total)
def query_col(j):
ct, cv = cols[int(j)]
total = cv * HEIGHT + sum(rv - cv for rt, rv in rows if rt > ct)
print(total)
ops = {
"SetRow": set_row,
"SetCol": set_col,
"QueryRow": query_row,
"QueryCol": query_col
}
inf = """SetCol 32 20
SetRow 15 7
SetRow 16 31
QueryCol 32
SetCol 2 14
QueryRow 10
SetCol 14 0
QueryRow 15
SetRow 10 1
QueryCol 2""".splitlines()
for line in inf:
line = line.split()
op = line.pop(0)
ops[op](*line)
which only uses 4.3k of memory for rows[] and cols[].
Edit2:
using your code from above for matrix, set_row, set_col,
import sys
for n in range(256):
set_row(n, 1)
print("{}: {}".format(2*(n+1)-1, sys.getsizeof(matrix)))
set_col(n, 1)
print("{}: {}".format(2*(n+1), sys.getsizeof(matrix)))
which returns (condensed:)
1: 12560
2: 49424
6: 196880
22: 786704
94: 3146000
... basically the allocated memory quadruples at each step. If I change the memory measure to include key-tuples,
def get_matrix_size():
return sys.getsizeof(matrix) + sum(sys.getsizeof(key) for key in matrix)
it increases more smoothly, but still takes a bit jump at the above points:
5 : 127.9k
6 : 287.7k
21 : 521.4k
22 : 1112.7k
60 : 1672.0k
61 : 1686.1k <-- approx expected size on your reported problem set
93 : 2121.1k
94 : 4438.2k

Categories