Generate values from a frequency distribution - python

I'm currently analyzing a 16 bit binary string - something like 0010001010110100. I have approximately 30 of these strings. I have written a simple program in Matlab that counts the numbers of 1's in each bit for all 30 strings.
So, for example:
1 30
2 15
3 1
4 10
etc
I want to generate more strings (100s) that roughly follow the frequency distribution above. Is there a Matlab (or Python or R) command that does that?
What I'm looking for is something like this: http://www.prenhall.com/weiss_dswin/html/simulate.htm

In MATLAB: just use < (or lt, less than) on rand:
len = 16; % string length
% counts of 1s for each bit (just random integer here)
counts = randi([0 30],[1 len]);
% probability for 1 in each bit
prob = counts./30;
% generate 100 random strings
n = 100;
moreStrings = rand(100,len);
% for each bit check if number is less than the probability of the bit
moreStrings = bsxfun(#lt, moreStrings, prob); % lt(x,y) := x < y
In Python:
import numpy as np
len = 16 # string length
# counts of 1's for each bit (just random integer here)
counts = np.random.randint(0, 30, (1,16)).astype(float)
# probability for 1 in each bit
prob = counts/30
# generate 100 random strings
n = 100
moreStrings = np.random.rand(100,len)
# for each bit check if number is less than the probability of the bit
moreStrings = moreStrings < prob

Related

Python: Size of (large) dict 10 times smaller when pickled

I'm trying to understand what's going in internally with python in the following.
Situation (Python3 on debian):
A (large) dict that has integers as keys (running from zero) and tuples as values.
The elements of the tuple are ALL integers (randomly from zero to the number of the largest key).
All tuples have exactly 30 elements.
Problem:
The pickled dict is significantly (approx. 10 times!) smaller on my harddisk than the size of the single elements should represent in memory.
Details:
The size of an integer is 28 bytes (except < 0 > which is just 24 bytes).
The size of a tuple is dependent on the number of elements it contains; assuming 30 elements it is 288 bytes.
The size of a dictionary is dependent on the number of elements it contains; assuming 1000 elements it is 49248 bytes.
Given the situation above, 1000 elements in the dict and assuming the number < 0 > appears 29 times in the tuples I get:
size of the integers in the tuples: 28 x 30 x 1000 - 4 x 29 = 839,884 bytes
size of the tuples: 288 x 1000 = 288,000 bytes
size of the keys: 28 x 1000 - 4 (the first key is zero) = 27,996 bytes
size of the dict with 1000 elements: 49,248 bytes
Sum of this all = 1,205,128 bytes
Now I pickle this dict to harddisk as a binary file and I actually get 91,207 bytes as the size of the file.
So my question is now: what is going on here?
Is the pickling "compressing" the integers to just what the bits (or something like that) are? The number < 1000 > for example can be represented with just 10 bits and would fit into 2 bytes (instead of 28).
Code that might be useful:
import os
import sys
import random
max_key = 1000
zeros = 0
theoretical_size = 0
the_dict = {}
for i in range(max_key):
the_tuple = tuple()
ii = 0
while ii < 30:
number = random.randint(0, (max_key - 1))
if number not in the_tuple:
the_tuple += (number, )
theoretical_size += sys.getsizeof(number)
ii += 1
if not number:
zeros += 1
theoretical_size += sys.getsizeof(the_tuple)
theoretical_size += sys.getsizeof(i)
the_dict[i] = the_tuple
theoretical_size += sys.getsizeof(the_dict)
outfile = '/path/to/outfile/outfilename'
with open(outfile, 'wb') as f:
pickle.dump(the_dict, f)
print(" zeros:", zeros)
print("theoretical size:", theoretical_size)
print(" Calculated:", 28*30*max_key - 4*zeros + 288*max_key + 28*max_key - 4 + sys.getsizeof(the_dict))
print(" On disk:", os.path.getsize(outfile))

PYTHON How to generate 20 million unrepeatable random numbers

Need to generate 20 million unrepeatable random numbers with 8 characters length and save it in an array.
I try with multiprocessing,threading but it stays slow.
Try with multiprocessing:
from numpy.random import default_rng
from multiprocessing import Process,Queue
import os,time
import numpy as np
rng = default_rng()
f=np.array([],dtype=np.int64)
def generate(q,start,stop):
numbers=[rng.choice(range(start,stop),replace=False) for _ in range(1000)]
q.put(numbers)
if __name__ == '__main__':
timeInit = time.time()
for x in range(20000):
q=Queue()
p = Process(target=generate,args=(q,11111111,99999999,))
p.start()
f=np.append(f,q.get())
p.join()
print(f)
timeStop = time.time()
print('[TIME EXECUTED] ' + str(timeStop-timeInit) +' segs')
This took less than 30 secs on my personal laptop, if it works for you:
import random
candidates = list(range(10**7, 10**8)) # all numbers from 10000000 to 99999999
random.shuffle(candidates)
result = candidates[:20* 10**6] # take first 20 million
You haven't explained why you're doing all of that overhead. I simply took a random sample from the candidate numbers:
from random import sample
result = sample(
list(range(10**7, 10**8)),
2*10**7
)
51 seconds on my laptop, with interference from other jobs.
I just ran a more controlled test on both solutions. The one in this post took 48.5 seconds; the one from naicolas took 81.6 seconds, likely due to the extra list creation.
I hope I got your idea. The random numbers that you are trying to generate are actually a bit tricky. Basically we are looking for a set of unique (non-repeatable) but random numbers. In this case, we can not draw random numbers from uniform distribution, because there is no guarantee that numbers are unique.
There are 2 possible algorithms. The first one is to generate A LOT of possible random numbers, and remove those repeated ones. For instance,
import numpy as np
N = 20_000_000
L0 = 11_111_111 # legitimate int in Python
L1 = L0 * 9
not_enough_unique = True
while not_enough_unique:
X = np.random.uniform(L0, L1, int(N * 2)).astype(int)
X_unique = np.unique(X) # remove repeated numbers
not_enough_unique = len(X_unique) < N
random_numbers = X_unique[:N]
np.random.shuffle(random_numbers)
There is also another more "physics" approach. We can start with equal–spaced numbers, and move each number a little bit. The result will not be as random as the first one, but it is much faster and purely fun.
import numpy as np
N = 20_000_000
L0 = 11_111_111 # legitimate int in Python
L1 = L0 * 9
lattice = np.linspace(L0, L1, N) # all numbers have equal spacing
pertubation = np.random.normal(0, 0.4, N) # every number move left/right a little bit
random_numbers = (lattice + pertubation).astype(int)
# check if the minimum distance between two successive numbers
# i.e. all numbers are unique
min_dist = np.abs(np.diff(random_numbers)).min()
print(f"generating random numbers with minimum separation of {min_dist}")
print("(if it is > 1 you are good)")
np.random.shuffle(random_numbers)
(Both algorithms generate the result within 10s on my laptop)

Binary mask with shift operation without cycle

We have some large binary number N (large means millions of digits). We also have binary mask M where 1 means that we must remove digit in this position in number N and move all higher bits one position right.
Example:
N = 100011101110
M = 000010001000
Res 1000110110
Is it possible to solve this problem without cycle with some set of logical or arithmetical operations? We can assume that we have access to bignum arithmetic in Python.
Feels like it should be something like this:
Res = N - (N xor M)
But it doesn't work
UPD: My current solution with cycle is following:
def prepare_reduced_arrays(dict_of_N, mask):
'''
mask: string '0000011000'
each element of dict_of_N - big python integer
'''
capacity = len(mask)
answer = dict()
for el in dict_of_N:
answer[el] = 0
new_capacity = 0
for i in range(capacity - 1, -1, -1):
if mask[i] == '1':
continue
cap2 = (1 << new_capacity)
pos = (capacity - i - 1)
for el in dict_of_N:
current_bit = (dict_of_N[el] >> pos) & 1
if current_bit:
answer[el] |= cap2
new_capacity += 1
return answer, new_capacity
While this may not be possible without a loop in python, it can be made extremely fast with numba and just in time compilation. I went on the assumption that your inputs could be easily represented as boolean arrays, which would be very simple to construct from a binary file using struct. The method I have implemented involves iterating a few different objects, however these iterations were chosen carefully to make sure they were compiler optimized, and never doing the same work twice. The first iteration is using np.where to locate the indices of all the bits to delete. This specific function (among many others) is optimized by the numba compiler. I then use this list of bit indices to build the slice indices for slices of bits to keep. The final loop copies these slices to an empty output array.
import numpy as np
from numba import jit
from time import time
def binary_mask(num, mask):
num_nbits = num.shape[0] #how many bits are in our big num
mask_bits = np.where(mask)[0] #which bits are we deleting
mask_n_bits = mask_bits.shape[0] #how many bits are we deleting
start = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice start indexes
start[0] = 0 #first slice starts at 0
start[1:] = mask_bits + 1 #subsequent slices start 1 after each True bit in mask
end = np.empty(mask_n_bits + 1, dtype=int) #preallocate array for slice end indexes
end[:mask_n_bits] = mask_bits #each slice ends on (but does not include) True bits in the mask
end[mask_n_bits] = num_nbits + 1 #last slice goes all the way to the end
out = np.empty(num_nbits - mask_n_bits, dtype=np.uint8) #preallocate return array
for i in range(mask_n_bits + 1): #for each slice
a = start[i] #use local variables to reduce number of lookups
b = end[i]
c = a - i
d = b - i
out[c:d] = num[a:b] #copy slices
return out
jit_binary_mask = jit("b1[:](b1[:], b1[:])")(binary_mask) #decorator without syntax sugar
###################### Benchmark ########################
bignum = np.random.randint(0,2,1000000, dtype=bool) # 1 million random bits
bigmask = np.random.randint(0,10,1000000, dtype=np.uint8)==9 #delete about 1 in 10 bits
t = time()
for _ in range(10): #10 cycles of just numpy implementation
out = binary_mask(bignum, bigmask)
print(f"non-jit: {time()-t} seconds")
t = time()
out = jit_binary_mask(bignum, bigmask) #once ahead of time to compile
compile_and_run = time() - t
t = time()
for _ in range(10): #10 cycles of compiled numpy implementation
out = jit_binary_mask(bignum, bigmask)
jit_runtime = time()-t
print(f"jit: {jit_runtime} seconds")
print(f"estimated compile_time: {compile_and_run - jit_runtime/10}")
In this example, I execute the benchmark on a boolean array of length 1,000,000 a total of 10 times for both the compiled and un-compiled version. On my laptop, the output is:
non-jit: 1.865583896636963 seconds
jit: 0.06370806694030762 seconds
estimated compile_time: 0.1652850866317749
As you can see with a simple algorithm like this, very significant performance gains can be seen from compilation. (in my case about 20-30x speedup)
As far as I know, this can be done without the use of loops if and only if M is a power of 2.
Let's take your example, and modify M so that it is a power of 2:
N = 0b100011101110 = 2286
M = 0b000000001000 = 8
Removing the fourth lowest bit from N and shifting the higher bits to the right would result in:
N = 0b10001110110 = 1142
We achieved this using the following algorithm:
Begin with N = 0b100011101110 = 2286
Iterate from the most-significant bit to the least-significant bit in M.
If the current bit in M is set to 1, then store the lower bits in some variable, x:
x = 0b1101110
Then, subtract every bit up to and including the current bit in M from N, so that we end up with the following:
N - (0b10000000 + x) = N - (0b10000000 + 0b1101110) = 0b100011101110 - 0b11101110 = 0b100000000000
This step can also be achieved by and-ing the bits with 0, which may be more efficient.
Next, we shift the result once to the right:
0b100000000000 >> 1 = 0b10000000000
Finally, we add back x to the shifted result:
0b10000000000 + x = 0b10000000000 + 0b1101110 = 0b10001101110 = 1142
There may be a possibility that this can somehow be done without loops, but it would actually be efficient if you were to simply iterate over M (from the most-significant bit to the least-significant bit) and performed this process on every set bit, as the time complexity would be O(M.bit_length()).
I wrote up the code for this algorithm as well, and I believe it's relatively efficient, but I don't have any big binary numbers to test it with:
def remove_bits(N, M):
bit = 2 ** (M.bit_length() - 1)
while bit != 0:
if M & bit:
ones = bit - 1
# Store lower `bit` bits.
temp = N & ones
# Clear lower `bit` bits.
N &= ~ones
# Shift once to the right.
N >>= 1
# Set stored lower `bit` bits.
N |= temp
bit >>= 1
return N
if __name__ == '__main__':
N = 0b100011101110
M = 0b000010001000
print(bin(remove_bits(N, M)))
Using your example, this returns your result: 0b1000110110
I don't think there's any way to do this in a constant number of calls to the built-in bitwise operators. Python would have to provide something like PEXT for that to be possible.
For literally millions of digits, you may actually get best performance by working in terms of sequences of bits, sacrificing the space advantages of Python ints and the time advantages of bitwise operations in favor of more flexibility in the operations you can perform. I don't know where the break-even point would be:
import itertools
bits = bin(N)[2:]
maskbits = bin(M)[2:].zfill(len(bits))
bits = bits.zfill(len(maskbits))
chosenbits = itertools.compress(bits, map('0'.__eq__, maskbits))
result = int(''.join(chosenbits), 2)

Standard deviation of combinations of dices

I am trying to find stdev for a sequence of numbers that were extracted from combinations of dice (30) that sum up to 120. I am very new to Python, so this code makes the console freeze because the numbers are endless and I am not sure how to fit them all into a smaller, more efficient function. What I did is:
found all possible combinations of 30 dice;
filtered combinations that sum up to 120;
multiplied all items in the list within result list;
tried extracting standard deviation.
Here is the code:
import itertools
import numpy
dice = [1,2,3,4,5,6]
subset = itertools.product(dice, repeat = 30)
result = []
for x in subset:
if sum(x) == 120:
result.append(x)
my_result = numpy.product(result, axis = 1).tolist()
std = numpy.std(my_result)
print(std)
Note that D(X^2) = E(X^2) - E(X)^2, you can solve this problem analytically by following equations.
f[i][N] = sum(k*f[i-1][N-k]) (1<=k<=6)
g[i][N] = sum(k^2*g[i-1][N-k])
h[i][N] = sum(h[i-1][N-k])
f[1][k] = k ( 1<=k<=6)
g[1][k] = k^2 ( 1<=k<=6)
h[1][k] = 1 ( 1<=k<=6)
Sample implementation:
import numpy as np
Nmax = 120
nmax = 30
min_value = 1
max_value = 6
f = np.zeros((nmax+1, Nmax+1), dtype ='object')
g = np.zeros((nmax+1, Nmax+1), dtype ='object') # the intermediate results will be really huge, to keep them accurate we have to utilize python big-int
h = np.zeros((nmax+1, Nmax+1), dtype ='object')
for i in range(min_value, max_value+1):
f[1][i] = i
g[1][i] = i**2
h[1][i] = 1
for i in range(2, nmax+1):
for N in range(1, Nmax+1):
f[i][N] = 0
g[i][N] = 0
h[i][N] = 0
for k in range(min_value, max_value+1):
f[i][N] += k*f[i-1][N-k]
g[i][N] += (k**2)*g[i-1][N-k]
h[i][N] += h[i-1][N-k]
result = np.sqrt(float(g[nmax][Nmax]) / h[nmax][Nmax] - (float(f[nmax][Nmax]) / h[nmax][Nmax]) ** 2)
# result = 32128174994365296.0
You ask for a result of an unfiltered lengths of 630 = 2*1023, impossible to handle as such.
There are two possibilities that can be combined:
Include more thinking to pre-treat the problem, e.g. on how to sample only
those with sum 120.
Do a Monte Carlo simulation instead, i.e. don't sample all
combinations, but only a random couple of 1000 to obtain a representative
sample to determine std sufficiently accurate.
Now, I only apply (2), giving the brute force code:
N = 30 # number of dices
M = 100000 # number of samples
S = 120 # required sum
result = [[random.randint(1,6) for _ in xrange(N)] for _ in xrange(M)]
result = [s for s in result if sum(s) == S]
Now, that result should be comparable to your result before using numpy.product ... that part I couldn't follow, though...
Ok, if you are out after the standard deviation of the product of the 30 dices, that is what your code does. Then I need 1 000 000 samples to get roughly reproducible values for std (1 digit) - takes my PC about 20 seconds, still considerably less than 1 million years :-D.
Is a number like 3.22*1016 what you are looking for?
Edit after comments:
Well, sampling the frequency of numbers instead gives only 6 independent variables - even 4 actually, by substituting in the constraints (sum = 120, total number = 30). My current code looks like this:
def p2(b, s):
return 2**b * 3**s[0] * 4**s[1] * 5**s[2] * 6**s[3]
hits = range(31)
subset = itertools.product(hits, repeat=4) # only 3,4,5,6 frequencies
product = []
permutations = []
for s in subset:
b = 90 - (2*s[0] + 3*s[1] + 4*s[2] + 5*s[3]) # 2 frequency
a = 30 - (b + sum(s)) # 1 frequency
if 0 <= b <= 30 and 0 <= a <= 30:
product.append(p2(b, s))
permutations.append(1) # TODO: Replace 1 with possible permutations
print numpy.std(product) # TODO: calculate std manually, considering permutations
This computes in about 1 second, but the confusing part is that I get as a result 1.28737023733e+17. Either my previous approaches or this one has a bug - or both.
Sorry - not that easy: The sampling is not of the same probability - that is the problem here. Each sample has a different number of possible combinations, giving its weight, which has to be considered before taking the std-deviation. I have drafted that in the code above.

Speeding up computations with numpy matrices

I have two matrices. Both are filled with zeros and ones. One is a big one (3000 x 2000 elements), and the other is smaller ( 20 x 20 ) elements. I am doing something like:
newMatrix = (size of bigMatrix), filled with zeros
l = (a constant)
for y in xrange(0, len(bigMatrix[0])):
for x in xrange(0, len(bigMatrix)):
for b in xrange(0, len(smallMatrix[0])):
for a in xrange(0, len(smallMatrix)):
if (bigMatrix[x, y] == smallMatrix[x + a - l, y + b - l]):
newMatrix[x, y] = 1
Which is being painfully slow. Am I doing anything wrong? Is there a smart way to make this work faster?
edit: Basically I am, for each (x,y) in the big matrix, checking all the pixels of both big matrix and the small matrix around (x,y) to see if they are 1. If they are 1, then I set that value on newMatrix. I am doing a sort of collision detection.
I can think of a couple of optimisations there -
As you are using 4 nested python "for" statements, you are about as slow as you can be.
I can't figure out exactly what you are looking for -
but for one thing, if your big matrix "1"s density is low, you can certainly use python's "any" function on bigMtarix's slices to quickly check if there are any set elements there -- you could get a several-fold speed increase there:
step = len(smallMatrix[0])
for y in xrange(0, len(bigMatrix[0], step)):
for x in xrange(0, len(bigMatrix), step):
if not any(bigMatrix[x: x+step, y: y + step]):
continue
(...)
At this point, if still need to interact on each element, you do another pair of indexes to walk each position inside the step - but I think you got the idea.
Apart from using inner Numeric operations like this "any" usage, you could certainly add some control flow code to break-off the (b,a) loop when the first matching pixel is found.
(Like, inserting a "break" statement inside your last "if" and another if..break pair for the "b" loop.
I really can't figure out exactly what your intent is - so I can't give you more specifc code.
Your example code makes no sense, but the description of your problem sounds like you are trying to do a 2d convolution of a small bitarray over the big bitarray. There's a convolve2d function in scipy.signal package that does exactly this. Just do convolve2d(bigMatrix, smallMatrix) to get the result. Unfortunately the scipy implementation doesn't have a special case for boolean arrays so the full convolution is rather slow. Here's a function that takes advantage of the fact that the arrays contain only ones and zeroes:
import numpy as np
def sparse_convolve_of_bools(a, b):
if a.size < b.size:
a, b = b, a
offsets = zip(*np.nonzero(b))
n = len(offsets)
dtype = np.byte if n < 128 else np.short if n < 32768 else np.int
result = np.zeros(np.array(a.shape) + b.shape - (1,1), dtype=dtype)
for o in offsets:
result[o[0]:o[0] + a.shape[0], o[1]:o[1] + a.shape[1]] += a
return result
On my machine it runs in less than 9 seconds for a 3000x2000 by 20x20 convolution. The running time depends on the number of ones in the smaller array, being 20ms per each nonzero element.
If your bits are really packed 8 per byte / 32 per int,
and you can reduce your smallMatrix to 20x16,
then try the following, here for a single row.
(newMatrix[x, y] = 1 when any bit of the 20x16 around x,y is 1 ??
What are you really looking for ?)
python -m timeit -s '
""" slide 16-bit mask across 32-bit pairs bits[j], bits[j+1] """
import numpy as np
bits = np.zeros( 2000 // 16, np.uint16 ) # 2000 bits
bits[::8] = 1
mask = 32+16
nhit = 16 * [0]
def hit16( bits, mask, nhit ):
"""
slide 16-bit mask across 32-bit pairs bits[j], bits[j+1]
bits: long np.array( uint16 )
mask: 16 bits, int
out: nhit[j] += 1 where pair & mask != 0
"""
left = bits[0]
for b in bits[1:]:
pair = (left << 16) | b
if pair: # np idiom for non-0 words ?
m = mask
for j in range(16):
if pair & m:
nhit[j] += 1
# hitposition = jb*16 + j
m <<= 1
left = b
# if any(nhit): print "hit16:", nhit
' \
'
hit16( bits, mask, nhit )
'
# 15 msec per loop, bits[::4] = 1
# 11 msec per loop, bits[::8] = 1
# mac g4 ppc

Categories