Numpy Unpacking the Byte and Reversing the Order - python

I have a binary file with size of 10 MB, what I want to do with this file is to read bit by bit. In Python- Numpy, as far as I know we cannot read data bit by bit but byte. So, in order to read the data bit by bit, first I read the file using np.fromfile function then later unpack the byte into 8 bits using np.unpackbits function. Here is the script how I did it:
fbyte = np.fromfile(binar_file, dtype='uint8')
fbit = np.unpackbits(fbyte)
What I have in fbit is a long binary file but with reversing order in every 8 bits (MSB - LSB) e.g 10010011 ..., what I actually expected is in order LSB - MSB like this 11001001. By using for loop to flip the order of binary file every 8 bits will solve the problem, but it will take some time which I would like to avoid since I want to read thousand of files. So my question is, is there any way to unpack the bytes into bit but directly in order of LSB - MSB. Just as comparison, in Matlab this process is easy to do since there is Matlab function fread where I can specify bit configuration, e.g 'ubit1' for reading bit by bit and the result is as I expected --> LSB - MSB. Any help/hints would be appreciated. Thanks.

You could simply reshape to 2D keeping 8 columns and then flip those, like so -
np.unpackbits(fbyte).reshape(-1,8)[:,::-1]
Sample run -
In [1176]: fbyte
Out[1176]: array([253, 35, 198, 182, 62], dtype=uint8)
In [1177]: np.unpackbits(fbyte).reshape(-1,8)[:,::-1]
Out[1177]:
array([[1, 0, 1, 1, 1, 1, 1, 1],
[1, 1, 0, 0, 0, 1, 0, 0],
[0, 1, 1, 0, 0, 0, 1, 1],
[0, 1, 1, 0, 1, 1, 0, 1],
[0, 1, 1, 1, 1, 1, 0, 0]], dtype=uint8)
Timings on one million elements array -
In [1173]: fbyte = np.random.randint(0,255,(1000000)).astype(np.uint8)
In [1174]: %timeit np.unpackbits(fbyte).reshape(-1,8)[:,::-1]
1000 loops, best of 3: 541 µs per loop
Seems crazy fast to me!

In NumPy 1.17 and newer, unpackbits accepts a bitorder parameter that will accomplish this -- just pass bitorder="little" to the np.unpackbits call.

Related

Unit Testing Sudoku - should the tests try to be independent of the data structure for the Sudoku board?

I'm learning unit testing and TDD in python and have got the basics mastered with pytest.
I've started a learning project of writing a sudoku solver and one thing I'm realising is that at the start I'm not currently sure how I want to store the data for the sudoku board. Many talks warn about testing behaviour over implementation. Would the board data structure count as implementation?
It's very hard to write a test that is independent of the board data structure.
For example you could store the values of each cell as an integer (eg 1) or a string (eg '1') and the board as a whole can be a dict (using cell label like 'A1' or 'C5' etc as keys), list (flat or 2d) or a 81 character string.
So I might start with
hardboard = [
[0, 0, 0, 0, 0, 0, 4, 0, 0],
[0, 0, 1, 0, 7, 0, 0, 9, 0],
[5, 0, 0, 0, 3, 0, 0, 0, 6],
[0, 8, 0, 2, 0, 0, 0, 0, 0],
[7, 0, 0, 0, 0, 0, 9, 2, 0],
[1, 2, 0, 6, 0, 5, 0, 0, 0],
[0, 5, 0, 0, 0, 0, 0, 4, 0],
[0, 7, 3, 9, 0, 8, 0, 0, 0],
[6, 0, 0, 4, 5, 0, 2, 0, 0]
]
but then decide to use a dict
hardboard = {'A1': 0, 'A2': 0, 'A3': 0, ... 'I9': 0}
So how should I start? The only solution I can think of is to commit to one format, at least in terms how I represent the board in tests, even if the actual code uses a different data structure later and has to translate. But in my TDD noobiness I'm unsure if this is a good idea?
I'm not a python developer, but still:
Don't think about the board in terms of implementation details (like, this is a dictionary, array, slice, whatever).
Instead imagine that the board is an abstraction and it has some behaviors - which are in plain words - things you can do with a board:
place a digit in an index(i,j) that should be in 1 to 9 inclusive boundaries
remove a digit by index (i,j)
So its like you have a "contract" - you create a board and can execute some operations on it. Of course it has some internal implementation, but its hidden from everyone who uses it. Its your responsibility to provide a working implementation but this is what you want to test: The implementation is considered working if it "reacts" on all the operations correctly (behavior = operation = function at the level of actual programming implementation)
In this case your test (schematically might look like this):
Board Tests:
Test 1 : Check that its impossible to put a negative i index:
// setup
board = ...
// when:
board.put(-1, 3, 8) // i-index is wrong
// then:
expect exception to be thrown
Test 2 : Check that its impossible to put a negative j index:
// setup
board = ...
// when:
board.put(0, -1, 5) // j-index is wrong
// then:
expect exception to be thrown
Test 3 : Check that its impossible to put a negative number into cell:
// setup
board = ...
// when:
board.put(0, 0, -7)
// then:
expect exception to be thrown
Test 4 : Check that its impossible to put number that alreay contains another number:
// setup
board = ...
// when:
board.put(0, 0, 6)
board.put(0, 0, 5)
// then:
expect exception to be thrown
... and so on and so forth (also cover with tests for getting a cell, and any other additional behavior you create for the board).
Note, that all these tests indeed check the behavior of your abstraction rather than internal implementation.
At the level of implementation you can create a class for the board, internal data fields will contain the state, whereas the behavior will be represented by the methods exposed to the users of the class.

Python: how to implement crossover of two integers?

I'm experimenting with a genetic search algorithm and after building the initial population at random, and then selecting the top two fittest entries, I need to 'mate' them (with some random mutation) to create 64
'children'. The crossover part, explained here:
https://towardsdatascience.com/introduction-to-genetic-algorithms-including-example-code-e396e98d8bf3
seems easy to follow, but I can't seem to figure out how to implement it in Python. How can I implement this crossover of two integers?
def crossover(a, b, index):
return b[:index] + a[index:], a[:index] + b[index:]
Should be quite a bit faster than James' solution, since this one lets Python do all the work!
Here is a function called crossover that takes two parents and a crossover point. The parents should be lists of integers of the same length. The crossover point is the point before which genes get exchanged, as defined in the article that you linked to.
It returns the two offspring of the parents.
def crossover(a, b, crossover_point):
a1 = a[:]
b1 = b[:]
for i in range(crossover_point):
a1[i], b1[i] = b1[i], a1[i]
return [a1, b1]
And here is some code that demonstrates its usage. It creates a population consisting of two lists of length 10, one with only zeros, and the other with only ones. It crosses them over at point 4, and adds the children to the population.
def test_crossover():
a = [0]*10
b = [1]*10
population = [a,b]
population += crossover(a,b,4)
return population
print (test_crossover())
The output of the above is:
[
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
[1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
]

Comparison between numpy.all and bitwise '&' for binary sequence

I have a problem to perform a bitwise '&' between two large binary sequences of the same length and I need to find the indexes where the 1's appear.
I used numpy to do it and here is my code:
>>> c = numpy.array([[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1],[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]]) #initialize 2d array
>>> c = c.all(axis=0)
>>> d = numpy.where(c)[False] #returns indices
I checked the timings for it.
>>> print("Time taken to perform 'numpy.all' : ",timeit.timeit(lambda :c.all(axis=0),number=10000))
>>> Time taken to perform 'numpy.all' : 0.01454929300234653
This operation was slower than what I expected.
Then, to compare, I performed a basic bitwise '&' operation:
>>> print("Time taken to perform bitwise & :",timeit.timeit('a = 0b0000000001111111111100000001111111111; b = 0b0000000001111111111100000001111111111; c = a&b',number=10000))
>>> Time taken to perform bitwise & : 0.0004252859980624635
This is much quicker than numpy
I'm using numpy because it allows to find the indexes where it has 1's, but the numpy.all operator is much slower.
My original data will be array list just like in first case. Will there be any repurcusion if I convert this list into a binary number and then perform the computation like in the second case?
I don't think you can beat the speed of a&b (the actual computation is just a bunch of elementary cpu ops, I'm pretty sure the result of your timeit is >99% overhead). For example:
>>> from timeit import timeit
>>> import numpy as np
>>> import random
>>>
>>> k = 2**17-2
>>> a = random.randint(0, 2**k-1) + 2**k
>>> b = random.randint(0, 2**k-1) + 2**k
>>> timeit('a & b', globals=globals())
2.520026927930303
That's >100k bits and takes just ~2.5 us.
In any case the cost of & will be dwarfed by the cost of generating the list or array of indices.
numpy comes with significant overhead itself, so for a simple operation like yours one needs to check whether it is worth it.
So let's try a pure python solution first:
>>> c = a & b
>>> timeit("[x for x, y in enumerate(bin(c), -2) if y=='1']", globals=globals(), number=1000)
7.905808186973445
That's ~8 ms and as anticipated several orders of magnitude more than the & operation.
How about numpy?
Let's move the list comprehension first:
>>> timeit("np.where(np.fromstring(bin(c), np.uint8)[2:] - ord('0'))[0]", globals=globals(), number=1000)
1.0363857130287215
So in this case we get a ~8-fold speedup. This shrinks to ~4-fold if we require the result to be a list:
>>> timeit("np.where(np.fromstring(bin(c), np.uint8)[2:] - ord('0'))[0].tolist()", globals=globals(), number=1000)
1.9008758360287175
We can also let numpy do the binary conversion, which gives another small speedup:
>>> timeit("np.where(np.unpackbits(np.frombuffer(c.to_bytes(k//8+1, 'big'), np.uint8))[1:])[0]", globals=globals(), number=1000)
0.869781385990791
In summary:
numpy is not always faster, better leave the & to pure Python
locating nonzero bits seems fast enough in numpy to offset the cost of conversion between list and array
Please note that all this comes with the caveat that my pure Python code is not necessarily optimal. For example using a lookup table we can get a bit faster:
>>> lookup = [(np.where(i)[0]-1).tolist() for i in np.ndindex(*8*[2])]
>>> timeit("[(x<<3) + z for x, y in enumerate(c.to_bytes(k//8+1, 'big')) for z in lookup[y]]", globals=globals(), number=1000)
4.687953414046206
>>> c = numpy.random.randint(2, size=(2, 40)) #initialize
>>> c
array([[1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1,
0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0],
[1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1,
0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1]])
Accessing this gives you two slow-downs:
You have to access the two rows, whereas your bit-wise test has the constants readily available in registers.
You are performing a series of 40 and operations, which may include casting from a full integer to a Boolean.
You severely handicapped the all test; the result is not a surprise (any more).
The factor you observe is a direct consequence of the fact that c=numpy.array([[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1],[0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1]]) is an array on int and an int is coded in 32bits
therefore when you go c.all() you are doing an operation on 37*32 = 1184 bits
However a = 0b0000000001111111111100000001111111111 is composed of 37 bits so when you do a&b the operation is on 37 bits.
therefore you are doing something 32 times more costly with the numpy array.
Let's test that
import timeit
import numpy as np
print("Time taken to perform bitwise & :",timeit.timeit('a=0b0000000001111111111100000001111111111; b = 0b0000000001111111111100000001111111111; c = a&b',number=320000))
a = 0b0000000001111111111100000001111111111
b = 0b0000000001111111111100000001111111111
c=np.array([a,b])
print("Time taken to perform 'numpy.all' : ",timeit.timeit(lambda :c.all(axis=0),number=10000))
the & operation I do 320000 times and the all() operation I do 10000 times.
Time taken to perform bitwise & : 0.01527938833025152
Time taken to perform 'numpy.all' : 0.01583387375572265
It's the same thing !
Now back to your initial problem you want to know the indices where bits are 1 in a large binary number.
Maybe you could try things provided by the bitarray module
a = bitarray.bitarray('0000000001111111111100000001111111111')
b = bitarray.bitarray('0000000001111111111100000001111111111')
i=0
data = list()
for c in a&b:
if(c):
data.append(i)
i=i+1
print (data)
outputs
[9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]

How to convert a keycode to an char in python?

I'm writing a program in Linux which reads and distinguish inputs from two USB devices(two barcode readers) which simulates a keyboard.
I've already can read inputs from USB, but it happens before OS translate keycode in a charactere.
For example, when I read 'a' i got 24, 'b' 25, etc....
For example, when I read 'a' i got 4, 'b' 5, etc....
Is there any way to convert that code in a char without manual mapping?
Some output exemples:
KEYPRESS = a output = array('B', [0, 0, 4, 0, 0, 0, 0, 0])
KEYPRESS = SHIFT + a output = array('B', [2, 0, 4, 0, 0, 0, 0, 0])
KEYPRESS = 1 output = array('B', [0, 0, 30, 0, 0, 0, 0, 0])
KEYPRESS = ENTER output = array('B', [0, 0, 81, 0, 0, 0, 0, 0])
thx!
Use the chr function. Python uses a different character mapping (ASCII) from whatever you're receiving though, so you will have to add 73 to your key values to fix the offset.
>>> chr(24 + 73)
'a'
>>> chr(25 + 73)
'b'
I've already can read inputs from USB, but it happens before OS
translate keycode in a charactere.
The problem seems to me in your interface or the driver program.
In ASCII 'a' is supposed to have ordinal value 97 whose binary representation is 0b1100001, where as what you are receiving is 27 whose binary representation is 0b11000, similarly for 'b' you were supposed to received '0b1100010' instead you received 25 which is 0b11001. Check your hardware to determine if the 1st and the 3rd bit is dropped from the input.
What you are receiving is USB scan code. I do not think there is a third party python library to do the conversion for you. I would suggest you to refer any of the USB Scan Code Table and from it, create a dictionary of USB Scan Code vs the corresponding ASCII.

Assigning multiple array indices at once in Python/Numpy

I'm looking to quickly (hopefully without a for loop) generate a Numpy array of the form:
array([a,a,a,a,0,0,0,0,0,b,b,b,0,0,0, c,c,0,0....])
Where a, b, c and other values are repeated at different points for different ranges. I'm really thinking of something like this:
import numpy as np
a = np.zeros(100)
a[0:3,9:11,15:16] = np.array([a,b,c])
Which obviously doesn't work. Any suggestions?
Edit (jterrace answered the original question):
The data is coming in the form of an N*M Numpy array. Each row is mostly zeros, occasionally interspersed by sequences of non-zero numbers. I want to replace all elements of each such sequence with the last value of the sequence. I'll take any fast method to do this! Using where and diff a few times, we can get the start and stop indices of each run.
raw_data = array([.....][....])
starts = array([0,0,0,1,1,1,1...][3, 9, 32, 7, 22, 45, 57,....])
stops = array([0,0,0,1,1,1,1...][5, 12, 50, 10, 30, 51, 65,....])
last_values = raw_data[stops]
length_to_repeat = stops[1]-starts[1]
Note that starts[0] and stops[0] are the same information (which row the run is occurring on). At this point, since the only route I know of is what jterrace suggest, we'll need to go through some contortions to get similar start/stop positions for the zeros, then interleave the zero start/stop with the values start/stops, and interleave the number 0 with the last_values array. Then we loop over each row, doing something like:
for i in range(N)
values_in_this_row = where(starts[0]==i)[0]
output[i] = numpy.repeat(last_values[values_in_this_row], length_to_repeat[values_in_this_row])
Does that make sense, or should I explain some more?
If you have the values and repeat counts fully specified, you can do it this way:
>>> import numpy
>>> values = numpy.array([1,0,2,0,3,0])
>>> counts = numpy.array([4,5,3,3,2,2])
>>> numpy.repeat(values, counts)
array([1, 1, 1, 1, 0, 0, 0, 0, 0, 2, 2, 2, 0, 0, 0, 3, 3, 0, 0])
you can use numpy.r_:
>>> np.r_[[a]*4,[b]*3,[c]*2]
array([1, 1, 1, 1, 2, 2, 2, 3, 3])

Categories