Process RGBA data efficiently using python?

Process RGBA data efficiently using python? - python

I'm trying to process an RGBA buffer (list of chars), and run "unpremultiply" on each pixel. The algorithm is color_out=color*255/alpha.
This is what I came up with:
def rgba_unpremultiply(data):
for i in range(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
data[i] = chr(255*ord(data[i])/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
It works but causes a major drawback in performance.
I'm wondering besides writing a C module, what are my options to optimize this particular function?

This is exactly the kind of code NumPy is great for.
import numpy
def rgba_unpremultiply(data):
a = numpy.fromstring(data, 'B') # Treat the string as an array of bytes
a = a.astype('I') # Cast array of bytes to array of uints, since temporary values needs to be larger than byte
alpha = a[3::4] # Every 4th element starting from index 3
alpha = numpy.where(alpha == 0, 255, alpha) # Don't modify colors where alpha is 0
a[0::4] = a[0::4] * 255 // alpha # Operates on entire slices of the array instead of looping over each element
a[1::4] = a[1::4] * 255 // alpha
a[2::4] = a[2::4] * 255 // alpha
return a.astype('B').tostring() # Cast back to bytes

How big is data? Assuming this is on python2.X Try using xrange instead of range so that you don't have to constantly allocate and reallocate a large list.
You could convert all the data to integers for working with them so you're not constantly converting to and from characters.
Look into using numpy to vectorize this: Link I suspect that simply storing the data as integers and using a numpy array will greatly improve the performance.
And another relatively simple thing you could do is write a little Cython:
http://wiki.cython.org/examples/mandelbrot
Basically Cython will compile your above function into C code with just a few lines of type hints. It greatly reduces the barrier to writing a C extension.

I don't have a concrete answer, but some useful pointers might be:
Python's array module
numpy
OpenCV if you have actual image data

There are some minor things you can do, but I do not think you can improve a lot.
Anyway, here's some hint:
def rgba_unpremultiply(data):
# xrange() is more performant then range, it does not precalculate the whole array
for i in xrange(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
# Not sure about this, but maybe (c << 8) - c is faster than c*255
# So maybe you can arrange this to do that
# Check for performance improvement
data[i] = chr(((ord(data[i]) << 8) - ord(data[i]))/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
I've just make some dummy benchmark on << vs *, and it seems not to be markable differences, but I guess you can do better evaluation on your project.
Anyway, a c module may be a good thing, even if it does not seem to be "language related" the problem.

Related

Low Performance unpacking byte array to data structure

I've some performance trouble to put data from a byte array to the internal data structure. The data contains several nested arrays and can be extracted as the attached code. In C it takes something like one Second by reading from a stream, but in Python it takes almost one Minute. I guess indexing and calling int.from_bytes was not the best idea.
Has anybody a proposal to improve the performance?
...
ycnt = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
while ycnt > 0:
ky = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
dv = DataObject()
xvec.update({ky: dv})
dv.x = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
dv.y = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
cntv = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
while cntv > 0:
dv.data_values.append(int.from_bytes(bytedat[idx:idx + 4], 'little', signed=True))
idx += 4
cntv -= 1
dv.score = struct.unpack('d', bytedat[idx:idx + 8])[0]
idx += 8
ycnt -= 1
...

First, a factor 60 between Python versus C is normal for low-level code like this. This is not where Python shines, because it doesn't get compiled down to machine-code.
Micro-Optimizations
The most obvious one is to reduce your integer math by using struct.unpack() properly. See the format string docu. Something like this:
ky, dy, dv.x, dv.y, cntv = struct.unpack('<iiiii', bytedat[idx:idx+5*4])
The second one is to load your int arrays (if they are large) "in batch" instead of the (interpreted!) while cntv > 0 loop. I would use a numpy array:
numpy.frombuffer(bytedat[idx:idx + 4*cntv], dtype='int32')
Why is not a list? A Python list contains (generic) Python objects. It requires extra memory and pointer indirection for each item. Libraries cannot use optimized C code (for example to calculate the sum) because each item has first to be dereferenced and then checked for its type.
A numpy object, on the other hand, is basically a wrapper to manage the memory of a C array. Loading it it will probably boil down to a memcpy(), or it may even just reference the bytes memory you passed.
And thirdly, instead of xvec.update({ky: dv}) you can probably write xvec[ky] = dy. This may prevent the creation of a temporary dict object.
Compiling your Python-Code
There are ways to compile Python (partially) down to machine code (PyPy, Numba, Cython). It's a bit involved, but your original byte-indexing code would then run at C speed.
However, you are filling a Python list and a dict in the inner loop. This is never going to get "C"-like fast because it will have to deal with Python objects and reference counting, even when it gets compiled down to C.
Different file format
The easiest way is to use a data format handled by a fast specialized library (like numpy, hd5, pillow, maybe even pandas).
The pickle module may also help, but only if you can control the writing and everything is trusted, and you mainly care about loading speed.

I do something similar, but big-endian.
I find that
(byte1 << 8) | byte2
to be faster than int.from_bytes() and struct.unpack().
I also find pypy3 to be at least 4x faster than python3
for this sort of stuff.

Why is `np.sum(range(N))` very slow?

I saw a video about speed of loops in python, where it was explained that doing sum(range(N)) is much faster than manually looping through range and adding the variables together, since the former runs in C due to built-in functions being used, while in the latter the summation is done in (slow) python. I was curious what happens when adding numpy to the mix. As I expected np.sum(np.arange(N)) is the fastest, but sum(np.arange(N)) and np.sum(range(N)) are even slower than doing the naive for loop.
Why is this?
Here's the script I used to test, some comments about the supposed cause of slowing done where I know (taken mostly from the video) and the results I got on my machine (python 3.10.0, numpy 1.21.2):
updated script:
import numpy as np
from timeit import timeit
N = 10_000_000
repetition = 10
def sum0(N = N):
s = 0
i = 0
while i < N: # condition is checked in python
s += i
i += 1 # both additions are done in python
return s
def sum1(N = N):
s = 0
for i in range(N): # increment in C
s += i # addition in python
return s
def sum2(N = N):
return sum(range(N)) # everything in C
def sum3(N = N):
return sum(list(range(N)))
def sum4(N = N):
return np.sum(range(N)) # very slow np.array conversion
def sum5(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = int))
def sum5v2_(N = N):
# much faster np.array conversion
return np.sum(np.fromiter(range(N),dtype = np.int_))
def sum6(N = N):
# possibly slow conversion to Py_long from np.int
return sum(np.arange(N))
def sum7(N = N):
# list returns a list of np.int-s
return sum(list(np.arange(N)))
def sum7v2(N = N):
# tolist conversion to python int seems faster than the implicit conversion
# in sum(list()) (tolist returns a list of python int-s)
return sum(np.arange(N).tolist())
def sum8(N = N):
return np.sum(np.arange(N)) # everything in numpy (fortran libblas?)
def sum9(N = N):
return np.arange(N).sum() # remove dispatch overhead
def array_basic(N = N):
return np.array(range(N))
def array_dtype(N = N):
return np.array(range(N),dtype = np.int_)
def array_iter(N = N):
# np.sum's source code mentions to use fromiter to convert from generators
return np.fromiter(range(N),dtype = np.int_)
print(f"while loop: {timeit(sum0, number = repetition)}")
print(f"for loop: {timeit(sum1, number = repetition)}")
print(f"sum_range: {timeit(sum2, number = repetition)}")
print(f"sum_rangelist: {timeit(sum3, number = repetition)}")
print(f"npsum_range: {timeit(sum4, number = repetition)}")
print(f"npsum_iterrange: {timeit(sum5, number = repetition)}")
print(f"npsum_iterrangev2: {timeit(sum5, number = repetition)}")
print(f"sum_arange: {timeit(sum6, number = repetition)}")
print(f"sum_list_arange: {timeit(sum7, number = repetition)}")
print(f"sum_arange_tolist: {timeit(sum7v2, number = repetition)}")
print(f"npsum_arange: {timeit(sum8, number = repetition)}")
print(f"nparangenpsum: {timeit(sum9, number = repetition)}")
print(f"array_basic: {timeit(array_basic, number = repetition)}")
print(f"array_dtype: {timeit(array_dtype, number = repetition)}")
print(f"array_iter: {timeit(array_iter, number = repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum8(N/1000), number = 100000*repetition)}")
print(f"npsumarangeREP: {timeit(lambda : sum9(N/1000), number = 100000*repetition)}")
# Example output:
#
# while loop: 11.493371912998555
# for loop: 7.385945574002108
# sum_range: 2.4605720699983067
# sum_rangelist: 4.509678105998319
# npsum_range: 11.85120212900074
# npsum_iterrange: 4.464334709002287
# npsum_iterrangev2: 4.498494338993623
# sum_arange: 9.537815956995473
# sum_list_arange: 13.290120724996086
# sum_arange_tolist: 5.231948580003518
# npsum_arange: 0.241889145996538
# nparangenpsum: 0.21876695199898677
# array_basic: 11.736577274998126
# array_dtype: 8.71628468400013
# array_iter: 4.303306431000237
# npsumarangeREP: 21.240833958996518
# npsumarangeREP: 16.690092379001726

np.sum(range(N)) is slow mostly because the current Numpy implementation do not use enough informations about the exact type/content of the values provided by the generator range(N). The heart of the general problem is inherently due to dynamic typing of Python and big integers although Numpy could optimize this specific case.
First of all, range(N) returns a dynamically-typed Python object which is a (special kind of) Python generator. The object provided by this generator are also dynamically-typed. It is in practice a pure-Python integer.
The thing is Numpy is written in the statically-typed language C and so it cannot efficiently work on dynamically-typed pure-Python objects. The strategy of Numpy is to convert such objects into C types when it can. One big problem in this case is that the integers provided by the generator can theorically be huge: Numpy do not know if the values can overflow a np.int32 or even a np.int64 type. Thus, Numpy first detect the good type to use and then compute the result using this type.
This translation process can be quite expensive and appear not to be needed here since all the values provided by range(10_000_000). However, range(5_000_000_000) returns the same object type with pure-Python integers overflowing np.int32 and Numpy needs to automatically detect this case not to return wrong results. The thing is also the input type can be correctly identified (np.int32 on my machine), it does not means that the output result will be correct because overflows can appear in during the computation of the sum. This is sadly the case on my machine.
Numpy developers decided to deprecate such a use and put in the documentation that np.fromiter should be used instead. np.fromiter has a dtype required parameter to let the user define what is the good type to use.
One way to check this behaviour in practice is to simply use create a temporary list:
tmp = list(range(10_000_000))
# Numpy implicitly convert the list in a Numpy array but
# still automatically detect the input type to use
np.sum(tmp)
A faster implementation is the following:
tmp = list(range(10_000_000))
# The array is explicitly converted using a well-defined type and
# thus there is no need to perform an automatic detection
# (note that the result is still wrong since it does not fit in a np.int32)
tmp2 = np.array(tmp, dtype=np.int32)
result = np.sum(tmp2)
The first case takes 476 ms on my machine while the second takes 289 ms. Note that np.sum takes only 4 ms. Thus, a large part of the time is spend in the conversion of pure-Python integer objects to internal int32 types (more specifically the management of pure-Python integers). list(range(10_000_000)) is expensive too as it takes 205 ms. This is again due to the overhead of pure-Python integers (ie. allocations, deallocations, reference counting, increment of variable-sized integers, memory indirections and conditions due to the dynamic typing) as well as the overhead of the generator.
sum(np.arange(N)) is slow because sum is a pure-Python function working on a Numpy-defined object. The CPython interpreter needs to call Numpy functions to perform basic additions. Moreover, Numpy-defined integer object are still Python object and so they are subject to reference counting, allocation, deallocation, etc. Not to mention Numpy and CPython add many checks in the functions aiming to finally just add two native numbers together. A Numpy-aware just-in-time compiler such as Numba can solve this issue. Indeed, Numba takes 23 ms on my machine to compute the sum of np.arange(10_000_000) (with code still written in Python) while the CPython interpreter takes 556 ms.

Let's see if I can summarize the results.
sum can work with any iterable, repeatedly asking for the next value and adding it. range is a generator, that's happy to supply the next value
# sum_range: 1.4830789409988938
Making a list from a range takes time:
# sum_rangelist: 3.6745876889999636
Summing a pregenerated list is actually faster than summing the range:
%%timeit x = list(range(N))
...: sum(x)
np.sum is designed to sum arrays. It's a wrapper to np.add.reduce.
np.sum has a deprecation warning for np.sum(generator), recommending the use of fromiter or Python sum:
# npsum_range: 16.216972655000063
fromiter is the best way of making an array from a generator. Using np.array on range is legacy code and may go away in the future. I think it's the only generator that np.array will accept.
np.array is a general purpose function that can handle many cases, including nested arrays, and conversion to various dtypes. As such it has to process the whole input argument, deducing both shape and dtype.
# npsum_fromiterrange:3.47655400199983
Iteration on a numpy array is slower than a list, since it has to "unbox" each element.
# sum_arange: 16.656015603000924
Similarly making a list from an array is slow; same sort of python level iteration.
# sum_list_arange: 19.500842117000502
arr.tolist() is relatively fast, creating a pure python list in compiled code. So speed is similar to making a list from range.
# sum_arange_tolist: 4.004777374000696
np.sum of an array is pure numpy and quite fast. np.sum(x) where x=np.arange(N) is even faster (by about 4x)
# npsum_arange: 0.2332638230000157
np.sum from range or list is dominated by the cost of creating the array first:
# array_basic: 16.1631146109994
# array_dtype: 16.550737804000164
# array_iter: 3.9803170430004684

From the cpython source code for sum sum initially seems to attempt a fast path that assumes all inputs are the same type. If that fails it will just iterate:
/* Fast addition by keeping temporary sums in C instead of new Python objects.
Assumes all inputs are the same type. If the assumption fails, default
to the more general routine.
*/
I'm not entirely certain what is happening under the hood, but it is likely the repeated creation/conversion of C types to Python objects that is causing these slow-downs. It's worth noting that both sum and range are implemented in C.
This next bit is not really an answer to the question, but I wondered if we could speed up sum for python ranges as range is quite a smart object.
To do this I've used functools.singledispatch to override the built-in sum function specifically for the range type; then implemented a small function to calculate the sum of an arithmetic progression.
from functools import singledispatch
def sum_range(range_, /, start=0):
"""Overloaded `sum` for range, compute arithmetic sum"""
n = len(range_)
if not n:
return start
return int(start + (n * (range_[0] + range_[-1]) / 2))
sum = singledispatch(sum)
sum.register(range, sum_range)
def test():
"""
>>> sum(range(0, 100))
4950
>>> sum(range(0, 10, 2))
20
>>> sum(range(0, 9, 2))
20
>>> sum(range(0, -10, -1))
-45
>>> sum(range(-10, 10))
-10
>>> sum(range(-1, -100, -2))
-2500
>>> sum(range(0, 10, 100))
0
>>> sum(range(0, 0))
0
>>> sum(range(0, 100), 50)
5000
>>> sum(range(0, 0), 10)
10
"""
if __name__ == "__main__":
import doctest
doctest.testmod()
I'm not sure if this is complete, but it's definitely faster than looping.

How to do elementwise processing (first int, then pairwise absolute length) to avoid memory problems? (python)

I want to process a big list of uint numbers (test1) and I do that i chunks of "length". I need them as signed int and then I need it as absolute length from each pait of even and odd values in this list.
But I want to get rid of two problems:
it uses a a lot of ram
it takes ages!
So how could I make this faster? Any trick? I could also use numpy, no problem in doing so.
Thanks in advance!
test2 = -127 + test1[i:i+length*2048000*2 + 2048000*2*1]
test3 = (test2[::2]**2 + test2[1::2]**2)**0.5

An efficient way is to try to use Numpy functions, e.g:
n = 10
ff = np.random.randint(0, 255, n) # generate some data
ff2 = ff.reshape(n/2, 2) # new view on ff (only makes copy if needed)
l_ff = np.linalg.norm(ff2, axis=1) # calculate vector length of each row
Note that when modifying an entry in ff2 then ff will change as well and vice versa.
Internally, Numpy stores data as contiguous memory blocks. So there are further methods besides np.reshape() to exploit that structure. For efficient conversion of data types, you can try:
dd_s = np.arange(-5, 10, dtype=np.int8)
dd_u = dd_s.astype(np.uint8) # conversion from signed to unsigned

What is the fastest way in python to build a c array from a list of tuples of floats?

The context: my Python code pass arrays of 2D vertices to OpenGL.
I tested 2 approaches, one with ctypes, the other with struct, the latter being more than twice faster.
from random import random
points = [(random(), random()) for _ in xrange(1000)]
from ctypes import c_float
def array_ctypes(points):
n = len(points)
return n, (c_float*(2*n))(*[u for point in points for u in point])
from struct import pack
def array_struct(points):
n = len(points)
return n, pack("f"*2*n, *[u for point in points for u in point])
Any other alternative?
Any hint on how to accelerate such code (and yes, this is one bottleneck of my code)?

You can pass numpy arrays to PyOpenGL without incurring any overhead. (The data attribute of the numpy array is a buffer that points to the underlying C data structure that contains the same information as the array you're building)
import numpy as np
def array_numpy(points):
n = len(points)
return n, np.array(points, dtype=np.float32)
On my computer, this is about 40% faster than the struct-based approach.

You could try Cython. For me, this gives:
function usec per loop:
Python Cython
array_ctypes 1370 1220
array_struct 384 249
array_numpy 336 339
So Numpy only gives 15% benefit on my hardware (old laptop running WindowsXP), whereas Cython gives about 35% (without any extra dependency in your distributed code).
If you can loosen your requirement that each point is a tuple of floats, and simply make 'points' a flattened list of floats:
def array_struct_flat(points):
n = len(points)
return pack(
"f"*n,
*[
coord
for coord in points
]
)
points = [random() for _ in xrange(1000 * 2)]
then the resulting output is the same, but the timing goes down further:
function usec per loop:
Python Cython
array_struct_flat 157
Cython might be capable of substantially better than this too, if someone smarter than me wanted to add static type declarations to the code. (Running 'cython -a test.pyx' is invaluable for this, it produces an html file showing you where the slowest (yellow) plain Python is in your code, versus python that has been converted to pure C (white). That's why I spread the code above out onto so many lines, because the coloring is done per-line, so it helps to spread it out like that.)
Full Cython instructions are here:
http://docs.cython.org/src/quickstart/build.html
Cython might produce similar performance benefits across your whole codebase, and in ideal conditions, with proper static typing applied, can improve speed by factors of ten or a hundred.

There's another idea I stumbled across. I don't have time to profile it right now, but in case someone else does:
# untested, but I'm fairly confident it runs
# using 'flattened points' list, i.e. a list of n*2 floats
points = [random() for _ in xrange(1000 * 2)]
c_array = c_float * len(points * 2)
c_array[:] = points
That is, first we create the ctypes array but don't populate it. Then we populate it using the slice notation. People smarter than I tell me that assigning to a slice like this may help performance. It allows us to pass a list or iterable directly on the RHS of the assignment, without having to use the *iterable syntax, which would perform some intermediate wrangling of the iterable. I suspect that this is what happens in the depths of creating pyglet's Batches.
Presumably you could just create c_array once, then just reassign to it (the final line in the above code) every time the points list changes.
There is probably an alternative formulation which accepts the original definition of points (a list of (x,y) tuples.) Something like this:
# very untested, likely contains errors
# using a list of n tuples of two floats
points = [(random(), random()) for _ in xrange(1000)]
c_array = c_float * len(points * 2)
c_array[:] = chain(p for p in points)

If performance is an issue, you do not want to use ctypes arrays with the star operation (e.g., (ctypes.c_float * size)(*t)).
In my test packis fastest followed by the use of the array module with a cast of the address (or using the from_buffer function).
import timeit
repeat = 100
setup="from struct import pack; from random import random; import numpy; from array import array; import ctypes; t = [random() for _ in range(2* 1000)];"
print(timeit.timeit(stmt="v = array('f',t); addr, count = v.buffer_info();x = ctypes.cast(addr,ctypes.POINTER(ctypes.c_float))",setup=setup,number=repeat))
print(timeit.timeit(stmt="v = array('f',t);a = (ctypes.c_float * len(v)).from_buffer(v)",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(*t)',setup=setup,number=repeat))
print(timeit.timeit(stmt="x = pack('f'*len(t), *t);",setup=setup,number=repeat))
print(timeit.timeit(stmt='x = (ctypes.c_float * len(t))(); x[:] = t',setup=setup,number=repeat))
print(timeit.timeit(stmt='x = numpy.array(t,numpy.float32).data',setup=setup,number=repeat))
The array.array approach is slightly faster than Jonathan Hartley's approach in my test while the numpy approach has about half the speed:
python3 convert.py
0.004665990360081196
0.004661010578274727
0.026358536444604397
0.0028003649786114693
0.005843495950102806
0.009067213162779808
The net winner is pack.

You can use array (notice also the generator expression instead of the list comprehension):
array("f", (u for point in points for u in point)).tostring()
Another optimization would be to keep the points flattened from the beginning.

What is the equivalent of 'fread' from Matlab in Python?

I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)

Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.

From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.

Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.

The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.