Faster bit-level data packing

Faster bit-level data packing - python

An 256*64 pixel OLED display connected to Raspberry Pi (Zero W) has 4 bit greyscale pixel data packed into a byte (i.e. two pixels per byte), so 8192 bytes in total. E.g. the bytes
0a 0b 0c 0d (only lower nibble has data)
become
ab cd
Converting these bytes either obtained from a Pillow (PIL) Image or a cairo ImageSurface takes up to 0.9 s when naively iterating the pixel data, depending on color depth.
Combining every two bytes from a Pillow "L" (monochrome 8 bit) Image:
imd = im.tobytes()
nibbles = [int(p / 16) for p in imd]
packed = []
msn = None
for n in nibbles:
nib = n & 0x0F
if msn is not None:
b = msn << 4 | nib
packed.append(b)
msn = None
else:
msn = nib
This (omitting state and saving float/integer conversion) brings it down to about half (0.2 s):
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
Basically the first applied to an RGB24 (32 bit!) cairo ImageSurface, though with crude greyscale conversion:
mv = surface.get_data()
w = surface.get_width()
h = surface.get_height()
f = surface.get_format()
s = surface.get_stride()
print(len(mv), w, h, f, s)
# convert xRGB
o = []
msn = None
for p in range(0, len(mv), 4):
nib = int( (mv[p+1] + mv[p+2] + mv[p+3]) / 3 / 16) & 0x0F
if msn is not None:
b = msn << 4 | nib
o.append(b)
msn = None
else:
msn = nib
takes about twice as long (0.9 s vs 0.4 s).
The struct module does not support nibbles (half-bytes).
bitstring does allow packing nibbles:
>>> a = bitstring.BitStream()
>>> a.insert('0xf')
>>> a.insert('0x1')
>>> a
BitStream('0xf1')
>>> a.insert(5)
>>> a
BitStream('0b1111000100000')
>>> a.insert('0x2')
>>> a
BitStream('0b11110001000000010')
>>>
But there does not seem to be a method to unpack this into a list of integers quickly -- this takes 30 seconds!:
a = bitstring.BitStream()
for p in imd:
a.append( bitstring.Bits(uint=p//16, length=4) )
packed=[]
a.pos=0
for p in range(256*64//2):
packed.append( a.read(8).uint )
Does Python 3 have the means to do this efficiently or do I need an alternative?
External packer wrapped with ctypes? The same, but simpler, with Cython (I have not yet looked into these)? Looks very good, see my answer.

Down to 130 ms from 200 ms by just wrapping the loop in a function
def packer0(imd):
"""same loop in a def"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 35 ms by Cythonizing the same code
def packer1(imd):
"""Cythonize python nibble packing loop"""
packed = []
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Down to 16 ms with type
def packer2(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int b
for b in range(0, 256*64, 2):
packed.append( (imd[b]//16)<<4 | (imd[b+1]//16) )
return packed
Not much of a difference with a "simplified" loop
def packer3(imd):
"""Cythonize python nibble packing loop, typed"""
packed = []
cdef unsigned int i
for i in range(256*64/2):
packed.append( (imd[i*2]//16)<<4 | (imd[i*2+1]//16) )
return packed
Maybe a tiny bit faster even (15 ms)
def packer4(it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ (it[i*2]//16)<<4 | it[i*2+1]//16 for i in range(n) ]
Here's with timeit
>>> timeit.timeit('packer4(data)', setup='from pack import packer4; data = [0]*256*64', number=100)
1.31725951000044
>>> exit()
pi#raspberrypi:~ $ python3 -m timeit -s 'from pack import packer4; data = [0]*256*64' 'packer4(data)'
100 loops, best of 3: 9.04 msec per loop
This already meets my requirements, but I guess there may be further optimization possible with the input/output iterables (-> unsigned int array?) or accessing the input data with a wider data type (Raspbian is 32 bit, BCM2835 is ARM1176JZF-S single-core).
Or with parallelism on the GPU or the multi-core Raspberry Pis.
A crude comparison with the same loop in C (ideone):
#include <stdio.h>
#include <stdint.h>
#define SIZE (256*64)
int main(void) {
uint8_t in[SIZE] = {0};
uint8_t out[SIZE/2] = {0};
uint8_t t;
for(t=0; t<100; t++){
uint16_t i;
for(i=0; i<SIZE/2; i++){
out[i] = (in[i*2]/16)<<4 | in[i*2+1]/16;
}
}
return 0;
}
It's apparently 100 times faster:
pi#raspberry:~ $ gcc p.c
pi#raspberry:~ $ time ./a.out
real 0m0.085s
user 0m0.060s
sys 0m0.010s
Eliminating the the shifts/division may be another slight optimization (I have not checked the resulting C, nor the binary):
def packs(bytes it):
"""Cythonize python nibble packing loop, typed"""
cdef unsigned int n = len(it)//2
cdef unsigned int i
return [ ( (it[i<<1]&0xF0) | (it[(i<<1)+1]>>4) ) for i in range(n) ]
results in
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 12.7 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 12 msec per loop
python3 -m timeit -s 'from pack import packs; data = bytes([0]*256*64)' 'packs(data)'
100 loops, best of 3: 11 msec per loop
python3 -m timeit -s 'from pack import pack; data = bytes([0]*256*64)' 'pack(data)'
100 loops, best of 3: 13.9 msec per loop

Related

How to get rid of for loop in my function? [duplicate]

T(i) = Tm(i) + (T(i-1)-Tm(i))**(-tau(i))
Tm and tau are NumPy vectors of the same length that have been previously calculated, and the desire is to create a new vector T. The i is included only to indicate the element index for what is desired.
Is a for loop necessary for this case?

You might think this would work:
import numpy as np
n = len(Tm)
t = np.empty(n)
t[0] = 0 # or whatever the initial condition is
t[1:] = Tm[1:] + (t[0:n-1] - Tm[1:])**(-tau[1:])
but it doesn't: you can't actually do recursion in numpy this way (since numpy calculates the whole RHS and then assigns it to the LHS).
So unless you can come up with a non-recursive version of this formula, you're stuck with an explicit loop:
tt = np.empty(n)
tt[0] = 0.
for i in range(1,n):
tt[i] = Tm[i] + (tt[i-1] - Tm[i])**(-tau[i])

2019 Update. The Numba code broke with the new version of numba. Changing dtype="float32" to dtype=np.float32 solved it.
I performed some benchmarks and in 2019 using Numba is the first option people should try to accelerate recursive functions in Numpy (adjusted proposal of Aronstef). Numba is already preinstalled in the Anaconda package and has one of the fastest times (about 20 times faster than any Python). In 2019 Python supports #numba annotations without additional steps (at least versions 3.6, 3.7, and 3.8). Here are three benchmarks: performed on 2019-12-05, 2018-10-20 and 2016-05-18.
And, as mentioned by Jaffe, in 2018 it is still not possible to vectorize recursive functions. I checked the vectorization by Aronstef and it does NOT work.
Benchmarks sorted by execution time:
-------------------------------------------
|Variant |2019-12 |2018-10 |2016-05 |
-------------------------------------------
|Pure C | na | na | 2.75 ms|
|C extension | na | na | 6.22 ms|
|Cython float32 | 0.55 ms| 1.01 ms| na |
|Cython float64 | 0.54 ms| 1.05 ms| 6.26 ms|
|Fortran f2py | 4.65 ms| na | 6.78 ms|
|Numba float32 |73.0 ms| 2.81 ms| na |
|(Aronstef) | | | |
|Numba float32v2| 1.82 ms| 2.81 ms| na |
|Numba float64 |78.9 ms| 5.28 ms| na |
|Numba float64v2| 4.49 ms| 5.28 ms| na |
|Append to list |73.3 ms|48.2 ms|91.0 ms|
|Using a.item() |36.9 ms|58.3 ms|74.4 ms|
|np.fromiter() |60.8 ms|60.0 ms|78.1 ms|
|Loop over Numpy|71.3 ms|71.9 ms|87.9 ms|
|(Jaffe) | | | |
|Loop over Numpy|74.6 ms|74.4 ms| na |
|(Aronstef) | | | |
-------------------------------------------
Corresponding code is provided at the end of the answer.
It seems that with time Numba and Cython times get better. Now both of them are faster than Fortran f2py. Cython is faster 8.6 times now and Numba 32bit is faster 2.5 times. Fortran was very hard to debug and compile in 2016. So now there is no reason to use Fortran at all.
I did not check Pure C and C extension in 2019 and 2018, because it is not easy to compile them in Jupyter notebooks.
I had the following setup in 2019:
Processor: Intel i5-9600K 3.70GHz
Versions:
Python: 3.8.0
Numba: 0.46.0
Cython: 0.29.14
Numpy: 1.17.4
I had the following setup in 2018:
Processor: Intel i7-7500U 2.7GHz
Versions:
Python: 3.7.0
Numba: 0.39.0
Cython: 0.28.5
Numpy: 1.15.1
The recommended Numba code using float32 (adjusted Aronstef):
#numba.jit("float32[:](float32[:], float32[:])", nopython=True, nogil=True)
def calc_py_jit32v2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float32)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
All the other code:
Data creation (like Aronstef + Mike T comment):
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float64'))
tau = np.random.uniform(-1, 0, size=n).astype('float64')
ar = np.column_stack([Tm,tau])
Tm32 = Tm.astype('float32')
tau32 = tau.astype('float32')
Tm_l = list(Tm)
tau_l = list(tau)
The code in 2016 was slightly different as I used abs() function to prevent nans and not the variant of Mike T. In 2018 the function is exactly the same as OP (Original Poster) wrote.
Cython float32 using Jupyter %% magic. The function can be used directly in Python. Cython needs a C++ compiler in which Python was compiled. Installation of the right version of Visual C++ compiler (for Windows) could be problematic:
%%cython
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
cdef extern from "math.h":
np.float32_t exp(np.float32_t m)
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop32(np.float32_t[:] Tm,np.float32_t[:] tau,int alen):
cdef np.float32_t[:] T=np.empty(alen, dtype=np.float32)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Cython float64 using Jupyter %% magic. The function can be used directly in Python:
%%cython
cdef extern from "math.h":
double exp(double m)
import cython
import numpy as np
cimport numpy as np
from numpy cimport ndarray
#cython.boundscheck(False)
#cython.wraparound(False)
#cython.infer_types(True)
#cython.initializedcheck(False)
def cy_loop(double[:] Tm,double[:] tau,int alen):
cdef double[:] T=np.empty(alen)
cdef int i
T[0]=0.0
for i in range(1,alen):
T[i] = Tm[i] + (T[i-1] - Tm[i])**(-tau[i])
return T
Numba float64:
#numba.jit("float64[:](float64[:], float64[:])", nopython=False, nogil=True)
def calc_py_jitv2(Tm_, tau_):
tt = np.empty(len(Tm_),dtype=np.float64)
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i])
return tt[1:]
Append to list. Fastest non-compiled solution:
def rec_py_loop(Tm,tau,alen):
T = [Tm[0]]
for i in range(1,alen):
T.append(Tm[i] - (T[i-1] + Tm[i])**(-tau[i]))
return np.array(T)
Using a.item():
def rec_numpy_loop_item(Tm_,tau_):
n_ = len(Tm_)
tt=np.empty(n_)
Ti=tt.item
Tis=tt.itemset
Tmi=Tm_.item
taui=tau_.item
Tis(0,Tm_[0])
for i in range(1,n_):
Tis(i,Tmi(i) - (Ti(i-1) + Tmi(i))**(-taui(i)))
return tt[1:]
np.fromiter():
def it(Tm,tau):
T=Tm[0]
i=0
while True:
yield T
i+=1
T=Tm[i] - (T + Tm[i])**(-tau[i])
def rec_numpy_iter(Tm,tau,alen):
return np.fromiter(it(Tm,tau), np.float64, alen)[1:]
Loop over Numpy (based on the Jaffe's idea):
def rec_numpy_loop(Tm,tau,alen):
tt=np.empty(alen)
tt[0]=Tm[0]
for i in range(1,alen):
tt[i] = Tm[i] - (tt[i-1] + Tm[i])**(-tau[i])
return tt[1:]
Loop over Numpy (Aronstef's code). On my computer float64 is the default type for np.empty.
def calc_py(Tm_, tau_):
tt = np.empty(len(Tm_),dtype="float64")
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = (Tm_[i] - (tt[i-1] + Tm_[i])**(-tau_[i]))
return tt[1:]
Pure C without using Python at all. Version from year 2016 (with fabs() function):
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <windows.h>
#include <sys\timeb.h>
double randn() {
double u = rand();
if (u > 0.5) {
return sqrt(-1.57079632679*log(1.0 - pow(2.0 * u - 1, 2)));
}
else {
return -sqrt(-1.57079632679*log(1.0 - pow(1 - 2.0 * u,2)));
}
}
void rec_pure_c(double *Tm, double *tau, int alen, double *T)
{
for (int i = 1; i < alen; i++)
{
T[i] = Tm[i] + pow(fabs(T[i - 1] - Tm[i]), (-tau[i]));
}
}
int main() {
int N = 100000;
double *Tm= calloc(N, sizeof *Tm);
double *tau = calloc(N, sizeof *tau);
double *T = calloc(N, sizeof *T);
double time = 0;
double sumtime = 0;
for (int i = 0; i < N; i++)
{
Tm[i] = randn();
tau[i] = randn();
}
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
LARGE_INTEGER Frequency;
for (int j = 0; j < 1000; j++)
{
for (int i = 0; i < 3; i++)
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
rec_pure_c(Tm, tau, N, T);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
if (i == 0)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
else {
if (time > (double)ElapsedMicroseconds.QuadPart / 1000)
time = (double)ElapsedMicroseconds.QuadPart / 1000;
}
}
sumtime += time;
}
printf("1000 loops,best of 3: %.3f ms per loop\n",sumtime/1000);
free(Tm);
free(tau);
free(T);
}
Fortran f2py. Function can be used from Python. Version from year 2016 (with abs() function):
subroutine rec_fortran(tm,tau,alen,result)
integer*8, intent(in) :: alen
real*8, dimension(alen), intent(in) :: tm
real*8, dimension(alen), intent(in) :: tau
real*8, dimension(alen) :: res
real*8, dimension(alen), intent(out) :: result
res(1)=0
do i=2,alen
res(i) = tm(i) + (abs(res(i-1) - tm(i)))**(-tau(i))
end do
result=res
end subroutine rec_fortran

Update: 21-10-2018
I have corrected my answer based on comments.
It is possible to vectorize operations on vectors as long as the calculation is not recursive. Because a recursive operation depends on the previous calculated value it is not possible to parallel process the operation.
This does therefore not work:
def calc_vect(Tm_, tau_):
return Tm_[1:] - (Tm_[:-1] + Tm_[1:]) ** (-tau_[1:])
Since (serial processing / a loop) is necessary, the best performance is gained by moving as close as possible to optimized machine code, therefore Numba and Cython are the best answers here.
A Numba approach can be achieves as follows:
init_string = """
from math import pow
import numpy as np
from numba import jit, float32
np.random.seed(0)
n = 100000
Tm = np.cumsum(np.random.uniform(0.1, 1, size=n).astype('float32'))
tau = np.random.uniform(-1, 0, size=n).astype('float32')
def calc_python(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
#jit(float32[:](float32[:], float32[:]), nopython=False, nogil=True)
def calc_numba(Tm_, tau_):
tt = np.empty(len(Tm_))
tt[0] = Tm_[0]
for i in range(1, len(Tm_)):
tt[i] = Tm_[i] - pow(tt[i-1] + Tm_[i], -tau_[i])
return tt
"""
import timeit
py_time = timeit.timeit('calc_python(Tm, tau)', init_string, number=100)
numba_time = timeit.timeit('calc_numba(Tm, tau)', init_string, number=100)
print("Python Solution: {}".format(py_time))
print("Numba Soltution: {}".format(numba_time))
Timeit comparison of the Python and Numba functions:
Python Solution: 54.58057559299999
Numba Soltution: 1.1389029540000024

This is a good question. I am also interested to know if this is possible but so far I have not found a way to do it except in some simple cases.
Option 1. numpy.ufunc.accumulate
This seems to be a promising option as mentioned by #Karl Knechtel. You need to create a ufunc first. This web page explains how.
In the simple case of a recurrent function that takes two scalars as input and outputs one scaler, it seems to work:
import numpy as np
def test_add(x, data):
return x + data
assert test_add(1, 2) == 3
assert test_add(2, 3) == 5
# Make a Numpy ufunc from my test_add function
test_add_ufunc = np.frompyfunc(test_add, 2, 1)
assert test_add_ufunc(1, 2) == 3
assert test_add_ufunc(2, 3) == 5
assert np.all(test_add_ufunc([1, 2], [2, 3]) == [3, 5])
data_sequence = np.array([1, 2, 3, 4])
f_out = test_add_ufunc.accumulate(data_sequence, dtype=object)
assert np.array_equal(f_out, [1, 3, 6, 10])
[Note the dtype=object argument which is necessary as explained on the web page linked above].
But in your case (and mine) we want to compute a recurrent equation that has more than one data input (and potentially more than one state variable too).
When I tried this using the ufunc.accumulate approach above I got ValueError: accumulate only supported for binary functions.
If anyone knows a way round that constraint I would be very interested.
Option 2. Python's builtin accumulate function
In the mean time, this solution doesn't quite achieve what you wanted in terms of a vectorized calculation in numpy, but it does at least avoid a for loop.
from itertools import accumulate, chain
def t_next(t, data):
Tm, tau = data # Unpack more than one data input
return Tm + (t - Tm)**tau
assert t_next(2, (0.38, 0)) == 1.38
t0 = 2 # Initial t
Tm_values = np.array([0.38, 0.88, 0.56, 0.67, 0.45, 0.98, 0.58, 0.72, 0.92, 0.82])
tau_values = np.linspace(0, 0.9, 10)
# Combine the input data into a 2D array
data_sequence = np.vstack([Tm_values, tau_values]).T
t_out = np.fromiter(accumulate(chain([t0], data_sequence), t_next), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]
# Slightly more readable version possible in Python 3.8+
t_out = np.fromiter(accumulate(data_sequence, t_next, initial=t0), dtype=float)
print(t_out)
# [2. 1.38 1.81303299 1.60614649 1.65039964 1.52579703
# 1.71878078 1.66109554 1.67839293 1.72152195 1.73091672]

To build on NPE's answer, I agree that there has to be a loop somewhere. Perhaps your goal is to avoid the overhead associated with a Python for loop? In that case, numpy.fromiter does beat out a for loop, but only by a little:
Using the very simple recursion relation,
x[i+1] = x[i] + 0.1
I get
#FOR LOOP
def loopit(n):
x = [0.0]
for i in range(n-1): x.append(x[-1] + 0.1)
return np.array(x)
#FROMITER
#define an iterator (a better way probably exists -- I'm a novice)
def it():
x = 0.0
while True:
yield x
x += 0.1
#use the iterator with np.fromiter
def fi_it(n):
return np.fromiter(it(), np.float, n)
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 31.7 ms per loop
%timeit -n 100 fi_it(100000)
#100 loops, best of 3: 18.6 ms per loop
Interestingly, pre-allocating a numpy array results in a substantial loss in performance. This is a mystery to me, though I would guess that there must be more overhead associated with accessing an array element than with appending to a list.
def loopit(n):
x = np.zeros(n)
for i in range(n-1): x[i+1] = x[i] + 0.1
return x
%timeit -n 100 loopit(100000)
#100 loops, best of 3: 50.1 ms per loop

PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

I am trying to parallelize the bitonic sort with pycuda. For this I use SourceModule and the C code of the parallel bitonic sort. For the memory copies management I use InOut of the pycuda.driver that simplify some of the memory transfers
import pycuda.autoinit
import pycuda.driver as drv
from pycuda.compiler import SourceModule
from pycuda import gpuarray
import numpy as np
from time import time
ker = SourceModule(
"""
__device__ void swap(int & a, int & b){
int tmp = a;
a = b;
b = tmp;
}
__global__ void bitonicSort(int * values, int N){
extern __shared__ int shared[];
int tid = threadIdx.x + blockDim.x * blockIdx.x;
// Copy input to shared mem.
shared[tid] = values[tid];
__syncthreads();
// Parallel bitonic sort.
for (int k = 2; k <= N; k *= 2){
// Bitonic merge:
for (int j = k / 2; j>0; j /= 2){
int ixj = tid ^ j;
if (ixj > tid){
if ((tid & k) == 0){
//Sort ascending
if (shared[tid] > shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
else{
//Sort descending
if (shared[tid] < shared[ixj]){
swap(shared[tid], shared[ixj]);
}
}
}
__syncthreads();
}
}
values[tid] = shared[tid];
}
"""
)
N = 8 #lenght of A
A = np.int32(np.random.randint(1, 20, N)) #random numbers in A
BLOCK_SIZE = 256
NUM_BLOCKS = (N + BLOCK_SIZE-1)//BLOCK_SIZE
bitonicSort = ker.get_function("bitonicSort")
t1 = time()
bitonicSort(drv.InOut(A), np.int32(N), block=(BLOCK_SIZE,1,1), grid=(NUM_BLOCKS,1), shared=4*N)
t2 = time()
print("Execution Time {0}".format(t2 - t1))
print(A)
As in the kernel I use extern __shared__, in pycuda I use the shared parameter with the respective 4*N. Also try using __shared__ int shared[N] in the kernel but it doesn't work either (check here: Getting started with shared memory on PyCUDA)
Running in Google Collab I get the following error:
/usr/local/lib/python3.6/dist-packages/pycuda/compiler.py in __init__(self, source, nvcc, options, keep, no_extern_c, arch, code, cache_dir, include_dirs)
292
293 from pycuda.driver import module_from_buffer
--> 294 self.module = module_from_buffer(cubin)
295
296 self._bind_module()
LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered
Does anyone know what could be generating this error?

Your device code isn't accounting for the sizes of your arrays correctly.
You are launching 256 threads in a single block. That means that you will have 256 threads, with tid numbered 0..255, trying to execute each line of code. For example, in this case:
shared[tid] = values[tid];
You will have, for example, one thread trying to do shared[255] = values[255];
Neither your shared nor values array are that large. That is the reason for the illegal memory access error.
The simplest solution for this kind of trivial problem is to make your array sizes match your block size.
BLOCK_SIZE = N
According to my testing, that change clears up any errors and results in a properly sorted array.
It won't work for N greater than 1024, or multi-block usage, but your code would have to be modified for a multi-block sort, anyway.
If you still have trouble after making that change, I suggest restarting your python session or your colab session.

Python Ctypes passing pointer for data

I am accessing an API and can't get the data returned. The two float pointers will point to an array of data. I must assume the API is working properly. A different function call provides a the length of the data I am retrieving. This values is length down below when attempted.
C Header for Function
int function(int, float * data1, float * data2)
ctypes setup
dll.function.argtypes = (c_int, POINTER(c_float), POINTER(c_float))
dll.function.restypes = c_int
Failed Attempt 1:
x = c_float()
y = c_float()
status = dll.function(1, byref(x), byref(y))
Program crashes OR Access violation writing.
Failed Attempt 2:
x = POINTER(c_float)()
y = POINTER(c_float)()
status = dll.function(1, x, y)
Null Pointer Error
Failed Attempt 3:
dll.function.argtypes = (c_int, c_void_p, c_void_p)
x = c_void_p()
y = c_void_p()
status = dll.function(1, x, y)
Null Pointer Error
Failed Attempt 4:
array = c_float * length
x = array()
y = array()
status = dll.function(1, byref(x), byref(y))
Program crashes
Failed Attempt 5:
array = c_float * length
x = POINTER(array)()
y = POINTER(array)()
status = dll.function(1, x, y)
Null Pointer Error OR ArgumentError: expected LP_c_float instance instead of LP_c_float_Array_[length]
Failed Attempt 6:
x = (c_float*length)()
y = (c_float*length)()
a = cast(x, POINTER(c_float))
b = cast(y, POINTER(c_float))
status = dll.function(1, a, b)
Program crashes
What am I missing and why?
I believe the argtypes are correct. I am attempting to meet them properly, but there continues to be an issues. Do I need to "malloc" the memory somehow? (I'm sure I need to free after I get the data).
This is on Windows 7 with Python 2.7 32-bit.
I have looked through other similar issues and am not finding a solution. I am wondering if, at this point, I can blame the API for this issue.

Dealing with pointers and arrays is explained in [Python.Docs]: ctypes - Type conversions.
I prepared a dummy example for you.
main00.c:
#if defined(_WIN32)
# define DECLSPEC_DLLEXPORT __declspec(dllexport)
#else
# define DECLSPEC_DLLEXPORT
#endif
static int kSize = 5;
DECLSPEC_DLLEXPORT int size() {
return kSize;
}
DECLSPEC_DLLEXPORT int function(int dummy, float *data1, float *data2) {
for (int i = 0; i < kSize; i++) {
data1[i] = dummy * i;
data2[i] = -dummy * (i + 1);
}
return 0;
}
code00.py:
#!/usr/bin/env python
import sys
import ctypes as ct
c_float_p = ct.POINTER(ct.c_float)
def main(*argv):
dll = ct.CDLL("./dll00.so")
size = dll.size
size.argtypes = []
size.restype = ct.c_int
function = dll.function
function.argtypes = [ct.c_int, c_float_p, c_float_p]
function.restype = ct.c_int
sz = size()
print(sz)
data1 = (ct.c_float * sz)()
data2 = (ct.c_float * sz)()
res = function(1, ct.cast(data1, c_float_p), ct.cast(data2, c_float_p))
for i in range(sz):
print(data1[i], data2[i])
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(item.strip() for item in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
rc = main(*sys.argv[1:])
print("\nDone.")
sys.exit(rc)
Notes:
The C part tries to mimic what your .dll does (or at least what I understood):
size - gets the arrays sizes
function - populates the arrays (till their size - assuming that they were properly allocated by the caller)
Python part is straightforward:
Load the .dll
Define argtypes and restype (in your code it's restype's) for the 2 functions (for size_func not necessary)
Get the lengths
Initialize the arrays
Pass them to function_func using ctypes.cast
Output (on Lnx, as building the C code is much simpler, but works on Win as well):
[cfati#cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q050043861]> gcc -shared -o dll00.so main00.c
[cfati#cfati-ubtu16x64-0:~/Work/Dev/StackOverflow/q050043861]> python3 code00.py
Python 3.8.5 (default, Jan 27 2021, 15:41:15) [GCC 9.3.0] 64bit on linux
5
0.0 -1.0
1.0 -2.0
2.0 -3.0
3.0 -4.0
4.0 -5.0

It really depends on what you are doing with these float pointers.
If you are trying to traverse it, i.e.
for(int i = 0; i < size; i++)
printf("%f\n%, data1[i])
then for sure this is problematic as no array was allocated. You simply passed a pointer pointing to a float. That is all.
You need to first allocate that memory. To this end Attempt 4 looks like the more promising, but I suspect you have a problem inside your C function leading to the crash.
Difficult to say without seeing the implementation of that function.

vectorized radix sort with numpy - can it beat np.sort?

Numpy doesn't yet have a radix sort, so I wondered whether it was possible to write one using pre-existing numpy functions. So far I have the following, which does work, but is about 10 times slower than numpy's quicksort.
Test and benchmark:
a = np.random.randint(0, 1e8, 1e6)
assert(np.all(radix_sort(a) == np.sort(a)))
%timeit np.sort(a)
%timeit radix_sort(a)
The mask_b loop can be at least partially vectorized, broadcasting out across masks from &, and using cumsum with axis arg, but that ends up being a pessimization, presumably due to the increased memory footprint.
If anyone can see a way to improve on what I have I'd be interested to hear, even if it's still slower than np.sort...this is more a case of intellectual curiosity and interest in numpy tricks.
Note that you can implement a fast counting sort easily enough, though that's only relevant for small integer data.
Edit 1: Taking np.arange(n) out of the loop helps a little, but that's not very exiciting.
Edit 2: The cumsum was actually redundant (ooops!) but this simpler version only helps marginally with performance..
def radix_sort(a):
bit_len = np.max(a).bit_length()
n = len(a)
cached_arange = arange(n)
idx = np.empty(n, dtype=int) # fully overwritten each iteration
for mask_b in xrange(bit_len):
is_one = (a & 2**mask_b).astype(bool)
n_ones = np.sum(is_one)
n_zeros = n-n_ones
idx[~is_one] = cached_arange[:n_zeros]
idx[is_one] = cached_arange[:n_ones] + n_zeros
# next three lines just do: a[idx] = a, but correctly
new_a = np.empty(n, dtype=a.dtype)
new_a[idx] = a
a = new_a
return a
Edit 3: rather than loop over single bits, you can loop over two or more at a time, if you construct idx in multiple steps. Using 2 bits helps a little, I've not tried more:
idx[is_zero] = np.arange(n_zeros)
idx[is_one] = np.arange(n_ones)
idx[is_two] = np.arange(n_twos)
idx[is_three] = np.arange(n_threes)
Edits 4 and 5: going to 4 bits seems best for the input I'm testing. Also, you can get rid of the idx step entirely. Now only about 5 times, rather than 10 times, slower than np.sort (source available as gist):
Edit 6: This is a tidied up version of the above, but it's also a tiny bit slower. 80% of the time is spent on repeat and extract - if only there was a way to broadcast the extract :( ...
def radix_sort(a, batch_m_bits=3):
bit_len = np.max(a).bit_length()
batch_m = 2**batch_m_bits
mask = 2**batch_m_bits - 1
val_set = np.arange(batch_m, dtype=a.dtype)[:, nax] # nax = np.newaxis
for _ in range((bit_len-1)//batch_m_bits + 1): # ceil-division
a = np.extract((a & mask)[nax, :] == val_set,
np.repeat(a[nax, :], batch_m, axis=0))
val_set <<= batch_m_bits
mask <<= batch_m_bits
return a
Edits 7 & 8: Actually, you can broadcast the extract using as_strided from numpy.lib.stride_tricks, but it doesn't seem to help much performance-wise:
Initially this made sense to me on the grounds that extract will be iterating over the whole array batch_m times, so the total number of cache lines requested by the CPU will be the same as before (it's just that by the end of the process it has request each cache line batch_m times). However the reality is that extract is not clever enough to iterate over arbitrary stepped arrays, and has to expand out the array before beginning, i.e. the repeat ends up being done anyway.
In fact, having looked at the source for extract, I now see that the best we can do with this approach is:
a = a[np.flatnonzero((a & mask)[nax, :] == val_set) % len(a)]
which is marginally slower than extract. However, if len(a) is a power of two we can replace the expensive mod operation with & (len(a) - 1), which does end up being a bit faster than the extract version (now about 4.9x np.sort for a=randint(0, 1e8, 2**20). I suppose we could make this work for non-power of two lengths by zero-padding, and then cropping the extra zeros at the end of the sort...however this would be a pessimisation unless the length was already close to being power of two.

I had a go with Numba to see how fast a radix sort could be. The key to good performance with Numba (often) is to write out all the loops, which is very instructive. I ended up with the following:
from numba import jit
#jit
def radix_loop(nbatches, batch_m_bits, bitsums, a, out):
mask = (1 << batch_m_bits) - 1
for shift in range(0, nbatches*batch_m_bits, batch_m_bits):
# set bit sums to zero
for i in range(bitsums.shape[0]):
bitsums[i] = 0
# determine bit sums
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
bitsums[j] += 1
# take the cumsum of the bit sums
cumsum = 0
for i in range(bitsums.shape[0]):
temp = bitsums[i]
bitsums[i] = cumsum
cumsum += temp
# sorting loop
for i in range(a.shape[0]):
j = (a[i] & mask) >> shift
out[bitsums[j]] = a[i]
bitsums[j] += 1
# prepare next iteration
mask <<= batch_m_bits
# cant use `temp` here because of numba internal types
temp2 = a
a = out
out = temp2
return a
From the 4 inner loops, it's easy to see it's the 4th one making it hard to vectorize with Numpy.
One way to cheat around that problem is to pull in a particular C++ function from Scipy: scipy.sparse.coo.coo_tocsr. It does pretty much the same inner loops as the Python function above, so it can be abused to write a faster "vectorized" radix sort in Python. Maybe something like:
from scipy.sparse.coo import coo_tocsr
def radix_step(radix, keys, bitsums, a, w):
coo_tocsr(radix, 1, a.size, keys, a, a, bitsums, w, w)
return w, a
def scipysparse_radix_perbyte(a):
# coo_tocsr internally works with system int and upcasts
# anything else. We need to copy anyway to not mess with
# original array. Also take into account endianness...
a = a.astype('<i', copy=True)
bitlen = int(a.max()).bit_length()
radix = 256
work = np.empty_like(a)
_ = np.empty(radix+1, int)
for i in range((bitlen-1)//8 + 1):
keys = a.view('u1')[i::a.itemsize].astype(int)
a, work = radix_step(radix, keys, _, a, work)
return a
EDIT: Optimized the function a little bit.. see edit history.
One inefficiency of LSB radix sorting like above is that the array is completely shuffled in RAM a number of times, which means the CPU cache isn't used very well. To try to mitigate this effect, one could opt to first do a pass with MSB radix sort, to put items in roughly the right block of RAM, before sorting every resulting group with a LSB radix sort. Here's one implementation:
def scipysparse_radix_hybrid(a, bbits=8, gbits=8):
"""
Parameters
----------
a : Array of non-negative integers to be sorted.
bbits : Number of bits in radix for LSB sorting.
gbits : Number of bits in radix for MSB grouping.
"""
a = a.copy()
bitlen = int(a.max()).bit_length()
work = np.empty_like(a)
# Group values by single iteration of MSB radix sort:
# Casting to np.int_ to get rid of python BigInt
ngroups = np.int_(2**gbits)
group_offset = np.empty(ngroups + 1, int)
shift = max(bitlen-gbits, 0)
a, work = radix_step(ngroups, a>>shift, group_offset, a, work)
bitlen = shift
if not bitlen:
return a
# LSB radix sort each group:
agroups = np.split(a, group_offset[1:-1])
# Mask off high bits to not undo the grouping..
gmask = (1 << shift) - 1
nbatch = (bitlen-1) // bbits + 1
radix = np.int_(2**bbits)
_ = np.empty(radix + 1, int)
for agi in agroups:
if not agi.size:
continue
mask = (radix - 1) & gmask
wgi = work[:agi.size]
for shift in range(0, nbatch*bbits, bbits):
keys = (agi & mask) >> shift
agi, wgi = radix_step(radix, keys, _, agi, wgi)
mask = (mask << bbits) & gmask
if nbatch % 2:
# Copy result back in to `a`
wgi[...] = agi
return a
Timings (with best performing settings for each on my system):
def numba_radix(a, batch_m_bits=8):
a = a.copy()
bit_len = int(a.max()).bit_length()
nbatches = (bit_len-1)//batch_m_bits +1
work = np.zeros_like(a)
bitsums = np.zeros(2**batch_m_bits + 1, int)
srtd = radix_loop(nbatches, batch_m_bits, bitsums, a, work)
return srtd
a = np.random.randint(0, 1e8, 1e6)
%timeit numba_radix(a, 9)
# 10 loops, best of 3: 76.1 ms per loop
%timeit np.sort(a)
#10 loops, best of 3: 115 ms per loop
%timeit scipysparse_radix_perbyte(a)
#10 loops, best of 3: 95.2 ms per loop
%timeit scipysparse_radix_hybrid(a, 11, 6)
#10 loops, best of 3: 75.4 ms per loop
Numba performs very well, as expected. And also with some clever application of existing C-extensions it's possible to beat numpy.sort. IMO at the level of optimization you've already gotten it's worth-it to also consider add-ons to Numpy, but I wouldn't really consider the implementations in my answer "vectorized": The bulk of the work is done in a external dedicated function.
One other thing that strikes me is the sensitivity to the choice of radix. For most of the settings I tried my implementations were still slower than numpy.sort, so in practice some sort of heuristic would be required to offer good performance across the board.

Can you change this to be a counting / radix sort that works 8 bits at a time? For 32 bit unsigned integers, create a matrix[4][257] of counts of occurrence of byte fields, making one read pass over the array to be sorted. matrix[][0] = 0, matrix[][1] = # of occurences of 0, ... . Then convert the counts into indexes, where matrix[][0] = 0, matrix[][1] = # of bytes == 0, matrix[][2] = # of bytes == 0 + # of bytes == 1, ... . The last count is not used, since that would index the end of the array. Then do 4 passes of radix sort, moving data back and forth between the original array and the output array. Working 16 bits at time would need a matrix[2][65537], but only take 2 passes. Example C code:
size_t mIndex[4][257] = {0}; /* index matrix */
size_t i, j, m;
uint32_t u;
uint32_t *pData; /* ptr to original array */
uint32_t *pTemp; /* ptr to working array */
uint32_t *pSrc; /* working ptr */
uint32_t *pDst; /* working ptr */
/* n is size of array */
for(i = 0; i < n; i++){ /* generate histograms */
u = pData[i];
for(j = 0; j < 4; j++){
mIndex[j][1 + (size_t)(u & 0xff)]++; /* note [1 + ... */
u >>= 8;
}
}
for(j = 0; j < 4; j++){ /* convert to indices */
for(i = 1; i < 257; i++){ /* (last count never used) */
mIndex[j][i] += mIndex[j][i-1]
}
}
pDst = pTemp; /* radix sort */
pSrc = pData;
for(j = 0; j < 4; j++){
for(i = 0; i < count; i++){ /* sort pass */
u = pSrc[i];
m = (size_t)(u >> (j<<3)) & 0xff;
/* pDst[mIndex[j][m]++] = u; split into 2 lines */
pDst[mIndex[j][m]] = u;
mIndex[j][m]++;
}
pTmp = pSrc; /* swap ptrs */
pSrc = pDst;
pDst = pTmp;
}

Working with bytes and binary data in Python

Four consecutive bytes in a byte string together specify some value. However, only 7 bits in each byte are used; the most significant bit is always zero and therefore its ignored (that makes 28 bits altogether). So...
b"\x00\x00\x02\x01"
would be 000 0000 000 0000 000 0010 000 0001.
Or, for the sake of legibility, 10 000 0001. That's the value the four bytes represent. But I want a decimal, so I do this:
>>> 0b100000001
257
I can work all that out myself, but how would I incorporate it into a program?

Use bitshifting and addition:
bytes = b"\x00\x00\x02\x01"
i = 0
for b in bytes:
i <<= 7
i += b # Or use (b & 0x7f) if the last bit might not be zero.
print(i)
Result:
257

Using the bitarray module, you can do it a lot quicker for big numbers:
Benchmarks (factor 2.4x speedup!):
janus#Zeus /tmp % python3 -m timeit -s "import tst" "tst.tst(10000)"
10 loops, best of 3: 251 msec per loop
janus#Zeus /tmp % python3 -m timeit -s "import tst" "tst.tst(100)"
1000 loops, best of 3: 700 usec per loop
janus#Zeus /tmp % python3 -m timeit -s "import sevenbittoint, os" "sevenbittoint.sevenbittoint(os.urandom(10000))"
10 loops, best of 3: 73.7 msec per loop
janus#Zeus /tmp % python3 -m timeit -s "import quick, os" "quick.quick(os.urandom(10000))"
10 loops, best of 3: 179 msec per loop
quick.py (from Mark Byers):
def quick(bites):
i = 0
for b in bites:
i <<= 7
i += (b & 0x7f)
#i += b
return i
sevenbittoint.py:
import bitarray
import functools
def inttobitarray(x):
a = bitarray.bitarray()
a.frombytes(x.to_bytes(1,'big'))
return a
def concatter(accumulator,thisitem):
thisitem.pop(0)
for i in thisitem.tolist():
accumulator.append(i)
return accumulator
def sevenbittoint(bajts):
concatted = functools.reduce(concatter, map(inttobitarray, bajts), bitarray.bitarray())
missingbits = 8 - len(concatted) % 8
for i in range(missingbits): concatted.insert(0,0) # zeropad
return int.from_bytes(concatted.tobytes(), byteorder='big')
def tst():
num = 32768
print(bin(num))
print(sevenbittoint(num.to_bytes(2,'big')))
if __name__ == "__main__":
tst()
tst.py:
import os
import quick
import sevenbittoint
def tst(sz):
bajts = os.urandom(sz)
#for i in range(pow(2,16)):
# if i % pow(2,12) == 0: print(i)
# bajts = i.to_bytes(2, 'big')
a = quick.quick(bajts)
b = sevenbittoint.sevenbittoint(bajts)
if a != b: raise Exception((i, bin(int.from_bytes(bajts,'big')), a, b))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Faster bit-level data packing - python

Related

How to get rid of for loop in my function? [duplicate]

PyCUDA LogicError: cuModuleLoadDataEx failed: an illegal memory access was encountered

Python Ctypes passing pointer for data

vectorized radix sort with numpy - can it beat np.sort?

Working with bytes and binary data in Python

Categories

Resources