I've some performance trouble to put data from a byte array to the internal data structure. The data contains several nested arrays and can be extracted as the attached code. In C it takes something like one Second by reading from a stream, but in Python it takes almost one Minute. I guess indexing and calling int.from_bytes was not the best idea.
Has anybody a proposal to improve the performance?
...
ycnt = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
while ycnt > 0:
ky = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
dv = DataObject()
xvec.update({ky: dv})
dv.x = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
dv.y = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
cntv = int.from_bytes(bytedat[idx:idx + 4], 'little')
idx += 4
while cntv > 0:
dv.data_values.append(int.from_bytes(bytedat[idx:idx + 4], 'little', signed=True))
idx += 4
cntv -= 1
dv.score = struct.unpack('d', bytedat[idx:idx + 8])[0]
idx += 8
ycnt -= 1
...
First, a factor 60 between Python versus C is normal for low-level code like this. This is not where Python shines, because it doesn't get compiled down to machine-code.
Micro-Optimizations
The most obvious one is to reduce your integer math by using struct.unpack() properly. See the format string docu. Something like this:
ky, dy, dv.x, dv.y, cntv = struct.unpack('<iiiii', bytedat[idx:idx+5*4])
The second one is to load your int arrays (if they are large) "in batch" instead of the (interpreted!) while cntv > 0 loop. I would use a numpy array:
numpy.frombuffer(bytedat[idx:idx + 4*cntv], dtype='int32')
Why is not a list? A Python list contains (generic) Python objects. It requires extra memory and pointer indirection for each item. Libraries cannot use optimized C code (for example to calculate the sum) because each item has first to be dereferenced and then checked for its type.
A numpy object, on the other hand, is basically a wrapper to manage the memory of a C array. Loading it it will probably boil down to a memcpy(), or it may even just reference the bytes memory you passed.
And thirdly, instead of xvec.update({ky: dv}) you can probably write xvec[ky] = dy. This may prevent the creation of a temporary dict object.
Compiling your Python-Code
There are ways to compile Python (partially) down to machine code (PyPy, Numba, Cython). It's a bit involved, but your original byte-indexing code would then run at C speed.
However, you are filling a Python list and a dict in the inner loop. This is never going to get "C"-like fast because it will have to deal with Python objects and reference counting, even when it gets compiled down to C.
Different file format
The easiest way is to use a data format handled by a fast specialized library (like numpy, hd5, pillow, maybe even pandas).
The pickle module may also help, but only if you can control the writing and everything is trusted, and you mainly care about loading speed.
I do something similar, but big-endian.
I find that
(byte1 << 8) | byte2
to be faster than int.from_bytes() and struct.unpack().
I also find pypy3 to be at least 4x faster than python3
for this sort of stuff.
Related
I have a very long string in Python:
x = "12;14;14;14;18;12;17;19" # I only show a small part of it : there are 10 millions of ;
The goal is to transform it into:
y = array([12, 14, 14, 14, 18, 12, 17, 19], dtype=int)
One way to do this is to use array(x.split(";")) or numpy.fromtostring.
But both are extremely slow.
Is there quicker way to do it in python?
Thank you very much and have a nice day.
String parsing is often slow. Unicode decoding often make things slower (especially when there are non-ASCII character) unless it is carefully optimized (hard). CPython is slow, especially loops. Numpy is not really design to (efficiently) deal with strings. I do not think Numpy can do this faster than fromstring yet. The only solutions I can come up with are using Numba, Cython or even basic C extensions. The simplest solution is to use Numba, the fastest is to use Cython/C-extensions.
Unfortunately Numba is very slow for strings/bytes so far (this is an open issue that is not planed to be solved any time soon). Some tricks are needed so that Numba can compute this efficiently: the string needs to be converted to a Numpy array. This means it must be first encoded to a byte-array first to avoid any variable-sized encoding (like UTF-8). np.frombuffer seems the fastest solution to convert the buffer to a Numpy array. Since the input is a read-only array (unusual, but efficient), the Numba signature is not very easy to read.
Here is the final solution:
import numpy as np
import numba as nb
#nb.njit(nb.int32[::1](nb.types.Array(nb.uint8, 1, 'C', readonly=True,)))
def compute(arr):
sep = ord(';')
base = ord('0')
minus = ord('-')
count = 1
for c in arr:
count += c == sep
res = np.empty(count, np.int32)
val = 0
positive = True
cur = 0
for c in arr:
if c != sep and c != minus:
val = (val * 10) + c - base
elif c == minus:
positive = False
else:
res[cur] = val if positive else -val
cur += 1
val = 0
positive = True
if cur < count:
res[cur] = val if positive else -val
return res
x = ';'.join(np.random.randint(0, 200, 10_000_000).astype(str))
result = compute(np.frombuffer(x.encode('ascii'), np.uint8))
Note that the Numba solution performs no checks for sake of performance. It also assume the numbers are positive ones. Thus, you must ensure the input is valid. Alternatively, you can perform additional checks in it (at the expense of a slower code).
Here are performance results on my machine with a i5-9600KF processor (with Numpy 1.22.4 on Windows):
np.fromstring(x, dtype=np.int32, sep=';'): 8927 ms
np.array(re.split(";", x), dtype=np.int32): 1419 ms
np.array(x.split(";"), dtype=np.int32): 1196 ms
Numba implementation: 78 ms
Numba implementation (without negative numbers): 67 ms
This solution is 114 times faster than np.fromstring and 15 times faster than the fastest solution (based on split). Note that removing the support for negative numbers makes the Numba function 18% faster. Also, note that 10~12% of the time is spent in encode. The rest of the time comes from the main loop in the Numba function. More specifically, the conditionals in the loop are the main source of the slowdown because they can hardly predicted by the processor and they prevent the use of fast (SIMD) instructions. This is often why string parsing is slow.
A possible improvement is to use a branchless implementation operating on chunks. Another possible improvement is to compute the chunks using multiple threads. However, both optimizations are tricky to do and they both make the resulting code significantly harder to read (and so to maintain).
I have a script in Python 3.6.8 which reads through a very large text file, where each line is an ASCII string drawn from the alphabet {a,b,c,d,e,f}.
For each line, I have a function which fragments the string using a sliding window of size k, and then increments a fragment counter dictionary fragment_dict by 1 for each fragment seen.
The same fragment_dict is used for the entire file, and it is initialized for all possible 5^k fragments mapping to zero.
I also ignore any fragment which has the character c in it. Note that c is uncommon, and most lines will not contain it at all.
def fragment_string(mystr, fragment_dict, k):
for i in range(len(mystr) - k + 1):
fragment = mystr[i:i+k]
if 'c' in fragment:
continue
fragment_dict[fragment] += 1
Because my file is so large, I would like to optimize the performance of the above function as much as possible. Could anyone provide any potential optimizations to make this function faster?
I'm worried I may be rate limited by the speed of Python loops, in which case I would need to consider dropping down into C/Cython.
Numpy may help in speeding up your code:
x = np.array([ord(c) - ord('a') for c in mystr])
filter = np.geomspace(1, 5**(k-1), k, dtype=int)
fragment_dict = collections.Counter(np.convolve(x, filter,mode='valid'))
The idea is, represent each k length segment is a k-digit 5-ary number. Then, converting a list of 0-5 integers equivalent to the string to its 5-ary representation is equivalent to applying a convolution with [1,5,25,125,...] as filter.
I have a 2D cost matrix M, perhaps 400x400, and I'm trying to calculate the optimal path through it. As such, I have a function like:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
which is obviously recursive. P1 is some additive constant. My code, which works more or less, is:
def optimalcost(cost, P1=10):
width1,width2 = cost.shape
M = array(cost)
for i in range(0,width1):
for j in range(0,width2):
try:
M[i,j] = M[i,j] + min(M[i-1,j-1],M[i-1,j]+P1,M[i,j-1]+P1)
except:
M[i,j] = inf
return M
Now I know looping in Numpy is a terrible idea, and for things like the calculation of the initial cost matrix I've been able to find shortcuts to cutting the time down. However, as I need to evaluate potentially the entire matrix I'm not sure how else to do it. This takes around 3 seconds per call on my machine and must be applied to around 300 of these cost matrices. I'm not sure where this time comes from, as profiling says the 200,000 calls to min only take 0.1s - maybe memory access?
Is there a way to do this in parallel somehow? I assume there may be, but to me it seems each iteration is dependent unless there's a smarter way to memoize things.
There are parallels to this question: Can I avoid Python loop overhead on dynamic programming with numpy?
I'm happy to switch to C if necessary, but I like the flexibility of Python for rapid testing and the lack of faff with file IO. Off the top of my head, is something like the following code likely to be significantly faster?
#define P1 10
void optimalcost(double** costin, double** costout){
/*
We assume that costout is initially
filled with costin's values.
*/
float a,b,c,prevcost;
for(i=0;i<400;i++){
for(j=0;j<400;j++){
a = prevcost+P1;
b = costout[i][j-1]+P1;
c = costout[i-1][j-1];
costout[i][j] += min(prevcost,min(b,c));
prevcost = costout[i][j];
}
}
}
return;
Update:
I'm on Mac, and I don't want to install a whole new Python toolchain so I used Homebrew.
> brew install llvm --rtti
> LLVM_CONFIG_PATH=/usr/local/opt/llvm/bin/llvm-config pip install llvmpy
> pip install numba
New "numba'd" code:
from numba import autojit, jit
import time
import numpy as np
#autojit
def cost(left, right):
height,width = left.shape
cost = np.zeros((height,width,width))
for row in range(height):
for x in range(width):
for y in range(width):
cost[row,x,y] = abs(left[row,x]-right[row,y])
return cost
#autojit
def optimalcosts(initcost):
costs = zeros_like(initcost)
for row in range(height):
costs[row,:,:] = optimalcost(initcost[row])
return costs
#autojit
def optimalcost(cost):
width1,width2 = cost.shape
P1=10
prevcost = 0.0
M = np.array(cost)
for i in range(1,width1):
for j in range(1,width2):
M[i,j] += min(M[i-1,j-1],prevcost+P1,M[i,j-1]+P1)
prevcost = M[i,j]
return M
prob_size = 400
left = np.random.rand(prob_size,prob_size)
right = np.random.rand(prob_size,prob_size)
print '---------- Numba Time ----------'
t = time.time()
c = cost(left,right)
optimalcost(c[100])
print time.time()-t
print '---------- Native python Time --'
t = time.time()
c = cost.py_func(left,right)
optimalcost.py_func(c[100])
print time.time()-t
It's interesting writing code in Python that is so un-Pythonic. Note for anyone interested in writing Numba code, you need to explicitly express loops in your code. Before, I had the neat Numpy one-liner,
abs(left[row,:][:,newaxis] - right[row,:])
to calculate the cost. That took around 7 seconds with Numba. Writing out the loops properly gives 0.5s.
It's an unfair comparison to compare it to native Python code, because Numpy can do that pretty quickly, but:
Numba compiled: 0.509318113327s
Native: 172.70626092s
I'm impressed both by the numbers and how utterly simple the conversion is.
If it's not hard for you to switch to the Anaconda distribution of Python, you can try using Numba, which for this particular simple dynamic algorithm would probably offer a lot of speedup without making you leave Python.
Numpy is usually not very good at iterative jobs (though it do have some commonly used iterative functions such as np.cumsum, np.cumprod, np.linalg.* and etc). But for simple tasks like finding the shortest path (or lowest energy path) above, you can vectorize the problem by thinking about what can be computed at the same time (also try to avoid making copy:
Suppose we are finding a shortest path in the "row" direction (i.e. horizontally), we can first create our algorithm input:
# The problem, 300 400*400 matrices
# Create infinitely high boundary so that we dont need to handle indexing "-1"
a = np.random.rand(300, 400, 402).astype('f')
a[:,:,::a.shape[2]-1] = np.inf
then prepare some utility arrays which we will use later (creation takes constant time):
# Create self-overlapping view for 3-way minimize
# This is the input in each iteration
# The shape is (400, 300, 400, 3), separately standing for row, batch, column, left-middle-right
A = np.lib.stride_tricks.as_strided(a, (a.shape[1],len(a),a.shape[2]-2,3), (a.strides[1],a.strides[0],a.strides[2],a.strides[2]))
# Create view for output, this is basically for convenience
# The shape is (399, 300, 400). 399 comes from the fact that first row is never modified
B = a[:,1:,1:-1].swapaxes(0, 1)
# Create a temporary array in advance (try to avoid cache miss)
T = np.empty((len(a), a.shape[2]-2), 'f')
and finally do the computation and timeit:
%%timeit
for i in np.arange(a.shape[1]-1):
A[i].min(2, T)
B[i] += T
The timing result on my (super old laptop) machine is 1.78s, which is already way faster than 3 minute. I believe you can improve even more (while stick to numpy) by optimize the memory layout and alignment (somehow). Or, you can simply use multiprocessing.Pool. It is easy to use, and this problem is trivial to split to smaller problems (by dividing on the batch axis).
i am new to python. I am running the following code and it is giving memory error with python2.7.
Since I am using opencv therefore I am working with python2.7. I have read the previous posts but I am not understanding much from them.
s={}
ns={}
ts={}
for i in range(0,256): #for red component
for j in range(0,256): #for green component
for k in range(0,256): # for blue component
s[(i,j,k)]=0
ns[(i,j,k)]=0
ts[(i,j,k)]=i*j*k
Please help. The code tries to store the frequency of red, green and blue components. And for that I am inititializing these values to zero
Thing 1: use itertools instead of constructing all the range lists each time around the loop. xrange will return an iterator object like range, and product will return an iterator choosing tuples of elements from the given iterable.
Thing 2: use numpy for large data. It's a matrix implementation designed for this sort of thing.
>>> import numpy as np
>>> from itertools import product
>>> x=np.zeros((256,256,256))
>>> for i, j, k in product(xrange(256), repeat=3):
... x[i,j,k]= i*j*k
...
Takes about five seconds for me, and the expected amount of memory.
$ cat /proc/27240/status
Name: python
State: S (sleeping)
...
VmPeak: 420808 kB
VmSize: 289732 kB
Note that you may actually run into system-wide memory limits if you try to allocate three 256*256*256 arrays, since each one has about 17 million entries. Fortunately numpy lets you persist arrays to disk.
Have you come across the PIL (Python Imaging Library)? You may find it helpful.
As a matter of fact, your program needs at least(!) 300*300*300*4*3 bytes solely for the value data of the dicts. Besides, your key tuples occupy 300*300*300*3*3*4 bytes.
This is in total 1296000000 bytes, or 1.2 GiB of data.
This calculation doesn't even include the overhead of maintaining the data in the dict.
So it depends on the amount of memory which your machine has if it fails or not.
You could do a first step and do
s = {}
ns = {}
ts = {}
for i in range(0, 300):
for j in range(0, 300):
for k in range(0, 300):
index=(i, j, k)
s[index]=j
ns[index]=k
ts[index]=i*j*k
which (in theory) will only occupy half the memory as before - as well, only for the data, as the index tuples are reused.
From what you describe (you want a mere counting), you don't need the full range of combinations to be pre-initialized. So you can omit your initialization shown in the question and instead build a storage where you only store these values where you actually have data, which are supposedly much fewer than possible.
You either could use a defaultdict() or imitate its behavoiur manually, as I think that most of the combinations are not used in your color "palette".
from collections import defaultdict
make0 = lambda: 0
s = defaultdict(make0)
ns = defaultdict(make0)
# what is ts? do you need it?
Now you have three dict-like objects which can be written to if needed. Then, for every combination of colors which you really have, you can do s[index] += 1 resp. ns[index] += 1.
I don't know about your ts - maybe you either can calculate it, or you'll have to find a different solution.
Even if all your variables used a single byte, that program would need 405 MB of RAM.
You should use compression to store more in limited space.
Edit: If you want to make a histogram in Python, see this nice example of using the Python Image Library (PIL). The hard work is done with these 3 lines:
import Image
img = Image.open(imagepath)
hist = img.histogram()
I'm trying to process an RGBA buffer (list of chars), and run "unpremultiply" on each pixel. The algorithm is color_out=color*255/alpha.
This is what I came up with:
def rgba_unpremultiply(data):
for i in range(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
data[i] = chr(255*ord(data[i])/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
It works but causes a major drawback in performance.
I'm wondering besides writing a C module, what are my options to optimize this particular function?
This is exactly the kind of code NumPy is great for.
import numpy
def rgba_unpremultiply(data):
a = numpy.fromstring(data, 'B') # Treat the string as an array of bytes
a = a.astype('I') # Cast array of bytes to array of uints, since temporary values needs to be larger than byte
alpha = a[3::4] # Every 4th element starting from index 3
alpha = numpy.where(alpha == 0, 255, alpha) # Don't modify colors where alpha is 0
a[0::4] = a[0::4] * 255 // alpha # Operates on entire slices of the array instead of looping over each element
a[1::4] = a[1::4] * 255 // alpha
a[2::4] = a[2::4] * 255 // alpha
return a.astype('B').tostring() # Cast back to bytes
How big is data? Assuming this is on python2.X Try using xrange instead of range so that you don't have to constantly allocate and reallocate a large list.
You could convert all the data to integers for working with them so you're not constantly converting to and from characters.
Look into using numpy to vectorize this: Link I suspect that simply storing the data as integers and using a numpy array will greatly improve the performance.
And another relatively simple thing you could do is write a little Cython:
http://wiki.cython.org/examples/mandelbrot
Basically Cython will compile your above function into C code with just a few lines of type hints. It greatly reduces the barrier to writing a C extension.
I don't have a concrete answer, but some useful pointers might be:
Python's array module
numpy
OpenCV if you have actual image data
There are some minor things you can do, but I do not think you can improve a lot.
Anyway, here's some hint:
def rgba_unpremultiply(data):
# xrange() is more performant then range, it does not precalculate the whole array
for i in xrange(0, len(data), 4):
a = ord(data[i+3])
if a != 0:
# Not sure about this, but maybe (c << 8) - c is faster than c*255
# So maybe you can arrange this to do that
# Check for performance improvement
data[i] = chr(((ord(data[i]) << 8) - ord(data[i]))/a)
data[i+1] = chr(255*ord(data[i+1])/a)
data[i+2] = chr(255*ord(data[i+2])/a)
return data
I've just make some dummy benchmark on << vs *, and it seems not to be markable differences, but I guess you can do better evaluation on your project.
Anyway, a c module may be a good thing, even if it does not seem to be "language related" the problem.