np.matmul with large integer matrix without conversion/copy

np.matmul with large integer matrix without conversion/copy - python

import numpy as np
v = np.zeros((3,10000), dtype=np.float32)
mat = np.zeros((10000,10000000), dtype=np.int8)
w = np.matmul(v, mat)
yields
Traceback (most recent call last):
File "int_mul_test.py", line 6, in <module>
w = np.matmul(v, mat)
numpy.core._exceptions.MemoryError: Unable to allocate 373. GiB
for an array with shape (10000, 10000000) and data type float32
Apparently, numpy is trying to convert my 10k x 10m int8 matrix to dtype float32. Why does it need to do this? It seems extremely wasteful, and if matrix multiplication must work with float numbers in memory, it could convert say 1m columns at a time (which shouldn't sacrifice speed too much), instead of converting all 10m columns all at once.
My current solution is to use a loop to break the matrix into 10 pieces and reduce temporary memory allocation to 1/10 of the 373 GiB:
w = np.empty((v.shape[0],mat.shape[1]),dtype=np.float32)
start = 0
block = 1000000
for i in range(mat.shape[1]//block):
end = start + block
w[:,start:end] = np.matmul(v, mat[:,start:end])
start = end
w[:,start:] = np.matmul(v, mat[:,start:])
# runs in 396 seconds
Is there a numpy-idiomatic way to multiply "piece by piece" without manually coding a loop?

The semantic of Numpy operations force the inputs of a binary operation to be casted when the left/right types are different. In fact, this is the case in almost all statically typed language including C, C++, Java, Rust, but also many dynamically-typed languages (the semantic rules are applied at runtime in this case). Python also (partially) applies such a well defined semantic rule. For example, when you evaluate the expression True * 1.7, the interpreter evaluates the type of both operands (bool and float here) and then applies multiple semantic rules until the type of both operand are the same before performing the actual multiplication. In this case, True of type bool is casted to 1 of type int which is then casted to 1.0 of type float. Such semantic rules are generally defined in a way that is both relatively unambiguous and safe. For example, you do not expect 2 * 1.7 to be equal to 3. Numpy use semantic rules similar to the ones of the C language because it is written in C and provide native types. The semantic rules should be defined independently of a given implementation. That being said performance and ease-of-use matters a lot when designing it. Unfortunately, in your case, this means a huge array has to be allocated.
Note that Numpy could theoretically bypass casting and implement the N * N possible versions for the N different types for each binary operations in order to make them faster (like the "as-if" semantic rule of the C language). However, this would be insane to implement for developers and it would result in a more bug-prone code (ie. less stable and slower development) and a huge code bloat (bigger binaries). This is especially true since other parameters should be taken into account like the shape of the array and the memory layout (eg. alignment) or event the target architecture. The current main casting generative function of Numpy is already quite complex and already results in 4 x 18 x 18 = 1296 different C functions to be compiled and stored in Numpy binaries!
In your case, you can use Cython or Numba to generate a memory-efficient (and possibly faster) implementation dedicated to your specific needs. Be careful about possible overflows though.

Related

Python finite field matrix exponentiation

Is there a simple way to calculate (especially powers/exponetiation) with matrices whose elements are integers from finite field, or at least arbitrary integer precision matrices with support of % operator?
For example let's say we have a matrix
A = 1 1
1 0
and want to compute something like (A**100) % 1000, how to achieve this?
I have tried numpy, but problem is that it uses fixed precision data types so it overflows quickly... Then I tried sympy since it supports arbitrary integer precision, but it does not seem to have support for finite fields operations (except for inverse)...

It might be an overkill, but Sage has everything you want (and much more). It is a python based software, but is very large (~1.2GB download). You can use sage --preparse script.sage to create a python file.
There is even SO-like QA site https://ask.sagemath.org/questions/ which is specialized to sage.
Example of your code might be:
m = Matrix(GF(5), [[1, 1], [1, 0]])
power = m^100
I have used GF(5) as GF(1000) is not a finite field. Also there are some differences, for instance the exponentiation can be done either by x**y or equivalently x^y.

How can numpy be so much faster than my Fortran routine?

I get a 512^3 array representing a Temperature distribution from a simulation (written in Fortran). The array is stored in a binary file that's about 1/2G in size. I need to know the minimum, maximum and mean of this array and as I will soon need to understand Fortran code anyway, I decided to give it a go and came up with the following very easy routine.
integer gridsize,unit,j
real mini,maxi
double precision mean
gridsize=512
unit=40
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
mini=tmp
maxi=tmp
mean=tmp
do j=2,gridsize**3
read(unit=unit) tmp
if(tmp>maxi)then
maxi=tmp
elseif(tmp<mini)then
mini=tmp
end if
mean=mean+tmp
end do
mean=mean/gridsize**3
close(unit=unit)
This takes about 25 seconds per file on the machine I use. That struck me as being rather long and so I went ahead and did the following in Python:
import numpy
mmap=numpy.memmap('T.out',dtype='float32',mode='r',offset=4,\
shape=(512,512,512),order='F')
mini=numpy.amin(mmap)
maxi=numpy.amax(mmap)
mean=numpy.mean(mmap)
Now, I expected this to be faster of course, but I was really blown away. It takes less than a second under identical conditions. The mean deviates from the one my Fortran routine finds (which I also ran with 128-bit floats, so I somehow trust it more) but only on the 7th significant digit or so.
How can numpy be so fast? I mean you have to look at every entry of an array to find these values, right? Am I doing something very stupid in my Fortran routine for it to take so much longer?
EDIT:
To answer the questions in the comments:
Yes, also I ran the Fortran routine with 32-bit and 64-bit floats but it had no impact on performance.
I used iso_fortran_env which provides 128-bit floats.
Using 32-bit floats my mean is off quite a bit though, so precision is really an issue.
I ran both routines on different files in different order, so the caching should have been fair in the comparison I guess ?
I actually tried open MP, but to read from the file at different positions at the same time. Having read your comments and answers this sounds really stupid now and it made the routine take a lot longer as well. I might give it a try on the array operations but maybe that won't even be necessary.
The files are actually 1/2G in size, that was a typo, Thanks.
I will try the array implementation now.
EDIT 2:
I implemented what #Alexander Vogt and #casey suggested in their answers, and it is as fast as numpy but now I have a precision problem as #Luaan pointed out I might get. Using a 32-bit float array the mean computed by sum is 20% off. Doing
...
real,allocatable :: tmp (:,:,:)
double precision,allocatable :: tmp2(:,:,:)
...
tmp2=tmp
mean=sum(tmp2)/size(tmp)
...
Solves the issue but increases computing time (not by very much, but noticeably).
Is there a better way to get around this issue? I couldn't find a way to read singles from the file directly to doubles.
And how does numpy avoid this?
Thanks for all the help so far.

Your Fortran implementation suffers two major shortcomings:
You mix IO and computations (and read from the file entry by entry).
You don't use vector/matrix operations.
This implementation does perform the same operation as yours and is faster by a factor of 20 on my machine:
program test
integer gridsize,unit
real mini,maxi,mean
real, allocatable :: tmp (:,:,:)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
mean = sum(tmp)/gridsize**3
print *, mini, maxi, mean
end program
The idea is to read in the whole file into one array tmp in one go. Then, I can use the functions MAXVAL, MINVAL, and SUM on the array directly.
For the accuracy issue: Simply using double precision values and doing the conversion on the fly as
mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
only marginally increases the calculation time. I tried performing the operation element-wise and in slices, but that did only increase the required time at the default optimization level.
At -O3, the element-wise addition performs ~3 % better than the array operation. The difference between double and single precision operations is less than 2% on my machine - on average (the individual runs deviate by far more).
Here is a very fast implementation using LAPACK:
program test
integer gridsize,unit, i, j
real mini,maxi
integer :: t1, t2, rate
real, allocatable :: tmp (:,:,:)
real, allocatable :: work(:)
! double precision :: mean
real :: mean
real :: slange
call system_clock(count_rate=rate)
call system_clock(t1)
gridsize=512
unit=40
allocate( tmp(gridsize, gridsize, gridsize), work(gridsize))
open(unit=unit,file='T.out',status='old',access='stream',&
form='unformatted',action='read')
read(unit=unit) tmp
close(unit=unit)
mini = minval(tmp)
maxi = maxval(tmp)
! mean = sum(tmp)/gridsize**3
! mean = sum(real(tmp, kind=kind(1.d0)))/real(gridsize**3, kind=kind(1.d0))
mean = 0.d0
do j=1,gridsize
do i=1,gridsize
mean = mean + slange('1', gridsize, 1, tmp(:,i,j),gridsize, work)
enddo !i
enddo !j
mean = mean / gridsize**3
print *, mini, maxi, mean
call system_clock(t2)
print *,real(t2-t1)/real(rate)
end program
This uses the single precision matrix 1-norm SLANGE on matrix columns. The run-time is even faster than the approach using single precision array functions - and does not show the precision issue.

The numpy is faster because you wrote much more efficient code in python (and much of the numpy backend is written in optimized Fortran and C) and terribly inefficient code in Fortran.
Look at your python code. You load the entire array at once and then call functions that can operate on an array.
Look at your fortran code. You read one value at a time and do some branching logic with it.
The majority of your discrepancy is the fragmented IO you have written in Fortran.
You can write the Fortran just about the same way as you wrote the python and you'll find it runs much faster that way.
program test
implicit none
integer :: gridsize, unit
real :: mini, maxi, mean
real, allocatable :: array(:,:,:)
gridsize=512
allocate(array(gridsize,gridsize,gridsize))
unit=40
open(unit=unit, file='T.out', status='old', access='stream',&
form='unformatted', action='read')
read(unit) array
maxi = maxval(array)
mini = minval(array)
mean = sum(array)/size(array)
close(unit)
end program test

numpy.dot -> MemoryError, my_dot -> very slow, but works. Why?

I am trying to compute the dot product of two numpy arrays sized respectively (162225, 10000) and (10000, 100). However, if I call numpy.dot(A, B) a MemoryError happens.
I, then, tried to write my implementation:
def slower_dot (A, B):
"""Low-memory implementation of dot product"""
#Assuming A and B are of the right type and size
R = np.empty([A.shape[0], B.shape[1]])
for i in range(A.shape[0]):
for j in range(B.shape[1]):
R[i,j] = np.dot(A[i,:], B[:,j])
return R
and it works just fine, but is of course very slow. Any idea of 1) what is the reason behind this behaviour and 2) how I could circumvent / solve the problem?
I am using Python 3.4.2 (64bit) and Numpy 1.9.1 on a 64bit equipped computer with 16GB of ram running Ubuntu 14.10.

The reason you're getting a memory error is probably because numpy is trying to copy one or both arrays inside the call to dot. For small to medium arrays this is often the most efficient option, but for large arrays you'll need to micro-manage numpy in order to avoid the memory error. Your slower_dot function is slow largely because of the python function call overhead, which you suffer 162225 x 100 times. Here is one common way of dealing with this kind of situation when you want to balance memory and performance limitations.
import numpy as np
def chunking_dot(big_matrix, small_matrix, chunk_size=100):
# Make a copy if the array is not already contiguous
small_matrix = np.ascontiguousarray(small_matrix)
R = np.empty((big_matrix.shape[0], small_matrix.shape[1]))
for i in range(0, R.shape[0], chunk_size):
end = i + chunk_size
R[i:end] = np.dot(big_matrix[i:end], small_matrix)
return R
You'll want to pick the chunk_size that works best for your specific array sizes. Typically larger chunk sizes will be faster as long as everything fits in memory.

I think the problem starts from the matrix A itself as a 16225 * 10000 size matrix already occupies about 12GB of memory if each element is a double precision floating point number. That together with how numpy creates temporary copies to do the dot operation will cause the error. The extra copies is because numpy uses the underlying BLAS operations for dot which needs the matrices to be stored in contiguous C order
Check out these links if you want more discussions about improving dot performance
http://wiki.scipy.org/PerformanceTips
Speeding up numpy.dot
https://github.com/numpy/numpy/pull/2730

Is there a way to view how much memory a SciPy matrix used?

I know in python it's hard to see the memory usage of an object.
Is it easier to do this for SciPy objects (for example, sparse matrix)?

you can use array.itemsize (size of the contained type in bytes) and array.flat to obtain the lenght:
# a is your array
bytes = a.itemsize * a.size
it's not the exact value, as it ignore the whole array infrastructure, but for big array it's the value that matter (and I guess that you care because you have something big)
if you want to use it on a sparse array you have to modify it, as the sparse doesn't have the itemsize attribute. You have to access the dtype and get the itemsize from it:
bytes = a.dtype.itemsize * a.size
In general I don't think it's easy to evaluate the real memory occupied by a python object...the numpy array is an exception being just a thin layer over a C array

If you are inside IPython, you can also use its %whosmagic function, which gives you information about the session's variables and includes how much RAM each takes.

Python list vs. array – when to use?

If you are creating a 1d array, you can implement it as a list, or else use the 'array' module in the standard library. I have always used lists for 1d arrays.
What is the reason or circumstance where I would want to use the array module instead?
Is it for performance and memory optimization, or am I missing something obvious?

Basically, Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your list time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays, in part because each item in the list requires the construction of an individual Python object, even for data that could be represented with simple C types (e.g. float or uint64_t).
The array.array type, on the other hand, is just a thin wrapper on C arrays. It can hold only homogeneous data (that is to say, all of the same type) and so it uses only sizeof(one object) * length bytes of memory. Mostly, you should use it when you need to expose a C array to an extension or a system call (for example, ioctl or fctnl).
array.array is also a reasonable way to represent a mutable string in Python 2.x (array('B', bytes)). However, Python 2.6+ and 3.x offer a mutable byte string as bytearray.
However, if you want to do math on a homogeneous array of numeric data, then you're much better off using NumPy, which can automatically vectorize operations on complex multi-dimensional arrays.
To make a long story short: array.array is useful when you need a homogeneous C array of data for reasons other than doing math.

For almost all cases the normal list is the right choice. The arrays module is more like a thin wrapper over C arrays, which give you kind of strongly typed containers (see docs), with access to more C-like types such as signed/unsigned short or double, which are not part of the built-in types. I'd say use the arrays module only if you really need it, in all other cases stick with lists.

The array module is kind of one of those things that you probably don't have a need for if you don't know why you would use it (and take note that I'm not trying to say that in a condescending manner!). Most of the time, the array module is used to interface with C code. To give you a more direct answer to your question about performance:
Arrays are more efficient than lists for some uses. If you need to allocate an array that you KNOW will not change, then arrays can be faster and use less memory. GvR has an optimization anecdote in which the array module comes out to be the winner (long read, but worth it).
On the other hand, part of the reason why lists eat up more memory than arrays is because python will allocate a few extra elements when all allocated elements get used. This means that appending items to lists is faster. So if you plan on adding items, a list is the way to go.
TL;DR I'd only use an array if you had an exceptional optimization need or you need to interface with C code (and can't use pyrex).

It's a trade off !
pros of each one :
list
flexible
can be heterogeneous
array (ex: numpy array)
array of uniform values
homogeneous
compact (in size)
efficient (functionality and speed)
convenient

My understanding is that arrays are stored more efficiently (i.e. as contiguous blocks of memory vs. pointers to Python objects), but I am not aware of any performance benefit. Additionally, with arrays you must store primitives of the same type, whereas lists can store anything.

The standard library arrays are useful for binary I/O, such as translating a list of ints to a string to write to, say, a wave file. That said, as many have already noted, if you're going to do any real work then you should consider using NumPy.

With regard to performance, here are some numbers comparing python lists, arrays and numpy arrays (all with Python 3.7 on a 2017 Macbook Pro).
The end result is that the python list is fastest for these operations.
# Python list with append()
np.mean(timeit.repeat(setup="a = []", stmt="a.append(1.0)", number=1000, repeat=5000)) * 1000
# 0.054 +/- 0.025 msec
# Python array with append()
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a.append(1.0)", number=1000, repeat=5000)) * 1000
# 0.104 +/- 0.025 msec
# Numpy array with append()
np.mean(timeit.repeat(setup="import numpy as np; a = np.array([])", stmt="np.append(a, [1.0])", number=1000, repeat=5000)) * 1000
# 5.183 +/- 0.950 msec
# Python list using +=
np.mean(timeit.repeat(setup="a = []", stmt="a += [1.0]", number=1000, repeat=5000)) * 1000
# 0.062 +/- 0.021 msec
# Python array using +=
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a += array.array('f', [1.0]) ", number=1000, repeat=5000)) * 1000
# 0.289 +/- 0.043 msec
# Python list using extend()
np.mean(timeit.repeat(setup="a = []", stmt="a.extend([1.0])", number=1000, repeat=5000)) * 1000
# 0.083 +/- 0.020 msec
# Python array using extend()
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a.extend([1.0]) ", number=1000, repeat=5000)) * 1000
# 0.169 +/- 0.034

If you're going to be using arrays, consider the numpy or scipy packages, which give you arrays with a lot more flexibility.

This answer will sum up almost all the queries about when to use List and Array:
The main difference between these two data types is the operations you can perform on them. For example, you can divide an array by 3 and it will divide each element of array by 3. Same can not be done with the list.
The list is the part of python's syntax so it doesn't need to be declared whereas you have to declare the array before using it.
You can store values of different data-types in a list (heterogeneous), whereas in Array you can only store values of only the same data-type (homogeneous).
Arrays being rich in functionalities and fast, it is widely used for arithmetic operations and for storing a large amount of data - compared to list.
Arrays take less memory compared to lists.

Array can only be used for specific types, whereas lists can be used for any object.
Arrays can also only data of one type, whereas a list can have entries of various object types.
Arrays are also more efficient for some numerical computation.

An important difference between numpy array and list is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.