I have a huge file of csv which can not be loaded into memory. Transforming it to libsvm format may save some memory.
There are many nan in csv file. If I read lines and store them as np.array, with np.nan as NULL, will the array still occupy too much memory ?
Does the np.nan in array also occupy memory ?
When working with floating point representations of numbers, non-numeric values (NaN and inf) are also represented by a specific binary pattern occupying the same number of bits as any numeric floating point value. Therefore, NaNs occupy the same amount of memory as any other number in the array.
As far as I know yes, nan and zero values occupy the same memory as any other value, however, you can address your problem in other ways:
Have you tried using a sparse vector? they are intended for vectors with a lot of 0 values and memory consumption is optimized
SVM Module Scipy
Sparse matrices Scipy
There you have some info about SVM and sparse matrices, if you have further questions just ask.
Edited to provide an answer as well as a solution
According to the getsizeof() command from the sys module it does. A simple and fast example :
import sys
import numpy as np
x = np.array([1,2,3])
y = np.array([1,np.nan,3])
x_size = sys.getsizeof(x)
y_size = sys.getsizeof(y)
print(x_size)
print(y_size)
print(y_size == x_size)
This should print out
120
120
True
so my conclusion was it uses as much memory as a normal entry.
Instead you could use sparse matrices (Scipy.sparse) which do not save zero / Null at all and therefore are more memory efficient. But Scipy strongly discourages from using Numpy methods directly https://docs.scipy.org/doc/scipy/reference/sparse.html since Numpy might not interpret them correctly.
Related
I try to write a matrix consisting of kronecker-products
def kron_sparse_2(a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p):
kron= sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(sparse.kron(a,b),c),d),e),f),g),h),i),j),k),l),m),n),o),p)
return kron
res = 0
for i in sd:
res= res +( kron_sparse_2(i,i,I,I,I,I,I,I,I,I,I,I,I,I,I,I))
The i's in sd are 2x2 matrices.
Is there anything I can do further to calculate this without the memory problem?
The error I get is: MemoryError: Unable to allocate 16.0 GiB for an array with shape (536870912, 2, 2) and data type float64
If I understand correctly (I think you are trying to form the Hamiltonian for some spin problem, and you should be able to go up to 20 spins with ease. Also if it is indeed the case try using np.roll and reduce functions to rewrite your methods efficiently), you might try converting all of your matrices (even with dims. 2X2) to sparse format (say csr or csc) and use scipy kron function with the format specified as the sparse matrix format you used to construct all of your matrices. Because as far as I remember kron(format=None) uses an explicit representation of matrices which causes memory problems, try format='csc' for instance.
I'm trying to use numpy to element-wise square an array. I've noticed that some of the values appear as negative numbers. The squared value isn't near the max int limit. Does anyone know why this is happening and how I could fix it? I'd rather avoid using a for loop to square an array element-wise, since my data set is quite large.
Here's an example of what is happening:
import numpy as np
test = [1, 2, 47852]
sq = np.array(test)**2
print(sq)
print(47852*47852)
Output:
[1,4, -2005153392]
2289813904
This is because NumPy doesn't check for integer overflow - likely because that would slow down every integer operation, and NumPy is designed with efficiency in mind. So when you have an array of 32-bit integers and your result does not fit in 32 bits, it is still interpreted as 32-bit integer, giving you the strange negative result.
To avoid this, you can be mindful of the dtype you need to perform the operation safely, in this case 'int64' would suffice.
>>> np.array(test, dtype='int64')**2
2289813904
You aren't seeing the same issue with Python int's because Python checks for overflow and adjusts accordingly to a larger data type if necessary. If I recall, there was a question about this on the mailing list and the response was that there would be a large performance implication on atomic array ops if the same were done in NumPy.
As for why your default integer type may be 32-bit on a 64-bit system, as Goyo answered on a related question, the default integer np.int_ type is the same as C long, which is platform dependent but can be 32-bits.
I try to read a 2d array with floats from a binary file with Python. Files have been written with big endian by a Fortran program (it is the intermediate file of the Weather Research and Forecast model). I already know dimension sizes of the array to read (nx & ny) but as a Fortran and IDl programer I am completely lost, how to manage it in Python. (Later on I want to visualize the array).
Shall I use struct.unpack or numpy.fromfile or the array module?
Do I have to read first a vector and afterwards reshape it? (have seen this option only for the numpy-way)
How do I define a 2d array with numpy and how do I define the dtype to read with big-endian byte ordering?
Is there an issue with array ordering (column or row wise) to take into account?
Short answers per sub-question:
I don't think the array module has a way to specify endianness.
Between the struct module and Numpy I think Numpy is easier to
use, especially for Fortran-like ordered arrays.
All data is inherently 1-dimensional as far as the hardware (disk,
RAM, etc) is concerned, so yes reshaping to get a 2D representation
is always necessary. With numpy.fromfile the reshape must happen
explicitly afterwards, but numpy.memmap provides a way to reshape
more implicitly.
The easiest way to specify endianness with Numpy is to use a short
type string, actually very similar to the approach needed for
the struct module. In Numpy >f and >f4 specify single
precision and >d and >f8 double precision big-endian floating
point.
Your binary file could walk the array along the rows (C-like) or along
the columns (Fortran-like). Whichever of the two, this has to be taken into
account to represent the data properly. Numpy makes this easy with the
order keyword argument for reshape and memmap (among others).
All in all, the code could be for example:
import numpy as np
filename = 'somethingsomething'
with open(filename, 'rb') as f:
nx, ny = ... # parse; advance file-pointer to data segment
data = np.fromfile(f, dtype='>f8', count=nx*ny)
array = np.reshape(data, [nx, ny], order='F')
scipy.random.rand() and other functions in the same package
all produce arrays of float64 as output
(at least for python 2.7.3 64-bit on Mac OS, scipy version 0.12.0).
What I want is a rather large (N gigabytes) randomly initialized matrix of float32.
Is there an easy way to produce one directly, rather than allocating double space
for float64 then converting down to 32 bits?
I would preallocate the array, then copy in batches of random float64s as Warren Weckesser recommends in the comments.
If you're in for a hack, here's ten floats generated using uniform random bits:
>>> bytes_per_float = np.float32(0).nbytes # ugly, I know
>>> np.frombuffer(random.bytes(10 * bytes_per_float), dtype=np.float32)
array([ -3.42894422e-23, -3.33389699e-01, -7.63695071e-26,
7.02152836e-10, 3.45816648e-18, 2.80226597e-09,
-9.34621269e-10, -9.75820352e+08, 2.95705402e+20,
2.57654391e+25], dtype=float32)
Of course, these don't follow any nice distribution, the array might contain NaN or Inf, and the code might actually crash on some non-x86 machines due to aligment problems.
I know in python it's hard to see the memory usage of an object.
Is it easier to do this for SciPy objects (for example, sparse matrix)?
you can use array.itemsize (size of the contained type in bytes) and array.flat to obtain the lenght:
# a is your array
bytes = a.itemsize * a.size
it's not the exact value, as it ignore the whole array infrastructure, but for big array it's the value that matter (and I guess that you care because you have something big)
if you want to use it on a sparse array you have to modify it, as the sparse doesn't have the itemsize attribute. You have to access the dtype and get the itemsize from it:
bytes = a.dtype.itemsize * a.size
In general I don't think it's easy to evaluate the real memory occupied by a python object...the numpy array is an exception being just a thin layer over a C array
If you are inside IPython, you can also use its %whosmagic function, which gives you information about the session's variables and includes how much RAM each takes.