This must be easy, but I'm very new to pytables. My application has dataset sizes so large they cannot be held in memory, thus I use PyTable CArrays. However, I need to find the maximum element in an array that is not infinity. Naively in numpy I'd do this:
max_element = numpy.max(array[array != numpy.inf])
Obviously that won't work in PyTables without introducing a whole array into memory. I could loop through the CArray in windows that fit in memory, but it'd be surprising to me if there weren't a max/min reduction operation. Is there an elegant mechanism to get the conditional maximum element of that array?
If your CArray is one dimensional, it is probably easier to stick it in a single-column Table. Then you have access to the where() method and can easily evaluate expressions like the following.
from itertools import imap
max(imap(lamdba r: r['col'], tab.where('col != np.inf')))
This works because where() never reads in all the data at once and returns an iterator, which is handed off to map, which is handed off to max. Note that in Python 3, you don't need to import imap() and imap() becomes just the builtin map().
Not using a table means that you need to use the Expr class and do more of the wiring yourself.
I know in python it's hard to see the memory usage of an object.
Is it easier to do this for SciPy objects (for example, sparse matrix)?
you can use array.itemsize (size of the contained type in bytes) and array.flat to obtain the lenght:
# a is your array
bytes = a.itemsize * a.size
it's not the exact value, as it ignore the whole array infrastructure, but for big array it's the value that matter (and I guess that you care because you have something big)
if you want to use it on a sparse array you have to modify it, as the sparse doesn't have the itemsize attribute. You have to access the dtype and get the itemsize from it:
bytes = a.dtype.itemsize * a.size
In general I don't think it's easy to evaluate the real memory occupied by a python object...the numpy array is an exception being just a thin layer over a C array
If you are inside IPython, you can also use its %whosmagic function, which gives you information about the session's variables and includes how much RAM each takes.
How would I translate the following into Python from Matlab? I'm still trying to wrap my head around lists/matrices and arrays in numpy, etc.
outframe(:,[4:4:nout-1]) = 0.25*inframe(:,[1:n-1]) + 0.75*inframe(:,[2:n])
pos=(beamnum>0)*(beamnum<=nbeams)*(binnum>0)*(binnum<=nbins)*((beamnum-1)*nbins+binnum)
for index =1:512:
outarray(index,:) =uint8(interp1([1:n],inarray64(index,:),[1:.25:n],method))
(There's other stuff, these are just the particular statements I'm not sure how to make sense of. I have numpy imported,
The main workhorse in numpy is the ndarray (or array). It will for the most part replace matlab matrices when you translate code. Like a matlab matrix, the ndarray stores homogeneous data (ie float64) and is optimized for numerical operations.
The numpy matrix is a subclass of the ndarray which can be convenient for some linear algebra intensive applications. Here is more info about the differences between the two.
The python list is more like a matlab cell array (though not exactly the same). It's one of the basic python data structures, but in scientific applications I find that it comes up most often when you need to hold heterogeneous data. (Or when you're doing something very simple and don't want to go to the trouble of creating a numpy array).
Your code above can be converted almost verbatim to python using the ndarray and replacing () with [] for indexing and taking into account that indexing starts at 1 in MATLAB and 0 in python
i.e. : the first element in MATLAB is element 1, and in python it is element 0.
Let's try this line by line:
outframe(:,[4:4:nout-1]) = 0.25*inframe(:,[1:n-1]) + 0.75*inframe(:,[2:n])
would translate in "English" to: all rows of outframe, but only every 4th column starting from 4 to nout-1 (i.e.4,8..). I assume you understand what inframe references mean.
pos=(beamnum>0)*(beamnum<=nbeams)*(binnum>0)*(binnum<=nbins)*((beamnum-1)*nbins+binnum)
Possibly beamnum is a vector and (beamnum >0) returns a vector of {0,1} such that the elements are '1' where the respective beamnum element is >0, else 0. The rest of it is clear, i hope.
The second last line is a for-loop and the last line should hopefully be clear.
Maybe this is a simple issue, but I could not find any information about it so far.
For an optimization in numpy I need an array of functions. The number of functions I need depends on the current object which shall be optimized.
I have already figured out how to create these functions dynamically, but now I would like to store them in an array like this:
myArray = zeros(x)
for i in range(x):
myArray[i] = createFunction(i)
If I run this I get a type mismatch:
float() argument must be a string or a number, not 'function'
Creating the array directly works well:
myArray = array([createFunction(0)...])
But because I don't know the number of functions I need, this is exactly what I want to prevent.
Ah, I get it. You really do mean an array of functions.
The type mismatch error arises because the call to zeros creates an array of floats by default. So your original would work if instead you did myArray = numpy.empty(x, dtype=numpy.object) (note that empty makes more sense than zeros here). The slightly more pythonic version is to use a list comprehension
myArray = numpy.array([createFunction(i) for i in range(x)]).
But you might not need to create a numpy array at all, depending on what you want to do with it:
myArray = [createFunction(i) for i in range(x)]
If you want to avoid the list, it might be better to use numpy.fromfunction along with numpy.vectorize:
myArray = numpy.fromfunction(numpy.vectorize(createFunction),
shape=(x,), dtype=numpy.object)
where (x,) is a tuple giving the shape of the array. The call to vectorize is needed because fromfunction assumes that the function can work on an array of inputs and return an array of scalars, and vectorize converts a function to do exactly that. The dtype=object is needed since otherwise numpy tries to create an array of floats.
Maybe you can use
myArray = array([createFunction(i) for i in range(x)])
If you need an array of functions, is it possible to not use NumPy? NumPy arrays have C-style types and it defaults to float. If you can, just use a standard Python list. But if you absolutely must use NumPy, try defining the array like so:
import numpy as np
a = np.empty([x], dtype=np.dtype(np.object_))
Or however you need it to be with that dtype.
Numpy arrays are homogeneous. That is all elements of a numpy array are of the same type -- python is duck-typed, numpy isn't. This is part of what makes matrix operations on numpy arrays and matrices so fast. However, because of this a data type must be known when the array is first created. Numpy is generally very good at inferring the data type. The problem comes when creating an empty or zeroed array. Since there are no elements to examine numpy must guess the data type. Numpy defaults to numpy.float64 if it isn't given a data type at array creation time. This is a decent choice as numpy is typically used in scientific or engineering areas where floating point numbers are required. This is also why numpy is complaining -- because it can't store your functions as 64-bit floating point numbers.
The quick solution is to let numpy know the data type you want. eg.
myArray = numpy.zeros(x, dtype=numpy.object)
Note that the data type cannot be any class, but must be an instance of numpy.dtype (for advanced use you can create additional dtypes a runtime that numpy can then manipulate). For functions, numpy will store them as numpy.object (which means any generic python object). I do not think you will get any performance benefit from using numpy to store arrays of functions. Perhaps you would be better off creating generator functions and chaining them, converting to a numpy array once you know the result will be a number.
funcs = [createFunction(i) for i in xrange(x)]
def getItemFromEachFunction(i):
return funcs[i]()
arr = numpy.fromfunction(getItemFromEachFunction, (x,))
If you are creating a 1d array, you can implement it as a list, or else use the 'array' module in the standard library. I have always used lists for 1d arrays.
What is the reason or circumstance where I would want to use the array module instead?
Is it for performance and memory optimization, or am I missing something obvious?
Basically, Python lists are very flexible and can hold completely heterogeneous, arbitrary data, and they can be appended to very efficiently, in amortized constant time. If you need to shrink and grow your list time-efficiently and without hassle, they are the way to go. But they use a lot more space than C arrays, in part because each item in the list requires the construction of an individual Python object, even for data that could be represented with simple C types (e.g. float or uint64_t).
The array.array type, on the other hand, is just a thin wrapper on C arrays. It can hold only homogeneous data (that is to say, all of the same type) and so it uses only sizeof(one object) * length bytes of memory. Mostly, you should use it when you need to expose a C array to an extension or a system call (for example, ioctl or fctnl).
array.array is also a reasonable way to represent a mutable string in Python 2.x (array('B', bytes)). However, Python 2.6+ and 3.x offer a mutable byte string as bytearray.
However, if you want to do math on a homogeneous array of numeric data, then you're much better off using NumPy, which can automatically vectorize operations on complex multi-dimensional arrays.
To make a long story short: array.array is useful when you need a homogeneous C array of data for reasons other than doing math.
For almost all cases the normal list is the right choice. The arrays module is more like a thin wrapper over C arrays, which give you kind of strongly typed containers (see docs), with access to more C-like types such as signed/unsigned short or double, which are not part of the built-in types. I'd say use the arrays module only if you really need it, in all other cases stick with lists.
The array module is kind of one of those things that you probably don't have a need for if you don't know why you would use it (and take note that I'm not trying to say that in a condescending manner!). Most of the time, the array module is used to interface with C code. To give you a more direct answer to your question about performance:
Arrays are more efficient than lists for some uses. If you need to allocate an array that you KNOW will not change, then arrays can be faster and use less memory. GvR has an optimization anecdote in which the array module comes out to be the winner (long read, but worth it).
On the other hand, part of the reason why lists eat up more memory than arrays is because python will allocate a few extra elements when all allocated elements get used. This means that appending items to lists is faster. So if you plan on adding items, a list is the way to go.
TL;DR I'd only use an array if you had an exceptional optimization need or you need to interface with C code (and can't use pyrex).
It's a trade off !
pros of each one :
list
flexible
can be heterogeneous
array (ex: numpy array)
array of uniform values
homogeneous
compact (in size)
efficient (functionality and speed)
convenient
My understanding is that arrays are stored more efficiently (i.e. as contiguous blocks of memory vs. pointers to Python objects), but I am not aware of any performance benefit. Additionally, with arrays you must store primitives of the same type, whereas lists can store anything.
The standard library arrays are useful for binary I/O, such as translating a list of ints to a string to write to, say, a wave file. That said, as many have already noted, if you're going to do any real work then you should consider using NumPy.
With regard to performance, here are some numbers comparing python lists, arrays and numpy arrays (all with Python 3.7 on a 2017 Macbook Pro).
The end result is that the python list is fastest for these operations.
# Python list with append()
np.mean(timeit.repeat(setup="a = []", stmt="a.append(1.0)", number=1000, repeat=5000)) * 1000
# 0.054 +/- 0.025 msec
# Python array with append()
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a.append(1.0)", number=1000, repeat=5000)) * 1000
# 0.104 +/- 0.025 msec
# Numpy array with append()
np.mean(timeit.repeat(setup="import numpy as np; a = np.array([])", stmt="np.append(a, [1.0])", number=1000, repeat=5000)) * 1000
# 5.183 +/- 0.950 msec
# Python list using +=
np.mean(timeit.repeat(setup="a = []", stmt="a += [1.0]", number=1000, repeat=5000)) * 1000
# 0.062 +/- 0.021 msec
# Python array using +=
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a += array.array('f', [1.0]) ", number=1000, repeat=5000)) * 1000
# 0.289 +/- 0.043 msec
# Python list using extend()
np.mean(timeit.repeat(setup="a = []", stmt="a.extend([1.0])", number=1000, repeat=5000)) * 1000
# 0.083 +/- 0.020 msec
# Python array using extend()
np.mean(timeit.repeat(setup="import array; a = array.array('f')", stmt="a.extend([1.0]) ", number=1000, repeat=5000)) * 1000
# 0.169 +/- 0.034
If you're going to be using arrays, consider the numpy or scipy packages, which give you arrays with a lot more flexibility.
This answer will sum up almost all the queries about when to use List and Array:
The main difference between these two data types is the operations you can perform on them. For example, you can divide an array by 3 and it will divide each element of array by 3. Same can not be done with the list.
The list is the part of python's syntax so it doesn't need to be declared whereas you have to declare the array before using it.
You can store values of different data-types in a list (heterogeneous), whereas in Array you can only store values of only the same data-type (homogeneous).
Arrays being rich in functionalities and fast, it is widely used for arithmetic operations and for storing a large amount of data - compared to list.
Arrays take less memory compared to lists.
Array can only be used for specific types, whereas lists can be used for any object.
Arrays can also only data of one type, whereas a list can have entries of various object types.
Arrays are also more efficient for some numerical computation.
An important difference between numpy array and list is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.