Related
I'm pretty illiterate in using Python/numpy.
I have the following piece of code:
data = np.array([])
for i in range(10):
data = np.append(data, GetData())
return data
GetData() returns a numpy array with a custom dtype. However when executing the above piece of code, the numbers convert to float64 which I suspect is the culprit for other issues I'm having. How can I copy/append the output of the functions while preserving the dtype as well?
Given the comments stating that you will only know the type of data once you run GetData(), and that multiple types are expected, you could do it like so:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the arrays with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty array with that dtype and store it in the dict
dataByType[newData.dtype] = np.array([], dtype=newData.dtype)
# Append the new data to the corresponding array in dict, depending on dtype
dataByType[newData.dtype] = np.append(dataByType[newData.dtype], newData)
Taking into account hpaulj's answer, if you wish to conserve the different types you might encounter without creating a new array at each iteration you can adapt the above to:
# [...]
dataByType = {} # dictionary to store the dtypes encountered and the list storing data with given dtype
for i in range(10):
newData = GetData()
if newData.dtype not in dataByType:
# If the dtype has not been encountered yet,
# create an empty list with that dtype and store it in the dict
dataByType[newData.dtype] = []
# Append the new data to the corresponding list in dict, depending on dtype
dataByType[newData.dtype].append(newData)
# At this point, you have all your data pieces stored according to their original dtype inside the dataByType dictionary.
# Now if you wish you can convert them to numpy arrays as well
# Either by concatenation, updating what is stored in the dict
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
# No need to specify the dtype in concatenate here, since previous step ensures all data pieces are the same type
# Or by creating array directly, to store each data piece at a different index
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
# As for concatenate, no need to specify the dtype here
A little example:
import numpy as np
# to get something similar to GetData in the example structure:
getData = [
np.array([1.,2.], dtype=np.float64),
np.array([1,2], dtype=np.int64),
np.array([3,4], dtype=np.int64),
np.array([3.,4.], dtype=np.float64)
] # dtype precised here for clarity, but not needed
dataByType = {}
for i in range(len(getData)):
newData = getData[i]
if newData.dtype not in dataByType:
dataByType[newData.dtype] = []
dataByType[newData.dtype].append(newData)
print(dataByType) # output formatted below for clarity
# {dtype('float64'):
# [array([1., 2.]), array([3., 4.])],
# dtype('int64'):
# [array([1, 2], dtype=int64), array([3, 4], dtype=int64)]}
Now if we use concatenate on that dataset, we get 1D arrays, conserving the original type (dtype=float64 not precised in the output since it is the default type for floating point values):
for dataType in dataByType:
dataByType[dataType] = np.concatenate(dataByType[dataType])
print(dataByType) # once again output formatted for clarity
# {dtype('float64'):
# array([1., 2., 3., 4.]),
# dtype('int64'):
# array([1, 2, 3, 4], dtype=int64)}
And if we use array, we get 2D arrays:
for dataType in dataByType:
dataByType[dataType] = np.array(dataByType[dataType])
print(dataByType)
# {dtype('float64'):
# array([[1., 2.],
# [3., 4.]]),
# dtype('int64'):
# array([[1, 2],
# [3, 4]], dtype=int64)}
Important thing to note: using array will not work as intended if all the arrays to combine don't have the same shape:
import numpy as np
print(repr(np.array([
np.array([1,2,3]),
np.array([4,5])])])))
# array([array([1, 2, 3]), array([4, 5])], dtype=object)
You get an array of dtype object, which are all in this case arrays of different lengths.
Your use of [] and append indicates that your are naively copying that common list idiom:
alist = []
for x in another_list:
alist.append(x)
Your data is not a clone of the [] list:
In [220]: np.array([])
Out[220]: array([], dtype=float64)
It's an array with shape (0,) and dtype float.
np.append is not an list append clone. I stress that, because too many new users make that mistake, and the result is many different errors. It is really just a cover for np.concatenate, one that takes 2 arguments instead of a list of arguments. As the docs stress it returns a new array, and when used iteratively, that means a lot of copying.
It is best to collect your arrays in a list, and give it to concatenate. List append is in-place, and better when done iteratively. If you give concatenate a list of arrays, the resulting dtype will be the common one (or whatever promoting requires). (new versions do let you specify dtype when calling concatenate.)
Keep the numpy documentation at hand (python too if necessary), and look up functions. Pay attention to how they are called, including the keyword parameters). And practice with small examples. I keep an interactive python session at hand, even when writing answers.
When working with arrays, pay close attention to shape and dtype. Don't make assumptions.
concatenating 2 int arrays:
In [238]: np.concatenate((np.array([1,2]),np.array([4,3])))
Out[238]: array([1, 2, 4, 3])
making one a float array (just by adding a decimal point to one number):
In [239]: np.concatenate((np.array([1,2]),np.array([4,3.])))
Out[239]: array([1., 2., 4., 3.])
It won't let me change the result to int:
In [240]: np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
Traceback (most recent call last):
File "<ipython-input-240-91b4e3fec07a>", line 1, in <module>
np.concatenate((np.array([1,2]),np.array([4,3.])), dtype=int)
File "<__array_function__ internals>", line 180, in concatenate
TypeError: Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'same_kind'
If an element is a string, the result is also a string dtype:
In [241]: np.concatenate((np.array([1,2]),np.array(['4',3.])))
Out[241]: array(['1', '2', '4', '3.0'], dtype='<U32')
Sometimes it is necessary to adjust dtypes after a calculation:
In [243]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float)
Out[243]: array([1., 2., 4., 3.])
In [244]: np.concatenate((np.array([1,2]),np.array(['4',3.]))).astype(float).as
...: type(int)
Out[244]: array([1, 2, 4, 3])
I have learned the axis indication of numpy array from how is axis indexed in numpy's array
The article says that, for 2-D array, axis=0 stands for each col in array, and axis=1 for each row in array. It works when I use np.mean that means values by col, but np.delete in axis=0 is different that deletes elements by row.
import numpy as np
arr = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])
'''
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12]])
'''
np.mean(arr, 0)
'''
array([5., 6., 7., 8.])
'''
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
I confuse whether I'm wrong for understanding that?
Why np.mean and np.delete operate in different axis when axis=0 is declared?
The accepted answer to the question you linked to actually says correctly that
Axis 0 is thus the first dimension (the "rows"), and axis 1 is the second dimension (the "columns")
which is what the code does and is the opposite to what you said.
This ought to be the source of your confusion. As we see from your own example:
np.delete(arr,1,axis=0)
'''
array([[ 1, 2, 3, 4],
[ 9, 10, 11, 12]])
'''
Row at index 1 is deleted, which is exactly what we want to happen.
This is a 2D example where we have rows and columns but it is important to understand how shapes work in general and then they will make sense in higher dimension. Consider the following example:
[
[
[1, 2],
[3, 4]
],
[
[5, 6],
[7, 8],
],
[
[9, 10],
[11, 12],
]
]
Here, we have 3 grids, each itself is 2x2, so we have something of shape 3x2x2. This is why we have 12 elements in total. Now, how do we know that at axis=0 we have 3 elements? Because if you look at this as a simple array and not some fancy numpy object then len(arr) == 3. Then if you take any of the elements along that axis (any of the "grids" that is), we will see that their length is 2 or len(arr[0]) == 2. That is because each of the grids has 2 rows. Finally, to check how many items each row of each of these grids has, we just have to inspect any one of these rows. Let's look at the second row of the first grid for a change. We will see that: len(arr[0][1]) == 2.
Now, what does np.mean(a, axis=0) mean? It means we will go over each of the items along axis=0 and find their mean. If these items are simply numbers (if a=np.array([1,2,3])) that's easy because the average of 1,2,3 is just the sum of these numbers divided by their quantity.
So, what if we have vectors or grids? What is the average of [2,4,6] and [0,0,0]? The convention is that the average of these to lists is a list of the averages at each index. So in other words it's:
[np.mean([2,0]), np.mean([4,0]), np.mean([6,0])]
which is trivially [1,2,3].
So, why does np.delete behave differently? Well, because the purpose of delete is to remove an element along some axis rather than to perform an aggregation over that axis. So in this particular case, we had 3 grids. So removing one of them will simply leave us with 2 grids. We could alternatively remove the second row of every grid (axis=1). That would leave us with 3 grids but each would have only 1 row instead of 2.
Hopefully, this brings some clarity :)
Usually I like to think about the axis in numpy (or pandas) as an indicator of the axis "along which" computations are carried out.
In this sense when you compute the mean along axis 0, this is, along the rows, you do it for each column. But if you delete along axis 0 it means you scroll along the rows to find the index you will delete.
I think your confusion is possibly coming from the fact that in delete, the axis refers to the axis you are indexing along when finding the section to delete, while in mean, the axis refers to which axis you are averaging along.
In both cases, axis tells the function which axis to "move along" when trying to perform it's operation - for delete it moves down the way when searching for what delete, and for mean it moves down the way when calculating averages
I know this is a relatively common topic on stackoverflow but I couldn't find the answer I was looking for. Basically, I am trying to make very efficient code (I have rather large data sets) to get certain columns of data from a matrix. Below is what I have so far. It gives me this error: could not broadcast input array from shape (2947,1) into shape (2947)
def get_data(self, colHeaders):
temp = np.zeros((self.matrix_data.shape[0],len(colHeaders)))
for col in colHeaders:
index = self.header2matrix[col]
temp[:,index:] = self.matrix_data[:,index]
data = np.matrix(temp)
return temp
Maybe this simple example will help:
In [70]: data=np.arange(12).reshape(3,4)
In [71]: header={'a':0,'b':1,'c':2}
In [72]: col=['c','a']
In [73]: index=[header[i] for i in col]
In [74]: index
Out[74]: [2, 0]
In [75]: data[:,index]
Out[75]:
array([[ 2, 0],
[ 6, 4],
[10, 8]])
data is some sort of 2D array, header is a dictionary mapping names to column numbers. Using the input col, I construct a column index list. You can select all columns at once, rather than one by one.
I've a numpy.ndarray the columns of which I'd like to access. I will be taking all columns after 8 and testing them for variance, removing the column if the variance/average is low. In order to do this, I need access to the columns, preferably with Numpy. By my current methods, I encounter errors or failure to transpose.
To mine these arrays, I am using the IOPro adapter, which gives a regular numpy.ndarray.
import iopro
import sys
adapter = iopro.text_adapter(sys.argv[1], parser='csv')
all_data = adapter[:]
z_matrix = adapter[range(8,len(all_data[0]))][1:3]
print type(z_matrix) #check type
print z_matrix # print array
print z_matrix.transpose() # attempt transpose (fails)
print z_matrix[:,0] # attempt access by column (fails)
Can someone explain what is happening?
The output is this:
<type 'numpy.ndarray'>
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
Traceback (most recent call last):
File "z-matrix-filtering.py", line 11, in <module>
print z_matrix[:,0]
IndexError: too many indices
What is going wrong? Is there a better way to access the columns? I will be reading all lines of a file, testing all columns from the 8th for significant variance, removing any columns that don't vary significantly, and then reprinting the result as a new CSV.
EDIT:
Based on responses, I have created the following very ugly and I think inane approach.
all_data = adapter[:]
z_matrix = []
for line in all_data:
to_append = []
for column in range(8,len(all_data.dtype)):
to_append.append(line[column].astype(np.float16))
z_matrix.append(to_append)
z_matrix = np.array(z_matrix)
The reason that the columns must be directly accessed is that there is a String inside the data. If this string is not circumvented in some way, an error will be thrown about a void-array with object members using buffer error.
Is there a better solution? This seems terrible, and it seems it will be inefficient for several gigabytes of data.
Notice that the output of print z_matrix has the form
[ (18.712, 64.903, ..., -138.173)
(17.679, 48.015, ..., -66.5854)]
That is, it is printed as a list of tuples. That is the output you get when the array is a "structured array". It is a one-dimensional array of structures. Each "element" in the array has 18 fields. The error occurs because you are trying to index a 1-D array as if it were 2-D; z_matrix[:,0] won't work.
Print the data type of the array to see the details. E.g.
print z_matrix.dtype
That should show the names of the fields and their individual data types.
You can get one of the elements as, for example, z_matrix[k] (where k is an integer), or you can access a "column" (really a field of the structured array) as z_matrix['name'] (change 'name' to one of the fields in the dtype).
If the fields all have the same data type (which looks like the case here--each field has type np.float64), you can create a 2-D view of the data by reshaping the result of the view method. For example:
z_2d = z_matrix.view(np.float64).reshape(-1, len(z_matrix.dtype.names))
Another way to get the data by column number rather than name is:
col = 8 # The column number (zero-based).
col_data = z_matrix[z_matrix.dtype.names[col]]
For more about structured arrays, see http://docs.scipy.org/doc/numpy/user/basics.rec.html.
The display of z_matrix is consistent with it being shape (2,), a 1d array of tuples.
np.array([np.array(a) for a in z_matrix])
produces a (2,18) 2d array. You should be able to do your column tests on that.
It is very easy to access numpy array. Here's a simple example which can be helpful
import numpy as n
A = n.array([[1, 2, 3], [4, 5, 6]])
print A
>>> array([[1, 2, 3],
[5, 6, 7]])
A.T // To obtain the transpose
>>> array([[1, 5],
[2, 6],
[3, 7]])
n.mean(A.T, axis = 1) // To obtain column wise mean of array A
>>> array([ 3., 4., 5.])
I hope this will help you perform your transpose and column-wise operations
I want to slice a NumPy nxn array. I want to extract an arbitrary selection of m rows and columns of that array (i.e. without any pattern in the numbers of rows/columns), making it a new, mxm array. For this example let us say the array is 4x4 and I want to extract a 2x2 array from it.
Here is our array:
from numpy import *
x = range(16)
x = reshape(x,(4,4))
print x
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]
[12 13 14 15]]
The line and columns to remove are the same. The easiest case is when I want to extract a 2x2 submatrix that is at the beginning or at the end, i.e. :
In [33]: x[0:2,0:2]
Out[33]:
array([[0, 1],
[4, 5]])
In [34]: x[2:,2:]
Out[34]:
array([[10, 11],
[14, 15]])
But what if I need to remove another mixture of rows/columns? What if I need to remove the first and third lines/rows, thus extracting the submatrix [[5,7],[13,15]]? There can be any composition of rows/lines. I read somewhere that I just need to index my array using arrays/lists of indices for both rows and columns, but that doesn't seem to work:
In [35]: x[[1,3],[1,3]]
Out[35]: array([ 5, 15])
I found one way, which is:
In [61]: x[[1,3]][:,[1,3]]
Out[61]:
array([[ 5, 7],
[13, 15]])
First issue with this is that it is hardly readable, although I can live with that. If someone has a better solution, I'd certainly like to hear it.
Other thing is I read on a forum that indexing arrays with arrays forces NumPy to make a copy of the desired array, thus when treating with large arrays this could become a problem. Why is that so / how does this mechanism work?
To answer this question, we have to look at how indexing a multidimensional array works in Numpy. Let's first say you have the array x from your question. The buffer assigned to x will contain 16 ascending integers from 0 to 15. If you access one element, say x[i,j], NumPy has to figure out the memory location of this element relative to the beginning of the buffer. This is done by calculating in effect i*x.shape[1]+j (and multiplying with the size of an int to get an actual memory offset).
If you extract a subarray by basic slicing like y = x[0:2,0:2], the resulting object will share the underlying buffer with x. But what happens if you acces y[i,j]? NumPy can't use i*y.shape[1]+j to calculate the offset into the array, because the data belonging to y is not consecutive in memory.
NumPy solves this problem by introducing strides. When calculating the memory offset for accessing x[i,j], what is actually calculated is i*x.strides[0]+j*x.strides[1] (and this already includes the factor for the size of an int):
x.strides
(16, 4)
When y is extracted like above, NumPy does not create a new buffer, but it does create a new array object referencing the same buffer (otherwise y would just be equal to x.) The new array object will have a different shape then x and maybe a different starting offset into the buffer, but will share the strides with x (in this case at least):
y.shape
(2,2)
y.strides
(16, 4)
This way, computing the memory offset for y[i,j] will yield the correct result.
But what should NumPy do for something like z=x[[1,3]]? The strides mechanism won't allow correct indexing if the original buffer is used for z. NumPy theoretically could add some more sophisticated mechanism than the strides, but this would make element access relatively expensive, somehow defying the whole idea of an array. In addition, a view wouldn't be a really lightweight object anymore.
This is covered in depth in the NumPy documentation on indexing.
Oh, and nearly forgot about your actual question: Here is how to make the indexing with multiple lists work as expected:
x[[[1],[3]],[1,3]]
This is because the index arrays are broadcasted to a common shape.
Of course, for this particular example, you can also make do with basic slicing:
x[1::2, 1::2]
As Sven mentioned, x[[[0],[2]],[1,3]] will give back the 0 and 2 rows that match with the 1 and 3 columns while x[[0,2],[1,3]] will return the values x[0,1] and x[2,3] in an array.
There is a helpful function for doing the first example I gave, numpy.ix_. You can do the same thing as my first example with x[numpy.ix_([0,2],[1,3])]. This can save you from having to enter in all of those extra brackets.
I don't think that x[[1,3]][:,[1,3]] is hardly readable. If you want to be more clear on your intent, you can do:
a[[1,3],:][:,[1,3]]
I am not an expert in slicing but typically, if you try to slice into an array and the values are continuous, you get back a view where the stride value is changed.
e.g. In your inputs 33 and 34, although you get a 2x2 array, the stride is 4. Thus, when you index the next row, the pointer moves to the correct position in memory.
Clearly, this mechanism doesn't carry well into the case of an array of indices. Hence, numpy will have to make the copy. After all, many other matrix math function relies on size, stride and continuous memory allocation.
If you want to skip every other row and every other column, then you can do it with basic slicing:
In [49]: x=np.arange(16).reshape((4,4))
In [50]: x[1:4:2,1:4:2]
Out[50]:
array([[ 5, 7],
[13, 15]])
This returns a view, not a copy of your array.
In [51]: y=x[1:4:2,1:4:2]
In [52]: y[0,0]=100
In [53]: x # <---- Notice x[1,1] has changed
Out[53]:
array([[ 0, 1, 2, 3],
[ 4, 100, 6, 7],
[ 8, 9, 10, 11],
[ 12, 13, 14, 15]])
while z=x[(1,3),:][:,(1,3)] uses advanced indexing and thus returns a copy:
In [58]: x=np.arange(16).reshape((4,4))
In [59]: z=x[(1,3),:][:,(1,3)]
In [60]: z
Out[60]:
array([[ 5, 7],
[13, 15]])
In [61]: z[0,0]=0
Note that x is unchanged:
In [62]: x
Out[62]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
If you wish to select arbitrary rows and columns, then you can't use basic slicing. You'll have to use advanced indexing, using something like x[rows,:][:,columns], where rows and columns are sequences. This of course is going to give you a copy, not a view, of your original array. This is as one should expect, since a numpy array uses contiguous memory (with constant strides), and there would be no way to generate a view with arbitrary rows and columns (since that would require non-constant strides).
With numpy, you can pass a slice for each component of the index - so, your x[0:2,0:2] example above works.
If you just want to evenly skip columns or rows, you can pass slices with three components
(i.e. start, stop, step).
Again, for your example above:
>>> x[1:4:2, 1:4:2]
array([[ 5, 7],
[13, 15]])
Which is basically: slice in the first dimension, with start at index 1, stop when index is equal or greater than 4, and add 2 to the index in each pass. The same for the second dimension. Again: this only works for constant steps.
The syntax you got to do something quite different internally - what x[[1,3]][:,[1,3]] actually does is create a new array including only rows 1 and 3 from the original array (done with the x[[1,3]] part), and then re-slice that - creating a third array - including only columns 1 and 3 of the previous array.
I have a similar question here: Writting in sub-ndarray of a ndarray in the most pythonian way. Python 2
.
Following the solution of previous post for your case the solution looks like:
columns_to_keep = [1,3]
rows_to_keep = [1,3]
An using ix_:
x[np.ix_(rows_to_keep, columns_to_keep)]
Which is:
array([[ 5, 7],
[13, 15]])
I'm not sure how efficient this is but you can use range() to slice in both axis
x=np.arange(16).reshape((4,4))
x[range(1,3), :][:,range(1,3)]