I have a ValueError: 'object too deep for desired array' in a Python program.
I have this error while using numpy.digitize.
I think it's how I use Pandas DataFrames:
To keep it simple (because this is done through an external library), I have a list in my program but the library needs a DataFrame so I do something like this:
ts = range(1000)
df = pandas.DataFrame(ts)
res = numpy.digitize(df.values, bins)
But then it seems like df.values is an array of lists instead of an array of floats. I mean:
array([[ 0],
[ 1],
[ 2],
...,
[997],
[998],
[999]], dtype=int64)
Help please, I spent too much time on this.
Try this:
numpy.digitize(df.iloc[:, 0], bins)
You are trying to get the values from a whole DataFrame. That is why you get the 2D array. Each row in the array is a row of the DataFrame.
Related
At the most basic I have the following dataframe:
a = {'possibility' : np.array([1,2,3])}
b = {'possibility' : np.array([4,5,6])}
df = pd.DataFrame([a,b])
This gives me a dataframe of size 2x1:
like so:
row 1: np.array([1,2,3])
row 2: np.array([4,5,6])
I have another vector of length 2. Like so:
[1,2]
These represent the index I want from each row.
So if I have [1,2] I want: from row 1: 2, and from row 2: 6.
Ideally, my output is [2,6] in a vector form, of length 2.
Is this possible? I can easily run through a for loop, but am looking for FAST approaches, ideally vectors approaches since it is already in pandas/numpy.
For actual use case approximations, I am looking to make this work in the 300k-400k row ranges. And need to run it in optimization problems (hence the fast part)
You could transform to a multi-dimensional numpy array and take_along_axis:
v = np.array([1,2])
a = np.vstack(df['possibility'])
np.take_along_axis(a.T, v[None], axis=0)[0]
output: array([2, 6])
I have a dataframe where one column consists of tuples, i.e
df['A'].values = array([(1,2), (5,6), (11,12)])
Now I want to split this into two different columns. A working solution is
df['A1'] = df['A'].apply(lambda x: x[0])
But this is extremely slow. On my Dataframe it takes multiple minutes. So I would like to vectorize this, to something like
df['A1'] = df['A'][:,0]
With pandas, or using numpy or anything. But all of them give me an error similar to
"*** KeyError: 'key of type tuple not found and not a MultiIndex'"
Is there any vectorized way? This feels like a super simple question and task but i cannot find any working and properly vectorized function.
n: int = 2
df = pd.DataFrame(df["A"].apply(lambda x: (x[:n], x[n:])).tolist(), index=df.index)
you can have a look into pandarallel also.
I'll do it in numpy and skip over the pandas bits.
You can get a decent speedup using np.fromiter together with either itertools.chain.from_iterable to extract everything in one go or operator.itemgetter for individual columns.
import operator as op
import itertools as it
a = [*zip(range(10000),range(10000,20000))]
A = np.empty(10000,object)
A[...] = a
A
# array([(0, 10000), (1, 10001), (2, 10002), ..., (9997, 19997),
# (9998, 19998), (9999, 19999)], dtype=object)
(*np.fromiter(it.chain.from_iterable(A),int,len(A[0])*A.size).reshape(A.size,-1).T,)
# (array([ 0, 1, 2, ..., 9997, 9998, 9999]), array([10000, 10001,
# 10002, ..., 19997, 19998, 19999]))
np.fromiter(map(op.itemgetter(0),A),int,A.size)
# array([ 0, 1, 2, ..., 9997, 9998, 9999])
I know this is a relatively common topic on stackoverflow but I couldn't find the answer I was looking for. Basically, I am trying to make very efficient code (I have rather large data sets) to get certain columns of data from a matrix. Below is what I have so far. It gives me this error: could not broadcast input array from shape (2947,1) into shape (2947)
def get_data(self, colHeaders):
temp = np.zeros((self.matrix_data.shape[0],len(colHeaders)))
for col in colHeaders:
index = self.header2matrix[col]
temp[:,index:] = self.matrix_data[:,index]
data = np.matrix(temp)
return temp
Maybe this simple example will help:
In [70]: data=np.arange(12).reshape(3,4)
In [71]: header={'a':0,'b':1,'c':2}
In [72]: col=['c','a']
In [73]: index=[header[i] for i in col]
In [74]: index
Out[74]: [2, 0]
In [75]: data[:,index]
Out[75]:
array([[ 2, 0],
[ 6, 4],
[10, 8]])
data is some sort of 2D array, header is a dictionary mapping names to column numbers. Using the input col, I construct a column index list. You can select all columns at once, rather than one by one.
I've a numpy.ndarray the columns of which I'd like to access. I will be taking all columns after 8 and testing them for variance, removing the column if the variance/average is low. In order to do this, I need access to the columns, preferably with Numpy. By my current methods, I encounter errors or failure to transpose.
To mine these arrays, I am using the IOPro adapter, which gives a regular numpy.ndarray.
import iopro
import sys
adapter = iopro.text_adapter(sys.argv[1], parser='csv')
all_data = adapter[:]
z_matrix = adapter[range(8,len(all_data[0]))][1:3]
print type(z_matrix) #check type
print z_matrix # print array
print z_matrix.transpose() # attempt transpose (fails)
print z_matrix[:,0] # attempt access by column (fails)
Can someone explain what is happening?
The output is this:
<type 'numpy.ndarray'>
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
[ (18.712, 64.903, -10.205, -1.346, 0.319, -0.654, 1.52398, 114.495, -75.2488, 1.52184, 111.31, 175.
408, 1.52256, 111.699, -128.141, 1.49227, 111.985, -138.173)
(17.679, 48.015, -3.152, 0.848, 1.239, -0.3, 1.52975, 113.963, -50.0622, 1.52708, 112.335, -57.4621
, 1.52603, 111.685, -161.098, 1.49204, 113.406, -66.5854)]
Traceback (most recent call last):
File "z-matrix-filtering.py", line 11, in <module>
print z_matrix[:,0]
IndexError: too many indices
What is going wrong? Is there a better way to access the columns? I will be reading all lines of a file, testing all columns from the 8th for significant variance, removing any columns that don't vary significantly, and then reprinting the result as a new CSV.
EDIT:
Based on responses, I have created the following very ugly and I think inane approach.
all_data = adapter[:]
z_matrix = []
for line in all_data:
to_append = []
for column in range(8,len(all_data.dtype)):
to_append.append(line[column].astype(np.float16))
z_matrix.append(to_append)
z_matrix = np.array(z_matrix)
The reason that the columns must be directly accessed is that there is a String inside the data. If this string is not circumvented in some way, an error will be thrown about a void-array with object members using buffer error.
Is there a better solution? This seems terrible, and it seems it will be inefficient for several gigabytes of data.
Notice that the output of print z_matrix has the form
[ (18.712, 64.903, ..., -138.173)
(17.679, 48.015, ..., -66.5854)]
That is, it is printed as a list of tuples. That is the output you get when the array is a "structured array". It is a one-dimensional array of structures. Each "element" in the array has 18 fields. The error occurs because you are trying to index a 1-D array as if it were 2-D; z_matrix[:,0] won't work.
Print the data type of the array to see the details. E.g.
print z_matrix.dtype
That should show the names of the fields and their individual data types.
You can get one of the elements as, for example, z_matrix[k] (where k is an integer), or you can access a "column" (really a field of the structured array) as z_matrix['name'] (change 'name' to one of the fields in the dtype).
If the fields all have the same data type (which looks like the case here--each field has type np.float64), you can create a 2-D view of the data by reshaping the result of the view method. For example:
z_2d = z_matrix.view(np.float64).reshape(-1, len(z_matrix.dtype.names))
Another way to get the data by column number rather than name is:
col = 8 # The column number (zero-based).
col_data = z_matrix[z_matrix.dtype.names[col]]
For more about structured arrays, see http://docs.scipy.org/doc/numpy/user/basics.rec.html.
The display of z_matrix is consistent with it being shape (2,), a 1d array of tuples.
np.array([np.array(a) for a in z_matrix])
produces a (2,18) 2d array. You should be able to do your column tests on that.
It is very easy to access numpy array. Here's a simple example which can be helpful
import numpy as n
A = n.array([[1, 2, 3], [4, 5, 6]])
print A
>>> array([[1, 2, 3],
[5, 6, 7]])
A.T // To obtain the transpose
>>> array([[1, 5],
[2, 6],
[3, 7]])
n.mean(A.T, axis = 1) // To obtain column wise mean of array A
>>> array([ 3., 4., 5.])
I hope this will help you perform your transpose and column-wise operations
Right, perhaps I should be using the normal Python lists for this, but here goes:
I want a 9 by 4 multidimensional array/matrix (whatever really) that I want to store arrays in. These arrays will be 1-dimensional and of length 4096.
So, I want to be able to go something like
column = 0 #column to insert into
row = 7 #row to insert into
storageMatrix[column,row][0] = NEW_VALUE
storageMatrix[column,row][4092] = NEW_VALUE_2
etc..
I appreciate I could be doing something a bit silly/unnecessary here, but it will make it ALOT easier for me to have it structured like this in my code (as there's alot of these, and alot of analysis to be done later).
Thanks!
Note that to leverage the full power of numpy, you'd be much better off with a 3-dimensional numpy array. Breaking apart the 3-d array into a 2-d array with 1-d values
may complicate your code and force you to use loops instead of built-in numpy functions.
It may be worth investing the time to refactor your code to use the superior 3-d numpy arrays.
However, if that's not an option, then:
import numpy as np
storageMatrix=np.empty((4,9),dtype='object')
By setting the dtype to 'object', we are telling numpy to allow each element of storageMatrix to be an arbitrary Python object.
Now you must initialize each element of the numpy array to be an 1-d numpy array:
storageMatrix[column,row]=np.arange(4096)
And then you can access the array elements like this:
storageMatrix[column,row][0] = 1
storageMatrix[column,row][4092] = 2
The Tentative NumPy Tutorial says you can declare a 2D array using the comma operator:
x = ones( (3,4) )
and index into a 2D array like this:
>>> x[1,2] = 20
>>> x[1,:] # x's second row
array([ 1, 1, 20, 1])
>>> x[0] = a # change first row of x
>>> x
array([[10, 20, -7, -3],
[ 1, 1, 20, 1],
[ 1, 1, 1, 1]])