I am very new to python and machine learning.
Let's say that I have a 1D np array (with both numbers and NaN) with one column and 1308 rows and want to create two variables:
train_outcome = outcome[0:891, 0]
y_pred = outcome[891:, 0]
I tried this and got the obvious <IndexError: too many indices for the array: array is 1-dimensional, but 2 were indexed>.
I was so desperate that I converted it back to a DF to make the operation. There must be an easier way to achieve this.
If the array has 1 dimension, there is no need for a comma. Here is how I'd do it:
train_outcome = outcome[:891]
y_pred = outcome[891:]
Use np.split for a one-liner:
train_outcome, y_pred = np.split(outcome, [891])
Related
This is an example of my error. Say i created a numpy array
X = np.zeros((1000, 50))
Where 1000 is the features (rows) and 50 is the examples (columns)
Since i am adding examples one by one i will have to replace columns in the array 1 by 1 to get the final feature array. I tried this:
X[:,i] = example
where example is of size (1000, 1), and i is iterated for every example. This does not work because X[:,i] is of shape (1000,), a rank 1 array. How do i code it so that each example replaces a row of the X array without throwing the broadcast error. Thank you.
Reshape your vector before assigning it.
X[:,i] = example.reshape(-1,)
This will suppress the second dimension and turn example into shape (1000,)
Or, avoiding assigning one by one in the loop you can put all of your arrays in a list and then call np.array on your list and transpose it to have them as columns. This will probably work better if you can construct your list of arrays in a list comprehension.
Example:
arrs = [np.random.randint(10, size=5) for _ in range(5)]
X = np.array(arrs).T
I have a 3 dimensional numpy array of dimensions 333*333*52
I have a 333 element lists of indices ranging from 0-332 eg [4 12 332 0 ...] that I wish to use to rearrange the first two dimensions of the 3d array
In matlab I would do:
rearranged_array = original_array(new_order, new_order, :)
But this approach does not work with numpy:
rearranged_array = original_array[new_order, new_order, :]
Produces a 333*52 array
While:
rearranged_array = original_array[new_order][new_order, :]
Does not get things in the right order
Edit:
This seems to work:
rearranged_array = original_array[new_order, :][:, new_order]
This seems a lot less intuitive to me than the matlab method - are there any better ways?
Your third one
rearranged_array = original_array[new_order][new_order, :]
is just doing the same operation twice.
You want
rearranged_array = original_array[new_order][:, new_order]
The reason your first solution doesn't work is because numpy only does the rearrangement if the index passed is a list or array, but if you pass new_order, new_order, that is a tuple.
Another solution is to do
rearranged_array = original_array[np.row_stack((new_order, new_order))]
nb. you keep doing things like a[x, y, :] and a[x, :]. Trailing : are superfluous. a[x, y] and a[x] respectively do exactly the same thing.
Matlab Code:
AP(queryIdx) = diff([0;recall]')*prec
My python code:
AP[queryIdx] = np.dot(np.diff(np.concatenate(([[0]], recall), axis=0).transpose()),prec)
Variables:(Checked and am quite sure they are equivalent in python and in Matlab)
Recall: 1000x1 np array*
prec: 1000x1 np array
* prints out as [[.],.....,[.]]
Results:
Matlab: .1011
Python: 0.05263158
Only cause I can think of outside of the code is that python uses more
precision, but I doubt that would make such a large difference)
*Edit There was a problem with my prec variable. The above code worked
That code looks a bit messy. Try replacing it with this:
AP[queryIdx] = np.dot(np.diff(np.hstack([0, recall.ravel()])), prec.ravel())
In your post, you mentioned that you have a 1000 x 1 array for both recall and prec. This to me is interpreted as a 2D array with a singleton dimension: the second dimension. As such, you'd need to convert this back to a 1D array using ravel.
Now, np.hstack horizontally stacks 1D arrays together and so this will append a 0 at the front, then apply the diff operator, and the perform the dot product with prec.
One common gotcha that MATLAB coders have with numpy is the representation of 1D arrays in numpy. There is no such thing as the transpose of a 1D array. All numpy 1D arrays are row vectors. If you explicitly want to make the 1D array a column vector, you need to include an additional dimension and make the second dimension 1, then transpose it. Something like this:
r = v[:][None].T
In any case, let's verify the results:
MATLAB
>> recall = (1:1000).';
>> prec = (1000:-1:1).';
>> diff([0; recall].')*prec
ans =
500500
Python (IPython)
In [1]: import numpy as np
In [2]: recall = np.arange(1,1001)
In [3]: prec = np.arange(1000,0,-1)
In [4]: np.dot(np.diff(np.hstack([0, recall.ravel()])), prec.ravel())
Out[4]: 500500
I have a np.ndarray with numbers that indicate spots of interest, I am interested in the spots which have values 1 and 9.
Right now they are being extracted as such:
maskindex.append(np.where(extract.variables['mask'][0] == 1) or np.where(megadatalist[0].variables['mask'][0] == 9))
xval = maskindex[0][1]
yval = maskindex[0][0]
I need to apply these x and y values to the arrays that I am operating on, to speed things up.
I have 140 arrays that are each 734 x 1468, I need the mean, max, min, std calculated for each field. And I was hoping there was an easy way for applying the masked array to speed up the operations, right now I am simply doing it on the entire arrays as such:
Average_List = np.mean([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Average_Error_List = np.mean([megadatalist[i].variables['analysis_error'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Std_List = np.std([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)], axis=0)
Maximum_List = np.maximum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Minimum_List = np.minimum.reduce([megadatalist[i].variables['analysed_sst'][0] for i in range(0,Numbers_of_datasets)])
Any ideas on how to speed things up would be highly appreciated
I may have solved it partially, depending on what you're aiming for. The following code reduces an array arr to a 1d array with only the relevant indicies. You can then do the needed calculations without considering the unwanted locations
arr = np.array([[0,9,9,0,0,9,9,1],[9,0,1,9,0,0,0,1]])
target = [1,9] # wanted values
index = np.where(np.in1d(arr.ravel(), target).reshape(arr.shape))
no_zeros = arr[index]
At this stage "all you need" is to reinsert the values "no_zeros" on an array of zeroes with appropriate shape, on the indices given in "index". One way is to flatten the index array and recalculate the indices, so that they match a flattened arr array. Then use numpy.insert(np.zeroes(arr.shape),new_index,no_zeroes) and then reshaping to the appropriate shape afterwards. Reshaping is constant time in numpy. Admittedly, I have not figured out a fast numpy way to create the new_index array.
Hope it helps.
I have a 3d Numpy array and would like to take the mean over one axis considering certain elements from the other two dimensions.
This is an example code depicting my problem:
import numpy as np
myarray = np.random.random((5,10,30))
yy = [1,2,3,4]
xx = [20,21,22,23,24,25,26,27,28,29]
mymean = [ np.mean(myarray[t,yy,xx]) for t in np.arange(5) ]
However, this results in:
ValueError: shape mismatch: objects cannot be broadcast to a single shape
Why does an indexing like e.g. myarray[:,[1,2,3,4],[1,2,3,4]] work, but not my code above?
This is how you fancy-index over more than one dimension:
>>> np.mean(myarray[np.arange(5)[:, None, None], np.array(yy)[:, None], xx],
axis=(-1, -2))
array([ 0.49482768, 0.53013301, 0.4485054 , 0.49516017, 0.47034123])
When you use fancy indexing, i.e. a list or array as an index, over more than one dimension, numpy broadcasts those arrays to a common shape, and uses them to index the array. You need to add those extra dimensions of length 1 at the end of the first indexing arrays, for the broadcast to work properly. Here are the rules of the game.
Since you use consecutive elements you can use a slice:
import numpy as np
myarray = np.random.random((5,10,30))
yy = slice(1,5)
xx = slice(20, 30)
mymean = [np.mean(myarray[t, yy, xx]) for t in np.arange(5)]
To answer your question about why it doesn't work: when you use lists/arrays as indices, Numpy uses a different set of indexing semantics than it does if you use slices. You can see the full story in the documentation and, as that page says, it "can be somewhat mind-boggling".
If you want to do it for nonconsecutive elements, you must grok that complex indexing mechanism.