How to delete multiple value from matrix numpy low computational cost - python

I've recently been trying my hand at numpy, and I'm trying to find a solution to delete the elements inside the matrix at column 2 equal to the value stored in the variable element.
Since I am a large amount of data I would need to know if there was a more efficient method which takes less time to execute than the classic for.
I enclose an example:
element = [ 85., 222., 166., 238.]
matrix = [[228., 1., 222.],
[140., 0., 85.],
[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.],
[12., 1., 166.],
[181., 1., 238.]]
the output:
matrix = [[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]]
The method I used is the following:
for y in element:
matrix = matrix[(matrix[:,2]!= y)]
When running it for a large amount of data it takes a long time. Is there anything more efficient, so that you can save on execution?

Since you tagged numpy, I'd assume matrix is a numpy array. With that, you can use np.isin for your purpose:
matrix = np.array(matrix)
matrix[~np.isin(np.array(matrix)[:,2], element)]
Output:
array([[140., 0., 104.],
[230., 0., 217.],
[115., 1., 250.]])

Related

Calculating Confusion Matrix by Using the Array of Arrays

I am using transformers and datasets libraries to train an multi-class nlp model for real specific dataset and I need to have an idea how my model performs for each label. So, I'd like to calculate the confusion matrix. I have 4 labels. My result.prediction looks like
array([[ -6.906 , -8.11 , -10.29 , 6.242 ],
[ -4.51 , 3.705 , -9.76 , -7.49 ],
[ -6.734 , 3.36 , -10.27 , -6.883 ],
...,
[ 8.41 , -9.43 , -9.45 , -8.6 ],
[ 1.3125, -3.094 , -11.016 , -9.31 ],
[ -7.152 , -8.5 , -9.13 , 6.766 ]], dtype=float16)
In here when predicted value is positive then model predicts 1, else model predicts 0. Next my result.label_ids looks like
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
As you can see model return an array of 4, and give 0 values to false labels and 1 to true values.
In general, I've been using the following function to calculate confusion matrix, but in this case it didn't work since this function is for 1 dimensional arrays.
import numpy as np
def compute_confusion_matrix(labels, true, pred):
K = len(labels) # Number of classes
result = np.zeros((K, K))
for i in range(labels):
result[true[i]][pred[i]] += 1
return result
If possible I'd like to modify this function suitable for my above case. At least I would like to understand how can I implement confusion matrix for results that in the form multi dimensional arrays.
A possibility could be reversing the encoding to the format required by compute_confusion_matrix and, in this way, it is still possible to use your function!
To convert the predictions it's possible to do:
pred = list(np.where(result.label_ids == 1.)[1])
where np.where(result.label_ids == 1.)[1] is a numpy 1-dimensional array containing the indexes of the 1.s in each row of result.label_ids.
So pred will look like this according to your result.label_ids:
[3, 0, 3, ..., 0, 0, 3]
so it should have the same format of the original true (if also true is one-hot encoded the same strategy could be used to convert it) and can be used as input of your function for computing the confusion matrix.
First of all I would like to thank Nicola Fanelli for the idea.
The function I gave above as well as the sklearn.metrics.confusion_matrix() both need to be provided a list of predicted and true values. After my prediction step, I try to retrieve my true and predicted values in order to calculate a confusion matrix. The results I was getting are in the following form
array([[0., 0., 0., 1.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
...,
[1., 0., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.]], dtype=float32)
Here the idea is to retrieve the positional index of the value 1. When I tried the approach suggested by Nicola Fanelli , the resulting sizes were lower then the initial ones and they weren't matching. Therefore, confusion matrix cannot be calculated. To be honest I couldn't find the reason behind it, but I'll investigate that more later.
So, I use a different technique to implement the same idea. I used np.argmax() and append these positions to a new list. Here is the code sample for true values
true = []
for i in range(len(result.label_ids)):
n = np.array(result.label_ids[i])
true.append(np.argmax(n))
This way I got the results in the desired format without my sizes are being changed.
Even though this is a working solution for my problem, I am still open to more elegant ways to approach this problem.

Finding the maximum non-zero matrix in python

Suppose we have a matrix:
a = array([[ 2., 3., 0., 0., 0.],
[ 0., 4., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.],
[ 0., 0., 0., 0., 0.]])
what is the best way to find the maximum non-zero matrix(ie. matrix which is not fully zero) and spanning all the elements, like
[[2.,3.],
[0.,4.]]
I ve gone through numpy.nonzero which gives the indices of non zero elements but how can i use it efficiently to get the expected matrix?
the matrix must be square. I ve come up with this for now,
a[:np.nonzero(a)[0][-1]+1,:np.nonzero(a)[1][-1]+1]
It works. But does not seem to be elegant. Also it wont work if the matrix does not start at 0. Like,
[[0,0,2,3,0],
[0,0,0,4,0],
[0,0,0,0,0],
[0,0,0,0,0],
[0,0,0,0,0]]
here the expected output is,
[[2,3],
[0,4]]
The reason why it is not working for the second case is because your starting point for forming the matrix is always 0,0 since you always specify the end range.
This gets the minimum index of np.nonzero as the start range, and the maximum index of np.nonzero as the end range, in both axis. Therefore it is guaranteed to include all non-zero elements
a[np.min(np.nonzero(a)[0]):np.max(np.nonzero(a)[0])+1,
np.min(np.nonzero(a)[1]):np.max(np.nonzero(a)[1])+1]

How does this Python memory optimization work?

OpenAI has published a set of Machine Learning/Reinforcement Learning environments called 'Open AI Gym'. Some of the environments are image based, and as such can potentially have a very large memory footprint when used with algorithms that store 100 000s or millions of frames worth of environment observations.
While poking around in their reference implementation of DeepQ Learning I found a pair of classes, LazyFrameStack and LazyFrames that claim to "ensure that common frames between the observations are only stored once... to optimize memory usage which can be huge for DQN's 1M frames replay buffers."
In the reference implementation, the DeepQ agent gets frames stacked together in groups of four, which are then put into the replay buffer. Having looked at the implementation of both classes, it's not obvious to me how these save memory -- if anything, because LazyFrames is basically a container object around a set of four numpy arrays, shouldn't a LazyFrame have a larger memory footprint?
In Python, objects are passed as reference. That means even though a LazyFrame object might be a list of extremely big numpy arrays, the size of that LazyFrame object itself is small, since it only stores the reference to the np.ndarrays. In other words, you can think of LazyFrame just pointing to the np.ndarray data, and not actually storing each copy of the individual array within itself.
import numpy as np
a = np.ones((2,3))
b = np.ones((2,3))
X = [a, b]
print(X)
>>> [array([[1., 1., 1.],
[1., 1., 1.]]),
array([[1., 1., 1.],
[1., 1., 1.]])]
X_stacked = np.stack(X)
print(X_stacked)
>>> array([[[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.]]])
a[0] = 2
print(X)
>>> [array([[2., 2., 2.],
[1., 1., 1.]]),
array([[1., 1., 1.],
[1., 1., 1.]])]
print(X_stacked)
>>> array([[[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.]]])
As you can see here, X (which is a list of arrays) stores only the reference to a and b, thus when we do a[0] = 2, the change can be seen by printing X. But once you stack the arrays, you actually create a new array with that much memory.
To address your "how does it save memory" question a bit more directly, here's an example.
import sys
a = np.random.randn(210, 160, 3)
b = np.random.randn(210, 160, 3)
X = [a,b]
X_stacked = np.stack(X)
print(sys.getsizeof(X))
>>> 80
print(sys.getsizeof(X_stacked))
>>> 1612944

Recomended way to create a matrix containing strings in python

I need to write a programm that collects different datasets and unites them. For this I have to read in a comma seperated matrix: In this case each row represents an instance (in this case proteins), each column represents an attribute of the instances. If an instance has an attribute, it is represented by a 1, otherwise 0. The matrix looks like the example given below, but much larger, with 35000 instances and hundreds of attributes.
Proteins,Attribute 1,Attribute 2,Attribute 3,Attribute 4
Protein 1,1,1,1,0
Protein 2,0,1,0,1
Protein 3,1,0,0,0
Protein 4,1,1,1,0
Protein 5,0,0,0,0
Protein 6,1,1,1,1
I need a way to store the matrix before writing into a new file with other information about the instances. I thought of using numpy arrays, since i would like to be able to select and check single columns. I tried to use numpy.empty to create the array of the given size, but it seems that you have to preselect the lengh of the strings and cannot change them afterwards.
Is there a better way to deal with such data? I also thought of dictionarys of lists but then iI cannot select single columns.
You can use numpy.loadtxt, for example:
import numpy as np
a = np.loadtxt(filename, delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=float)
Which will result in something like:
#array([[ 1., 1., 1., 0.],
# [ 0., 1., 0., 1.],
# [ 1., 0., 0., 0.],
# [ 1., 1., 1., 0.],
# [ 0., 0., 0., 0.],
# [ 1., 1., 1., 1.]])
Or, using structured arrays (`np.recarray'):
a = np.loadtxt('stack.txt', delimiter=',',usecols=(1,2,3,4),
skiprows=1, dtype=[('Attribute 1', float),
('Attribute 2', float),
('Attribute 3', float),
('Attribute 4', float)])
from where you can get each field like:
a['Attribute 1']
#array([ 1., 0., 1., 1., 0., 1.])
Take a look at pandas.
pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language.
You could use genfromtxt instead:
data = np.genfromtxt('file.txt', dtype=None)
This will create a structured array (aka record array) of your table.

How do I add a column to a python (matix) multi-dimensional array? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What's the simplest way to extend a numpy array in 2 dimensions?
I've been frustrated as a Matlab user switching over to python because I don't know all the tricks and get stuck hacking together code until it works. Below is an example where I have a matrix that I want to add a dummy column to. Surely, there is a simpler way then the zip vstack zip method below. It works, but it is totally a noob attempt. Please enlighten me. Thank you in advance for taking the time for this tutorial.
# BEGIN CODE
from pylab import *
# Find that unlike most things in python i must build a dummy matrix to
# add stuff in a for loop.
H = ones((4,10-1))
print "shape(H):"
print shape(H)
print H
### enter for loop to populate dummy matrix with interesting data...
# stuff happens in the for loop, which is awesome and off topic.
### exit for loop
# more awesome stuff happens...
# Now I need a new column on H
H = zip(*vstack((zip(*H),ones(4)))) # THIS SEEMS LIKE THE DUMB WAY TO DO THIS...
print "shape(H):"
print shape(H)
print H
# in conclusion. I found a hack job solution to adding a column
# to a numpy matrix, but I'm not happy with it.
# Could someone educate me on the better way to do this?
# END CODE
Use np.column_stack:
In [12]: import numpy as np
In [13]: H = np.ones((4,10-1))
In [14]: x = np.ones(4)
In [15]: np.column_stack((H,x))
Out[15]:
array([[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
[ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])
In [16]: np.column_stack((H,x)).shape
Out[16]: (4, 10)
There are several functions that let you concatenate arrays in different dimensions:
np.vstack along axis=0
np.hstack along axis=1
np.dstack along axis=2
In your case, the np.hstack looks what you want. np.column_stack stacks a set 1D arrays as a 2D array, but you have already a 2D array to start with.
Of course, nothing prevents you to do it the hard way:
>>> new = np.empty((a.shape[0], a.shape[1]+1), dtype=a.dtype)
>>> new.T[:a.shape[1]] = a.T
Here, we created an empty array with an extra column, then used some tricks to set the first columns to a (using the transpose operator T, so that new.T has an extra row compared to a.T...)

Categories