I have a 2D Python array, from which I would like to remove certain columns, but I don't know how many I would like to remove until the code runs.
I want to loop over the columns in the original array, and if the sum of the rows in any one column is about a certain value I want to remove the whole column.
I started to do this the following way:
for i in range(original_number_of_columns)
if sum(original_array[:,i]) < certain_value:
new_array[:,new_index] = original_array[:,i]
new_index+=1
But then I realised that I was going to have to define new_array first, and tell Python what size it is. But I don't know what size it is going to be beforehand.
I have got around it before by firstly looping over the columns to find out how many I will lose, then defining the new_array, and then lastly running the loop above - but obviously there will be much more efficient ways to do such things!
Thank you.
You can use the following:
import numpy as np
a = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
]
)
print a.compress(a.sum(0) > 15, 1)
[[3]
[6]
[9]]
without numpy
my_2d_table = [[...],[...],...]
only_cols_that_sum_lt_x = [col for col in zip(*my_2d_table) if sum(col) < some_threshold]
new_table = map(list,zip(*only_cols_that_sum_lt_x))
with numpy
a = np.array(my_2d_table)
a[:,np.sum(a,0) < some_target]
I suggest using numpy.compress.
>>> import numpy as np
>>> a = np.array([[1, 2, 3], [1, -3, 2], [4, 5, 7]])
>>> a
array([[ 1, 2, 3],
[ 1, -3, 2],
[ 4, 5, 7]])
>>> a.sum(axis=0) # sums each column
array([ 6, 4, 12])
>>> a.sum(0) < 5
array([ False, True, False], dtype=bool)
>>> a.compress(a.sum(0) < 5, axis=1) # applies the condition to the elements of each row so that only those elements in the rows whose column indices correspond to True values in the condition array will be kept
array([[ 2],
[-3],
[ 5]])
Related
a = np.array(list(range(16).reshape((4,4))
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
Say I want the middle square. It'd seem reasonable to do this:
a[[1,2],[1,2]]
but I get this:
array([5, 10])
This works, but seems inelegant:
a[[1,2],:][:,[1,2]]
array([[5, 6],
[9, 10]])
So my questions are:
Why is it this way? What premises are required to make the implemented way sensible?
Is there a canonical way to select along more than one index at once?
I think you can read more details on advanced indexing. Basically, when you slice the array by lists/arrays, the arrays will be broadcast and iterate together.
In your case, you can do:
idx = np.array([1,3])
a[idx,idx[:,None]]
Or as in the doc above:
a[np.ix_(idx, idx)]
Output:
array([[ 5, 13],
[ 7, 15]])
You can do both slicing operations at once instead of creating a view and indexing that again:
import numpy as np
a = np.arange(16).reshape((4, 4))
# preferred if possible
print(a[1:3, 1:3])
# [[ 5 6]
# [ 9 10]]
# otherwise add a second dimension to the first index to make it broadcastable
index1 = np.asarray([1, 2])
index2 = np.asarray([1, 2])
print(a[index1[:, None], index2])
# [[ 5 6]
# [ 9 10]]
You could use multiple np.take to select indices from multiple axes
a = np.arange(16).reshape((4, 4))
idx = np.array([1,2])
np.take(np.take(a, idx, axis=1), idx, axis=0)
Or (slightly more readable)
a.take(idx, axis=1).take(idx, axis=0)
Output:
array([[ 5, 6],
[ 9, 10]])
np.take also allows you to conveniently wrap around out-of-bound indices and such.
I am currently working on a problem where in one requirement I need to compare two 3d NumPy arrays and return the unmatched values with their index position and later recreate the same array. Currently, the only approach I can think of is to loop across the arrays to get the values during comparing and later recreating. The problem is with scale as there will be hundreds of arrays and looping effects the Latency of the overall application. I would be thankful if anyone can help me with better utilization of NumPy comparison while using minimal or no loops. A dummy code is below:
def compare_array(final_array_list):
base_array = None
i = 0
for array in final_array_list:
if i==0:
base_array =array[0]
else:
index = np.where(base_array != array)
#getting index like (array([0, 1]), array([1, 1]), array([2, 2]))
# to access all unmatched values I need to loop.Need to avoid loop here
i=i+1
return [base_array, [unmatched value (8,10)and its index (array([0, 1]), array([1, 1]), array([2, 2])],..]
# similarly recreate array1 back
def recreate_array(array_list):
# need to avoid looping while recreating array back
return list of array #i.e. [base_array, array_1]
# creating dummy array
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
final_array_list = [base_array,array_1, ...... ]
#compare base_array with other arrays and get unmatched values (like 8,10 in array_1) and their index
difff_array = compare_array(final_array_list)
# recreate array1 from the base array after receiving unmatched value and its index value
recreate_array(difff_array)
I think this may be what you're looking for:
base_array = np.array([[[1, 2, 3], [3, 4, 5]], [[5, 6, 7], [7, 8, 9]]])
array_1 = b = np.array([[[1, 2,3], [3, 4,8]], [[5, 6,7], [7, 8,10]]])
match_mask = (base_array == array_1)
idx_unmatched = np.argwhere(~match_mask)
# idx_unmatched:
# array([[0, 1, 2],
# [1, 1, 2]])
# values with associated with idx_unmatched:
values_unmatched = base_array[tuple(idx_unmatched.T)]
# values_unmatched:
# array([5, 9])
I'm not sure I understand what you mean by "recreate them" (completely recreate them? why not use the arrays themselves?).
I can help you though by noting that ther are plenty of functions which vectorize with numpy, and as a general rule of thumb, do not use for loops unless G-d himself tells you to :)
For example:
If a, b are any np.arrays (regardless of dimensions), the simple a == b will return a numpy array of the same size, with boolean values. Trues = they are equal in this coordinate, and False otherwise.
The function np.where(c), will convert c to a boolean np.array, and return you the indexes in which c is True.
To clarify:
Here I instantiate two arrays, with b differing from a with -1 values:
Note what a==b is, at the end.
>>> a = np.random.randint(low=0, high=10, size=(4, 4))
>>> b = np.copy(a)
>>> b[2, 3] = -1
>>> b[0, 1] = -1
>>> b[1, 1] = -1
>>> a
array([[9, 9, 3, 4],
[8, 4, 6, 7],
[8, 4, 5, 5],
[1, 7, 2, 5]])
>>> b
array([[ 9, -1, 3, 4],
[ 8, -1, 6, 7],
[ 8, 4, 5, -1],
[ 1, 7, 2, 5]])
>>> a == b
array([[ True, False, True, True],
[ True, False, True, True],
[ True, True, True, False],
[ True, True, True, True]])
Now the function np.where, which output is a bit tricky, but can be used easily. This will return two arrays of the same size: the first array is the rows and the second array is the columns at places in which the given array is True.
>>> np.where(a == b)
(array([0, 0, 0, 1, 1, 1, 2, 2, 2, 3, 3, 3, 3], dtype=int64), array([0, 2, 3, 0, 2, 3, 0, 1, 2, 0, 1, 2, 3], dtype=int64))
Now you can "fix" the b array to match a, by switching the values of b ar indexes in which it differs from a, to be a's indexes:
>>> b[np.where(a != b)]
array([-1, -1, -1])
>>> b[np.where(a != b)] = a[np.where(a != b)]
>>> np.all(a == b)
True
I have the following code which contains several numpy arrays that I am trying to sort into numerical order according to ID:
from astropy.table import Table
import numpy as np
from math import log10
cat_fil = '/home/myname/filtered_catalogue.csv'
cat_fil = Table.read(cat_fil, format="ascii")
ID = np.array(cat_fil['id'])
ID = ID.astype(str)
redshift = np.array(cat_fil['z'])
redshift = redshift.astype(float)
radius = np.array(cat_fil['radius_pixels'])
radius = radius.astype(float)
mag = np.array(cat_fil['magntiude'])
mag = mag.astype(float)
stacked = np.column_stack((ID, redshift, radius, mag))
stacked = stacked.astype(float)
idx = ((stacked[:, 1] > 0.0) &
(stacked[:, 2] > 10.0) &
(stacked[:, 3] > 0.0))
filtered = stacked[idx]
**filtered = np.sort(filtered, axis = 0)**
print(filtered[0])
The problem I'm having is with sorting the array. Whenever I sort the ID in order of increasing number it does not move the rest of the other data with it, meaning that when my code runs it assigns the wrong values to each ID. Is there a way to sort all of the numpy arrays in terms of increasing numerical ID whist also keeping the actual rows together such that the correct data stays next to each other.
I gather from a quick read of your question/code that you have a 2d array, which you want to sort on the first column. Let's make a simple array like that:
In [258]: arr = np.random.randint(0,10,(4,3))
In [259]: arr
Out[259]:
array([[0, 9, 7],
[8, 7, 1],
[3, 7, 8],
[3, 5, 9]])
Sort, with the axis parameter 0, sorts each column independently:
In [260]: np.sort(arr, axis=0)
Out[260]:
array([[0, 5, 1],
[3, 7, 7],
[3, 7, 8],
[8, 9, 9]])
But with argsort we get indices for sorting just the first column:
In [261]: idx = np.argsort(arr[:,0])
In [262]: idx
Out[262]: array([0, 2, 3, 1])
which we can then use to reorder all the rows
In [263]: arr[idx,:]
Out[263]:
array([[0, 9, 7],
[3, 7, 8],
[3, 5, 9],
[8, 7, 1]])
I've run into this interaction with arrays that I'm a little confused. I can work around it, but for my own understanding, I'd like to know what is going on.
Essentially, I have a datafile that I'm trying to tailor so I can run this as an input for some code I've already written. This involves some calculations on some columns, rows, etc. In particular, I also need to rearrange some elements, where the original array isn't being modified as I expect it would.
import numpy as np
ex_data = np.arange(12).reshape(4,3)
ex_data[2,0] = 0 #Constructing some fake data
ex_data[ex_data[:,0] == 0][:,1] = 3
print ex_data
Basically, I look in a column of interest, collect all the rows where that column contains a parameter value of interest and just reassigning values.
With the snippet of code above, I would expect ex_data to have it's column 1 elements, conditional if it's column 0 element is equal to 0, to be assigned a value of 3. However what I'm seeing is that there is no effect at all.
>>> ex_data
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 7, 8],
[ 9, 10, 11]])
In another case, if I don't 'slice', my 'sliced' data file, then the reassignment goes on as normal.
ex_data[ex_data[:,0] == 0] = 3
print ex_data
Here I'd expect my entire row, conditional to where column 0 is equal to 0, be populated with 3. This is what you see.
>>> ex_data
array([[ 3, 3, 3],
[ 3, 4, 5],
[ 3, 3, 3],
[ 9, 10, 11]])
Can anyone explain the interaction?
In [368]: ex_data
Out[368]:
array([[ 0, 1, 2],
[ 3, 4, 5],
[ 0, 7, 8],
[ 9, 10, 11]])
The column 0 test:
In [369]: ex_data[:,0]==0
Out[369]: array([ True, False, True, False])
That boolean mask can be applied to the rows as:
In [370]: ex_data[ex_data[:,0]==0,0]
Out[370]: array([0, 0]) # the 0's you expected
In [371]: ex_data[ex_data[:,0]==0,1]
Out[371]: array([1, 7]) # the col 1 values you want to replace
In [372]: ex_data[ex_data[:,0]==0,1] = 3
In [373]: ex_data
Out[373]:
array([[ 0, 3, 2],
[ 3, 4, 5],
[ 0, 3, 8],
[ 9, 10, 11]])
The indexing you tried:
In [374]: ex_data[ex_data[:,0]==0]
Out[374]:
array([[0, 3, 2],
[0, 3, 8]])
produces a copy. Assigning ...[:,1]=3 just changes that copy, not the original array. Fortunately in this case, it is easy to use
ex_data[ex_data[:,0]==0,1]
instead of
ex_data[ex_data[:,0]==0][:,1]
I have a 2d and 1d array. I am looking to find the two rows that contain at least once the values from the 1d array as follows:
import numpy as np
A = np.array([[0, 3, 1],
[9, 4, 6],
[2, 7, 3],
[1, 8, 9],
[6, 2, 7],
[4, 8, 0]])
B = np.array([0,1,2,3])
results = []
for elem in B:
results.append(np.where(A==elem)[0])
This works and results in the following array:
[array([0, 5], dtype=int64),
array([0, 3], dtype=int64),
array([2, 4], dtype=int64),
array([0, 2], dtype=int64)]
But this is probably not the best way of proceeding. Following the answers given in this question (Search Numpy array with multiple values) I tried the following solutions:
out1 = np.where(np.in1d(A, B))
num_arr = np.sort(B)
idx = np.searchsorted(B, A)
idx[idx==len(num_arr)] = 0
out2 = A[A == num_arr[idx]]
But these give me incorrect values:
In [36]: out1
Out[36]: (array([ 0, 1, 2, 6, 8, 9, 13, 17], dtype=int64),)
In [37]: out2
Out[37]: array([0, 3, 1, 2, 3, 1, 2, 0])
Thanks for your help
If you need to know whether each row of A contains ANY element of array B without interest in which particular element of B it is, the following script can be used:
input:
np.isin(A,B).sum(axis=1)>0
output:
array([ True, False, True, True, True, True])
Since you're dealing with a 2D array* you can use broadcasting to compare B with raveled version of A. This will give you the respective indices in a raveled shape. Then you can reverse the result and get the corresponding indices in original array using np.unravel_index.
In [50]: d = np.where(B[:, None] == A.ravel())[1]
In [51]: np.unravel_index(d, A.shape)
Out[51]: (array([0, 5, 0, 3, 2, 4, 0, 2]), array([0, 2, 2, 0, 0, 1, 1, 2]))
^
# expected result
* From documentation: For 3-dimensional arrays this is certainly efficient in terms of lines of code, and, for small data sets, it can also be computationally efficient. For large data sets, however, the creation of the large 3-d array may result in sluggish performance.
Also, Broadcasting is a powerful tool for writing short and usually intuitive code that does its computations very efficiently in C. However, there are cases when broadcasting uses unnecessarily large amounts of memory for a particular algorithm. In these cases, it is better to write the algorithm's outer loop in Python. This may also produce more readable code, as algorithms that use broadcasting tend to become more difficult to interpret as the number of dimensions in the broadcast increases.
Is something like this what you are looking for?
import numpy as np
from itertools import combinations
A = np.array([[0, 3, 1],
[9, 4, 6],
[2, 7, 3],
[1, 8, 9],
[6, 2, 7],
[4, 8, 0]])
B = np.array([0,1,2,3])
for i in combinations(A, 2):
if np.all(np.isin(B, np.hstack(i))):
print(i[0], ' ', i[1])
which prints the following:
[0 3 1] [2 7 3]
[0 3 1] [6 2 7]
note: this solution does NOT require the rows be consecutive. Please let me know if that is required.