A numpy array z is constructed from 2 Python lists x and y where values of y can be 0 and values of x are not continuously incrementing (i.e. values can be skipped).
Since y values can also be 0, it will be confusing to assign missing values in z to be 0 as well.
What is the best practice to avoid this confusion?
import numpy as np
# Construct `z`
x = [1, 2, 3, 5, 8, 13]
y = [12, 34, 56, 0, 78, 0]
z = np.ndarray(max(x)+1).astype(np.uint32) # missing values become 0
for i in range(len(x)):
z[x[i]] = y[i]
print(z) # [ 0 12 34 56 0 0 0 0 78 0 0 0 0 0]
print(z[4]) # missing value but is assigned 0
print(z[13]) # non-missing value but also assigned 0
Solution
You could typically assign np.nan or any other value for the non-existing indices in x.
Also, no need for the for loop. You can directly assign all values of y in one line, as I showed here.
However, since you are typecasting to uint32, you cannot use np.nan (why not?). Instead, you could use a large number (for example, 999999) of your choice, which by design, will not show up in y. For more details, please refer to the links shared in the References section below.
import numpy as np
x = [1, 2, 3, 5, 8, 13]
y = [12, 34, 56, 0, 78, 0]
# cannot use np.nan with uint32 as np.nan is treated as a float
# choose some large value instead: 999999
z = np.ones(max(x)+1).astype(np.uint32) * 999999
z[x] = y
z
# array([999999, 12, 34, 56, 999999, 0, 999999, 999999,
# 78, 999999, 999999, 999999, 999999, 0], dtype=uint32)
References
Numpy integer nan
Maximum and minimum value of C types integers from Python
Related
I have a nympy array a = np.array([483, 39, 18, 999, 20, 48]
I have an array of indices indices = np.array([2, 3])
I would like to have all the indices of the array and fill the rest of the indices with 0 so I get as a result :
np.array([0, 0, 18, 999, 0, 0])
Thank you for your answer.
Create an all zeros array and copy the values at the desired indices:
import numpy as np
a = np.array([483, 39, 18, 999, 20, 48])
indices = np.array([2, 3])
b = np.zeros_like(a)
b[indices] = a[indices]
# a = b # if needed
print(a)
print(indices)
print(b)
Output:
[483 39 18 999 20 48]
[2 3]
[ 0 0 18 999 0 0]
Hope that helps!
----------------------------------------
System information
----------------------------------------
Platform: Windows-10-10.0.16299-SP0
Python: 3.8.1
NumPy: 1.18.1
----------------------------------------
EDIT: Even better, use np.setdiff1d:
import numpy as np
a = np.array([483, 39, 18, 999, 20, 48])
indices = np.array([2, 3])
print(a)
print(indices)
a[np.setdiff1d(np.arange(a.shape[0]), indices, True)] = 0
print(a)
Output:
[483 39 18 999 20 48]
[2 3]
[ 0 0 18 999 0 0]
What about using list comprehension?
a = np.array([n if i in indices else 0 for i, n in enumerate(a)])
print(a) #array([ 0, 0, 18, 999, 0, 0])
You can create a function that uses the input array and the index array to do this, as in the following:
import numpy as np
def remove_by_index(input_array, indexes):
for i,_ in enumerate(input_array):
if i not in indexes:
input_array[i] = 0
return input_array
input_array = np.array([483, 39, 18, 999, 20, 48])
indexes = np.array([2, 3])
new_out = remove_by_index(input_array, indexes)
expected_out = np.array([0, 0, 18, 999, 0, 0])
print(new_out == expected_out) # to check if it's correct
Edit
You can also use list comprehension inside the function, which would be better, as:
def remove_by_index(input_array, indexes):
return [input_array[i] if (i in indexes) else 0 for i,_ in enumerate(input_array)]
It is not, as pointed out in comments, the most efficient way of doing it, performing iteration at Python level instead of C level, but it does work, and for casual use it will solve.
I know that
a - a.min(axis=0)
will subtract the minimum of each column from every element in the column. I want to subtract the minimum in each row from every element in the row. I know that
a.min(axis=1)
specifies the minimum within a row, but how do I tell the subtraction to go by rows instead of columns? (How do I specify the axis of the subtraction?)
edit: For my question, a is a 2d array in NumPy.
Assuming a is a numpy array, you can use this:
new_a = a - np.min(a, axis=1)[:,None]
Try it out:
import numpy as np
a = np.arange(24).reshape((4,6))
print (a)
new_a = a - np.min(a, axis=1)[:,None]
print (new_a)
Result:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]]
[[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]
[0 1 2 3 4 5]]
Note that np.min(a, axis=1) returns a 1d array of row-wise minimum values.
We than add an extra dimension to it using [:,None]. It then looks like this 2d array:
array([[ 0],
[ 6],
[12],
[18]])
When this 2d array participates in the subtraction, it gets broadcasted into a shape of (4,6), which looks like this:
array([[ 0, 0, 0, 0, 0, 0],
[ 6, 6, 6, 6, 6, 6],
[12, 12, 12, 12, 12, 12],
[18, 18, 18, 18, 18, 18]])
Now, element-wise subtraction happens between the two (4,6) arrays.
Specify keepdims=True to preserve a length-1 dimension in place of the dimension that min collapses, allowing broadcasting to work out naturally:
a - a.min(axis=1, keepdims=True)
This is especially convenient when axis is determined at runtime, but still probably clearer than manually reintroducing the squashed dimension even when the 1 value is fixed.
If you want to use only pandas you can just apply a lambda to every column using min(row)
new_df = pd.DataFrame()
for i, col in enumerate(df.columns):
new_df[col] = df.apply(lambda row: row[i] - min(row))
I'm working with a huge dataset. What I want to do is take all values > 0 from the array and place them in a new array, run statistics on those extracted values and then place the new values back in the original array.
Suppose I have an array [0,0,0,0,0, . . . .32, .44,0,0,0] (i.e. the object arr in the script below): I want to remove the values such as .32, .44, etc., and put them in a new array arr2.
Then I want to do a statistical analysis (PCA) on this second array, take the new values corresponding with the original position in the original array and replace the original values with these new values. I've started coding this below, but have no idea how to extract values > 0 while maintaining the position in the array.
import os
import nibabel as nb
import numpy as np
import numpy.linalg as npl
import nibabel as nib
import matplotlib.pyplot as plt
from matplotlib.mlab import PCA
#from dipy.io.image import load_nifti, save_nifti
np.set_printoptions(precision=4, suppress=True)
FA = './all_FA_skeletonised.nii'
from dipy.io.image import load_nifti
img = nib.load(FA)
data = img.get_data()
data.shape #get x,y,z and subject # parameters from image
#place subject number into a variable
vol_shape = data.shape[:-1] # x,y,z coordinates
n_vols = data.shape[-1] # 28 subjects volumes
# N is the num of voxels (dimensions) in a volume
N = np.prod(vol_shape)
#- Reshape first dimension of whole image data array to N, and take
#- transpose
arr2 = []
arr = data.reshape(N, n_vols).T # 28 X 7,200,000 array
for a in array:
if a > 0:
arr2.append(a)
row_means = np.outer(np.mean(arr2, axis=1), np.ones(N))
X = arr2 - row_means # mean center data array
#- Calculate unscaled covariance matrix of X
unscaled_covariance = X.dot(X.T)
unscaled_covariance.shape
# Calculate U, S, VT with SVD on unscaled covariance matrix
U, S, VT = npl.svd(unscaled_covariance)
#- Use subplots to make axes to plot first 10 principal component
#- vectors
#- Plot one component vector per sub-plot.
fig, axes = plt.subplots(10, 1)
for i, ax in enumerate(axes):
ax.plot(U[:, i])
#- Calculate scalar projections for projecting X onto U
#- Put results into array C.
C = U.T.dot(X)
***#- Put values in C back into original data matrix***
I would extract the wanted values with their positions (in the original array) and store them in a dictionary as index_in_the_original_array: value_in_the_original_array. Then I would do the calculations on the values in the dictionary. Finally, we have the indices preserved (as keys in the dictionary) for replacing the values back in the original array. In code:
from pprint import pprint
original_array = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Collecting all values & indices of the elements that are greater than 5:
my_dictionary = {index: value for index, value in enumerate(original_array) if value > 5}
pprint(original_array) # [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
pprint(my_dictionary) # {5: 6, 6: 7, 7: 8, 8: 9, 9: 10}
# doing the processing (Here just incrementing the values by 2):
my_dictionary = {key: my_dictionary[key] + 2 for key in my_dictionary.keys()}
pprint(my_dictionary) # {5: 8, 6: 9, 7: 10, 8: 11, 9: 12}
# Replacing the new values into the original array:
for key in my_dictionary.keys():
original_array[key] = my_dictionary[key]
pprint(original_array) # [1, 2, 3, 4, 5, 8, 9, 10, 11, 12]
Update
If we want to avoid the use of a dictionary, we could do the following which does basically the same as above.
import numpy as np
def process_data(data):
return data * 5
original_array = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
new_array = np.array([[index, value] for index, value in enumerate(original_array) if value > 5])
print(new_array) # [[ 5 6]
# [ 6 7]
# [ 7 8]
# [ 8 9]
# [ 9 10]]
# doing the processing (Here, just using the above function that multiplies the values by 5):
new_array[:, 1] = process_data(new_array[:, 1])
print(new_array) # [[ 5 30]
# [ 6 35]
# [ 7 40]
# [ 8 45]
# [ 9 50]]
# Replacing the new values into the original array:
for indx, val in new_array:
original_array[indx] = val
print(original_array) # [ 1 2 3 4 5 30 35 40 45 50]
edit: got the question wrong (see comments) so here's an update.
Say we have a=[0,0,1,2,0,3] and b=[.1, .1, .1] and want to combine them to get a [0, 0,.1, .1, 0, 0.1], i.e. 0 remains at same indexes and all the other values get substituted:
import numpy as np
b = np.array([.1, .1, .1])
a = np.array([0,0,1,2,0,3], dtype='float64') # expects same dtype
np.place(a, a>0, b) # modify in place
Backup a before the np.place line if you need its original values.
previous version:
Not sure whether I got you right, assuming by 'maintaining the position in the array', you mean for example [0,0,1,2,0,3,0] should eval [1,2,3] (instead of [1,3,2] or something else). You can do this by a[a!=] where a is your array. If you only want to knock off leading/trailing zeros, try numpy.trim_zeros instead.
Things should be different if input is 2D arrays or matrices, as you'll need to keep them in shape.
I have an array myA like this:
array([ 7, 4, 5, 8, 3, 10])
If I want to replace all values that are larger than a value val by 0, I can simply do:
myA[myA > val] = 0
which gives me the desired output (for val = 5):
array([0, 4, 5, 0, 3, 0])
However, my goal is to replace not all but only the first n elements of this array that are larger than a value val.
So, if n = 2 my desired outcome would look like this (10 is the third element and should therefore not been replaced):
array([ 0, 4, 5, 0, 3, 10])
A straightforward implementation would be:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
# track the number of replacements
repl = 0
for ind, vali in enumerate(myA):
if vali > val:
myA[ind] = 0
repl += 1
if repl == n:
break
That works but maybe someone can can up with a smart way of masking!?
The following should work:
myA[(myA > val).nonzero()[0][:2]] = 0
since nonzero will return the indexes where the boolean array myA > val is non zero e.g. True.
For example:
In [1]: myA = array([ 7, 4, 5, 8, 3, 10])
In [2]: myA[(myA > 5).nonzero()[0][:2]] = 0
In [3]: myA
Out[3]: array([ 0, 4, 5, 0, 3, 10])
Final solution is very simple:
import numpy as np
myA = np.array([7, 4, 5, 8, 3, 10])
n = 2
val = 5
myA[np.where(myA > val)[0][:n]] = 0
print(myA)
Output:
[ 0 4 5 0 3 10]
Here's another possibility (untested), probably no better than nonzero:
def truncate_mask(m, stop):
m = m.astype(bool, copy=False) # if we allow non-bool m, the next line becomes nonsense
return m & (np.cumsum(m) <= stop)
myA[truncate_mask(myA > val, n)] = 0
By avoiding building and using an explicit index you might end up with slightly better performance...but you'd have to test it to find out.
Edit 1: while we're on the subject of possibilities, you could also try:
def truncate_mask(m, stop):
m = m.astype(bool, copy=True) # note we need to copy m here to safely modify it
m[np.searchsorted(np.cumsum(m), stop):] = 0
return m
Edit 2 (the next day): I've just tested this and it seems that cumsum is actually worse than nonzero, at least with the kinds of values I was using (so neither of the above approaches is worth using). Out of curiosity, I also tried it with numba:
import numba
#numba.jit
def set_first_n_gt_thresh(a, val, thresh, n):
ii = 0
while n>0 and ii < len(a):
if a[ii] > thresh:
a[ii] = val
n -= 1
ii += 1
This only iterates over the array once, or rather it only iterates over the necessary part of the array once, never even touching the latter part. This gives you vastly superior performance for small n, but even for the worst case of n>=len(a) this approach is faster.
You could use the same solution as here with converting you np.array to pd.Series:
s = pd.Series([ 7, 4, 5, 8, 3, 10])
n = 2
m = 5
s[s[s>m].iloc[:n].index] = 0
In [416]: s
Out[416]:
0 0
1 4
2 5
3 0
4 3
5 10
dtype: int64
Step by step explanation:
In [426]: s > m
Out[426]:
0 True
1 False
2 False
3 True
4 False
5 True
dtype: bool
In [428]: s[s>m].iloc[:n]
Out[428]:
0 7
3 8
dtype: int64
In [429]: s[s>m].iloc[:n].index
Out[429]: Int64Index([0, 3], dtype='int64')
In [430]: s[s[s>m].iloc[:n].index]
Out[430]:
0 7
3 8
dtype: int64
Output in In[430] looks the same as In[428] but in 428 it's a copy and in 430 original series.
If you'll need np.array you could use values method:
In [418]: s.values
Out[418]: array([ 0, 4, 5, 0, 3, 10], dtype=int64)
I have data in long format that stores the row#, column# and value as shown below:
ROW COLUMN VALUE
1 1 1
1 3 3
2 1 1
2 2 2
3 1 1
3 2 2
3 3 3
Please note that the certain ROW, COLUMN combinations are missing (for instance there is no value for ROW = 1 and COLUMN = 2). I would like to convert this into a 3 x 3 array like so. The missing row column combination gets filled in by 0:
1 0 3
1 2 0
1 2 3
My initial approach to this problem was to declare an empty 3 x 3 array, read in the three columns as 1d arrays and loop over rows and columns and update the array based on the value array. For small dimensional cases this seems doable, but for higher dimensions this does not seem to be the "Pythonic" way to do it. Has this problem been tackled in some canned function in numpy package? I looked into reshape - but that assumes no missing values.
Once you have the row, column and values in numpy arrays, you can do something like the following. (Note that I've taken the more Pythonic approach of putting the 0-based indices in row and col).
Here's the data, in one-dimensonal arrays:
In [13]: row = np.array([0, 0, 1, 1, 2, 2, 2])
In [14]: col = np.array([0, 2, 0, 1, 0, 1, 2])
In [15]: values = np.array([11, 12, 13, 14, 15, 16, 17])
Create a two-dimensional array to hold the values. I use the maxima from row and col to figure out how big the array should be. You might use some other values if row and col don't necessarily include values in the last row or column.
In [16]: a = np.zeros((row.max()+1, col.max()+1), dtype=values.dtype)
Now fill in the values with this assignment
In [17]: a[row, col] = values
Et voilĂ :
In [18]: a
Out[18]:
array([[11, 0, 12],
[13, 14, 0],
[15, 16, 17]])
Your example is a 3x3 array, but if you will actually have much larger arrays and not a lot of entries, you might consider using a scipy sparse matrix. For example, here's how you can create a "COO" matrix from the same data as above, using the coo_matrix class:
In [25]: from scipy.sparse import coo_matrix
In [26]: c = coo_matrix((values, (row, col)), shape=(row.max()+1, col.max()+1))
In [27]: c
Out[27]:
<3x3 sparse matrix of type '<type 'numpy.int64'>'
with 7 stored elements in COOrdinate format>
In [28]: c.A
Out[28]:
array([[11, 0, 12],
[13, 14, 0],
[15, 16, 17]])