Array reclassification with numpy - python

I have a large (50000 x 50000) 64-bit integer NumPy array containing 10-digit numbers. There are about 250,000 unique numbers in the array.
I have a second reclassification table which maps each unique value from the first array to an integer between 1 and 100. My hope would be to reclassify the values from the first array to the corresponding values in the second.
I've tried two methods of doing this, and while they work, they are quite slow. In both methods I create a blank (zeros) array of the same dimensions.
new_array = np.zeros(old_array.shape)
First method:
for old_value, new_value in lookup_array:
new_array[old_array == old_value] = new_value
Second method, where lookup_array is in a pandas dataframe with the headings "Old" and "New:
for new_value, old_values in lookup_table.groupby("New"):
new_array[np.in1d(old_array, old_values)] = new_value
Is there a faster way to reclassify values

Store the lookup table as a 250,000 element array where for each index you have the mapped value. For example, if you have something like:
lookups = [(old_value_1, new_value_1), (old_value_2, new_value_2), ...]
Then you can do:
idx, val = np.asarray(lookups).T
lookup_array = np.zeros(idx.max() + 1)
lookup_array[idx] = val
When you get that, you can get your transformed array simply as:
new_array = lookup_array[old_array]

Related

Sort np array based on summed selected values of each row

I have a 2D numpy array, filled with floats.
I want to take a selected chunk of each row (say item 2nd to 3rd), sum these values and sort all the rows based on that sum in a descending order.
For example:
array([[0.80372444, 0.35468653, 0.9081662 , 0.69995566],
[0.53712474, 0.90619077, 0.69068265, 0.73794143],
[0.14056974, 0.34685164, 0.87505744, 0.56927803]])
Here's what I tried:
a = np.array(sorted(a, key = sum))
But that just sums all values from each row, rather that, say, only 2nd to 6th element.
You can start by using take to get elements at indices [1,2] from each row (axis = 1). Then sum across those element for each row (again axis = 1), and use argsort to get the order of the sums. This gives a set of row indices, which you can use to slice the array in the desired order.
import numpy as np
a = np.array([[0.80372444, 0.35468653, 0.9081662 , 0.69995566],
[0.53712474, 0.90619077, 0.69068265, 0.73794143],
[0.14056974, 0.34685164, 0.87505744, 0.56927803]])
a[a.take([1, 2], axis=1).sum(axis=1).argsort()]
# returns:
array([[0.14056974, 0.34685164, 0.87505744, 0.56927803],
[0.80372444, 0.35468653, 0.9081662 , 0.69995566],
[0.53712474, 0.90619077, 0.69068265, 0.73794143]])
Replace key with the function you actually want:
a = np.array(sorted(d, key = lambda v : sum(v[1:3])))

Compare and store elements of multidimensional array to two new arrays

Assume I have the following simple array:
my_array = np.array([[1,2],[2,4],[3,6],[2,1]])
which corresponds to another parent array:
parent_array = np.array([0,1,2,3])
Of course, there is a function that maps parent_array to np.array but it is not important what function this is.
Goal:
I want to use this my_array so as to create two new arrays A and B by iterating each row of my_array: for row i if the value of the first column of my_array[i] is greater than the value of the second column I will store parent_array[i] in A . Otherwise I will store parent_array[i] in B (if the value of the second column in my_array[i] if bigger).
So for the case above the result would be:
A = [3]
because only in the 4-th value of my_array the first column has greater value and
B = [0,1,2]
because the in the first three rows the second column has greater value.
Now, although I know how to save the greater element in a row of columns to a new array, the fact that each row in my_array is associated with a row in parent_array is confusing me. I don't know how to correlate them.
Summary:
I need therefore to associate each row of parent_array to each row of my_array and then if check row by row the latter and if the value of the first column is greater in my_array[i] I save parent_row[i] in A while if the second column is greater in my_array[i] I save parent_row[i] in B.
Use boolean array indexing for this: create boolean condition array by comparing values from 1st and 2nd column of my_array and then use it to select values from parent_array:
cond = my_array[:,0] > my_array[:,1]
A, B = parent_array[cond], parent_array[~cond]
A
# [3]
B
# [0 1 2]

Save random values from multi-dimensional NumPy array

I have an 149x5 NumPy array. I need to save some (30%) of values selected randomly from whole array. Additionally selected values will be deleted from data.
What I have so far:
# Load dataset
data = pd.read_csv('iris.csv')
# Select randomly 30%(45) of rows from dataset
random_rows = data.sample(45)
# Object for values to be saved
values = []
# Iterate over rows and select a value randomly.
for index, row in data.iterrows():
# Random between 1 - 5
rand_selector = randint(0, 4)
# Somehow save deleted value and its position in data object
value = ?? <-------
values.append(value)
# Delete random value
del row[rand_selector]
To add further, the data from value will later be compared to values imputed in its place by other methods(data imputation), therefore I need the position of the deleted value in original dataset.
This method will, given a 2D numpy matrix m, return an array of length 0.3*m.size containing arrays of length 3 consisting of a random value and its coordinates in m.
def pickRand30(data):
rand = np.random.choice(np.arange(data.size), size = int(data.size*0.3))
indexes1 = rand//data.shape[1]
indexes2 = rand%data.shape[1]
return np.array((data[indexes1, indexes2], indexes1, indexes2)).T
You can delete the entries by using its coordinates, however you may want to have a look into masked arrays instead of deleting single entries out of a matrix.

iterate in numpy array over rows just in column

I want to iterate over a column in a numpy array (interpol_values_array) but only in that specific column to find the place where a value (mole percentage) to put.
column = interpol_values_array[:][-2]
for column in interpol_values_array:
place = []
place_array = np.array(place)
place_array = np.searchsorted([interpol_values_array], mole_percentage,
side="right")
place_array should return me the index, where to put my value (mole_percentage)
Is that way a possible way? Further, is there a way using np.nditer, which could be a far more elegant?

Python Numpy: Coalesce and return first nonzero observation

I am currently new to NumPy, but very proficient with SQL.
I used a function called coalesce in SQL, which I was disappointed not to find in NumPy. I need this function to create a third array want from 2 arrays i.e. array1 and array2, where zero/ missing observations in array1 are replaced by observations in array2 under the same address/Location. I can't figure out how to use np.where?
Once this task is accomplished, I would like to take the lower diagonal of this array want and then populate a final array want2 noting the first non-zero observation. If all observations i.e. coalesce(array1, array2) returns missing or 0 in want2, then assign by default zero.
I have written an example demonstrating the desired behavior.
import numpy as np
array1= np.array(([-10,0,20],[-1,0,0],[0,34,-50]))
array2= np.array(([10,10,50],[10,0,25],[50,45,0]))
# Coalesce array1,array2 i.e. get the first non-zero value from array1, then from array2.
# if array1 is empty or zero, then populate table want with values from array2 under same address
want=np.tril(np.array(([-10,10,20],[-1,0,25],[50,34,-50])))
print(array1)
print(array2)
print(want)
# print first instance of nonzero observation from each column of table want
want2=np.array([-10,34,-50])
print(want2)
"Coalesce": use putmask to replace values equal to zero with values from array2:
want = array1.copy()
np.putmask(array1.copy(), array1==0, array2)
First nonzero element of each column of np.tril(want):
where_nonzero = np.where(np.tril(want) != 0)
"""For the where array, get the indices of only
the first index for each column"""
first_indices = np.unique(where_nonzero[1], return_index=True)[1]
# Get the values from want for those indices
want2 = want[(where_nonzero[0][first_indices], where_nonzero[1][first_indices])]

Categories