Making a index array from an array in numpy - python

Good morning experts,
I have an array which contain integer numbers, and I have a list with the unique values that are in the array sorted in special order. What I want is to make another array which will contain the indexes of each value in the a array.
#a numpy array with integer values
#size_x and size_y: array dimensions of a
#index_list contain the unique values of a sorted in a special order.
#b New array with the index values
for i in xrange(0,size_x):
for j in xrange(0,size_y):
b[i][j]=index_list.index(a[i][j])
This works but it takes long time to do it. Is there a faster way to do it?
Many thanks for your help
German

The slow part is the lookup
index_list.index(a[i][j])
It will be much quicker to use a Python dictionary for this task, ie. rather than
index_list = [ item_0, item_1, item_2, ...]
use
index_dict = { item_0:0, item_1:1, item_2:2, ...}
Which can be created using:
index_dict = dict( (item, i) for i, item in enumerate(index_list) )

Didn't try, but as this is pure numpy, it should be much faster then a dictionary based approach:
# note that the code will use the next higher value if a value is
# missing from index_list.
new_vals, old_index = np.unique(index_list, return_index=True)
# use searchsorted to find the index:
b_new_index = np.searchsorted(new_vals, a)
# And the original index:
b = old_index[b_new_index]
Alternatively you could simply fill any wholes in index_list.
Edited code, it was as such quite simply wrong (or very limited)...

Related

Sort a list of arrays using the sort of other array python

I have two lists: a list of 2D-arrays and the time it was generated, which is an integer. They are of length N and both "equally disordered". So by using the indices to order the list 'time' I can order the list of 2D-arrays.
I would like to do something like:
ordered_list_of_arrays = np.asarray(disordered_list).argsort(np.asarray(time))
ordered_time = np.asarray(time).sort()
Another option would be to leave it as a list:
ordered_arrays = disordered_list[np.argsort(np.asarray(time))]
TypeError: only integer scalar arrays can be converted to a scalar index
By iterating np.argsort(time) I could sort my disordered_list but I would like to know if there is a better option for this or which are the best.
Thanks
Create an index array and sort it based on time_list's value. Then use those indices to construct sorted versions of both arrays.
Code:
def sorted_lists(time_list, array_list):
sorted_indices = sorted(range(len(time_list)), key=lambda i: time_list[i])
sorted_time_list = [time_list[i] for i in sorted_indices]
sorted_array_list = [array_list[i] for i in sorted_indices]
return sorted_time_list, sorted_array_list

Create array of index values from list with another list python

I have an array of values as well as another array which I would like to create an index to.
For example:
value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])
MagicFunction(value_list,key_list)
#result = [[0,1,0,2]] which has the same length as value_list
The solutions I have seen online after researching are not quite what I am asking for I believe, any help would be appreciated!
I have this brute force code which provides the result but I don't even want to test it on my actual data size
T = np.zeros((len(value_list)), dtype = np.uint32)
for i in range(len(value_list)):
for j in range(len(key_list)):
if sum(value_list[i] == key_list[j]) == 3:
T[i] = j
The issue is how to get this to be not terribly inefficient. I see two approaches
use a dictionary so that the lookups will be fast. numpy arrays are mutable, and thus not hashable, so you'll have to convert them into, e.g., tuples to use with the dictionary.
Use broadcasting to check value_list against every "key" in key_list in a vectorized fashion. This will at least bring the for loops out of Python, but you will still have to compare every value to every key.
I'm going to assume here too that key_list only has unique "keys".
Here's how you could do the first approach:
value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]])
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]])
key_map = {tuple(key): i for i, key in enumerate(key_list)}
result = np.array([key_map[tuple(value)] for value in value_list])
result # array([0, 1, 0, 2])
And here's the second:
result = np.where((key_list[None] == value_list[:, None]).all(axis=-1))[1]
result # array([0, 1, 0, 2])
Which way is faster might depend on the size of key_list and value_list. I would time both for arrays of typical sizes for you.
EDIT - as noted in the comments, the second solution doesn't appear to be entirely correct, but I'm not sure what makes it fail. Consider using the first solution instead.
Assumptions:
Every element of value_list will be present in key_list (at some position or the other)
We are interested in the index within key_list, of only the first match
Solution:
From the two arrays, we create views of 3-tuples. We then broadcast the two views in two orthogonal directions and then check for element-wise equality on the broadcasted arrays.
import numpy as np
value_list = np.array([[2,2,3],[255,243,198],[2,2,3],[50,35,3]], dtype='uint8')
key_list = np.array([[2,2,3],[255,243,198],[50,35,3]], dtype='uint8')
# Define a new dtype, describing a "structure" of 3 uint8's (since
# your original dtype is uint8). To the fields of this structure,
# give some arbitrary names 'first', 'sec', and 'third'
dt = np.dtype([('first', np.uint8, 1),('sec', np.uint8, 1),('third', np.uint8, 1)])
# Now view the arrays as 1-d arrays of 3-tuples, using the dt
v_value_list = value_list.view(dtype=dt).reshape(value_list.shape[0])
v_key_list = key_list.view(dtype=dt).reshape(key_list.shape[0])
result = np.argmax(v_key_list[:,None] == v_value_list[None,:], axis=0)
print (result)
Output:
[0, 1, 0, 2]
Notes:
Though this is a pure numpy solution without any visible loops, it could have hidden inefficiencies, because, it matches every element of value_list with every element of key_list, in contrast with a loop-based search that smartly stops upon the first successful match. Any advantage gained will be dependent upon the actual size of key_list, and upon where the successful matches occur, in key_list. As the size of key_list grows, there might be some erosion of the numpy advantage, especially if the successful matches happen mostly in the earlier part of key_list.
The views that we are creating are in fact numpy structured arrays, where each element of the view is a structure of two int s. One, interesting question which I haven't yet explored is, when numpy compares one structure with another, does it perform a comparison of every field in the structure, or, does it short-circuit the field-comparisons at the first failed field of the structure? Any such short-cicuiting could imply a small additional advantage to this structured array solution.

Selecting randomly from two arrays based upon condition in Python

Suppose i have two arrays of equal lengths:
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
Now i want to pick up elements from these two arrays , in the sequence given such that they form a new array of same length as a & b by randomly selecting values between a & b, in the ratio of a:b = 4.68 i.e for every 1 value picked from a , there should be 4.68 values picked from b in the resultant array.
So effectively the resultant array could be something like :
res = [0,1,1,0,1, 1(from a) ,0(from a),1,1,0,0,1,1,0, 0(from a),0,0]
res array has : first 5 values are from b ,6th & 7th from a ,8th-14th from b , 15th from a ,16th-17th from b
Overall ratio of values from a:b in the given res array example is a:b 4.67 ( from a = 3 ,from b = 14 )
Thus between the two arrays, values have to be chosen at random however the sequence needs to be maintained i.e cannot take 7th value from one array and 3rd value from other .If the value to be populated in resultant array is 3rd then the choice is between the 3rd element of both input arrays at random.Also, overall ratio needs to be maintained as well.
Can you please help me in developing an efficient Pythonic way of reaching this resultant solution ? The solution need not be consistent with every run w.r.t values
Borrowing the a_count calculation from Barmar's answer (because it seems to work and I can't be bothered to reinvent it), this solution preserves the ordering of the values chosen from a and b:
from future_builtins import zip # Only on Python 2, to avoid temporary list of tuples
import random
# int() unnecessary on Python 3
a_count = int(round(1/(1 + 4.68) * len(a)))
# Use range on Python 3, xrange on Python 2, to avoid making actual list
a_indices = frozenset(random.sample(xrange(len(a)), a_count))
res = [aval if i in a_indices else bval for i, (aval, bval) in enumerate(zip(a, b))]
The basic idea here is that you determine how many a values you need, get a unique sample of the possible indices of that size, then iterate a and b in parallel, keeping the a value for the selected indices, and the b value for all others.
If you don't like the complexity of the list comprehension, you could use a different approach, copying b, then filling in the a values one by one:
res = b[:] # Copy b in its entirety
# Replace selected indices with a values
# No need to convert to frozenset for efficiency here, and it's clean
# enough to just iterate the sample directly without storing it
for i in random.sample(xrange(len(a)), a_count):
res[i] = a[i]
I believe this should work. You specify how many you want from a (you can simply use your ratio to figure out that number), you randomly generate a 'mask' of numbers and choose from a or be based on the cutoff (notice that you only sort to figure out the cutoff, but you use the unsorted mask later)
import numpy as np
a = [0,0,1,0,0,1,0,0,0,1,0,1,1,0,0,0,1]
b = [0,1,1,0,1,0,0,1,1,0,0,1,1,0,1,0,0]
mask = np.random.random(len(a))
from_a = 3
cutoff = np.sort(mask)[from_a]
res = []
for i in range(len(a)):
if (mask[i]>=cutoff):
res.append(a[i])
else:
res.append(b[i])

"list indices must be integers, not tuple" but it worked before?

I have a 2d list of the form:
d = [[0.87768026489137663, -0.42848220833223599],
[0.87770426313019434, -0.428411425505765],
[0.87796388044104012, -0.42873867479872063],
[0.87801587662514491, -0.42860583582101786],
[0.87794315468933382, -0.42847396647067809]]
I want to get a single column from it, I've done this before on a different program using d[:,0] or d[:,1] and it worked perfectly. But now when I try that I get the error: list indices must be integers, not tuple. I know this must be a really simple fix but I'm just not sure whats wrong. I'm using python 3.4 if that matters.
You have a list of lists. What you want to do is iterate through the list of lists, and for every sub-list, pick out the first item if you want the first column, or the second item if you want the second column, etc. The following one-liner will do that:
column = [x[0] for x in d]
Note that x[0] selects the first item in the sub-list. If you want the second item, take x[1], etc. Generally, if you want the nth column in your 2d list (call it d), the code to grab that column is:
column = [x[n] for x in d]
In python you cant get a column using R notation from a matrix, you can do that using numpy lib. If you want to get the column i using pure python, just do:
columns = map(list,zip(d))
column_i = columns [i] #i is the column that you want
Example
d = [[1,2],[3,4] ]
new_d = zip(d)
>> [(1,3),(2,4)]
map(list,new_d)
>> [[1,3],[2,4]]
It seems to me that you are going to use the data in the list for further calculations. My favourite for handling such lists is "numpy". If you import the numpy module, you can access the data like you proposed:
import numpy as np
d = np.array([[0.87768026489137663, -0.42848220833223599],
[0.87770426313019434, -0.428411425505765],
[0.87796388044104012, -0.42873867479872063],
[0.87801587662514491, -0.42860583582101786],
[0.87794315468933382, -0.42847396647067809]])
d[:,1]
output:
array([-0.42848221, -0.42841143, -0.42873867, -0.42860584, -0.42847397])
I find it much easier to use numpy for such data as it is more intuitive to use than list comprehensions.
Hope this helps.

Vectorize iteration over two large numpy arrays in parallel

I have two large arrays of type numpy.core.memmap.memmap, called data and new_data, with > 7 million float32 items.
I need to iterate over them both within the same loop which I'm currently doing like this.
for i in range(0,len(data)):
if new_data[i] == 0: continue
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
However this is unreasonably slow, so I gather that using numpy's vectorising functions are the way to go.
Is it possible to vectorize with the index – so that the vectorised array can compare it's items to the corresponding item in the other array?
I thought of zipping the two arrays but I guess this would cause unreasonable overhead to prepare?
Is there some other way to optimise this operation?
For context: the goal is to effectively merge the two arrays such that each unique combination of corresponding values between the two arrays is represented by a different value in the resulting array, except zeros in the new_data array which are ignored. The arrays represent 3D bitmap images.
EDIT: available_values is a set of values that have not yet been used in data and persists across calls to this loop. new_values_map on the other hand is reset to an empty dictionary before each time this loop is used.
EDIT2: the data array only contains whole numbers, that is: it's initialised as zeros then with each usage of this loop with a different new_data it is populated with more values drawn from available_values which is initially a range of integers. new_data could theoretically be anything.
In answer to you question about vectorising, the answer is probably yes, though you need to clarify what available_values contains and how it's used, as that is the core of the vectorisation.
Your solution will probably look something like this...
indices = new_data != 0
data[indices] = available_values
In this case, if available_values can be considered as a set of values in which we allocate the first value to the first value in data in which new_data is not 0, that should work, as long as available_values is a numpy array.
Let's say new_data and data take values 0-255, then you can construct an available_values array with unique entries for every possible pair of values in new_data and data like the following:
available_data = numpy.array(xrange(0, 255*255)).reshape((255, 255))
indices = new_data != 0
data[indices] = available_data[data[indices], new_data[indices]]
Obviously, available_data can be whatever mapping you want. The above should be very quick whatever is in available_data (especially if you only construct available_data once).
Python gives you a powerful tools for handling large arrays of data: generators and iterators
Basically, they will allow to acces your data as they were regular lists, without fetching them at once to memory, but accessing piece by piece.
In case of accessing two large arrays at once, you can
for item_a, item_b in izip(data, new_data):
#... do you stuff here
izip creates an iterator what iterates over the elements of your arrays at once, but it does picks pieces as you need them, not all at once.
It seems that replacing the first two lines of loop to produce:
for i in numpy.where(new_data != 0)[0]:
combo = ( data[i], new_data[i] )
if not combo in new_values_map: new_values_map[combo] = available_values.pop()
data[i] = new_values_map[combo]
has the desired effect.
So most of the time in the loop was spent skipping the entire loop upon encountering a zero in new_data. Don't really understand why these many null iterations were so expensive, maybe one day I will...

Categories