using numpy to store objects of different sizes

using numpy to store objects of different sizes - python

I have the following task:
2 lists. 1st list -> item, 2nd list -> meta data (2 floats) for every item.
These lists get changed by various steps of an algorithm, but their length is kept equal. i.e. their length increases, but increases equally. This way the index is a way to identify which item the meta data refers to.
At one (repeated) step of the algorithm I am shortening the list identifying the same items. Respectively, I have to tune the meta data list.
I could implement that using generic lists, but at some point it overloads the memory. So, I tried using np.array, but the issue with them is that their dimensions should be equal for every element. i.e. arr=np.array([1,2, [3, [4]] ],dtype=object) returns arr.ndim=1. What I need though is for it to return arr.ndim=3. I played around with it and discovered that [3,[4]] is of type list and has nothing to do with np.array. With equal dimensions for every element of np.array, it returns elements along every axis of np types, say, np.int32 or np.array
Critical 3rd step. When I am going through the list and collecting meta data for the same items, I am putting them into the meta_list under the same index, i.e. creating (or expanding) a list of lists at that index. Example,
meta_list=[[1,2],[3,4],[5,6],[7,8]]
Then, say, 1st and 3rd elements of the item_list are the same. So I have to combine their meta data. It yields this:
meta_list=[[1,2],[[3,4],[7,8]],[5,6]]
But I cannot wrap my head around how to implement this step using np.array, profiting from its storage efficiency as that [[3,4],[7,8]] element will be of type list.
Would be very grateful for hints.

Related

Changing the values of sliced numpy array doesn't change the original data in it

I have a numpy array total_weights which is an IxI array of floats. Each row/columns corresponds to one of I items.
During my main loop I acquire another real float array weights of size NxM (N, M < I) where each/column row also corresponds to one of the original I items (duplicates may also exist).
I want to add this array to total_weights. However, the sizes and order of the two arrays are not aligned. Therefore, I maintain a position map, a pandas Series with an index of item IDs to their proper index/position in total_weights, called pos_df.
In order to properly make the addition I want I perform the following operation inside the loop:
candidate_pos = pos_df.loc[candidate_IDs] # don't worry about how I get these
rated_pos = pos_df.loc[rated_IDs] # ^^
total_weights[candidate_pos, :][:, rated_pos] += weights
Unfortunately, the above operation must be editing a copy of the orignal total_weights matrix and not a view of it, since after the loop the total_weights array is still full of zeroes. How do I make it change the original data?
Edit:
I want to clarify that candidate_IDs are the N IDs of items and rated_IDs are the M IDs of items in the NxM array called weights. Through pos_df I can get their total order in all of I items.
Also, my guess as to the reason a copy is returned is that candidate_IDs and thus candidate_pos will probably contain duplicates e.g. [0, 1, 3, 1, ...]. So the same rows will sometimes have to be pulled into the new array/view.

Your first problem is in how you are using indexing. As candidate_pos is an array, total_weights[candidate_pos, :] is a fancy indexing operation that returns a new array. When you apply indexing again, i.e. ...[:, rated_pos] you are assigning elements to the newly created array rather than to total_weights.
The second problem, as you have already spotted, is in the actual logic you are trying to apply. If I understand your example correctly, you have a I x I matrix with weights, and you want to update weights for a sequence of pairs ((Ix_1, Iy_1), ..., (Ix_N, Iy_N)) with repetitions, with a single line of code. This can't be done in this way, using += operator, as you'll find yourself having added to weights[Ix_n, Iy_n] the weight corresponding to the last time (Ix_n, Iy_n) appears in your sequence: you have to first merge all the repeating elements in your sequence of weight updates, and then perform the update of your weights matrix with the new "unique" sequence of updates. Alternatively, you must collect your weights as an I x I matrix, and directly sum it to total_weights.

After #rveronese pointed out that it's impossible to do it one go because of the duplicates in candidate_pos I believe I have managed to do what I want with a for-loop on them:
candidate_pos = pos_df.loc[candidate_IDs] # don't worry about how I get these
rated_pos = pos_df.loc[rated_IDs] # ^^
for i, c in enumerate(candidate_pos):
total_weights[c, rated_pos] += weights[i, :]
In this case, the indexing does not create a copy and the assignment should be working as expected...

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

For example, suppose I had an (n,2) dimensional tensor t whose elements are all from the set S containing random integers. I want to build another tensor d with size (m,2) where individual elements in each tuple are from S, but the whole tuples do not occur in t.
E.g.
S = [0,1,2,3,7]
t = [[0,1],
[7,3],
[3,1]]
d = some_algorithm(S,t)
/*
d =[[2,1],
[3,2],
[7,4]]
*/
What is the most efficient way to do this in python? Preferably with pytorch or numpy, but I can work around general solutions.
In my naive attempt, I just use
d = np.random.choice(S,(m,2))
non_dupes = [i not in t for i in d]
d = d[non_dupes]
But both t and S are incredibly large, and this takes an enormous amount of time (not to mention, rarely results in a (m,2) array). I feel like there has to be some fancy tensor thing I can do to achieve this, or maybe making a large hash map of the values in t so checking for membership in t is O(1), but this produces the same issue just with memory. Is there a more efficient way?
An approximate solution is also okay.

my naive attempt would be a base-transformation function to reduce the problem to an integer set problem:
definitions and assumptions:
let S be a set (unique elements)
let L be the number of elements in S
let t be a set of M-tuples with elements from S
the original order of the elements in t is irrelevant
let I(x) be the index function of the element x in S
let x[n] be the n-th tuple-member of an element of t
let f(x) be our base-transform function (and f^-1 its inverse)
since S is a set we can write each element in t as a M digit number to the base L using elements from S as digits.
for M=2 the transformation looks like
f(x) = I(x[1])*L^1 + I(x[0])*L^0
f^-1(x) is also rather trivial ... x mod L to get back the index of the least significant digit. floor(x/L) and repeat until all indices are extracted. lookup the values in S and construct the tuple.
since now you can represet t as an integer set (read hastable) calculating the inverse set d becomes rather trivial
loop from L^(M-1) to (L^(M+1)-1) and ask your hashtable if the element is in t or d
if the size of S is too big you can also just draw random numbers against the hashtable for a subset of the inverse of t
does this help you?

If |t| + |d| << |S|^2 then the probability of some random tuple to be chosen again (in a single iteration) is relatively small.
To be more exact, if (|t|+|d|) / |S|^2 = C for some constant C<1, then if you redraw an element until it is a "new" one, the expected number of redraws needed is 1/(1-C).
This means, that by doing this, and redrawing elements until this is a new element, you get O((1/(1-C)) * |d|) times to process a new element (on average), which is O(|d|) if C is indeed constant.
Checking is an element is already "seen" can be done in several ways:
Keeping hash sets of t and d. This requires extra space, but each lookup is constant O(1) time. You could also use a bloom filter instead of storing the actual elements you already seen, this will make some errors, saying an element is already "seen" though it was not, but never the other way around - so you will still get all elements in d as unique.
Inplace sorting t, and using binary search. This adds O(|t|log|t|) pre-processing, and O(log|t|) for each lookup, but requires no additional space (other then where you store d).
If in fact, |d| + |t| is very close to |S|^2, then an O(|S|^2) time solution could be to use Fisher Yates shuffle on the available choices, and choosing the first |d| elements that do not appear in t.

i want to iterate over this structure. It is treating everyhting inside as one element and not as seperate numbers

loss=[[ -137.70171527 -81408.95809899 -94508.84395371 -311.81198933 -294.08711874]]
When I print loss it prints the addition of the numbers and not the individual numbers. I want to change this to a list so I can iterate over each individual number bit and I don't know how, please help.
I have tried:
result = map(tuple,loss)
However it prints the addition of the inside. When I try to index it says there is only 1 element. It works if I put a comma in between but this is a matrix that is outputed from other codes so i can't change or add to it.

You have a list of a list with numbers the outer list contains exactly one element (namely the inner list). The inner list is your list of integers over which you want to iterate. Hence to iterate over the inner list you would first have to access it from the outer list using indices for example like this:
for list_item in loss[0]:
do_something_for_each_element(list_item)
Moreover, I think that you wanted to have separate elements in the inner list and not compute one single number, didn't you? If that is the case you have to separate each element using a ,.
E.g.:
loss=[[-137.70171527, -81408.95809899, -94508.84395371, -311.81198933, -294.08711874]]
EDIT:
As you clarified in the comments you want to iterate over a numpy matrix. One way to do so is by converting the matrix into an n dimensional array (ndarray) and the iterate that structure. This could for example look like this, other options have also been presented in this answer(Iterate over a numpy Matrix rows):
import numpy as np
test_matrix=np.matrix([[1, 2], [3, 4]])
for row in test_matrix.A:
print(row)
note that the A attribute of a matrix object is its ndarray representation (https://docs.scipy.org/doc/numpy/reference/generated/numpy.matrix.html).

How can a list's lists be modified efficiently to have equal length to the list's longest list?

I have a 2-D list of shape (300,000, X), where each of the sublists has a different size. In order to convert the data to a Tensor, all of the sublists need to have equal length, but I don't want to lose any data from my sublists in the conversion.
That means that I need to fill all sublists smaller than the longest sublist with filler (-1) in order to create a rectangular array. For my current dataset, the longest sublist is of length 5037.
My conversion code is below:
for seq in new_format:
for i in range(0, length-len(seq)):
seq.append(-1)
However, when there are 300,000 sequences in new_format, and length-len(seq) is generally >4000, the process is extraordinarily slow. How can I speed this process up or get around the issue efficiently?

Individual append calls can be rather slow, so use list multiplication to create the whole filler value at once, then concatenate it all at once, e.g.:
for seq in new_format:
seq += [-1] * (length-len(seq))
seq.extend([-1] * (length-len(seq))) would be equivalent (trivially slower due to generalized method call approach, but likely unnoticeable given size of real work).
In theory, seq.extend(itertools.repeat(-1, length-len(seq))) would avoid the potentially large temporaries, but IIRC, the actual CPython implementation of list.__iadd__/list.extend forces the creation of a temporary list anyway (to handle the case where the generator is defined in terms of the list being extended), so it wouldn't actually avoid the temporary.

Permutations over subarray in python

I have a array of identifiers that have been grouped into threes. For each group, I would like to randomly assign them to one of three sets and to have those assignments stored in another array. So, for a given array of grouped identifiers (I presort them):
groupings = array([1,1,1,2,2,2,3,3,3])
A possible output would be
assignments = array([0,1,2,1,0,2,2,0,1])
Ultimately, I would like to be able to generate many of these assignment lists and to do so efficiently. My current method is just to create an zeroes array and set each consecutive subarray of length 3 to a random permutation of 3.
assignment = numpy.zeros((12,10),dtype=int)
for i in range(0,12,3):
for j in range(10):
assignment[i:i+3,j] = numpy.random.permutation(3)
Is there a better/faster way?

Two things I can think about:
instead of visiting the 2D array 3 row * 1 column in your inner loop, try to visit it 1*3. Accessing 2D array horizontally first is usually faster than vertically first, since it gives you better spatial locality, which is good for caching.
instead of running numpy.random.permutation(3) each time, if 3 is fixed and is a small number, try to generate the arrays of permutations beforehand and save them into a constant array of array like: (array([0,1,2]), array([0,2,1]), array([1,0,2])...). You just need to randomly pick one array from it each time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

using numpy to store objects of different sizes - python

Related

Changing the values of sliced numpy array doesn't change the original data in it

Given a set t of tuples containing elements from the set S, what is the most efficient way to build another set whose members are not contained in t?

i want to iterate over this structure. It is treating everyhting inside as one element and not as seperate numbers

How can a list's lists be modified efficiently to have equal length to the list's longest list?

Permutations over subarray in python

Categories

Resources