Equation calculations with 4D arrays - python

Basically I have over 1000 3D arrays with the shape (100,100,1000). So some pretty large arrays, which I need to use in some calculations. The great thing about python and Numpy is that instead of interations, calculations on each element and such can be done very quickly. For example, I can make a sum of each index for each 3D array almost instant. The result is one large array with the sum of each index for each array. In principle, that is ALMOST what I want to do, however, there is a bit of a problem.
What I need to do is use an equation that looks like this:
So as stated, I have around 1000 3D arrays. In total, the shape of this total array is (1000, 100, 100, 1000). For each of the 1000 I also have a list going from 1 to 1000 that corresponds to the 1000 3D arrays, and each index of that list contains either a 1 or a 0. If it has a 1 that entire 3D array of that index should go in the first term of the equation, and if 0, it goes into the other.
I am however very much in doubt about how I am going to do this without turning to some kind of looping that might destroy the speed of the calculations by a great deal.

You could sort it by locating the 1's and 0's.
Something like:
list_ones = np.where(Array[0] == 1)
list_zeros = np.where(Array[0] == 0)
Then Array[list_ones,:,:,:] will contain all elements corresponding to a one and Array[list_zeros,:,:,:] will correspond to all elements corresponding to a zero.
Then you can just put
first_term = Array[list_ones,:,:,:]
second_term = Array[list_zeros,:,:,:]
And sum as appropriate.
Would this work for your purpose?

Related

How can I solve for x with Ax=B, when A and X are 1-d arrays and I know A?

In my original code I have the following function:
B = np.inner(A,x)
where A.shape = [307_200] and has values -1 or 1
where x.shape = [307_200] and has values 0 to 256
where B results in a integer with a large value.
Assuming I know A and B, but don't know x, how can I solve for x??
To simplify the problem...
import numpy as np
A = np.random.choice(a=[-1,1], size=10)
x = np.random.choice(a=range(0,256), size=10)
B = np.inner(A, x)
I want to solve for x now. So something like one of the following...
x_solved = np.linalg.solve(A,x)
x_solved = np.linalg.lstsq(A,x)
Is it possible?
Extra info...
I could change A to be a n x m matrix, but since I am dealing with large matrices, when I try to use lstsq I quickly run out of memory. This is bad because 1. I can't run on my local machine and 2. the end use application needs to limit RAM.
However, for the problem above, I can except RAM intensive solutions since I might be able to moderate the compute resources with some cleaver tricks.
Also, we could switch A to boolean values if that would help.
Apologies if solution is obvious or simple.
Thanks for helps.
Here is your problem re-stated:
I have an array A containing many 1s and -1s. I want to make another array x containing integers 0-255 so that when I multiply each entry by the corresponding first array, then add up all the entries, I get some target number B.
Notice that the problem is just as difficult if you shuffle the array elements. So let's shuffle them so all the 1s are at the start and all the -1s are at the end. After solving this simplified version of the problem, we can shuffle them back.
Now the simplified problem is this:
I have A1 number of 1s and A-1 number of -1s. I want to make two arrays x1 and x-1 containing numbers from 0-255 so that when I add all the numbers in x1 and subtract all the numbers in x-1 I get some target number B.
Can you work out how to solve this?
I'd start by filling x1 with numbers 255 until the next 255 would make the sum too high, then fill the next entry with the number that makes the sum equal the target, then fill the rest with 0s. Then fill x-1 with 0s. If the target number is negative, do the opposite. Then un-shuffle it - match up the x1 and x-1 arrays with positions of the the 1s and -1s in your array A. And you're done.
You can actually write that algorithm so it puts the numbers directly in x without needing to make the temporary arrays x1 and x-1.

Gensim word2vec model outputs 1000 dimension ndarray but the maximum number of ndarray dimensions is 32 - how?

I'm trying to use this 1000 dimension wikipedia word2vec model to analyze some documents.
Using introspection I found out that the vector representation of a word is a 1000 dimension numpy.ndarray, however whenever I try to create an ndarray to find the nearest words I get a value error:
ValueError: maximum supported dimension for an ndarray is 32, found 1000
and from what I can tell by looking around online 32 is indeed the maximum supported number of dimensions for an ndarray - so what gives? How is gensim able to output a 1000 dimension ndarray?
Here is some example code:
doc = [model[word] for word in text if word in model.vocab]
out = []
n = len(doc[0])
print(n)
print(len(model["hello"]))
print(type(doc[0]))
for i in range(n):
sum = 0
for d in doc:
sum += d[i]
out.append(sum/n)
out = np.ndarray(out)
which outputs:
1000
1000
<class 'numpy.ndarray'>
ValueError: maximum supported dimension for an ndarray is 32, found 1000
The goal here would be to compute the average vector of all words in the corpus in a format that can be used to find nearby words in the model so any alternative suggestions to that effect are welcome.
You're calling numpy's ndarray() constructor-function with a list that has 1000 numbers in it – your hand-calculated averages of each of the 1000 dimensions.
The ndarray() function expects its argument to be the shape of the matrix constructed, so it's trying to create a new matrix of shape (d[0], d[1], ..., d[999]) – and then every individual value inside that matrix would be addressed with a 1000-int set of coordinates. And, indeed numpy arrays can only have 32 independent dimensions.
But even if you reduced the list you're supplying to ndarray() to just 32 numbers, you'd still have a problem, because your 32 numbers are floating-point values, and ndarray() is expecting integral counts. (You'd get a TypeError.)
Along the approach you're trying to take – which isn't quite optimal as we'll get to below – you really want to create a single vector of 1000 floating-point dimensions. That is, 1000 cell-like values – not d[0] * d[1] * ... * d[999] separate cell-like values.
So a crude fix along the lines of your initial approach could be replacing your last line with either:
result = np.ndarray(len(d))
for i in range(len(d)):
result[i] = d[i]
But there are many ways to incrementally make this more efficient, compact, and idiomatic – a number of which I'll mention below, even though the best approach, at bottom, makes most of these interim steps unnecessary.
For one, instead of that assignment-loop in my code just above, you could use Python's bracket-indexing assignment option:
result = np.ndarray(len(d))
result[:] = d # same result as previous 3-lines w/ loop
But in fact, numpy's array() function can essentially create the necessary numpy-native ndarray from a given list, so instead of using ndarray() at all, you could just use array():
result = np.array(d) # same result as previous 2-lines
But further, numpy's many functions for natively working with arrays (and array-like lists) already include things to do averages-of-many-vectors in a single step (where even the looping is hidden inside very-efficient compiled code or CPU bulk-vector operations). For example, there's a mean() function that can average lists of numbers, or multi-dimensional arrays of numbers, or aligned sets of vectors, and so forth.
This allows faster, clearer, one-liner approaches that can replace your entire original code with something like:
# get a list of available word-vetors
doc = [model[word] for word in text if word in model.vocab]
# average all those vectors
out = np.mean(doc, axis=0)
(Without the axis argument, it'd average together all individual dimension-values , in all slots, into just one single final average number.)

Efficient way doing comparizons and arithmetic in list with irregular dimensions , Python

I'm trying to find an efficient way to transform an array in the following way:
Each element will get transformed into either None, a real number, or a tuple/list/array of size 2 (contaning real numbers).
The transformation function im using is simple and just does number comparizons. So my first thought is to use np.where to make for fast comparizons. Now, if the transformation is a None or a real number, i have no problems.
But, when the transformation is a tuple/list/array, np.where gives me errors. This is ofc because numpy arrays demand regularity of dimensions. So now im forced to work with lists...
So my idea now is to, instead of tranforming the element into a tuple/list/Array of size 2, i transform it into a complex number. But then i have an Array of complex numbers containing mostly numbers with 0 imaginary part, (since most transformations will be None or real numbers). I cant afford this, memory speaking. (or?)
When i have the transformation list/Array/whatever, i will be doing sign operations and arithmetic btw its elements and comparizons again, thats why i would like to keep it being a numpy Array.
Am I forced to work with lists in this scenario or would you do something else?
EDIT:
Im asked to give certaing examples of my transformation:
Input: an array contaning elements with values None or real numbers btw [0,360)
Transformation: (simplified):
None goes to None
element in [0,45) goes to 2 real numbers (left,right), say 2 random real numbers btw 0 and element.
element in [45,360) goes to 1 real number
what i do is for example:
arrayTransformed = np.where((array>=0) & (array<45), transform(array), array)
#this gives problems ofc
arrayTransformed = np.where((array>=45) & (array<=360), transform(array), arrayTransformed )

NumPy: Compute mode row-wise spanning over multiple arrays from iterator

Have a look at this image:
In my application I receive from an iterator an arbitrary amount (let's say 1000 for now) of big 1-dimensional arrays arr1, arr2, arr3, ..., arr1000 (10000 entries each). Each entry is an integer between 0 and n, where in this case n = 9. My ultimate goal is to compute a 1-dimensional array result such that result[i] == the mode of arr1[i], arr2[i], arr3[i], ..., arr1000[i].
However, it is not tractable to concatenate the arrays to one big matrix and then compute the mode row-wise, since this may exceed the RAM on my machine.
An alternative would be to set up an array res2 of shape (10000, 10), then loop through every array, use each entry e as index and then to increase the value of res2[i][e] by 1. Alter looping, I would apply something like argmax. However, this is too slow.
So: Is the a way to perform the task in a fast way, maybe by using NumPy's advanced indexing?
EDIT (due to the comments):
This is basically the code which calculates the modes row-wise – avoiding to concatenate the arrays:
def foo(length, n):
counts = np.zeros((length, n), dtype=np.int_)
for arr in array_iterator():
i = 0
for e in arr:
counts[i][e] += 1
i += 1
return np.argmax(counts, axis=1)
It takes already 60 seconds for 100 arrays of size 10000 (although there is more work done behind the scenes, which results into that time – however, this work scales linearly with the amount of arrays).
Regarding the real sizes:
The amount of different arrays is really arbitrary. It's a parameter of experiments and I'd like to have the opportunity even to set this to values like 10^6. The length of each array is depending of my data set I'm working with. This could be 10000, or 100000 or even worse. However – spitting this into smaller pieces may be possible, though annoying.
My free RAM for this task is about 4 GB.
EDIT 2:
The running time I gave above leads to a wrong impression. Actually, the running time which just belongs to the inner loop (for e in arr) in the above mentioned scenario is just 5 seconds – which is now ok for me, since it's negligible compared to the remaining running time. I will leave this question open anyway for a moment, since there might be an even faster method waiting out there.

Fast way to construct a matrix in Python

I have been browsing through the questions, and could find some help, but I prefer having confirmation by asking it directly. So here is my problem.
I have an (numpy) array u of dimension N, from which I want to build a square matrix k of dimension N^2. Basically, each matrix element k(i,j) is defined as k(i,j)=exp(-|u_i-u_j|^2).
My first naive way to do it was like this, which is, I believe, Fortran-like:
for i in range(N):
for j in range(N):
k[i][j]=np.exp(np.sum(-(u[i]-u[j])**2))
However, this is extremely slow. For N=1000, for example, it is taking around 15 seconds.
My other way to proceed is the following (inspired by other questions/answers):
i, j = np.ogrid[:N,:N]
k = np.exp(np.sum(-(u[i]-u[j])**2,axis=2))
This is way faster, as for N=1000, the result is almost instantaneous.
So I have two questions.
1) Why is the first method so slow, and why is the second one so fast ?
2) Is there a faster way to do it ? For N=10000, it is starting to take quite some time already, so I really don't know if this was the "right" way to do it.
Thank you in advance !
P.S: the matrix is symmetric, so there must also be a way to make the process faster by calculating only the upper half of the matrix, but my question was more related to the way to manipulate arrays, etc.
First, a small remark, there is no need to use np.sum if u can be re-written as u = np.arange(N). Which seems to be the case since you wrote that it is of dimension N.
1) First question:
Accessing indices in Python is slow, so best is to not use [] if there is a way to not use it. Plus you call multiple times np.exp and np.sum, whereas they can be called for vectors and matrices. So, your second proposal is better since you compute your k all in once, instead of elements by elements.
2) Second question:
Yes there is. You should consider using only numpy functions and not using indices (around 3 times faster):
k = np.exp(-np.power(np.subtract.outer(u,u),2))
(NB: You can keep **2 instead of np.power, which is a bit faster but has smaller precision)
edit (Take into account that u is an array of tuples)
With tuple data, it's a bit more complicated:
ma = np.subtract.outer(u[:,0],u[:,0])**2
mb = np.subtract.outer(u[:,1],u[:,1])**2
k = np.exp(-np.add(ma, mb))
You'll have to use twice np.substract.outer since it will return a 4 dimensions array if you do it in one time (and compute lots of useless data), whereas u[i]-u[j] returns a 3 dimensions array.
I used np.add instead of np.sum since it keep the array dimensions.
NB: I checked with
N = 10000
u = np.random.random_sample((N,2))
I returns the same as your proposals. (But 1.7 times faster)

Categories