Creating two concatenated arrays from a generator - python

Consider the following example in Python 2.7. We have an arbitrary function f() that returns two 1-dimensional numpy arrays. Note that in general f() may returns arrays of different size and that the size may depend on the input.
Now we would like to call map on f() and concatenate the results into two separate new arrays.
import numpy as np
def f(x):
return np.arange(x),np.ones(x,dtype=int)
inputs = np.arange(1,10)
result = map(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
This gives the intended result. However, since result may take up much memory, it may be preferable to use a generator by calling imap instead of map.
from itertools import imap
result = imap(f,inputs)
x = np.concatenate([i[0] for i in result])
y = np.concatenate([i[1] for i in result])
However, this gives an error because the generator is empty at the point where we calculate y.
Is there a way to use the generator only once and still create these two concatenated arrays? I'm looking for a solution without a for loop, since it is rather inefficient to repeatedly concatenate/append arrays.
Thanks in advance.

Is there a way to use the generator only once and still create these two concatenated arrays?
Yes, a generator can be cloned with tee:
import itertools
a, b = itertools.tee(result)
x = np.concatenate([i[0] for i in a])
y = np.concatenate([i[1] for i in b])
However, using tee does not help with the memory usage in your case. The above solution would require 5 N memory to run:
N for caching the generator inside tee,
2 N for the list comprehensions inside np.concatenate calls,
2 N for the concatenated arrays.
Clearly, we could do better by dropping the tee:
x_acc = []
y_acc = []
for x_i, y_i in result:
x_acc.append(x_i)
y_acc.append(y_i)
x = np.concatenate(x_acc)
y = np.concatenate(y_acc)
This shaved off one more N, leaving 4 N. Going further means dropping the intermediate lists and preallocating x and y. Note, that you needn't know the exact sizes of the arrays, only the upper bounds:
x = np.empty(capacity)
y = np.empty(capacity)
right = 0
for x_i, y_i in result:
left = right
right += len(x_i) # == len(y_i)
x[left:right] = x_i
y[left:right] = y_i
x = x[:right].copy()
y = y[:right].copy()
In fact, you don't even need an upper bound. Just ensure that x and y are big enough to accommodate the new item:
for x_i, y_i in result:
# ...
if right >= len(x):
# It would be slightly trickier for >1D, but the idea
# remains the same: alter the 0-the dimension to fit
# the new item.
new_capacity = max(right, len(x)) * 1.5
x = x.resize(new_capacity)
y = y.resize(new_capacity)

Related

Does numpy.corrcoef calculate correlation within matrices when two matrices are provided (intra-correlation)?

Here I'm calculating the Pearson correlation such that I'm accounting for every comparison.
x = pd.DataFrame({'a':[3,6,4,7,9],'b':[6,2,4,1,5],'c':[7,9,1,2,9]},index=['aa','bb','cc','dd','ee']).T
y = pd.DataFrame({'A':[9,4,1,3,5],'B':[9,8,9,5,7],'C':[1,1,3,1,2]},index=['aa','bb','cc','dd','ee']).T
table = pd.DataFrame(columns=['Correlation Coeff'])
for i in range(0, len(x)):
for j in range(0, len(y)):
xf = list(x.iloc[i])
yf = list(y.iloc[j])
n = np.corrcoef(xf,yf)[0,1]
name = x.index[i]+'|'+y.index[j]
table.at[name, 'Correlation Coeff'] = n
table
This is the result:
Correlation Coeff
a|A -0.232973
a|B -0.713392
a|C -0.046829
b|A 0.601487
b|B 0.662849
b|C 0.29654
c|A 0.608993
c|B 0.16311
c|C -0.421398
Now when I just apply these tables directly to numpy's function, removing duplicate values and 'ones' it looks this this.
x = pd.DataFrame({'a':[3,6,4,7,9],'b':[6,2,4,1,5],'c':[7,9,1,2,9]},index=['aa','bb','cc','dd','ee']).T.to_numpy()
y = pd.DataFrame({'A':[9,4,1,3,5],'B':[9,8,9,5,7],'C':[1,1,3,1,2]},index=['aa','bb','cc','dd','ee']).T.to_numpy()
n = np.corrcoef(x,y)
n = n.tolist()
n = [element for sub in n for element in sub]
# Rounding to ensure no duplicates are being picked up.
rnd = [round(num, 13) for num in n]
X = [i for i in rnd if i != 1]
X = list(dict.fromkeys(X))
X
[-0.3231828652987,
0.3157400783243,
-0.232972779074,
-0.7133922984085,
-0.0468292905791,
0.3196502842345,
0.6014868821052,
0.6628489803599,
0.2965401263095,
0.608993434846,
0.1631095635753,
-0.4213976904463,
0.2417468892076,
-0.5841782301194,
0.3674842076296]
There are 6 extra values (in bold) not accounted for. I'm assuming that they are correlation values calculated within a single matrix and if so, why? Is there a way to use this function without generating these additional values?
You are right in assuming that those are the correlations from variables within x and y, and so far as I can tell there is no way to turn this behaviour off.
You can see that this is true by looking at the implementation of numpy.corrcoef. As expected, most of the heavy lifting is being done by a separate function that computes covariance - if you look at the implementation of numpy.cov, particularly line 2639, you will see that, if you supply an additional y argument, this is simply concatenated onto x before computing the covariance matrix.
If necessary, it wouldn't be too hard to implement your own version of corrcoef that works how you want. Note that you can do this in pure numpy, which in most cases will be faster than the iterative approach from the example code above.

Efficiently adding two different sized one dimensional arrays

I want to add two numpy arrays of different sizes starting at a specific index. As I need to do this couple of thousand times with large arrays, this needs to be efficient, and I am not sure how to do this efficiently without iterating through each cell.
a = [5,10,15]
b = [0,0,10,10,10,0,0]
res = add_arrays(b,a,2)
print(res) => [0,0,15,20,25,0,0]
naive approach:
# b is the bigger array
def add_arrays(b, a, i):
for j in range(len(a)):
b[i+j] = a[j]
You might assign smaller one into zeros array then add, I would do it following way
import numpy as np
a = np.array([5,10,15])
b = np.array([0,0,10,10,10,0,0])
z = np.zeros(b.shape,dtype=int)
z[2:2+len(a)] = a # 2 is offset
res = z+b
print(res)
output
[ 0 0 15 20 25 0 0]
Disclaimer: I assume that offset + len(a) is always less or equal len(b).
Nothing wrong with your approach. You cannot get better asymptotic time or space complexity. If you want to reduce code lines (which is not an end in itself), you could use slice assignment and some other utils:
def add_arrays(b, a, i):
b[i:i+len(a)] = map(sum, zip(b[i:i+len(a)], a))
But the functional overhead should makes this less performant, if anything.
Some docs:
map
sum
zip
It should be faster than Daweo answer, 1.5-5x times (depending on the size ratio between a and b).
result = b.copy()
result[offset: offset+len(a)] += a

Is there a way to get every element of a list without using loops?

I found this task in a book of my prof:
def f(x):
return f = log(exp(z))
def problem(M: List)
return np.array([f(x) for x in M])
How do I implement a solution?
Numpy is all about performing operations on entire arrays. Your professor is expecting you to use that functionality.
Start by converting your list M into array z:
z = np.array(M)
Now you can do elementwise operations like exp and log:
e = np.exp(z)
f = 1 + e
g = np.log(f)
The functions np.exp and np.log are applied to each element of an array. If the input is not an array, it will be converted into one.
Operations like 1 + e work on an entire array as well, in this case using the magic of broadcasting. Since 1 is a scalar, it can unambiguously expanded to the same shape as e, and added as if by np.add.
Normally, the sequence of operations can be compactified into a single line, similarly to what you did in your initial attempt. You can reduce the number of operations slightly by using np.log1p:
def f(x):
return np.log1p(np.exp(x))
Notice that I did not convert x to an array first since np.exp will do that for you.
A fundamental problem with this naive approach is that np.exp will overflow for values that we would expect to get reasonable results. This can be solved using the technique in this answer:
def f(x):
return np.log1p(np.exp(-np.abs(x))) + np.maximum(x, 0)

Having trouble with the variant of the "Two Sum" coding challenge?

The two problems seeks to find two elements x and y such that x+y=target. This can be implemented using a brute force approach.
for x in arr:
for y in arr:
if x+y==target:
return [x,y]
We are doing some redundant computation in the for loop -- that is we only want to consider combinations of two elements. We can do a N C 2 dual-loop as follows.
for i, x in enumerate(arr):
if y in arr[i+1:]:
if x+y==target:
return [x,y]
And we save a large constant factor of time complexity. Now let's note that inner most loop is a search. We can either use a hash search or a binary search for.
seen = set()
for i, x in enumerate(arr):
if target-x in seen:
y = target-x
return [x,y]
seen.add(x)
Not that seen is only of length of i. And it will only trigger when hit the second number (because it's complement must be in the set).
A variant of this problem is: to find elements that satisfy the following x-y = target. It's a simple variant but it adds a bit of logical complexity to this problem.
My question is: why does the following not work? That is, we're just modifying the previous code?
seen = set()
for i, x in enumerate(arr):
for x-target in seen:
y = x-target
return [x,y]
seen.add(x)
I've asked a friend, however I didn't understand him. He said that subtraction isn't associative. We're exploiting the associative property of addition in the two sum problem to achieve the constant time improvement. But that's all he told me. I don't get it to be honest. I still think my code should work. Can someone tell me why my code doesn't work?
Your algorithm (once the if/for mixup is fixed) still doesn't work because subtraction is not commutative. The algorithm only effectively checks x,y pairs where x comes later in the array than y. That's OK when it's testing x+y = target, since it doesn't matter which order the two values are in. But for x-y = target, the order does matter, since x - y is not the same thing as y - x.
A fix for this would be to check each number in the array to see if it could be either x or y with the other value being one of the earlier values from arr. There needs to be a different check for each, so you probably need two if statements inside the loop:
seen = set()
for n in arr:
if n-target in seen:
x = n
y = n-target
return [x,y]
if n+target in seen:
x = n+target
y = n
return [x,y]
seen.add(x)
Note that I renamed the loop variable to n, since it could be either x or y depending on how the math worked out. It's not strictly necessary to use x and y variables in the bodies of the if statements, you could do those computations directly in the return statement. I also dropped the unneeded enumerate call, since the single-loop versions of the code don't use i at all.

Using nested for loops to evaluate function at every coordinate - python

I am trying to use a nested for loop to evaluate function f(r) at every combination of 2D polar coordinates.
My current code is:
r = np.linspace(0, 5, 10)
phi = np.linespace(0,2*np.pi, 10)
for i in r:
for j in phi:
X = f(r) * np.cos(phi)
print X
When I run this as is it returns X as a 1D array of f(r) and cos(phi) is 1 (ie. phi = 0). However I want f(r) at every value of r, which is then multiplied by its corresponding phi value. This would be a 2D array (10 by 10) where by every combination of r and phi is evaluated.
If you have any suggestions about possible efficiencies I would appreciate it as eventually I will be running this with a resolution much greater than 10 (maybe as high as 10,000) and looping it many thousands of times.
It's np.linspace not np.linespace.
You're not iterating over the actual data at all, i.e. i and j are not even used in your inner loop. Right now, you're executing the same statement with the same arguments and the same result over and over again. You could have executed the inner statement only one single time, X would be the same.
You need to preallocate a 2D-array for the result.
Iterate over both axes, grab the items of r and phi, do your calculation and put it into the right field of your output array.
Somehow the first point makes me think you never event tried to run your code, as it gives an obvious error message. Anyways, here some solution:
def f(x):
return x
r = np.linspace(0, 5, 10)
phi = np.linspace(0, 2*np.pi, 10)
X = np.zeros((len(r), len(phi)))
for i in xrange(len(r)):
for j in xrange(len(phi)):
X[i,j] = f(r[i]) * np.cos(phi[j])
print X
P.S. Don't go with the solutions, that put the results into ordinary lists, stick with numpy arrays for mathematical problems.
When I run this as is it returns X as a 1D array of f(r) and cos(phi) is 1 (ie. phi = 0).
That's because you don't store, print, or do anything else with all those intermediate values you generate. All you do is rebind X to the latest value over and over (forgetting whatever used to be in X), so at the end, it's bound to the last one.
If you want all of those X values, you have to actually store them in some way. For example:
Xs = []
for i in r:
for j in phi:
X = f(r) * np.cos(phi)
Xs.append[X]
Now, you'll have a list of each X instead of just the last one.
Of course in general, the whole reason you use NumPy is to "vectorize" operations instead of looping in the first place. (This allows the loops and the arithmetic to be done in C/C++/Fortran/whatever instead of Python, typically making them an order of magnitude or so faster.) So, if you can rewrite your code to create a 2D array from your 1D array by broadcasting, instead of creating a list of 1D arrays, it will be better…
Instead of assigning to X, you should be adding the most recent 1D array to it. (You'll need to initialize X before the outer loop.)
Edit: You don't need the inner loop; it's computing across all of phi (which is why you get the 1D array) repeatedly. (Note how it doesn't use j). That'll speed things up a bit, AND get you the correct answer!

Categories