Related
A small intro to my question.
I want to plot a sound file with numpy in dB. Just doing 20 * np.log10(arr) doesn't work because of negative numbers.
So I was looking into using np.where(). because np.where() is part of ufunc I just want to do it in that way (easier and more readable than using brackets).
I had some difficulties with it (unexpected results with a small test of random numbers). So I updated to latest version of numpy (was 1.18.x and is now 1.19.1).
Also I updated Spyder to 4.1.4.
So I did the following steps in the console for checking.
I also added comments for some steps.
In [1]: import numpy as np
In [2]: a = np.round(np.random.rand(10) * 3, 0)
In [3]: a
Out[3]: array([2., 1., 2., 1., 1., 3., 2., 0., 2., 1.])
In [4]: tf = np.where(a==2, True, False)
In [5]: b = np.power(a,3, where=np.where(a==2, True, False))
In [6]: b
Out[6]:
array([8. , 1.42180731, 8. , 1.31799691, 1.01436297,
2.82985094, 8. , 0.35036821, 8. , 0.73520376])
In [7]: np.power(a,3, where=tf)
Out[7]: array([8., 1., 8., 1., 1., 3., 8., 0., 8., 1.])
In[8]: np.power(a,3, where=np.where(a==2, True, False))
Out[8]: array([8., 1., 8., 1., 1., 3., 8., 0., 8., 1.])
So when just calculating in the console it looks fine, but when using a variable the original values messed up.
In[9]: b=np.power(a,3, where=tf)
In[10]: b
Out[10]:
array([8. , 1.42180731, 8. , 1.31799691, 1.01436297,
2.82985094, 8. , 0.35036821, 8. , 0.73520376])
In[11]: np.log10(a, where=np.where(a>2, True, False))
Out[11]:
array([8. , 1.42180731, 8. , 1.31799691, 1.01436297,
0.47712125, 8. , 0.35036821, 8. , 0.73520376])
Because of weird results I checked if a is still right
In[12]: a # Check if a still right
Out[12]: array([2., 1., 2., 1., 1., 3., 2., 0., 2., 1.])
Looks like it is. So I tried np.log10() next. Check what happens:
In[13]: np.log10(a, where=np.where(a>2, True, False))
Out[13]:
array([2. , 1. , 2. , 1. , 1. ,
0.47712125, 2. , 0. , 2. , 1. ])
In[14]: c = np.log10(a, where=np.where(a>2, True, False))
In[15]: c
Out[15]:
array([2. , 1. , 2. , 1. , 1. ,
0.47712125, 2. , 0. , 2. , 1. ])
somehow with c (log10) everything works / looks fine. So I removed b (I use spyder so I removed it from the variable explorer).
Then I recreated b
In[16]: b = np.power(a, 3, where=np.where(a==2, True, False))
In[17]: b
Out[17]: array([8., 1., 8., 1., 1., 3., 8., 0., 8., 1.])
In[18]: b1 = np.power(a, 3, where=tf)
In[19]: b1
Out[19]:
array([8.00000000e+000, 1.82804289e-322, 8.00000000e+000, 0.00000000e+000,
0.00000000e+000, 6.52741159e-038, 8.00000000e+000, 7.63251534e+169,
8.00000000e+000, 1.23967276e+224])
So I can't follow why this is the case. Did I do something wrong? (If yes please explain). Is this a bug in Numpy?
Edit: I have occurence on multiple laptops. So i created a mini script what should do the trick if it is on your pc / laptop. One adition before the script i use anaconda on all my machines for if that matters.
a = np.round(np.random.rand(10) * 3, 0)
tf = np.where(a==2, True, False)
b = np.power(a, 3, where=np.where(a==2, True, False))
b1 = np.power(a, 3, where=tf)
c = np.log10(a, where=np.where(a>2, True, False))
bits = 16
linarr = np.arange(2 ** bits) - 2 ** (bits - 1)
logarr = np.copy(linarr)
logarr = 20 * np.log10(logarr, where=np.where(linarr > 0, True, False))
I would expect that at least for the logarr i get something in the following direction array([-32768, -32767,-32766, ... , 0, ... 90.3085, 90.3087]) but i get array([1.76e-314, 1.72-314, 2.12e-312, ... , 0, ... 90.3085, 90.3087])
I'm trying to use numpy.multiply.outer on multidimensional arrays, and I really need it to assume that any 0 * infinity it sees evaluates to zero. How can I do this efficiently?
>>> import numpy
>>> numpy.multiply.outer([0.], [float('inf')])
Warning (from warnings module):
File "__main__", line 2
RuntimeWarning: invalid value encountered in multiply
array([[ nan]])
Do you need to worry about other sources of nan values? If not, you could always just fix up in a separate step:
import numpy as np
r = np.multiply.outer([0.], [float('inf')])
np.where(np.isnan(r), 0, r)
Up to you if you want to suppress the warnings.
One solution could be to avoid using np.multiply.outer and find the solution using element-wise multiplication on matrices that have already been checked to see if they meet the condition of interest (zero in one array, inf in other array).
import numpy as np
A = np.array([0., 0., 0.4, 2])
B = np.array([float('inf'), 1., 3.4, np.inf])
# Conditions of interest
c1 = (A == 0)
c2 = (B == np.inf)
condition1 = np.multiply.outer(c1, c2)
c3 = (A == np.inf)
c4 = (B == 0)
condition2 = np.multiply.outer(c3, c4)
condition = condition1 | condition2
AA = np.multiply.outer(A, np.ones(B.shape))
BB = np.multiply.outer(np.ones(A.shape), B)
AA[condition] = 0.
BB[condition] = 0.
AA*BB
This may not pass the 'efficiency' request of the poster, however.
Here's how to suppress the warnings:
mean, nanmean and warning: Mean of empty slice
In [528]: import warnings
In [530]: x = np.array([0,1,2],float)
In [531]: y = np.array([np.inf,3,2],float)
In [532]: np.outer(x,y)
/usr/local/lib/python3.5/dist-packages/numpy/core/numeric.py:1093: RuntimeWarning: invalid value encountered in multiply
return multiply(a.ravel()[:, newaxis], b.ravel()[newaxis,:], out)
Out[532]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [535]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [536]: z
Out[536]:
array([[ nan, 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
replace the nan with 1:
In [542]: z[np.isnan(z)]=1
In [543]: z
Out[543]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
In [547]: z[np.isinf(z)]=9999
In [548]: z
Out[548]:
array([[ 1.00000000e+00, 0.00000000e+00, 0.00000000e+00],
[ 9.99900000e+03, 3.00000000e+00, 2.00000000e+00],
[ 9.99900000e+03, 6.00000000e+00, 4.00000000e+00]])
=================
We could create a mask using the kind of testing that #P-robot demonstrates:
In [570]: np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
Out[570]:
array([[ True, False, False],
[False, False, False],
[False, False, False]], dtype=bool)
In [571]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
In [572]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [573]: z[mask]=1
In [574]: z
Out[574]:
array([[ 1., 0., 0.],
[ inf, 3., 2.],
[ inf, 6., 4.]])
Or with messier inputs:
In [587]: x = np.array([0,1,2,np.inf],float)
In [588]: y = np.array([np.inf,3,np.nan,0],float)
In [589]: mask=np.outer(np.isinf(x),y==0)|np.outer(x==0,np.isinf(y))
...
In [591]: with warnings.catch_warnings():
...: warnings.simplefilter('ignore',category=RuntimeWarning)
...: z = np.outer(x,y)
...:
In [592]: z[mask]=1
In [593]: z
Out[593]:
array([[ 1., 0., nan, 0.],
[ inf, 3., nan, 0.],
[ inf, 6., nan, 0.],
[ inf, inf, nan, 1.]])
While I agree that #ShadowRanger's answer, a cheap hack could be to take advantage of np.nan_to_num, which replaces infs with large finite numbers, which will then get you inf * 0 = 0.
To convert unwanted remaining high finite numbers back to inf (given some other operations besides your question) you can use multiple the high number by anything > 1 and then divide by the same amount (so as not to impact other numbers). E.g.:
In [1]: np.nan_to_num(np.inf)
Out[1]: 1.7976931348623157e+308
In [2]: np.nan_to_num(np.inf)*1.1
RuntimeWarning: overflow encountered in double_scalars
Out[2]: inf
In [3]: np.nan_to_num(np.inf)*1.1/1.1
RuntimeWarning: overflow encountered in double_scalars
Out[3]: inf
Before the flood of downvotes, this is clearly not a best practice, and can potentially have side effects depending on your use case, but just thought I'd throw an alternative out there.
Since collections.Counter is so slow, I am pursuing a faster method of summing mapped values in Python 2.7. It seems like a simple concept and I'm kind of disappointed in the built-in Counter method.
Basically, I need to be able to take arrays like this:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
And then "add" them so they look like this:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If there isn't a good way to do this quickly and efficiently, I'm open to any other ideas that will allow me to do something similar to this, and I'm open to modules other than Numpy.
Thanks!
Edit: Ready for some speedtests?
Intel win 64bit machine. All of the following values are in seconds; 20000 loops.
collections.Counter results:
2.131000, 2.125000, 2.125000
Divakar's union1d + masking results:
1.641000, 1.633000, 1.625000
Divakar's union1d + indexing results:
0.625000, 0.625000, 0.641000
Histogram results:
1.844000, 1.938000, 1.858000
Pandas results:
16.659000, 16.686000, 16.885000
Conclusions: union1d + indexing wins, the array size is too small for Pandas to be effective, and the histogram approach blew my mind with its simplicity but I'm guessing it takes too much overhead to create. All of the responses I received were very good, though. This is what I used to get the numbers. Thanks again!
Edit: And it should be mentioned that using Counter1.update(Counter2.elements()) is terrible despite doing the same exact thing (65.671000 sec).
Later Edit: I've been thinking about this a lot, and I've came to realize that, with Numpy, it might be more effective to fill each array with zeros so that the first column isn't even needed since we can just use the index, and that would also make it much easier to add multiple arrays together as well as do other functions. Additionally, Pandas makes more sense than Numpy since there would be no need to 0-fill, and it would definitely be more effective with large data sets (however, Numpy has the advantage of being compatible on more platforms, like GAE, if that matters at all). Lastly, the answer I checked was definitely the best answer for the exact question I asked--adding the two arrays in the way I showed--but I think what I needed was a change in perspective.
Here's one approach with np.union1d and masking -
def app1(a,b):
c0 = np.union1d(a[:,0],b[:,0])
out = np.zeros((len(c0),2))
out[:,0] = c0
mask1 = np.in1d(c0,a[:,0])
out[mask1,1] = a[:,1]
mask2 = np.in1d(c0,b[:,0])
out[mask2,1] += b[:,1]
return out
Sample run -
In [174]: a
Out[174]:
array([[ 0., 2.],
[ 12., 2.],
[ 23., 1.]])
In [175]: b
Out[175]:
array([[ 0., 3.],
[ 1., 1.],
[ 12., 5.]])
In [176]: app1(a,b)
Out[176]:
array([[ 0., 5.],
[ 1., 1.],
[ 12., 7.],
[ 23., 1.]])
Here's another with np.union1d and indexing -
def app2(a,b):
n = np.maximum(a[:,0].max(), b[:,0].max())+1
c0 = np.union1d(a[:,0],b[:,0])
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out = out0[c0.astype(int)]
out[:,0] = c0
return out
For the case where all indices are covered by the first column values in a and b -
def app2_specific(a,b):
c0 = np.union1d(a[:,0],b[:,0])
n = c0[-1]+1
out0 = np.zeros((int(n), 2))
out0[a[:,0].astype(int),1] = a[:,1]
out0[b[:,0].astype(int),1] += b[:,1]
out0[:,0] = c0
return out0
Sample run -
In [234]: a
Out[234]:
array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
In [235]: b
Out[235]:
array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.]])
In [236]: app2_specific(a,b)
Out[236]:
array([[ 0., 5.],
[ 1., 1.],
[ 2., 7.],
[ 3., 1.]])
If you know the number of fields, use np.bincount.
c = np.vstack([a, b])
counts = np.bincount(c[:, 0], weights = c[:, 1], minlength = numFields)
out = np.vstack([np.arange(numFields), counts]).T
This works if you're getting all your data at once. Make a list of your arrays and vstack them. If you're getting data chunks sequentially, you can use np.add.at to do the same thing.
out = np.zeros(2, numFields)
out[:, 0] = np.arange(numFields)
np.add.at(out[:, 1], a[:, 0], a[:, 1])
np.add.at(out[:, 1], b[:, 0], b[:, 1])
You can use a basic histogram, this will deal with gaps, too. You can filter out zero-count entries if need be.
import numpy as np
x = np.array([[ 0., 2.],
[ 2., 2.],
[ 3., 1.]])
y = np.array([[ 0., 3.],
[ 1., 1.],
[ 2., 5.],
[ 5., 3.]])
c, w = np.vstack((x,y)).T
h, b = np.histogram(c, weights=w,
bins=np.arange(c.min(),c.max()+2))
r = np.vstack((b[:-1], h)).T
print(r)
# [[ 0. 5.]
# [ 1. 1.]
# [ 2. 7.]
# [ 3. 1.]
# [ 4. 0.]
# [ 5. 3.]]
r_nonzero = r[r[:,1]!=0]
Pandas have some functions doing exactly what you intend
import pandas as pd
pda = pd.DataFrame(a).set_index(0)
pdb = pd.DataFrame(b).set_index(0)
result = pd.concat([pda, pdb], axis=1).fillna(0).sum(axis=1)
Edit: If you actually need the data back in numpy format, just do
array_res = result.reset_index(name=1).values
This is a quintessential grouping problem, which numpy_indexed (disclaimer: I am its author) was created to solve elegantly and efficiently:
import numpy_indexed as npi
C = np.concatenate([A, B], axis=0)
labels, sums = npi.group_by(C[:, 0]).sum(C[:, 1])
Note: its cleaner to maintain your label arrays as a seperate int array; floats are finicky when it comes to labeling things, with positive and negative zeros, and printed values not relaying all binary state. Better to use ints for that.
I think I've misunderstood something with indexing in numpy.
I have a 3D-numpy array of shape (dim_x, dim_y, dim_z) and I want to find the maximum along the third axis (dim_z), and set its value to 1 and all the others to zero.
The problem is that I end up with several 1 in the same row, even if values are different.
Here is the code :
>>> test = np.random.rand(2,3,2)
>>> test
array([[[ 0.13110146, 0.07138861],
[ 0.84444158, 0.35296986],
[ 0.97414498, 0.63728852]],
[[ 0.61301975, 0.02313646],
[ 0.14251848, 0.91090492],
[ 0.14217992, 0.41549218]]])
>>> result = np.zeros_like(test)
>>> result[:test.shape[0], np.arange(test.shape[1]), np.argmax(test, axis=2)]=1
>>> result
array([[[ 1., 0.],
[ 1., 1.],
[ 1., 1.]],
[[ 1., 0.],
[ 1., 1.],
[ 1., 1.]]])
I was expecting to end with :
array([[[ 1., 0.],
[ 1., 0.],
[ 1., 0.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
Probably I'm missing something here. From what I've understood, 0:dim_x, np.arange(dim_y) returns dim_x of dim_y tuples and np.argmax(test, axis=dim_z) has the shape (dim_x, dim_y) so if the indexing is of the form [x, y, z] a couple [x, y] is not supposed to appear twice.
Could someone explain me where I'm wrong ? Thanks in advance.
What we are looking for
We get the argmax indices along the last axis -
idx = np.argmax(test, axis=2)
For the given sample data, we have idx :
array([[0, 0, 0],
[0, 1, 1]])
Now, idx covers the first and second axes, while getting those argmax indices.
To assign the corresponding ones in the output, we need to create range arrays for the first two axes covering the lengths along those and aligned according to the shape of idx. Now, idx is a 2D array of shape (m,n), where m = test.shape[0] and n = test.shape[1].
Thus, the range arrays for assignment into first two axes of output must be -
X = np.arange(test.shape[0])[:,None]
Y = np.arange(test.shape[1])
Notice, the extension of the first range array to 2D is needed to have it aligned against the rows of idx and Y would align against the cols of idx -
In [239]: X
Out[239]:
array([[0],
[1]])
In [240]: Y
Out[240]: array([0, 1, 2])
Schematically put -
idx :
Y array
--------->
x x x | X array
x x x |
v
The fault in original code
Your code was -
result[:test.shape[0], np.arange(test.shape[1]), ..
This is essentially :
result[:, np.arange(test.shape[1]), ...
So, you are selecting all elements along the first axis, instead of only selecting the corresponding ones that correspond to idx indices. In that process, you were selecting a lot more than required elements for assignment and hence you were seeing many more than required 1s in result array.
The correction
Thus, the only correction needed was indexing into the first axis with the range array and a working solution would be -
result[np.arange(test.shape[0])[:,None], np.arange(test.shape[1]), ...
The alternative(s)
Alternatively, using the range arrays created earlier with X and Y -
result[X,Y,idx] = 1
Another way to get X,Y would be with np.mgrid -
m,n = test.shape[:2]
X,Y = np.ogrid[:m,:n]
I think there's a problem with mixing basic (slice) and advanced indexing. It's easier to see when selecting value from an array than with this assignment; but it can result in transposed axes. For a problem like this it is better use advanced indexing all around, as provided by ix_
In [24]: test = np.random.rand(2,3,2)
In [25]: idx=np.argmax(test,axis=2)
In [26]: idx
Out[26]:
array([[1, 0, 1],
[0, 1, 1]], dtype=int32)
with basic and advanced:
In [31]: res1 = np.zeros_like(test)
In [32]: res1[:, np.arange(test.shape[1]), idx]=1
In [33]: res1
Out[33]:
array([[[ 1., 1.],
[ 1., 1.],
[ 0., 1.]],
[[ 1., 1.],
[ 1., 1.],
[ 0., 1.]]])
with advanced:
In [35]: I,J = np.ix_(range(test.shape[0]), range(test.shape[1]))
In [36]: I
Out[36]:
array([[0],
[1]])
In [37]: J
Out[37]: array([[0, 1, 2]])
In [38]: res2 = np.zeros_like(test)
In [40]: res2[I, J , idx]=1
In [41]: res2
Out[41]:
array([[[ 0., 1.],
[ 1., 0.],
[ 0., 1.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
On further thought, the use of the slice for the 1st dimension is just wrong , if the goal is to set or find the 6 argmax values
In [54]: test
Out[54]:
array([[[ 0.15288242, 0.36013289],
[ 0.90794601, 0.15265616],
[ 0.34014976, 0.53804266]],
[[ 0.97979479, 0.15898605],
[ 0.04933804, 0.89804999],
[ 0.10199319, 0.76170911]]])
In [55]: test[I, J, idx]
Out[55]:
array([[ 0.36013289, 0.90794601, 0.53804266],
[ 0.97979479, 0.89804999, 0.76170911]])
In [56]: test[:, J, idx]
Out[56]:
array([[[ 0.36013289, 0.90794601, 0.53804266],
[ 0.15288242, 0.15265616, 0.53804266]],
[[ 0.15898605, 0.04933804, 0.76170911],
[ 0.97979479, 0.89804999, 0.76170911]]])
With the slice it selects a (2,3,2) set of values from test (or res), not the intended (2,3). There 2 extra rows.
Here is an easier way to do it:
>>> test == test.max(axis=2, keepdims=1)
array([[[ True, False],
[ True, False],
[ True, False]],
[[ True, False],
[False, True],
[False, True]]], dtype=bool)
...and if you really want that as floating-point 1.0 and 0.0, then convert it:
>>> (test==test.max(axis=2, keepdims=1)).astype(float)
array([[[ 1., 0.],
[ 1., 0.],
[ 1., 0.]],
[[ 1., 0.],
[ 0., 1.],
[ 0., 1.]]])
Here is a way to do it with only one winner per row-column combo (i.e. no ties, as discussed in comments):
rowmesh, colmesh = np.meshgrid(range(test.shape[0]), range(test.shape[1]), indexing='ij')
maxloc = np.argmax(test, axis=2)
flatind = np.ravel_multi_index( [rowmesh, colmesh, maxloc ], test.shape )
result = np.zeros_like(test)
result.flat[flatind] = 1
UPDATE after reading hpaulj's answer:
rowmesh, colmesh = np.ix_(range(test.shape[0]), range(test.shape[1]))
is a more-efficient, more numpythonic, alternative to my meshgrid call (the rest of the code stays the same)
The issue of why your approach fails is hard to explain, but here's one place where intuition could start: your slicing approach says "all rows, times all columns, times a certain sequence of layers". How many elements is that slice in total? By contrast, how many elements do you actually want to set to 1? It can be instructive to look at the values you get when you view the corresponding test values of the slice you're trying to assign to:
>>> test[:, :, maxloc].shape
(2, 3, 2, 3) # oops! it's because maxloc itself is 2x3
>>> test[:, :, maxloc]
array([[[[ 0.13110146, 0.13110146, 0.13110146],
[ 0.13110146, 0.07138861, 0.07138861]],
[[ 0.84444158, 0.84444158, 0.84444158],
[ 0.84444158, 0.35296986, 0.35296986]],
[[ 0.97414498, 0.97414498, 0.97414498],
[ 0.97414498, 0.63728852, 0.63728852]]],
[[[ 0.61301975, 0.61301975, 0.61301975],
[ 0.61301975, 0.02313646, 0.02313646]],
[[ 0.14251848, 0.14251848, 0.14251848],
[ 0.14251848, 0.91090492, 0.91090492]],
[[ 0.14217992, 0.14217992, 0.14217992],
[ 0.14217992, 0.41549218, 0.41549218]]]]) # note the repetition, because in maxloc you're repeatedly asking for layer 0 sometimes, and sometimes repeatedly for layer 1
I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])