Upsampling using Numpy - python

I want to upsample a given 1d array by adding 'k-1' zeros between the elements for a given upsampling factor 'k'.
k=2
A = np.array([1,2,3,4,5])
B = np.insert(A,np.arange(1,len(A)), values=np.zeros(k-1))
The Above code works for k=2.
Output: [1 0 2 0 3 0 4 0 5]
k=3
A = np.array([1,2,3,4,5])
B = np.insert(A,np.arange(1,len(A)), values=np.zeros(k-1))
For k=3, it's throwing me an error.
The output I desire is k-1 i.e., 3-1 = 2 zeros between the elements.
Output: [1,0,0,2,0,0,3,0,0,4,0,0,5]
I want to add k-1 zeros between the elements of the 1d array.
ValueError Traceback (most recent call last)
Cell In [98], line 4
1 k = 3
3 A = np.array([1,2,3,4,5])
----> 4 B = np.insert(A, np.arange(1,len(A)), values=np.zeros(k-1))
6 print(k,'\n')
7 print(A,'\n')
File <__array_function__ internals>:180, in insert(*args, **kwargs)
File c:\Users\Naruto\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\lib\function_base.py:5325, in insert(arr, obj, values, axis)
5323 slobj[axis] = indices
5324 slobj2[axis] = old_mask
-> 5325 new[tuple(slobj)] = values
5326 new[tuple(slobj2)] = arr
5328 if wrap:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (4,)```

Would this be what you are looking for?
k=3
A=np.array([1,2,3,4,5])
B=np.insert(A, list(range(1,len(A)+1))*(k-1), 0)
I just duplicate the indexes in the obj array. Plus, no need to build an array of zeros, a single 0 scalar will do for the value argument.
Note that there are certainly better ways than the list to create that index (since it actually build a list). I fail to think of a one-liner for now. But, if that list is big, it might be a good idea to create an iterator for that.
I am not sure (I've never asked myself this question before) if this insert is optimal neither.
For example
B=np.zeros((len(A)*k,), dtype=np.int)
B[::k]=A
also does the trick. Which one is better memory wise (I would say this one, but just at first glance, because it doesn't create the obj list), and cpu-wise, not sure.
EDIT: in fact, I've just tried. The second solution is way faster (27 ms vs 1586 ms, for A with 50000 values and k=100). Which is not surprising. It is quite easy to figure out what it does (in C, I mean, in numpy code, not in python): just an allocation, and then a for loop to copy some values. It could hardly be simpler. Whereas insert probably computes shifting and such

A simple and fast method using np.zeros to create B, then assign values from A.
k = 3
A = np.array([1,2,3,4,5])
B = np.zeros(k*len(A)-k+1, dtype=A.dtype)
B[::k] = A

Related

Python: fast matrix multiplication with extra indices

I have two arrays, A and B, with dimensions (l,m,n) and (l,m,n,n), respectively. I would like to obtain an array C of dimensions (l,m,n) which is obtained by treating A and B as matrices in their fourth (A) and third and fourth indices (B). An easy way to do this is:
import numpy as np
#Define dimensions
l = 1024
m = l
n = 6
#Create some random arrays
A = np.random.rand(l,m,n)
B = np.random.rand(l,m,n,n)
C = np.zeros((l,m,n))
#Desired multiplication
for i in range(0,l):
for j in range(0,m):
C[i,j,:] = np.matmul(A[i,j,:],B[i,j,:,:])
It is, however, slow (about 3 seconds on my MacBook). What'd be the fastest, fully vectorial way to do this?
Try to use einsum.
It has many use cases, check the docs: https://numpy.org/doc/stable/reference/generated/numpy.einsum.html
Or, for more info, a really good explanation can be also found at: https://ajcr.net/Basic-guide-to-einsum/
In your case, it seems like
np.einsum('dhi,dhij->dhj',A,B)
should work. Also, you can try the optimize=True flag to get more speed, if needed.

How to prevent accidental assignment into empty NumPy views

Consider the following Python + NumPy code that executes without error:
a = np.array((1, 2, 3))
a[13:17] = 23
Using a slice beyond the limits of the array truncates the slice and even returns an empty view if start and stop are beyond the limits. Assigning to such a slice just drops the input.
In my use case the indices are calculated in a non-trivial way and are used to manipulate selected parts of an array. The above behavior means that I might silently skip parts of that manipultion if the indices are miscalculated. That can be hard to detect and can lead to "almost correct" results, i.e. the worst kind of programming errors.
For that reason I'd like to have strict checking for slices so that a start or stop outside the array bounds triggers an error. Is there a way to enable that in NumPy?
As additional information, the arrays are large and the operation is performed very often, i.e. there should be no performance penalty. Furthermore, the arrays are often multidimensional, including multidimensional slicing.
You could be using np.put_along_axis instead, which seems to fit your needs:
>>> a = np.array((1, 2, 3))
>>> np.put_along_axis(a, indices=np.arange(13, 17), axis=0, values=23)
The above will raise the following error:
IndexError: index 13 is out of bounds for axis 0 with size 3
Parameter values can either be a scalar value or another NumPy array.
Or in a shorter form:
>>> np.put_along_axis(a, np.r_[13:17], 23, 0)
Edit: Alternatively np.put has a mode='raise' option (which is set by default):
np.put(a, ind, v, mode='raise')
a: ndarray - Target array.
ind: array_like - Target indices, interpreted as integers.
v: array_like - Values to place in a at target indices. [...]
mode: {'raise', 'wrap', 'clip'} optional - Specifies how out-of-bounds
indices will behave.
'raise' – raise an error (default)
'wrap' – wrap around
'clip' – clip to the range
The default behavior will be:
>>> np.put(a, np.r_[13:17], 23)
IndexError: index 13 is out of bounds for axis 0 with size 3
while with mode='clip', it remains silent:
>>> np.put(a, np.r_[13:17], 23, mode='clip')
Depending on how complicated your indices are (read: how much pain in the backside it is to predict shapes after slicing), you may want to compute the expected shape directly and then reshape to it. If the size of your actual sliced array doesn't match this will raise an error. Overhead is minor:
import numpy as np
from timeit import timeit
def use_reshape(a,idx,val):
expected_shape = ((s.stop-s.start-1)//(s.step or 1) + 1 if isinstance(s,slice) else 1 for s in idx)
a[idx].reshape(*expected_shape)[...] = val
def no_check(a,idx,val):
a[idx] = val
val = 23
idx = np.s_[13:1000:2,14:20]
for f in (no_check,use_reshape):
a = np.zeros((1000,1000))
print(f.__name__)
print(timeit(lambda:f(a,idx,val),number=1000),'ms')
assert (a[idx] == val).all()
# check it works
print("\nThis should raise an exception:\n")
use_reshape(a,np.s_[1000:1001,10],0)
Please note, that this is proof of concept code. To make it safe you'd have to check for unexpected index kinds, matching numbers of dimensions and, importantly, check for indices that select a single element.
Running it anyway:
no_check
0.004587646995787509 ms
use_reshape
0.006306983006652445 ms
This should raise an exception:
Traceback (most recent call last):
File "check.py", line 22, in <module>
use_reshape(a,np.s_[1000:1001,10],0)
File "check.py", line 7, in use_reshape
a[idx].reshape(*expected_shape)[...] = val
ValueError: cannot reshape array of size 0 into shape (1,1)
One way to achieve the behavior you want is to use ranges instead of slices:
a = np.array((1, 2, 3))
a[np.arange(13, 17)] = 23
I think NumPy's behavior here is consistent with the behavior of pure Python's lists and should be expected. Instead of workarounds, it might be better for code readability to explicitly add asserts:
index_1, index_2 = ... # a complex computation
assert index_1 < index_2 and index_2 < a.shape[0]
a[index_1:index_2] = 23

Making the nested loop run faster, e.g. by vectorization in Python

I have a 2 * N integer array ids representing intervals, where N is about a million. It looks like this
0 2 1 ...
3 4 3 ...
The ints in the arrays can be 0, 1, ... , M-1, where M <= 2N - 1. (Detail: if M = 2N, then the ints span all the 2N integers; if M < 2N, then there are some integers that have the same values.)
I need to calculate a kind of inverse map from ids. What I called "inverse map" is to see ids as intervals and capture the relation from their inner points with their indices.
Intuition Intuitively,
0 2 1
3 4 3
can be seen as
0 -> 0, 1, 2
1 -> 2, 3
2 -> 1, 2
where the right-hand-side endpoints are excluded for my problem. The "inverse" map would be
0 -> 0
1 -> 0, 2
2 -> 0, 1, 2
3 -> 1
Code I have a piece of Python code that attempts to calculate the inverse map in a dictionary inv below:
for i in range(ids.shape[1]):
for j in range(ids[0][i], ids[1][i]):
inv[j].append(i)
where each inv[j] is an array-like data initialized as empty before the nested loop. Currently I use python's built-in arrays to initialize it.
for i in range(M): inv[i]=array.array('I')
Question The nested loop above works like a mess. In my problem setting (in image processing), my first loop has a million iterations; second one about 3000 iterations. Not only it takes much memory (because inv is huge), it is also slow. I would like to focus on speed in this question. How can I accelerate this nested loop above, e.g. with vectorization?
You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):
import numpy as np
import array
ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?
range_index = 0 # Initialize. To be bumped up by each invocation of my_func()
inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')
def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1
np.apply_along_axis (my_func,0,ids_arr)
print (inv)
Output:
{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}
Edit:
I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:
inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')
(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)
avoid for loop, just a pandas sample
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": np.random.randint(0, 100, 100000),
"B": np.random.randint(0, 100, 100000)
})
df.groupby("B")["A"].agg(list)
Since the order of N is large, I've come up with what seems like a practical approach; let me know if there are any flaws.
For the ith interval as [x,y], store it as [x,y,i]. Sort the arrays based on their start and end times. This should take O(NlogN) time.
Create a frequency array freq[2*N+1]. For each interval, update the frequency using the concept of range update in O(1) per update. Generating the frequencies gets done in O(N).
Determine a threshold, based on your data. According to that value, the elements can be specified as either sparse or frequent. For sparse elements, do nothing. For frequent elements only, store the intervals in which they occur.
During lookup, if there is a frequent element, you can directly access the pre-computed lists. If the element is a sparse one, you can search the intervals in O(logN) time, since the intervals are sorted and there indexes were appended in step 1.
This seems like a practical approach to me, rest depends on your usage. Like the amortized time complexity you need per query and so on.

Are Numpy arrays hashable?

I've read that numpy arrays are hashable which means it is immutable but I'm able to change it's values so what does it exactly mean by being hashable?
c=pd.Series('a',index=range(6))
c
Out[276]:
0 a
1 a
2 a
3 a
4 a
5 a
dtype: object
This doesn't give me error then why it gives error if I try to do the same with numpy array.
d=pd.Series(np.array(['a']),index=range(6))
Contrary to what you have read, array are not hashable. You can test this with
import numpy as np,collections
isinstance(np.array(1), collections.Hashable)
or
{np.array(1):1}
This has nothing to do with the error you are getting:
d=pd.Series(np.array('a'),index=range(6))
ValueError: Wrong number of dimensions
the error is specific, and has nothing to do with hashes. The data frame is expecting at least something with 1 dimension, whereas the above has 0 dimensions. This is due to the fact it is getting an array - so it checks the dimension (as opposed to passing the string directly, where Pandas developers have chosen to implement as you have shown. TBH they could have chosen the same for a 0 dimension array).
So you could try:
d=pd.Series(np.array(('a',)),index=range(6))
ValueError: Wrong number of items passed 1, placement implies 6
The index value expects there to be a 6 in one dimension, so it fails. Finally
pd.Series(np.array(['a']*6),index=range(6))
0 a
1 a
2 a
3 a
4 a
5 a
dtype: object
works. So the DataFrame has no problem being initiated from an array, and this has nothing to do with hashability.

Replace part of numpy 1D array with shorter array

I have a 1D numpy array containing some audio data. I'm doing some processing and want to replace certain parts of the data with white noise. The noise should, however, be shorter then the replaced part. Generating the noise is not a problem, but I'm wondering what the easiest way to replace the original data with the noise is. My first thought of doing data[10:110] = noise[0:10] does not work due to the obvious dimension mismatch.
What's the easiest way to replace a part of a numpy array with another part of different dimension?
edit:
The data is uncompressed PCM data that can be up to an hour long, taking up a few hundred MB of memory. I would like to avoid creating any additional copies in memory.
What advantage does a numpy array have over a python list for your application? I think one of the weaknesses of numpy arrays is that they are not easy to resize:
http://mail.python.org/pipermail/python-list/2008-June/1181494.html
Do you really need to reclaim the memory from the segments of the array you're shortening? If not, maybe you can use a masked array:
http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
When you want to replace a section of your signal with a shorter section of noise, replace the first chunk of the signal, then mask out the remainder of the removed signal.
EDIT: Here's some clunky numpy code that doesn't use masked arrays, and doesn't allocate more memory. It also doesn't free any memory for the deleted segments. The idea is to replace data that you want deleted by shifting the remainder of the array, leaving zeros (or garbage) at the end of the array.
import numpy
a = numpy.arange(10)
# [0 1 2 3 4 5 6 7 8 9]
## Replace a[2:7] with length-2 noise:
insert = -1 * numpy.ones((2))
new = slice(2, 4)
old = slice(2, 7)
#Just to indicate what we'll be replacing:
a[old] = 0
# [0 1 0 0 0 0 0 7 8 9]
a[new] = insert
# [0 1 -1 -1 0 0 0 7 8 9]
#Shift the remaining data over:
a[new.stop:(new.stop - old.stop)] = a[old.stop:]
# [0 1 -1 -1 7 8 9 7 8 9]
#Zero out the dangly bit at the end:
a[(new.stop - old.stop):] = 0
# [0 1 -1 -1 7 8 9 0 0 0]
not entirely familiar with numpy but can't you just break down the data array into pieces that are the same size as the noise array and set each data piece to the noise piece. for example:
data[10:20] = noise[0:10]
data[21:31] = noise[0:10]
etc., etc.?
you could loop like this:
for x in range(10,100,10):
data[x:10+x] = noise[0:10]
UPDATE:
if you want to shorten the original data array, you could do this:
data = data[:10] + noise[:10]
this will truncate the data array and add the the noise to the original array after the 10th location, you could then add the rest of the data array to the new array if you need it.

Categories