Replace part of numpy 1D array with shorter array - python

I have a 1D numpy array containing some audio data. I'm doing some processing and want to replace certain parts of the data with white noise. The noise should, however, be shorter then the replaced part. Generating the noise is not a problem, but I'm wondering what the easiest way to replace the original data with the noise is. My first thought of doing data[10:110] = noise[0:10] does not work due to the obvious dimension mismatch.
What's the easiest way to replace a part of a numpy array with another part of different dimension?
edit:
The data is uncompressed PCM data that can be up to an hour long, taking up a few hundred MB of memory. I would like to avoid creating any additional copies in memory.

What advantage does a numpy array have over a python list for your application? I think one of the weaknesses of numpy arrays is that they are not easy to resize:
http://mail.python.org/pipermail/python-list/2008-June/1181494.html
Do you really need to reclaim the memory from the segments of the array you're shortening? If not, maybe you can use a masked array:
http://docs.scipy.org/doc/numpy/reference/maskedarray.generic.html
When you want to replace a section of your signal with a shorter section of noise, replace the first chunk of the signal, then mask out the remainder of the removed signal.
EDIT: Here's some clunky numpy code that doesn't use masked arrays, and doesn't allocate more memory. It also doesn't free any memory for the deleted segments. The idea is to replace data that you want deleted by shifting the remainder of the array, leaving zeros (or garbage) at the end of the array.
import numpy
a = numpy.arange(10)
# [0 1 2 3 4 5 6 7 8 9]
## Replace a[2:7] with length-2 noise:
insert = -1 * numpy.ones((2))
new = slice(2, 4)
old = slice(2, 7)
#Just to indicate what we'll be replacing:
a[old] = 0
# [0 1 0 0 0 0 0 7 8 9]
a[new] = insert
# [0 1 -1 -1 0 0 0 7 8 9]
#Shift the remaining data over:
a[new.stop:(new.stop - old.stop)] = a[old.stop:]
# [0 1 -1 -1 7 8 9 7 8 9]
#Zero out the dangly bit at the end:
a[(new.stop - old.stop):] = 0
# [0 1 -1 -1 7 8 9 0 0 0]

not entirely familiar with numpy but can't you just break down the data array into pieces that are the same size as the noise array and set each data piece to the noise piece. for example:
data[10:20] = noise[0:10]
data[21:31] = noise[0:10]
etc., etc.?
you could loop like this:
for x in range(10,100,10):
data[x:10+x] = noise[0:10]
UPDATE:
if you want to shorten the original data array, you could do this:
data = data[:10] + noise[:10]
this will truncate the data array and add the the noise to the original array after the 10th location, you could then add the rest of the data array to the new array if you need it.

Related

Upsampling using Numpy

I want to upsample a given 1d array by adding 'k-1' zeros between the elements for a given upsampling factor 'k'.
k=2
A = np.array([1,2,3,4,5])
B = np.insert(A,np.arange(1,len(A)), values=np.zeros(k-1))
The Above code works for k=2.
Output: [1 0 2 0 3 0 4 0 5]
k=3
A = np.array([1,2,3,4,5])
B = np.insert(A,np.arange(1,len(A)), values=np.zeros(k-1))
For k=3, it's throwing me an error.
The output I desire is k-1 i.e., 3-1 = 2 zeros between the elements.
Output: [1,0,0,2,0,0,3,0,0,4,0,0,5]
I want to add k-1 zeros between the elements of the 1d array.
ValueError Traceback (most recent call last)
Cell In [98], line 4
1 k = 3
3 A = np.array([1,2,3,4,5])
----> 4 B = np.insert(A, np.arange(1,len(A)), values=np.zeros(k-1))
6 print(k,'\n')
7 print(A,'\n')
File <__array_function__ internals>:180, in insert(*args, **kwargs)
File c:\Users\Naruto\AppData\Local\Programs\Python\Python310\lib\site-packages\numpy\lib\function_base.py:5325, in insert(arr, obj, values, axis)
5323 slobj[axis] = indices
5324 slobj2[axis] = old_mask
-> 5325 new[tuple(slobj)] = values
5326 new[tuple(slobj2)] = arr
5328 if wrap:
ValueError: shape mismatch: value array of shape (2,) could not be broadcast to indexing result of shape (4,)```
Would this be what you are looking for?
k=3
A=np.array([1,2,3,4,5])
B=np.insert(A, list(range(1,len(A)+1))*(k-1), 0)
I just duplicate the indexes in the obj array. Plus, no need to build an array of zeros, a single 0 scalar will do for the value argument.
Note that there are certainly better ways than the list to create that index (since it actually build a list). I fail to think of a one-liner for now. But, if that list is big, it might be a good idea to create an iterator for that.
I am not sure (I've never asked myself this question before) if this insert is optimal neither.
For example
B=np.zeros((len(A)*k,), dtype=np.int)
B[::k]=A
also does the trick. Which one is better memory wise (I would say this one, but just at first glance, because it doesn't create the obj list), and cpu-wise, not sure.
EDIT: in fact, I've just tried. The second solution is way faster (27 ms vs 1586 ms, for A with 50000 values and k=100). Which is not surprising. It is quite easy to figure out what it does (in C, I mean, in numpy code, not in python): just an allocation, and then a for loop to copy some values. It could hardly be simpler. Whereas insert probably computes shifting and such
A simple and fast method using np.zeros to create B, then assign values from A.
k = 3
A = np.array([1,2,3,4,5])
B = np.zeros(k*len(A)-k+1, dtype=A.dtype)
B[::k] = A

Can I make a new dtype for numpy ndarray?

I have a very big numpy matrix. Luckily there are only 4 possible options for values in the matrix, let say 0,1,2,3. Now of course instead of using float32, I can set the type to int8, to save memory (and storage, later when I save it).
My values range is exactly 2 in the power of 2, so theoretically, I can use just 2 bits for each cell, and save about 4 times memory. Is it possible to give a specific number of bits for the dtype?
Ideally I would like to do something like:
compressed_matrix = matrix.astype(np.int2)
I know theoretically I can stuck few cells together, like:
for i in range(0,len(arr),4):
comp_arr[i/4] = arr[i+3] + 4 * arr[i+2] + 16 * arr[i+1] + 64 * arr[i]
And later it will be possible to reconstruct it. But in the compressed form it will be very hard to apply operations on this array, such as transpose it, sum along axis, etc.. This solution might be good only for saving the matrix in the disk.
EDIT:
I add here some code example of usage. Let say window0 and window2 are 2 matrix that satisfy this condition. Both very big, with values in range 0 to 3, and both in the same shape).
We have as well vectors name ref_freq and non_ref_freq, which are floats vector in the range between 0 to 1, and are in the same len, defined by ref_freq = 1 - non_ref_freq.
My current code, which works, but not on very large matrix is:
first_element = (ref_freq * window0) # window0.T
second_element = (non_ref_freq * (2 - window2)) # (2 - window2).T
similarity = (first_element + second_element) / 4
np.fill_diagonal(similarity, 0)
Any idea that can help me?
Thanks,
Shahar

How do I sort a txt file dataset into two datasets using a label?

I loaded the data set with
np.loadtxt("dataset")
which has given me an array of arrays? I guess what I am trying to do is sort these internal arrays which comprise of three variables x, y and z where z is either a +1 or a -1 which denotes if it is positive or negative.
what I am trying to do is to break down these arrays into two separate arrays for processing so I can plot the negative labeled arrays against the positive ones.
example dataset
[[ 1 2 1 ],
[ 2 1 -1 ],
[ 3 2 1 ]]
this is what I've thought of so far
negex = []
posex = []
if dataset[2] < 0
negex.append()
else
posex.append()
I know this is wrong but it is the best I can think of. The reason why I put dataset[2] is because I'm addressing the third variable of the array and basically I'm saying if less than 0 which is negative one is then append to negex if not less than 0 then append to posex
ultimately I want to transform this dataset to the point where I can plug it into matplotlib and get points also I'm only allowed to use numpy.
You can break down the dataset into two separate arrays as following:
negex, posex = np.delete(dataset[dataset[:,2] < 0],2,1) , np.delete(dataset[dataset[:,2] > 0],2,1)

Making the nested loop run faster, e.g. by vectorization in Python

I have a 2 * N integer array ids representing intervals, where N is about a million. It looks like this
0 2 1 ...
3 4 3 ...
The ints in the arrays can be 0, 1, ... , M-1, where M <= 2N - 1. (Detail: if M = 2N, then the ints span all the 2N integers; if M < 2N, then there are some integers that have the same values.)
I need to calculate a kind of inverse map from ids. What I called "inverse map" is to see ids as intervals and capture the relation from their inner points with their indices.
Intuition Intuitively,
0 2 1
3 4 3
can be seen as
0 -> 0, 1, 2
1 -> 2, 3
2 -> 1, 2
where the right-hand-side endpoints are excluded for my problem. The "inverse" map would be
0 -> 0
1 -> 0, 2
2 -> 0, 1, 2
3 -> 1
Code I have a piece of Python code that attempts to calculate the inverse map in a dictionary inv below:
for i in range(ids.shape[1]):
for j in range(ids[0][i], ids[1][i]):
inv[j].append(i)
where each inv[j] is an array-like data initialized as empty before the nested loop. Currently I use python's built-in arrays to initialize it.
for i in range(M): inv[i]=array.array('I')
Question The nested loop above works like a mess. In my problem setting (in image processing), my first loop has a million iterations; second one about 3000 iterations. Not only it takes much memory (because inv is huge), it is also slow. I would like to focus on speed in this question. How can I accelerate this nested loop above, e.g. with vectorization?
You could try the below option, in which, your outer loop is hidden away within numpy's C-language implementation of apply_along_axis(). Not sure about about performance benefit, only a test at a decent scale can tell (especially as there's some initial overhead involved in converting lists to numpy arrays):
import numpy as np
import array
ids = [[0,2,1],[3,4,3]]
ids_arr = np.array(ids) # Convert to numpy array. Expensive operation?
range_index = 0 # Initialize. To be bumped up by each invocation of my_func()
inv = {}
for i in range(np.max(ids_arr)):
inv[i] = array.array('I')
def my_func(my_slice):
global range_index
for i in range(my_slice[0], my_slice[1]):
inv[i].append(range_index)
range_index += 1
np.apply_along_axis (my_func,0,ids_arr)
print (inv)
Output:
{0: array('I', [0]), 1: array('I', [0, 2]), 2: array('I', [0, 1, 2]),
3: array('I', [1])}
Edit:
I feel that using a dictionary might not be a good idea here. I suspect that in this particular context, dictionary-indexing might actually be slower than numpy array indexing. Use the below lines to create and initialize inv as a numpy array of Python arrays. The rest of the code can remain as-is:
inv_len = np.max(ids_arr)
inv = np.empty(shape=(inv_len,), dtype=array.array)
for i in range(inv_len):
inv[i] = array.array('I')
(Note: This assumes that your application isn't doing dict-specific stuff on inv, such as inv.items() or inv.keys(). If that's the case, however, you might need an extra step to convert the numpy array into a dict)
avoid for loop, just a pandas sample
import numpy as np
import pandas as pd
df = pd.DataFrame({
"A": np.random.randint(0, 100, 100000),
"B": np.random.randint(0, 100, 100000)
})
df.groupby("B")["A"].agg(list)
Since the order of N is large, I've come up with what seems like a practical approach; let me know if there are any flaws.
For the ith interval as [x,y], store it as [x,y,i]. Sort the arrays based on their start and end times. This should take O(NlogN) time.
Create a frequency array freq[2*N+1]. For each interval, update the frequency using the concept of range update in O(1) per update. Generating the frequencies gets done in O(N).
Determine a threshold, based on your data. According to that value, the elements can be specified as either sparse or frequent. For sparse elements, do nothing. For frequent elements only, store the intervals in which they occur.
During lookup, if there is a frequent element, you can directly access the pre-computed lists. If the element is a sparse one, you can search the intervals in O(logN) time, since the intervals are sorted and there indexes were appended in step 1.
This seems like a practical approach to me, rest depends on your usage. Like the amortized time complexity you need per query and so on.

How to control reserved capacity of numpy array?

In C++ vector there is .reserve(size) and .capacity() methods which allow you to reserve memory for array and get current reserved size. This reserved size is greater or equal to vector's real size (obtained through .size()).
If I do .push_back(element) in this array memory for array is not reallocated if current .size() < .capacity(). This allows to fast appending elements to array. If there is no more capacity then array gets reallocated to new memory location and all data is copied.
I'd like to know if there are same low-level methods available for numpy arrays? Can I reserve large capacity so that small appends/inserts don't reallocate numpy array in memory to often?
Probably there is already some growth mechanism built into numpy array, like 10% growth of reserved capacity on each reallocation. But I wonder if I can control this by myself and maybe implement faster growth, like doubling reserved capacity on each growth.
Also would be nice to know if there in-place variants of numpy functions, like insert/append, which modify array in-place without creating a copy. I.e. part of array is somehow reserved and filled with zeros and this part is used for shifting. E.g. if I have array [1 0 0 0] with 3 last 0 elements reserved then in-place .append(2) would modify mutably this array to make [1 2 0 0] with 2 reserved 0 elements left. Then .insert(1, 3) would again modify it to become [1 3 2 0] with 1 reserved 0 element left. I.e. everything like in C++.

Categories