I have the following list or numpy array
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
and an additional list indicating the positions of non-zeros
i=[0,6,9]
I would like to make two new lists out of them, one filling the zeros and one counting in between, for this short example:
a=[7.2,7.2,7.2,7.2,7.2,7.2,6.5,6.5,6.5,-8.1,-8.1,-8.1,-8.1,-8.1]
b=[0,1,2,3,4,5,0,1,2,0,1,2,3,4]
Is therea a way to do that without a for loop to speed up things, as the list ll is quite long in my case.
Array a is the result of a forward fill and array b are indices associated with the range between each consecutive non-zero element.
pandas has a forward fill function, but it should be easy enough to compute with numpy and there are many sources on how to do this.
ll=[7.2,0,0,0,0,0,6.5,0,0,-8.1,0,0,0,0]
a = np.array(ll)
# find zero elements and associated index
mask = a == 0
idx = np.where(~mask, np.arange(mask.size), False)
# do the fill
a[np.maximum.accumulate(idx)]
output:
array([ 7.2, 7.2, 7.2, 7.2, 7.2, 7.2, 6.5, 6.5, 6.5, -8.1, -8.1,
-8.1, -8.1, -8.1])
More information about forward fill is found here:
Most efficient way to forward-fill NaN values in numpy array
Finding the consecutive zeros in a numpy array
Computing array b you could use the forward fill mask and combine it with a single np.arange:
fill_mask = np.maximum.accumulate(idx)
np.arange(len(fill_mask)) - fill_mask
output:
array([0, 1, 2, 3, 4, 5, 0, 1, 2, 0, 1, 2, 3, 4])
So...
import numpy as np
ll = np.array([7.2, 0, 0, 0, 0, 0, 6.5, 0, 0, -8.1, 0, 0, 0, 0])
i = np.array([0, 6, 9])
counts = np.append(
np.diff(i), # difference between each element in i
# (i element shorter than i)
len(ll) - i[-1], # + length of last repeat
)
repeated = np.repeat(ll[i], counts)
repeated becomes
[ 7.2 7.2 7.2 7.2 7.2 7.2 6.5 6.5 6.5 -8.1 -8.1 -8.1 -8.1 -8.1]
b could be computed with
b = np.concatenate([np.arange(c) for c in counts])
print(b)
# [0 1 2 3 4 5 0 1 2 0 1 2 3 4]
but that involves a loop in the form of that list comprehension; perhaps someone Numpyier could implement it without a Python loop.
Related
I'm trying to manipulate an index and source array such that:
result[ i ][ j ][ k ] = source[ i ][ indices[ i ][ j ][ k ] ]
I know how to do this with for loops but I'm using giant arrays and I'd like to use something more time efficient. I've tried to use numpy's advanced indexing but I don't really understand it.
Example functionality:
source = [[0.0 0.1 0.2 0.3]
[1.0 1.1 1.2 1.3]
[2.0 2.1 2.2 2.3]]
indices = [[[3 1 0 1]
[3 0 0 3]]
[[0 1 0 2]
[3 2 1 1]]
[[1 1 0 1]
[0 1 2 2]]]
# result[i][j][k] = source[i][indices[i][j][k]]
result = [[[0.3 0.1 0.0 0.1]
[0.3 0.0 0.0 0.3]]
[[1.0 1.1 1.0 1.2]
[1.3 1.2 1.1 1.1]]
[[2.1 2.1 2.0 2.1]
[2.0 2.1 2.2 2.2]]]
Solution using Integer Advanced Indexing:
Given:
source = [[0.0, 0.1, 0.2, 0.3],
[1.0, 1.1, 1.2, 1.3],
[2.0, 2.1, 2.2, 2.3]]
indices = [[[3, 1, 0, 1],
[3, 0, 0, 3]],
[[0, 1, 0, 2],
[3, 2, 1, 1]],
[[1, 1, 0, 1],
[0, 1, 2, 2]]]
Use this:
import numpy as np
nd_source = np.array(source)
source_rows = len(source) # == 3, in above example
source_cols = len(source[0]) # == 4, in above example
row_indices = np.arange(source_rows).reshape(-1,1,1)
result = nd_source [row_indices, indices]
Result:
print (result)
[[[0.3 0.1 0. 0.1]
[0.3 0. 0. 0.3]]
[[1. 1.1 1. 1.2]
[1.3 1.2 1.1 1.1]]
[[2.1 2.1 2. 2.1]
[2. 2.1 2.2 2.2]]]
Explanation:
To use Integer Advanced Indexing, the key rules are:
We must supply index arrays consisting of integer indices.
We must supply as many of these index arrays, as there are dimensions in the source array.
The shape of these index arrays must be the same, or, at least all of them must be broadcastable to a single final shape.
How the Integer Advanced Indexing works is:
Given that source array has n dimensions, and that we have therefore supplied n integer index arrays:
All of these index arrays, if not in the same uniform shape, will be broadcasted to be in a single uniform shape.
To access any element in the source array, we obviously need an n-tuple of indices. Therefore to generate the result array from the source array, we need several n-tuples, one n-tuple for each element-position of the final result array. For each element-position of the result array, the n-tuple of indices will be constructed from the corresponding element-positions in the broadcasted index arrays. (Remember the result array has exactly the same shape as the broadcasted index arrays, as already mentioned above).
Thus, by traversing the index arrays in tandem, we get all the n-tuples we need to generate the result array, in the same shape as the broadcasted index arrays.
Applying this explanation to the above example:
Our source array is nd_source = np.array(source), which is 2d.
Our final result shape is (3,2,4).
We therefore need to supply 2 index arrays, and these index arrays must either be in the final result shape of (3,2,4), or broadcastable to the (3,2,4) shape.
Our first index array is row_indices = np.arange(source_rows).reshape(-1,1,1). (source_rows is the number of rows in the source, which is 3 in this example) This index array has shape (3,1,1), and actually looks like [[[0]],[[1]],[[2]]]. This is broadcastable to the final result shape of (3,2,4), and the broadcasted array looks like [[[0,0,0,0],[0,0,0,0]],[[1,1,1,1],[1,1,1,1]],[[2,2,2,2],[2,2,2,2]]].
Our second index array is indices. Though this is not an array and is only a list of lists, numpy is flexible enough to automatically convert it into the corresponding ndarray, when we pass it as our send index array. Note that this array is already in the final desired result shape of (3,2,4) even without any broadcasting.
Traversing these two index arrays in tandem (one a broadcasted array and the other as is), numpy generates all the 2-tuples needed to access our source 2d array nd_source, and generate the final result in the shape (3,2,4).
What is a pythonic way to calculate the mean of a list ,but only considering the positive values?
So if I have the values
[1,2,3,4,5,-1,4,2,3] and I want to calculate the rolling mean of three values it is basically calculating the average rolling average of [1,2,3,4,5,'nan',4,2,3].
And that becomes
[nan,2,3,4,4.5,4.5,3,nan] where the first and the last nan are due to the missing elements.
The 2 = mean ([1,2,3])
the 3 = mean ([2,3,4])
but the 4.5 = mean ([4,5,nan])=mean ([4,5])
and so on. So it is important that when there are negative values they are excluded, but the division is between the number of positive values.
I tried:
def RollingPositiveAverage(listA,nElements):
listB=[element for element in listA if element>0]
return pd.rolling_mean(listB,3)
but the list B has elements missing. I tried to substitute those elements with nan but then the mean becomes nan itself.
Is there any nice and elegant way to solve this?
Thanks
Since you are using Pandas:
import numpy as np
import pandas as pd
def RollingPositiveAverage(listA, window=3):
s = pd.Series(listA)
s[s < 0] = np.nan
result = s.rolling(window, center=True, min_periods=1).mean()
result.iloc[:window // 2] = np.nan
result.iloc[-(window // 2):] = np.nan
return result # or result.values or list(result) if you prefer array or list
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
0 NaN
1 2.0
2 3.0
3 4.0
4 4.5
5 4.5
6 3.0
7 3.0
8 NaN
dtype: float64
Plain Python version:
import math
def RollingPositiveAverage(listA, window=3):
result = [math.nan] * (window // 2)
for win in zip(*(listA[i:] for i in range(window))):
win = tuple(v for v in win if v >= 0)
result.append(float(sum(win)) / min(len(win), 1))
result.extend([math.nan] * (window // 2))
return result
print(RollingPositiveAverage([1, 2, 3, 4, 5, -1, 4, 2, 3]))
Output:
[nan, 2.0, 3.0, 4.0, 4.5, 4.5, 3.0, 3.0, nan]
Get rolling summations and get the count of valid elements participating with rolling summations of the mask of positive elements and simple divide them for the average values. For the rolling summations, we could use np.convolve.
Hence, the implementation -
def rolling_mean(a, W=3):
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
return np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
Extending to the specific case of NaN-padding at the boundaries -
def rolling_mean_pad(a, W=3):
hW = (W-1)//2 # half window size for padding
a = np.asarray(a) # convert to array
k = np.ones(W) # kernel for convolution
# Mask of positive numbers and get clipped array
m = a>=0
a_clipped = np.where(m,a,0)
# Get rolling windowed summations and divide by the rolling valid counts
out = np.convolve(a_clipped,k,'same')/np.convolve(m,k,'same')
out[:hW] = np.nan
out[-hW:] = np.nan
return out
Sample run -
In [54]: a
Out[54]: array([ 1, 2, 3, 4, 5, -1, 4, 2, 3])
In [55]: rolling_mean_pad(a, W=3)
Out[55]: array([ nan, 2. , 3. , 4. , 4.5, 4.5, 3. , 3. , nan])
I start with an array a containing N unique values (product(a.shape) >= N).
I need to find the array b that has the index 0 .. N-1 from the (sorted) list of unique values in a at the positions of the respective elements in a.
As an example
import numpy as np
np.random.seed(42)
a = np.random.choice([0.1,1.3,7,9.4], size=(4,3))
print a
prints a as
[[ 7. 9.4 0.1]
[ 7. 7. 9.4]
[ 0.1 0.1 7. ]
[ 1.3 7. 7. ]]
The unique values are [0.1, 1.3, 7.0, 9.4], so the required outcome b would be
[[2 3 0]
[2 2 3]
[0 0 2]
[1 2 2]]
(e.g. the value at a[0,0] is 7.; 7. has the index 2; thus b[0,0] == 2.)
Since numpy does not have an index function,
I could do this using a loop. Either looping over the input array, like this:
u = np.unique(a).tolist()
af = a.flatten()
b = np.empty(len(af), dtype=int)
for i in range(len(af)):
b[i] = u.index(af[i])
b = b.reshape(a.shape)
print b
or looping over the unique values as follows:
u = np.unique(a)
b = np.empty(a.shape, dtype=int)
for i in range(len(u)):
b[np.where(a == u[i])] = i
print b
I suppose that the second way of looping over the unique values is already more efficient than the first in cases where not all values in a are distinct; but still, it involves this loop and is rather inefficient compared to inplace operations.
So my question is: What is the most efficient way of obtaining the array b filled with the indizes of the unique values of a?
You could use np.unique with its optional argument return_inverse -
np.unique(a, return_inverse=1)[1].reshape(a.shape)
Sample run -
In [308]: a
Out[308]:
array([[ 7. , 9.4, 0.1],
[ 7. , 7. , 9.4],
[ 0.1, 0.1, 7. ],
[ 1.3, 7. , 7. ]])
In [309]: np.unique(a, return_inverse=1)[1].reshape(a.shape)
Out[309]:
array([[2, 3, 0],
[2, 2, 3],
[0, 0, 2],
[1, 2, 2]])
Going through the source code of np.unique that looks pretty efficient to me, but still pruning out the un-necessary parts, we would end up with another solution, like so -
def unique_return_inverse(a):
ar = a.flatten()
perm = ar.argsort()
aux = ar[perm]
flag = np.concatenate(([True], aux[1:] != aux[:-1]))
iflag = np.cumsum(flag) - 1
inv_idx = np.empty(ar.shape, dtype=np.intp)
inv_idx[perm] = iflag
return inv_idx
Timings -
In [444]: a= np.random.randint(0,1000,(1000,400))
In [445]: np.allclose( np.unique(a, return_inverse=1)[1],unique_return_inverse(a))
Out[445]: True
In [446]: %timeit np.unique(a, return_inverse=1)[1]
10 loops, best of 3: 30.4 ms per loop
In [447]: %timeit unique_return_inverse(a)
10 loops, best of 3: 29.5 ms per loop
Not a great deal of improvement there over the built-in.
I have a huge file with the first row as a string and others rows represent integers. The number of columns is variable, dependent on a row.
A have one global list where I save my intermediate results. arr_of_arr is an list of lists of floats. The length of arr_of_arr is about 5000. Each element (again an array) of this array has a length from 100.000 to 10.000.000. The maximum length of the elements might vary, so I cannot extend each element in advance when I create arr_of_arr.
After I have proceeded the whole file,I add artificially I compute the mean over the elements for each of the global array and plot.max_size_arr is a length of the longest element in an array ( I compete it when I iterate over the lines in file)
arr = [x+[0]*(max_size_arr - len(x)) for x in arr_of_arr]
I need to calculate the means across arrays element wise.
For example,
[[1,2,3],[4,5,6],[0,2,10]] would result in [5/3,9/3,19/3] (mean of the first elements across arrays, mean od second elements across arrays etc.)
arr = np.mean(np.array(arr),axis=0)
However, this result in a huge memory consumption (like 100GB according to the cluster information). What would be a good solution in sense of structure to reduce memory consumption? Would numpy arrays be lighter than the normal arrays in python?
I think that the huge memory consumption is because you want to have the whole array in memory at once.
Why don't you use slices combined with numpy arrays?. Doing that you can simulate a batch processing of your data. You can give to a function the batch size (1000 or 10000 arrays), calculate the means and write the results into a dict or a file indicating slices and its own respectives means.
Have you tried using the Numba package? It reduces computation time and memory usage with standard numpy arrays.
http://numba.pydata.org
If the lines vary widely in the number of values, I'd stick with a list of lists, at long as is practical. numpy arrays are best when the data is 'row' lengths are the same.
To illustrate with a small example:
In [453]: list_of_lists=[[1,2,3],[4,5,6,7,8,9],[0],[1,2]]
In [454]: list_of_lists
Out[454]: [[1, 2, 3], [4, 5, 6, 7, 8, 9], [0], [1, 2]]
In [455]: [len(x) for x in list_of_lists]
Out[455]: [3, 6, 1, 2]
In [456]: [sum(x) for x in list_of_lists]
Out[456]: [6, 39, 0, 3]
In [458]: [sum(x)/float(len(x)) for x in list_of_lists]
Out[458]: [2.0, 6.5, 0.0, 1.5]
With your array approach to taking the mean, I get different numbers - because of all the padding 0s. Is that intentional?
In [460]: max_len=6
In [461]: arr=[x+[0]*(max_len-len(x)) for x in list_of_lists]
In [462]: arr
Out[462]:
[[1, 2, 3, 0, 0, 0],
[4, 5, 6, 7, 8, 9],
[0, 0, 0, 0, 0, 0],
[1, 2, 0, 0, 0, 0]]
mean along columns?
In [463]: np.mean(np.array(arr),axis=0)
Out[463]: array([ 1.5 , 2.25, 2.25, 1.75, 2. , 2.25])
mean along rows
In [476]: In [463]: np.mean(np.array(arr),axis=1)
Out[476]: array([ 1. , 6.5, 0. , 0.5])
list mean with max length:
In [477]: [sum(x)/float(max_len) for x in list_of_lists]
Out[477]: [1.0, 6.5, 0.0, 0.5]
let's say I have:
c = array([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], float)
then I take a fast fourier transform:
r = rfft(c)
which produces the following complex array:
r = [ 21.+0.j , -3.+5.19615242j , -3.+1.73205081j , -3.+0.j ]
the number of elements in the new array is 1/2*N + 1.
I'm trying to tell python to change the values of SPECIFIC elements in the new array. I want to tell python to keep the FIRST 50% of the elements and to set the others equal to zero, so instead the result would look like
r = r = [ 21.+0.j , -3.+5.19615242j , 0 , 0 ]
how would I go about this?
rfft return a numpy array which helps easy manipulation of the array.
c = [1,2,3,4,5,6]
r = rfft(c)
r[r.shape[0]/2:] = 0
r
>> array([21.+0.j, -3.+5.1961j, 0.+0.j , 0.+0.j])
You can use slice notation and extend the result to the correct length:
r = r[:len(r)/2].extend([0] * (len(r) - len(r) / 2))
The * syntax just repeats the zero element the specified number of times.
You can split the list in half, then append a list of zeros as the same length as the remaining part:
>>> i
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
>>> i[:len(i)/2] + [0]*len(i[len(i)/2:])
[1, 2, 3, 4, 5, 0, 0, 0, 0, 0]