flatten arrays in a list in python

flatten arrays in a list in python - python

I have multiple numpy masked arrays arr0, arr1, ..., arrn.
I put them in a list arrs = [arr0, ..., arrn].
I want to flatten these arrays et put a mask on them. I did something like:
for arr in arrs:
arr = np.ravel(arr)
arr[mask] = ma.masked
I do not understand when Python make copies and when it is just a pointer. This for loop does not flatten the arr0, ..., arrn, (whereas ravel outputs a view and not a copy) it just flattens the variable arr, although it does change their mask !
As I understand it, arr is a view of the elements in the list arrs, so when I change elements of arr it changes the elements of the corresponding array in the list. But when I assign a new value to arr it does not change the original array, even if the assignement is supposed to be a view of this array. Why ?
Edit with an example:
Arrays to flatten:
arr0 = masked_array(data=[[1,2],[3,4]], mask=False)
arr1 = masked_array(data=[[5,6],[7,8]], mask=False)
mask = [[False,True],[True,False]]
Expected output:
arr0 = masked_array(data=[[1,--],[--,4]], mask=[[False,True],[True,False]])
arr1 = masked_array(data=[[5,--],[--,8]], mask=[[False,True],[True,False]])
I'd like to do this in a loop because I have a lot of arrays (15 more or less), and I want to use the arrays name in the code. Is there no other way than do to:
arr0 = np.ravel(arr0)
...
arrn = np.ravel(arrn)

In [1032]: arr0 = np.ma.masked_array(data=[[1,2],[3,4]], mask=False)
In [1033]: arr1 = np.ma.masked_array(data=[[5,6],[7,8]], mask=False)
This is the basic way of iterating over a list, applying some action to each element, and collecting the results in another list:
In [1037]: ll=[arr0,arr1]
In [1038]: ll1=[]
In [1047]: for a in ll:
a1=a.flatten() # makes a copy
a1.mask=mask
ll1.append(a1)
In [1049]: ll1
Out[1049]:
[masked_array(data = [1 -- -- 4], mask = [False True True False],
fill_value = 999999),
masked_array(data = [5 -- -- 8], mask = [False True True False],
fill_value = 999999)]
Often that can be writen a list comprehension
[foo(a) for a in alist]
but the action here isn't a neat function
If I use ravel instead, a1 is a view (not a copy), and applying mask to it changes the mask of a as well - the result is changed masks for arr0, but no change in shape:
In [1051]: for a in ll:
......: a1=a.ravel()
......: a1.mask=mask
(the same happens with your a=a.ravel(). The a= assigns a new value to a, breaking the link to the iteration value. That's true for any Python iteration. It's best to use new variable names inside the iteration like a1 so you don't confuse yourself.)
Essentially the same as
In [1054]: for a in ll:
......: a.mask=mask
I can change the shape in the same in-place way
In [1055]: for a in ll:
......: a.shape=[-1] # short hand for inplace ravel
......: a.mask=mask
In [1056]: arr0
Out[1056]:
masked_array(data = [1 -- -- 4],
mask = [False True True False],
fill_value = 999999)
Here's a functional way of creating new arrays with new shape and mask, and using it in a list comprehension (and no change to arr0)
[np.ma.masked_array(a,mask=mask).ravel() for a in [arr0,arr1]]
Understanding these alternatives does require understanding how Python assigns iterative variables, and how numpy makes copies and views.

Related

Numpy .in1d method not evaluating array vs array view correctly?

I'm trying to search and see if a numpy array is within another for debugging something.
#Pattern
arr1 = np.array([1.62434536, -0.61175641, -0.52817175])
#type : np.ndarray
#dtype : 'float64'
#shape : (3,)
Then I have a list of tuples, where the first element in each tuple is a n by m ndarray
Lets say this object is called 'my_nest'
arr2 = my_nest[0][0][0][0:3]
arr2
#array([ 1.62434536, -0.61175641, -0.52817175])
#type : np.ndarray
#dtype : 'float64'
#shape : (3,)
But then using the in1d method returns an unintuitive result
np.in1d(arr1,arr2)
#array([False, False, False], dtype=bool)
I know slicing an ndarray creates a view of the object as it is in memory, but I even tried wrapping np.copy around it to create a new object in memory and then compare and I still get False.
Anyone know what's going on here?

As mentioned in the comments this is an effect of floating point precision. You can reimplement in1d according to its source for small arrays using isclose instead of ==.
import numpy as np
arr1 = np.array([1.62434536, -0.61175641, -0.52817175])
arr2 = np.array([1.62434536, -0.61175641, -0.52817175+1e-12])
print(arr1)
print(arr2)
print('isin: ', np.in1d(arr1,arr2))
mask = np.zeros(len(arr1), dtype=bool)
for a in arr2:
mask |= np.isclose(arr1, a)
print('isclose:', mask)
Output:
[ 1.62434536 -0.61175641 -0.52817175]
[ 1.62434536 -0.61175641 -0.52817175]
isin: [ True True False]
isclose: [ True True True]

Replacing numpy array elements with chained masks

Consider some array arr and advanced indexing mask mask:
import numpy as np
arr = np.arange(4).reshape(2, 2)
mask = A < 2
Using advanced indexing creates a new copy of an array. Accordingly, one cannot "chain" a mask with an an additional mask or even with a basic slicing operation to replace elements of an array:
submask = [False, True]
arr[mask][submask] = -1 # chaining 2 masks
arr[mask][:] = -1 # chaining a mask with a basic slicing operation
print(arr)
[[0 1]
[2 3]]
I have two related questions:
1/ What is the best way to replace elements of an array using chained masks?
2/ If advanced indexing returns a copy of an array, why does the following work?
arr[mask] = -1
print(arr)
[[-1 -1]
[ 2 3]]

The short answer:
you have to figure out a way of combining the masks. Since masks can "chain" in different ways I don't think there's a simple all-purpose substitute.
indexing can either be a __getitem__ call, or a __setitem__. Your last case is a set.
With chained indexing, a[mask1][mask2] =value gets translated into
a.__getitem__(mask1).__setitem__(mask2, value)
Whether a gets modified or not depends on what the first getitem produces (a view vs copy).
In [11]: arr = np.arange(4).reshape(2,2)
In [12]: mask = arr<2
In [13]: mask
Out[13]:
array([[ True, True],
[False, False]])
In [14]: arr[mask]
Out[14]: array([0, 1])
Indexing with a list or array may preserve the number of dimensions, but a boolean like this returns a 1d array, the items where the mask is true.
In your example, we could tweak the mask (details may vary with the intent of the 2nd mask):
In [15]: mask[:,0]=False
In [16]: mask
Out[16]:
array([[False, True],
[False, False]])
In [17]: arr[mask]
Out[17]: array([1])
In [18]: arr[mask] += 10
In [19]: arr
Out[19]:
array([[ 0, 11],
[ 2, 3]])
Or a logical combination of masks:
In [26]: (np.arange(4).reshape(2,2)<2)&[False,True]
Out[26]:
array([[False, True],
[False, False]])

Couple of good questions! My take:
I would do something like this:
x,y=np.where(mask)
arr[x[submask],y[submask]] = -1
From the official document:
Most of the following examples show the use of indexing when referencing data in an array. The examples work just as well when assigning to an array. See the section at the end for specific examples and explanations on how assignments work.
which means arr[mask]=1 is referrencing, while arr[mask] is extracting data and creates a copy.

Return True/False for entire array if any value meets mask requirement(s)

I have already tried looking at other similar posts however, their solutions do not solve this specific issue. Using the answer from this post I found that I get the error: "The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()" because I define my array differently from theirs. Their array is a size (n,) while my array is a size (n,m). Moreover, the solution from this post does not work either because it applies to lists. The only method I could think of was this:
When there is at least 1 True in array, then entire array is considered True:
filt = 4
tracktruth = list()
arraytruth = list()
arr1 = np.array([[1,2,4]])
for track in range(0,arr1.size):
if filt == arr1[0,track]:
tracktruth.append(True)
else:
tracktruth.append(False)
if any(tracktruth):
arraytruth.append(True)
else:
arraytruth.append(False)
When there is not a single True in array, then entire array is considered False:
filt = 5
tracktruth = list()
arraytruth = list()
arr1 = np.array([[1,2,4]])
for track in range(0,arr1.size):
if filt == arr1[0,track]:
tracktruth.append(True)
else:
tracktruth.append(False)
if any(tracktruth):
arraytruth.append(True)
else:
arraytruth.append(False)
The reason the second if-else statement is there is because I wish to apply this mask to multiple arrays and ultimately create a master list that describes which arrays are true and which are false in their entirety. However, with a for loop and two if-else statements, I think this would be very slow with larger arrays. What would be a faster way to do this?

This seems overly complicated, you can use boolean indexing to achieve results without loops
arr1=np.array([[1,2,4]])
filt=4
arr1==filt
array([[False, False, True]])
np.sum(arr1==filt).astype(bool)
True
With nmore than one row, you can use the row or column index in the np.sum or you can use the axis parameter to sum on rows or columns
As pointed out in the comments, you can use np.any() instead of the np.sum(...).astype(bool) and it runs in roughly 2/3 the time on the test dataset:
np.any(a==filt, axis=1)
array([ True])

You can do this with list comprehension. I've done it here for one array but it's easily extended to multiple arrays with a for loop
filt = 4
arr1 = np.array([[1,2,4]])
print(any([part == filt for part in arr1[0]]))

You can get the arraytruth more generally, with list comprehension for the array of size (n,m)
import numpy as np
filt = 4
a = np.array([[1, 2, 4]])
b = np.array([[1, 2, 3],
[5, 6, 7]])
array_lists = [a, b]
arraytruth = [True if a[a==filt].size>0 else False for a in array_lists]
print(arraytruth)
This will give you:
[True, False]

[edit] Use numpy hstack method.
filt = 4
arr = np.array([[1,2,3,4], [1,2,3]])
print(any([x for x in np.hstack(arr) if x < filt]))

Referencing a numpy arrray without creating an expensive copy

Let's say that I have a function that requires that NumPy ndarray with 2 axes, e.g., a data matrix of rows and columns. If a "column" is sliced from such an array, this function should also work, thus it should do some internal X[:, np.newaxis] for convenience. However, I don't want to create a new array object for this since this can be expensive in certain cases.
I am wondering if there is a good way to do it. For example, would the following code be safe (by that I mean, would the global arrays always be unchanged like Python lists)?
X1 = np.array([[1,2,3], [4,5,6], [7,8,9]])
X2 = np.array([1,4,7])
def some_func(X):
if len(X.shape) == 1:
X = X[:, np.newaxis]
return X[:,0].sum()
some_func(X2)
some_func(X1[:, 0])
some_func(X1)
I am asking because I heard that NumPy arrays are sometimes copied in certain cases, however, I can't find a good resource about this. Any ideas?

It shouldn't create a copy. For illustration:
>>> A = np.ones((50000000,))
>>> B = A[:,np.newaxis]
>>> B.flags
C_CONTIGUOUS : False
F_CONTIGUOUS : False
OWNDATA : False
WRITEABLE : True
ALIGNED : True
UPDATEIFCOPY : False
Note the OWNDATA : False - it's sharing data with A.
For a few more details have a look at http://docs.scipy.org/doc/numpy/reference/arrays.indexing.html. The basic rule is that it doesn't create a copy unless you're doing indexing with either an array of indices (e.g. A[[1,2,4]]) or with a boolean array (e.g. A[[True, False, True]]). Pretty much everything else returns a view with no copy.

It shouldn't create a copy - these types of operations are all just views - a copy with changed metadata of the ndarray, but not the data.

You can reshape the input array to force it to be a M x N dimensional array, where M is the number of elements for the first dimension. Then, slice it to get the first column and sum all its elements. The reshaping and slicing must not
make copies.
So, you could have this alternative approach without the IF statement -
def some_func2(X):
return X.reshape(X.shape[0],-1)[:,0].sum()
To check and confirm that it doesn't create a copy with reshaping and slicing, you can use np.may_share_memory like so -
In [515]: X1
Out[515]:
array([[1, 2, 3],
[4, 5, 6],
[7, 8, 9]])
In [516]: np.may_share_memory(X1,X1.reshape(X1.shape[0],-1)[:,0])
Out[516]: True
In [517]: X2
Out[517]: array([1, 4, 7])
In [518]: np.may_share_memory(X2,X2.reshape(X2.shape[0],-1)[:,0])
Out[518]: True
A True value with np.may_share_memory is a good indicator that they are views and not copies.

Efficiently sum a small numpy array, broadcast across a ginormous numpy array?

I want to calculate an indexed weight sum across a large (1,000,000 x
3,000) boolean numpy array. The large boolean array changes
infrequently, but the weights come at query time, and I need answers
very fast, without copying the whole large array, or expanding the
small weight array to the size of the large array.
The result should be an array with 1,000,000 entries, each having the
sum of the weights array entries corresponding to that row's True
values.
I looked into using masked arrays, but they seem to require building a
weights array the size of my large boolean array.
The code below gives the correct results, but I can't afford that copy
during the multiply step. The multiply isn't even necessary, since
the values array is boolean, but at least it handles the broadcasting
properly.
I'm new to numpy, and loving it, but I'm about to give up on it for
this particular problem. I've learned enough numpy to know to stay
away from anything that loops in python.
My next step will be to write this routine in C (which has the added
benefit of letting me save memory by using bits instead of bytes, by
the way.)
Unless one of you numpy gurus can save me from cython?
from numpy import array, multiply, sum
# Construct an example values array, alternating True and False.
# This represents four records of three attributes each:
# array([[False, True, False],
# [ True, False, True],
# [False, True, False],
# [ True, False, True]], dtype=bool)
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
# Construct example weights, one for each attribute:
# array([1, 2, 3])
weights = array(range(1, 4))
# Create expensive NEW array with the weights for the True attributes.
# Broadcast the weights array into the values array.
# array([[0, 2, 0],
# [1, 0, 3],
# [0, 2, 0],
# [1, 0, 3]])
weighted = multiply(values, weights)
# Add up the weights:
# array([2, 4, 2, 4])
answers = sum(weighted, axis=1)
print answers
# Rejected masked_array solution is too expensive (and oddly inverts
# the results):
masked = numpy.ma.array([[1,2,3]] * 4, mask=values)

The dot product (or inner product) is what you want. It allows you to take a matrix of size m×n and a vector of length n and multiply them together yielding a vector of length m, where each entry is the weighted sum of a row of the matrix with the entries of the vector of as weights.
Numpy implements this as array1.dot(array2) (or numpy.dot(array1, array2) in older versions). e.g.:
from numpy import array
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
weights = array(range(1, 4))
answers = values.dot(weights)
print answers
# output: [ 2 4 2 4 ]
(You should benchmark this though, using the timeit module.)

It seems likely that dbaupp's answer is the correct one. But just for the sake of diversity, here's another solution that saves memory. This will work even for operations that don't have a built-in numpy equivalent.
>>> values = numpy.array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
>>> weights = numpy.array(range(1, 4))
>>> weights_stretched = numpy.lib.stride_tricks.as_strided(weights, (4, 3), (0, 8))
numpy.lib.stride_tricks.as_strided is a wonderful little function! It allows you to specify shape and strides values that allow a small array to mimic a much larger array. Observe -- there aren't really four rows here; it just looks that way:
>>> weights_stretched[0][0] = 4
>>> weights_stretched
array([[4, 2, 3],
[4, 2, 3],
[4, 2, 3],
[4, 2, 3]])
So instead of passing a huge array to MaskedArray, you can pass a smaller one. (But as you've already noticed, numpy masking works in the opposite way you might expect; truth masks, rather than revealing, so you'll have to store your values inverted.) As you can see, MaskedArray doesn't copy any data; it just reflects whatever is in weights_stretched:
>>> masked = numpy.ma.MaskedArray(weights_stretched, numpy.logical_not(values))
>>> weights_stretched[0][0] = 1
>>> masked
masked_array(data =
[[-- 2 --]
[1 -- 3]
[-- 2 --]
[1 -- 3]],
mask =
[[ True False True]
[False True False]
[ True False True]
[False True False]],
fill_value=999999)
Now we can just pass it to sum:
>>> sum(masked, axis=1)
masked_array(data = [2 4 2 4],
mask = [False False False False],
fill_value=999999)
I benchmarked numpy.dot and the above against a 1,000,000 x 30 array. This is the result on a relatively modern MacBook Pro (numpy.dot is dot1; mine is dot2):
>>> %timeit dot1(values, weights)
1 loops, best of 3: 194 ms per loop
>>> %timeit dot2(values, weights)
1 loops, best of 3: 459 ms per loop
As you can see, the built-in numpy solution is faster. But stride_tricks is worth knowing about regardless, so I'm leaving this.

Would this work for you?
a = np.array([sum(row * weights) for row in values])
This uses sum() to immediately sum the row * weights values, so you don't need the memory to store all the intermediate values. Then the list comprehension collects all the values.
You said you want to avoid anything that "loops in Python". This at least does the looping with the C guts of Python, rather than an explicit Python loop, but it can't be as fast as a NumPy solution because that uses compiled C or Fortran.

I don't think you need numpy for something like that. And 1000000 by 3000 is a huge array; this will not fit in your RAM, most likely.
I would do it this way:
Let's say that you data is originally in a text file:
False,True,False
True,False,True
False,True,False
True,False,True
My code:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = sum(sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin)
Result:
>>> a
12
EDIT:
I think I slightly misread the question first time around, and summed up the everything together. Here is the solution that gives the exact solution that OP is after:
weight = range(1,4)
dicto = {'True':1, 'False':0}
with open ('my_data.txt') as fin:
a = [sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin]
Result:
>>> a
[2, 4, 2, 4]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

flatten arrays in a list in python - python

Related

Numpy .in1d method not evaluating array vs array view correctly?

Replacing numpy array elements with chained masks

Return True/False for entire array if any value meets mask requirement(s)

Referencing a numpy arrray without creating an expensive copy

Efficiently sum a small numpy array, broadcast across a ginormous numpy array?

Categories

Resources