Apply numpy index to matrix - python

I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊

One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.

Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]

Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop

Related

Efficiently determining if large sorted numpy array has only unique values

I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop

Filter rows of a numpy array?

I am looking to apply a function to each row of a numpy array. If this function evaluates to true I will keep the row, otherwise I will discard it. For example, my function might be:
def f(row):
if sum(row)>10: return True
else: return False
I was wondering if there was something similar to:
np.apply_over_axes()
which applies a function to each row of a numpy array and returns the result. I was hoping for something like:
np.filter_over_axes()
which would apply a function to each row of an numpy array and only return rows for which the function returned true. Is there anything like this? Or should I just use a for loop?
Ideally, you would be able to implement a vectorized version of your function and use that to do boolean indexing. For the vast majority of problems this is the right solution. Numpy provides quite a few functions that can act over various axes as well as all the basic operations and comparisons, so most useful conditions should be vectorizable.
import numpy as np
x = np.random.randn(20, 3)
x_new = x[np.sum(x, axis=1) > .5]
If you are absolutely sure that you can't do the above, I would suggest using a list comprehension (or np.apply_along_axis) to create an array of bools to index with.
def myfunc(row):
return sum(row) > .5
bool_arr = np.array([myfunc(row) for row in x])
x_new = x[bool_arr]
This will get the job done in a relatively clean way, but will be significantly slower than a vectorized version. An example:
x = np.random.randn(5000, 200)
%timeit x[np.sum(x, axis=1) > .5]
# 100 loops, best of 3: 5.71 ms per loop
%timeit x[np.array([myfunc(row) for row in x])]
# 1 loops, best of 3: 217 ms per loop

Approach to speed up pandas multilevel index selection?

I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.

Is numpy.transpose reordering data in memory?

In order to speed up the functions like np.std, np.sum etc along an axis of an n dimensional huge numpy array, it is recommended to apply along the last axis.
When I do, np.transpose to rotate the axis I want to operate, to the last axis. Is it really reshuffling the data in memory, or just changing the way the axis are addressed?
When i tried to measure the time using %timeit. it was doing this transpose in micro seconds, (much smaller than the time required to copy the (112x1024x1024) array i was having.
If it is not actually reordering the data in memory and only changing the addressing, will it still speed up the np.sum or np.std when applied to newly rotated last axis?
When i tried to measure it, i does seem to speed up. But i don't understand how.
Update
It doesn't really seem to speed up with transpose. The fastest axis is last one when it is C-ordered, and first one when it is Fortran-ordered. So there is no point in transposing before applying np.sum or np.std.
For my specific code, i solved the issue by giving order='FORTRAN' during the array creation. Which made the first axis fastest.
Thanks for all the answers.
Transpose just changes the strides, it doesn't touch the actual array. I think the reason why sum etc. along the final axis is recommended (I'd like to see the source for that, btw.) is that when an array is C-ordered, walking along the final axis preserves locality of reference. That won't be the case after you transpose, since the transposed array will be Fortran-ordered.
To elaborate on larsman's answer, here are some timings:
# normal C (row-major) order array
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a, axis=1)
1000 loops, best of 3: 272 us per loop
# transposing and summing along the first axis makes no real difference
# to performance
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(a.T, axis=0)
1000 loops, best of 3: 269 us per loop
# however, converting to Fortran (column-major) order does improve speed...
>>> %%timeit a = np.asfortranarray(np.random.randn(500,400))
>>> np.sum(a, axis=1)
10000 loops, best of 3: 114 us per loop
# ... but only if you don't count the conversion in the timed operations
>>> %%timeit a = np.random.randn(500, 400)
>>> np.sum(np.asfortranarray(a), axis=1)
1000 loops, best of 3: 599 us per loop
In summary, it might make sense to convert your arrays to Fortran order if you're going to apply a lot of operations over the columns, but the conversion itself is costly and almost certainly not worth it for a single operation.
You can force a physical reorg with np.ascontinguousarray. For example...
a = np.ascontiguousarray(a.transpose())

Two Columns of a pandas dataframe - Concat in Python

New to pandas python.
I have a dataframe (df) with two columns of cusips.
I want to turn those columns into a list of the unique entries of the two columns.
My first attempt was to do the following:
cusips = pd.concat(df['long'], df['short']).
This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?
Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?
Thank you.
Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:
In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop
In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop
To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).
df.long.unique()
# returns numpy array of unique values
You could then use numpy.append:
np.append(df.long.unique(), df.short.unique())
Note: This just appends the two unique results together and so itself is not unique!
.
Here's a (trivial) example:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])
In [4]: df
Out[4]:
long short
0 1 2
1 1 4
In [5]: df.long.unique()
Out[5]: array([1])
In [6]: df.short.unique()
Out[6]: array([2, 4])
And then appending the resulting two arrays:
In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])
Using #Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).

Categories