New to pandas python.
I have a dataframe (df) with two columns of cusips.
I want to turn those columns into a list of the unique entries of the two columns.
My first attempt was to do the following:
cusips = pd.concat(df['long'], df['short']).
This returned the error: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
I have read a few postings, but I am still having trouble with why this comes up. What am I missing here?
Also, what's the most efficient way to select the unique entries in a column or a dataframe? Can I call it in one function? Does the function differ if I want to create a list or a new, one-coulmn dataframe?
Thank you.
Adding to Hayden's answer, you could also use the set() method for the same result. The performance is slightly better if that's a consideration:
In [28]: %timeit set(np.append(df[0],df[1]))
100000 loops, best of 3: 19.6 us per loop
In [29]: %timeit np.append(df[0].unique(), df[1].unique())
10000 loops, best of 3: 55 us per loop
To obtain the unique values in a column you can use the unique Series method, which will return a numpy array of the unique values (and it is fast!).
df.long.unique()
# returns numpy array of unique values
You could then use numpy.append:
np.append(df.long.unique(), df.short.unique())
Note: This just appends the two unique results together and so itself is not unique!
.
Here's a (trivial) example:
import pandas as pd
import numpy as np
df = pd.DataFrame([[1, 2], [1, 4]], columns=['long','short'])
In [4]: df
Out[4]:
long short
0 1 2
1 1 4
In [5]: df.long.unique()
Out[5]: array([1])
In [6]: df.short.unique()
Out[6]: array([2, 4])
And then appending the resulting two arrays:
In [7]: np.append(df.long.unique(), df.short.unique())
Out[7]: array([1, 2, 4])
Using #Zalazny7's set is significantly faster (since it runs over the array only once) and somewhat upsettingly it's even faster than np.unique (which sorts the resulting array!).
Related
I could perform filtering of numpy arrays via
a[np.where(a[:,0]==some_expression)]
or
a[a[:,0]==some_expression]
What are the (dis)advantages of each of these versions - especially with regard to performance?
Boolean indexing is transformed into integer indexing internally. This is indicated in the docs:
In general if an index includes a Boolean array, the result will be
identical to inserting obj.nonzero() into the same position and
using the integer array indexing mechanism described above.
So the complexity of the two approaches is the same. But np.where is more efficient for large arrays:
np.random.seed(0)
a = np.random.randint(0, 10, (10**7, 1))
%timeit a[np.where(a[:, 0] == 5)] # 50.1 ms per loop
%timeit a[a[:, 0] == 5] # 62.6 ms per loop
Now np.where has other benefits: advanced integer indexing works well across multiple dimensions. For an example where Boolean indexing is unintuitive in this aspect, see NumPy indexing: broadcasting with Boolean arrays. Since np.where is more efficient than Boolean indexing, this is just an extra reason it should be preferred.
To my surprise, the first one seems to perform slightly better:
a = np.random.random_integers(100, size=(1000,1))
import timeit
repeat = 3
numbers = 1000
def time(statement, _setup=None):
print(min(
timeit.Timer(statement, setup=_setup or setup).repeat(repeat, numbers)))
setup = """from __main__ import np, a"""
time('a[np.where(a[:,0]==99)]')
time('a[(a[:,0]==99)]')
prints (for instance):
0.017856399000000023
0.019185326999999974
Increasing the size of the array makes the numbers differ even more
I have a numpy matrix that contains row vectors. I want to sort the matrix by its rows, like Python would sort a list of lists:
import numpy as np
def sortx(a):
return np.array(sorted([list(i) for i in a]))
a = np.array([[1,4,0,2],[0,2,3,1]])
print(sortx(a))
Output:
[[0 2 3 1]
[1 4 0 2]]
Is there a numpy equivalent of my sortx() function so I don't have to convert the data twice?
You can try to use numpy's lexsort:
a=a[np.lexsort(a[:,::-1].T)]
On my machine this was about four times faster than your sortx method when applied to a 4x4 matrix. On a matrix with 100 rows, the speed difference is even more significant.
arr=np.random.randint(0,100,(100,4))
%timeit np.lexsort(arr[:,::-1].T)
#6.29 µs +- 27.1ns
% timeit sortx(arr)
# 112µs +- 1.2µs
Edit:
Andyk suggested an improved version of the sortx() method.
def sortx_andyk(a):
return np.array(sorted(a.tolist())
Timing of this method:
%timeit sortx_andryk(arr)
# 43µs +- 169ns
You can use np.sort(arr, axis=0)
In your case
import numpy as np
a = np.array([[1,4,0,2],[0,2,3,1]])
np.sort(a, axis=0)
Edit
I misunderstood the question, even though I do not have an exact answer for your question, you might be able to use argsort. This returns the indices in to sort your array. However, it only does this based on an axis. It is possible to use this to sort your arrays based on a specific column, e.g. the first. Then you would use it as such
a = a[a.argsort(axis=0)[:, 0]]
where [:, 0] specifies the column by which to sort, i.e. [:, n] will sort on the n-th column.
I have a very large numpy array and I want to sort it and test if it is unique.
I'm aware of the function numpy.unique but it sorts the array another time to achieve it.
The reason I need the array sorted a priori is because the returned keys from the argsort function will be used to reorder another array.
I'm looking for a way to do both (argsort and unique test) without the need to sort the array again.
Example code:
import numpy as np
import numpy.random
# generating random arrays with 2 ^ 27 columns (it can grow even bigger!)
slices = np.random.random_integers(2 ** 32, size = 2 ** 27)
values = np.random.random_integers(2 ** 32, size = 2 ** 27)
# get an array of keys to sort slices AND values
# this operation takes a long time
sorted_slices = slices.argsort()
# sort both arrays
# it would be nice to make this operation in place
slices = slices[sorted_slices]
values = values[sorted_slices]
# test 'uniqueness'
# here, the np.unique function sorts the array again
if slices.shape[0] == np.unique(slices).shape[0]:
print('it is unique!')
else:
print('not unique!')
Both the arrays slices and values have 1 row and the same (huge) number of columns.
Thanks in advance.
You can check whether there are two or more equal values next to each other (non-unique values in a sorted array) by comparing their difference to 0
numpy.any(numpy.diff(slices) == 0)
Be aware though that numpy will create two intermediate arrays: one with the difference values, one with boolean values.
Here's an approach making use of slicing and instead of actual differentiation, we can just compare each element against the previous one without actually computing the differentiation value, like so -
~((slices[1:] == slices[:-1]).any())
Runtime test -
In [54]: slices = np.sort(np.random.randint(0,100000000,(10000000)))
# #Nils Werner's soln
In [55]: %timeit ~np.any(np.diff(slices) == 0)
100 loops, best of 3: 18.5 ms per loop
# #Marco's suggestion in comments
In [56]: %timeit np.diff(slices).all()
10 loops, best of 3: 20.6 ms per loop
# Proposed soln in this post
In [57]: %timeit ~((slices[1:] == slices[:-1]).any())
100 loops, best of 3: 6.12 ms per loop
I have spent the last hour trying to figure this out
Suppose we have
import numpy as np
a = np.random.rand(5, 20) - 0.5
amin_index = np.argmin(np.abs(a), axis=1)
print(amin_index)
> [ 0 12 5 18 1] # or something similar
this does not work:
a[amin_index]
So, in essence, I need to find the minima along a certain axis for the array np.abs(a), but then extract the values from the array a at these positions. How can I apply an index to just one axis?
Probably very simple, but I can't get it figured out. Also, I can't use any loops since I have to do this for arrays with several million entries.
thanks 😊
One way is to pass in the array of row indexes (e.g. [0,1,2,3,4]) and the list of column indexes for the minimum in each corresponding row (your list amin_index).
This returns an array containing the value at [i, amin_index[i]] for each row i:
>>> a[np.arange(a.shape[0]), amin_index]
array([-0.0069325 , 0.04268358, -0.00128002, -0.01185333, -0.00389487])
This is basic indexing (rather than advanced indexing), so the returned array is actually a view of a rather than a new array in memory.
Is because argmin returns indexes of columns for each of the rows (with axis=1), therefore you need to access to each row at those particular columns:
a[range(a.shape[0]), amin_index]
Why not simply do np.amin(np.abs(a), axis=1), it's much simpler if you don't need the intermediate amin_index array via argmin?
Numpy's reference page is an excellent resource, see "Indexing".
Edits: Timing is always useful:
In [3]: a=np.random.rand(4000, 4000)-.5
In [4]: %timeit np.amin(np.abs(a), axis=1)
10 loops, best of 3: 128 ms per loop
In [5]: %timeit a[np.arange(a.shape[0]), np.argmin(np.abs(a), axis=1)]
10 loops, best of 3: 135 ms per loop
I have a data frame, and would like to work on a small partition each time for particular tuples of values of 'a', 'b','c'.
df = pd.DataFrame({'a':np.random.randint(0,10,10000),
'b':np.random.randint(0,10,10000),
'c':np.random.randint(0,10,10000),
'value':np.random.randint(0,100,10000)})
so I chose to use pandas multiindex:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
However, the performance is not great.
%timeit dfi.ix[(2,1,7)] # 511 us
%timeit df[(df['a'].values == 2) &
(df['b'].values == 1) & (df['c'].values == 7)] # 247 us
I suspect there are some overheads somewhere. My program has ~1k tuples, so it takes 511 * 1000 = 0.5s for one run. How can I improve further?
update:
hmm, I forgot to mention that the number of tuples are less than the total Cartesian product of distinct values in 'a', 'b','c' in df. Wouldn't groupby do excess amount of work on indices that doesn't exist in my tuples?
its not clear what 'work' on means, but I would do this
this can be almost any function
In [33]: %timeit df.groupby(['a','b','c']).apply(lambda x: x.sum())
10 loops, best of 3: 83.6 ms per loop
certain operations are cythonized so very fast
In [34]: %timeit df.groupby(['a','b','c']).sum()
100 loops, best of 3: 2.65 ms per loop
Doing a a selection on a multi-index is not efficient to do index by index.
If you are operating on a very small subset of the total groups, then you might want to directly index into the multi-index; groupby wins if you are operating on a fraction (maybe 20%) of the groups or more. You might also want to investigate filter which you can use to pre-filter the groups based on some criteria.
As noted above, the cartesian product of the groups indexers is irrelevant. Only the actual groups will be iterated by groupby (think of a MultiIndex as a sparse representation of the total possible space).
How about:
dfi = df.set_index(['a','b','c'])
dfi.sortlevel(inplace = True)
value = dfi["value"].values
value[dfi.index.get_loc((2, 1, 7))]
the result is a ndarray without index.