def match_score(vendor, company):
return max(fuzz.ratio(vendor, company), fuzz.partial_ratio(vendor, company), fuzz.token_sort_ratio(vendor, company))
Note: fuzz is from import fuzzywuzzy library
========================
vendor = 'RED DEER TELUS STORE'
When I try this code:
df['Vendor']=vendor
df['Score'] = np.array(match_score(tuple(df['Vendor']), tuple(df['Company'])))
I get this
However, when I try an almost identical code I get a different 'Score'?
df['Score'] = np.array(match_score(vendor, tuple(df['Company'])))
My logic in #2 is that the vendor is the same across the entire column so no need to put it in a tuple..I can just give it as a string and make the processing faster.
Can anyone explain why passing an entire column where vendor in each cell = 'RED DEER TELUS STORE' gives a different result than just passing 'RED DEER TELUS STORE' to the function as a string? Thanks!
tuple versus tolist:
In [166]: x=np.arange(10000)
In [167]: timeit tuple(x)
1.14 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [168]: timeit list(x)
1.12 ms ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit x.tolist()
296 µs ± 9.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
or with a series
In [170]: ds = pd.Series(x)
In [171]: timeit tuple(ds)
1.22 ms ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [172]: timeit list(ds)
1.23 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: timeit ds.to_list()
394 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With a series of string values (object dtype):
In [184]: ds = pd.Series(['' for _ in range(1000)])
In [185]: ds[:] = vendor
In [186]: timeit tuple(ds)
104 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [187]: timeit ds.to_list()
27.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As to why you get a difference between passing the Series/tuple versus the string, I think you need to examine the fuzz code/docs in more detail. Maybe even test the function(s) with small examples. I don't have fuzz installed, so can't explore that part of your calculations.
You might even want to make up some lists (or tuples) of strings, and experiment with those. I don't think this is a numpy/pandas issue. It's a matter of learning to use fuzz correctly.
Related
I need help in understanding the %timeit function works in the two programs.
Program A
a = [1,3,2,4,1,4,2]
%timeit [val + 5 for val in a]
830 ns ± 45.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
Program B
import numpy as np
a = np.array([1,3,2,4,1,4,2])
%timeit [a+5]
1.07 µs ± 23.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
My confusion:
µs is bigger than ns. How does the NumPy function execute slower than for loop here?
1.07 µs ± 23.7 ns per loop... why is the loop speed calculated in ns and not in µs?
Numpy adds an overhead, this will impact the speed on small datasets. Vectorization is mostly useful when using large datasets.
You must try on larger numbers:
N = 10_000_000
a = list(range(N))
%timeit [val + 5 for val in a]
import numpy as np
a = np.arange(N)
%timeit a+5
Output:
1.51 s ± 318 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
55.8 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!
Looking for the fastest way to cut a timeseries ... for example just taking the values that are more recent than a certain index.
I've found two commonly used methods:
df = original_series.truncate(before=example_time)
and
df = original_series[example_time:]
Which one is faster (for large time-series > 10**6 values) ?
This usually depends on what your dataframe index is, throwing a random DataFrame of 10^7 values into timeit we get the following.
From a performance standpoint in truncation more inefficient as pandas is optimized for integer based indexing via numpy.
Truncate:
62.6 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Bracket Indexing:
54.1 µs ± 4.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
ILoc:
69.5 µs ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Loc:
92 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Ix (which is deprecated):
110 µs ± 8.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
EDIT This is all on pandas 0.24.2, back in the 0.14-0.18 versions loc performance was much much worse
Consider the following code:
import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)
13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()
13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?
Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.
The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max always will inspect all element when any inspect only element before the first True.
The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.
As said in comment, the python any fonction have a short circuit mechanism, when np.any
have not. see here.
But True in a.x is even faster:
%timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)
numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.
pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.
The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.