Fastest way to cut a pandas time-series - python

Looking for the fastest way to cut a timeseries ... for example just taking the values that are more recent than a certain index.
I've found two commonly used methods:
df = original_series.truncate(before=example_time)
and
df = original_series[example_time:]
Which one is faster (for large time-series > 10**6 values) ?

This usually depends on what your dataframe index is, throwing a random DataFrame of 10^7 values into timeit we get the following.
From a performance standpoint in truncation more inefficient as pandas is optimized for integer based indexing via numpy.
Truncate:
62.6 ms ± 3.63 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Bracket Indexing:
54.1 µs ± 4.41 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
ILoc:
69.5 µs ± 4.52 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Loc:
92 µs ± 5.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Ix (which is deprecated):
110 µs ± 8.44 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
EDIT This is all on pandas 0.24.2, back in the 0.14-0.18 versions loc performance was much much worse

Related

What is the most optimized way to replace nan with 0 in a pandas dataframe? [duplicate]

Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Numpy gives different results for almost identical code?

def match_score(vendor, company):
return max(fuzz.ratio(vendor, company), fuzz.partial_ratio(vendor, company), fuzz.token_sort_ratio(vendor, company))
Note: fuzz is from import fuzzywuzzy library
========================
vendor = 'RED DEER TELUS STORE'
When I try this code:
df['Vendor']=vendor
df['Score'] = np.array(match_score(tuple(df['Vendor']), tuple(df['Company'])))
I get this
However, when I try an almost identical code I get a different 'Score'?
df['Score'] = np.array(match_score(vendor, tuple(df['Company'])))
My logic in #2 is that the vendor is the same across the entire column so no need to put it in a tuple..I can just give it as a string and make the processing faster.
Can anyone explain why passing an entire column where vendor in each cell = 'RED DEER TELUS STORE' gives a different result than just passing 'RED DEER TELUS STORE' to the function as a string? Thanks!
tuple versus tolist:
In [166]: x=np.arange(10000)
In [167]: timeit tuple(x)
1.14 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [168]: timeit list(x)
1.12 ms ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit x.tolist()
296 µs ± 9.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
or with a series
In [170]: ds = pd.Series(x)
In [171]: timeit tuple(ds)
1.22 ms ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [172]: timeit list(ds)
1.23 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: timeit ds.to_list()
394 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With a series of string values (object dtype):
In [184]: ds = pd.Series(['' for _ in range(1000)])
In [185]: ds[:] = vendor
In [186]: timeit tuple(ds)
104 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [187]: timeit ds.to_list()
27.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As to why you get a difference between passing the Series/tuple versus the string, I think you need to examine the fuzz code/docs in more detail. Maybe even test the function(s) with small examples. I don't have fuzz installed, so can't explore that part of your calculations.
You might even want to make up some lists (or tuples) of strings, and experiment with those. I don't think this is a numpy/pandas issue. It's a matter of learning to use fuzz correctly.

Why "any" sometimes works much faster, and sometimes much slower than "max" on Boolean values in python?

Consider the following code:
import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)
13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()
13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?
Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.
The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max always will inspect all element when any inspect only element before the first True.
The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.
As said in comment, the python any fonction have a short circuit mechanism, when np.any
have not. see here.
But True in a.x is even faster:
%timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Fastest way to drop rows / get subset with difference from large DataFrame in Pandas

Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
~41 sec: df[~df.index.isin(indices)] by #jezrael
I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:
df1 = df[~df.index.isin(indices)]
As #user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Using iloc (or loc, see below) and Series.drop:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As #jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc outperform boolean_indexing?
Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing vs iloc vs loc:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
#njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms

Python Speedup np.unique

I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)
numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.
pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.
The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.

Categories