Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Related
Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I was going through pandas documentation. And it quoted that
I have a sample csv data file.
Date
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
Next I tried
In [174]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"])
1.5 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"], infer_datetime_format=True)
1.73 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, according to the documentation it should be less time. Is my understanding correct? Or on what data does the statement hold good?
Update:
Pandas version - '1.0.5'
What you actually want to do is add dayfirst = True
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"],dayfirst = True, infer_datetime_format=True)
1.96 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Compared to
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"])
2.38 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"], infer_datetime_format=True)
3.02 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The solution is to reduce the number of choices read_csv has to do things.
Consider the following code:
import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)
13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()
13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?
Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.
The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max always will inspect all element when any inspect only element before the first True.
The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.
As said in comment, the python any fonction have a short circuit mechanism, when np.any
have not. see here.
But True in a.x is even faster:
%timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).
I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?
df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
Use str.strip with indexing by str[-1]:
df['LastDigit'] = df['UserId'].str.strip().str[-1]
If performance is important and no missing values use list comprehension:
df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
Your solution is really slow, it is last solution from this:
6) updating an empty frame (e.g. using loc one-row-at-a-time)
Performance:
np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
'UserId': np.random.choice(users, size=1000),
})
In [139]: %%timeit
...: df['LastDigit'] = np.nan
...: for i in range(0,len(df['UserId'])):
...: df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
...:
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)
%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am looking to speed up the following piece of code:
NNlist=[np.unique(i) for i in NNlist]
where NNlist is a list of np.arrays with duplicated entries.
Thanks :)
numpy.unique is already pretty optimized, you're not likely to get get much of a speedup over what you already have unless you know something else about the underlying data. For example if the data is all small integers you might be able to use numpy.bincount or if the unique values in each of the arrays are mostly the same there might be some optimization that could be done over the whole list of arrays.
pandas.unique() is much faster than numpy.unique(). The Pandas version does not sort the result, but you can do that yourself and it will still be much faster if the result is much smaller than the input (i.e. there are a lot of duplicate values):
np.sort(pd.unique(arr))
Timings:
In [1]: x = np.random.randint(10, 20, 50000000)
In [2]: %timeit np.sort(pd.unique(x))
201 ms ± 9.32 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [3]: %timeit np.unique(x)
1.49 s ± 27.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I also took a look at list(set()) and strings in a list, between pandas Series and python lists.
data = np.random.randint(0,10,100)
data_hex = [str(hex(n)) for n in data] # just some simple strings
sample1 = pd.Series(data, name='data')
sample2 = data.tolist()
sample3 = pd.Series(data_hex, name='data')
sample4 = data_hex
And then the benchmarks:
%timeit np.unique(sample1) # 16.4 µs ± 464 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample2) # 15.9 µs ± 743 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit np.unique(sample3) # 45.8 µs ± 5.88 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.unique(sample4) # 20.6 µs ± 680 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample1) # 60.3 µs ± 5.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample2) # 196 µs ± 18.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample3) # 79.7 µs ± 3.98 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit pd.unique(sample4) # 214 µs ± 61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each
%timeit list(set(sample1)) # 16.3 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample2)) # 1.64 µs ± 83.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit list(set(sample3)) # 17.8 µs ± 1.96 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit list(set(sample4)) # 2.48 µs ± 439 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The take away is:
Starting with a Pandas Series with integers? Go with either np.unique() or list(set())
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
However, if N=1,000,000 instead, the results are different.
%timeit np.unique(sample1) # 26.5 ms ± 616 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample2) # 98.1 ms ± 3.64 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit np.unique(sample3) # 1.31 s ± 78.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit np.unique(sample4) # 174 ms ± 2.57 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit pd.unique(sample1) # 10.5 ms ± 472 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit pd.unique(sample2) # 99.3 ms ± 5.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample3) # 46.4 ms ± 4.73 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit pd.unique(sample4) # 113 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample1)) # 25.9 ms ± 2.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample2)) # 11.2 ms ± 496 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit list(set(sample3)) # 37.1 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit list(set(sample4)) # 20.2 ms ± 843 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Starting with a Pandas Series with integers? Go with pd.unique()
Starting with a Pandas Series with strings? Go with list(set())
Starting with a list of integers? Go with list(set())
Starting with a list of strings? Go with list(set())
Here are some benchmarks:
In [72]: ar_list = [np.random.randint(0, 100, 1000) for _ in range(100)]
In [73]: %timeit map(np.unique, ar_list)
100 loops, best of 3: 4.9 ms per loop
In [74]: %timeit [np.unique(ar) for ar in ar_list]
100 loops, best of 3: 4.9 ms per loop
In [75]: %timeit [pd.unique(ar) for ar in ar_list] # using pandas
100 loops, best of 3: 2.25 ms per loop
So pandas.unique seems to be faster than numpy.unique. However the docstring mentions that the values are "not necessarily sorted", which (partly) explains, that it is faster.
Using a list comprehension or map doesn't give a difference in this example.
The numpy.unique() is based on sorting (quicksort), and the pandas.unique() is based on hash table. Normally, the latter is faster according to my benchmarks. They are already very optimized.
For some special case, you can continue to optimize the performance.
For example, if the data already sorted, you can skip the sorting method:
# ar is already sorted
# this segment is from source code of numpy
mask = np.empty(ar.shape, dtype=np.bool_)
mask[:1] = True
mask[1:] = ar[1:] != ar[:-1]
ret = ar[mask]
I meet the similar problem to yours. I wrote my unique function for my use. Because the pandas.unique doesn't support return_counts option. It is fast implementation. But my implementation only supports integers array. You can check out the source code here.