I have a long 1-d numpy array with some 10% missing values. I want to change its missing values (np.nan) to other values repeatedly. I know of two ways to do this:
data[np.isnan(data)] = 0 or the function
np.copyto(data, 0, where=np.isnan(data))
Sometimes I want to put zeros there, other times I want to restore the nans.
I thought that recomputing the np.isnan function repeatedly would be slow, and it would be better to save the locations of the nans. Some of the timing results of the code below are counter-intuitive.
I ran the following:
import numpy as np
import sys
print(sys.version)
print(sys.version_info)
print(f'numpy version {np.__version__}')
data = np.random.random(100000)
data[data<0.1] = 0
data[data==0] = np.nan
%timeit missing = np.isnan(data)
%timeit wheremiss = np.where(np.isnan(data))
missing = np.isnan(data)
wheremiss = np.where(np.isnan(data))
print("Use missing list store 0")
%timeit data[missing] = 0
data[data==0] = np.nan
%timeit data[wheremiss] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=missing)
print("Use isnan function store 0")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=np.isnan(data))
print("Use missing list store np.nan")
data[data==0] = np.nan
%timeit data[missing] = np.nan
data[data==0] = np.nan
%timeit data[wheremiss] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=missing)
print("Use isnan function store np.nan")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=np.isnan(data))
And I got the following output (I have taken the liberty to add numbers to the timing lines, so that I can refer to them later):
3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
numpy version 1.17.1
01. 30 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
02. 219 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use missing list store 0
03. 339 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
04. 26 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
05. 287 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store 0
06. 38.5 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
07. 43.8 µs ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use missing list store np.nan
08. 328 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
09. 24.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10. 322 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store np.nan
11. 356 µs ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12. 300 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So here is the first question. Why would it take nearly 10 times longer to store a np.nan than to store a 0? (compare lines 6 and 7 vs. lines 11 and 12)
Why would it take much longer to use a stored list of missing as compared to recomputing the missing values using the isnan function? (compare lines 3 and 5 vs. 6 and 7)
This is just for curiosity. I can see that the fastest way is to use np.where to get a list of indices (because I have only 10% missing). But if I had many more, things might not be so obvious.
because you're not measuring what you think you are! you're mutating your data while doing the test, and timeit runs the test multiple times. thus additional runs are running on changed data. when you change the value to 0 the next time you run isnan you get nothing back and assignment is basically a no-op. while when you're assigning nan this causes more work to be done in the next iteration.
your question about when to use np.where vs leaving it as an array of bools is a bit more difficult. it would involve the relative sizes of the different datatypes (e.g. bool is 1 byte, int64 is 8 bytes), the proportion of values that are selected, how well the distribution matches up to the CPU/memory subsystem's optimisations (e.g. are they mostly in one block vs. uniformly distributed), the relative cost of doing np.where vs how many times the result will be reused, and other things I can't think of right now.
for other users, it might be worth pointing out that RAM latency (i.e. speed) is >100 times slower than L1 cache, so keeping memory access predictable is important to maximize cache utilisation
Related
I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.
I have the following issue: I have a matrix yj of size (m,200) (m = 3683), and I have a dictionary that for each key, returns a numpy array of row indices for yj (for each key, the size array changes, just in case anyone is wondering).
Now, I have to access this matrix lots of times (around 1M times) and my code is slowing down because of the indexing (I've profiled the code and it takes 65% of time on this step).
Here is what I've tried out:
First of all, use the indices for slicing:
>> %timeit yj[R_u_idx_train[1]]
10.5 µs ± 79.7 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
The variable R_u_idx_train is the dictionary that has the row indices.
I thought that maybe boolean indexing might be faster:
>> yj[R_u_idx_train_mask[1]]
10.5 µs ± 159 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
R_u_idx_train_mask is a dictionary that returns a boolean array of size m where the indices given by R_u_idx_train are set to True.
I also tried np.ix_
>> cols = np.arange(0,200)
>> %timeit ix_ = np.ix_(R_u_idx_train[1], cols); yj[ix_]
42.1 µs ± 353 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I also tried np.take
>> %timeit np.take(yj, R_u_idx_train[1], axis=0)
2.35 ms ± 88.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And while this seems great, it is not, since it gives an array that is shape (R_u_idx_train[1].shape[0], R_u_idx_train[1].shape[0]) (it should be (R_u_idx_train[1].shape[0], 200)). I guess I'm not using the method correctly.
I also tried np.compress
>> %timeit np.compress(R_u_idx_train_mask[1], yj, axis=0)
14.1 µs ± 124 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Finally I tried to index with a boolean matrix
>> %timeit yj[R_u_idx_train_mask2[1]]
244 µs ± 786 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, is 10.5 µs ± 79.7 ns per loop the best I can do? I could try to use cython but that seems like a lot of work for just indexing...
Thanks a lot.
A very smart solution was given by V.Ayrat in the comments.
>> newdict = {k: yj[R_u_idx_train[k]] for k in R_u_idx_train.keys()}
>> %timeit newdict[1]
202 ns ± 6.7 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
Anyway maybe it would still be cool to know if there is a way to speed it up using numpy!
I am curious about the fact that, when applying a function to each element of pd.Series inside for loop, the execution time looks significantly faster than O(N).
Considering a function below, which is rotating the number bit-wise, but the code itself is not important here.
def rotate(x: np.uint32) -> np.uint32:
return np.uint32(x >> 1) | np.uint32((x & 1) << 31)
When executing this code 1000 times in a for loop, it simply takes the order of 1000 times as expected.
x = np.random.randint(2 ** 32 - 1, dtype=np.uint32)
%timeit rotate(x)
# 13 µs ± 807 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%%timeit
for i in range(1000):
rotate(x)
# 9.61 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
However when I apply this code inside for loop over a Series of size 1000, it gets significantly faster.
s = pd.Series(np.random.randint(2 ** 32 - 1, size=1000, dtype=np.uint32))
%%timeit
for x in s:
rotate(x)
# 2.08 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
I am curious about the mechanism that makes this happen?
Note in your first loop you're not actually using the next value of the iterator. The following is a better comparison:
...: %%timeit
...: for i in range(1000):
...: rotate(i)
...:
1.46 ms ± 71.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
...: %%timeit
...: for x in s:
...: rotate(x)
...:
1.6 ms ± 66.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Not surprisingly, they perform more or less the same.
In your original example, by using a variable x declared outside, the interpreter needed to load in that variable using LOAD_GLOBAL 2 (x) while if you just used the value i then the interpreter could just call LOAD_FAST 0 (i), which as the name hints is faster.
Consider the following code:
import numpy as np
import pandas as pd
a = pd.DataFrame({'case': np.arange(10000) % 100,
'x': np.random.rand(10000) > 0.5})
%timeit any(a.x)
%timeit a.x.max()
%timeit a.groupby('case').x.transform(any)
%timeit a.groupby('case').x.transform(max)
13.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
195 µs ± 811 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
25.9 ms ± 555 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
1.43 ms ± 13.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
b = pd.DataFrame({'x': np.random.rand(100) > 0.5})
%timeit any(b.x)
%timeit b.x.max()
13.1 µs ± 205 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
81.5 µs ± 1.81 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
We see that "any" works faster than "max" on a boolean pandas.Series of size 100 and 10000, but when we try to groupby and transform data in groups of 100, suddenly "max" is a lot faster than "any". Why?
Because any evaluation is lazy. Which means that the that the any function will stop at the first True boolean element.
The max, however, can't do so because it required to inspect every element in a sequence to be sure it haven't missed any greater element.
That's why, max always will inspect all element when any inspect only element before the first True.
The case when max works faster are probably the cases with type coercion because all values in numpy are stored in their own types and formats, mathematical operations may be faster that python's any.
As said in comment, the python any fonction have a short circuit mechanism, when np.any
have not. see here.
But True in a.x is even faster:
%timeit any(a.x)
53.6 µs ± 543 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit True in (a.x)
3.39 µs ± 31.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
~41 sec: df[~df.index.isin(indices)] by #jezrael
I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:
df1 = df[~df.index.isin(indices)]
As #user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Using iloc (or loc, see below) and Series.drop:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As #jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc outperform boolean_indexing?
Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing vs iloc vs loc:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
#njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms