I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.
Related
Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
def match_score(vendor, company):
return max(fuzz.ratio(vendor, company), fuzz.partial_ratio(vendor, company), fuzz.token_sort_ratio(vendor, company))
Note: fuzz is from import fuzzywuzzy library
========================
vendor = 'RED DEER TELUS STORE'
When I try this code:
df['Vendor']=vendor
df['Score'] = np.array(match_score(tuple(df['Vendor']), tuple(df['Company'])))
I get this
However, when I try an almost identical code I get a different 'Score'?
df['Score'] = np.array(match_score(vendor, tuple(df['Company'])))
My logic in #2 is that the vendor is the same across the entire column so no need to put it in a tuple..I can just give it as a string and make the processing faster.
Can anyone explain why passing an entire column where vendor in each cell = 'RED DEER TELUS STORE' gives a different result than just passing 'RED DEER TELUS STORE' to the function as a string? Thanks!
tuple versus tolist:
In [166]: x=np.arange(10000)
In [167]: timeit tuple(x)
1.14 ms ± 26.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [168]: timeit list(x)
1.12 ms ± 2.01 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [169]: timeit x.tolist()
296 µs ± 9.98 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
or with a series
In [170]: ds = pd.Series(x)
In [171]: timeit tuple(ds)
1.22 ms ± 1.57 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [172]: timeit list(ds)
1.23 ms ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [173]: timeit ds.to_list()
394 µs ± 22.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
With a series of string values (object dtype):
In [184]: ds = pd.Series(['' for _ in range(1000)])
In [185]: ds[:] = vendor
In [186]: timeit tuple(ds)
104 µs ± 1.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [187]: timeit ds.to_list()
27.2 µs ± 179 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
As to why you get a difference between passing the Series/tuple versus the string, I think you need to examine the fuzz code/docs in more detail. Maybe even test the function(s) with small examples. I don't have fuzz installed, so can't explore that part of your calculations.
You might even want to make up some lists (or tuples) of strings, and experiment with those. I don't think this is a numpy/pandas issue. It's a matter of learning to use fuzz correctly.
I found two forms of replacing some values of a data frame based on a condition:
.loc
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
np.where()
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?
Well, not a throughout test, but here's a sample. In each run (loc, np.where), the data is reset to the original random with seed.
toy data 1
Here, there are more np.nan than valid values. Also, the column is of float type.
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})
# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
toy data 2:
Here there are less np.nan than valid values, and the column is of object type:
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})
same story:
df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So contrary to #cs95's comment, loc seems to outperform np.where.
for what it's worth, i'm working with a very large dataset (millions of rows, 100+ columns) and i was using df.loc for a simple substitution and it would often take literally hours. when i changed to np.where it worked instantly.
The code runs in jupyter notebook
np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})
%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
~41 sec: df[~df.index.isin(indices)] by #jezrael
I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:
df1 = df[~df.index.isin(indices)]
As #user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Using iloc (or loc, see below) and Series.drop:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As #jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc outperform boolean_indexing?
Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing vs iloc vs loc:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
#njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).
I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?
df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
Use str.strip with indexing by str[-1]:
df['LastDigit'] = df['UserId'].str.strip().str[-1]
If performance is important and no missing values use list comprehension:
df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
Your solution is really slow, it is last solution from this:
6) updating an empty frame (e.g. using loc one-row-at-a-time)
Performance:
np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
'UserId': np.random.choice(users, size=1000),
})
In [139]: %%timeit
...: df['LastDigit'] = np.nan
...: for i in range(0,len(df['UserId'])):
...: df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
...:
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)
%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)