I found two forms of replacing some values of a data frame based on a condition:
.loc
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
np.where()
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?
Well, not a throughout test, but here's a sample. In each run (loc, np.where), the data is reset to the original random with seed.
toy data 1
Here, there are more np.nan than valid values. Also, the column is of float type.
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})
# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
toy data 2:
Here there are less np.nan than valid values, and the column is of object type:
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})
same story:
df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So contrary to #cs95's comment, loc seems to outperform np.where.
for what it's worth, i'm working with a very large dataset (millions of rows, 100+ columns) and i was using df.loc for a simple substitution and it would often take literally hours. when i changed to np.where it worked instantly.
The code runs in jupyter notebook
np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})
%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
Related
Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.
I was going through pandas documentation. And it quoted that
I have a sample csv data file.
Date
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
22-01-1943
15-10-1932
23-11-1910
04-05-2000
02-02-1943
01-01-1943
28-08-1943
31-12-1943
Next I tried
In [174]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"])
1.5 ms ± 178 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [175]: %timeit df = pd.read_csv("a.csv", parse_dates=["Date"], infer_datetime_format=True)
1.73 ms ± 45 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So, according to the documentation it should be less time. Is my understanding correct? Or on what data does the statement hold good?
Update:
Pandas version - '1.0.5'
What you actually want to do is add dayfirst = True
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"],dayfirst = True, infer_datetime_format=True)
1.96 ms ± 115 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Compared to
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"])
2.38 ms ± 182 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
and
%timeit df = pd.read_csv("C:/Users/k_sego/Dates.csv", parse_dates=["Date"], infer_datetime_format=True)
3.02 ms ± 670 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The solution is to reduce the number of choices read_csv has to do things.
Pandas fillna() is significantly slow especially if there is a big amount of missing data in a dataframe.
Is there any quicker way than it?
(I know that it would help if I simply dropped some of the rows and/or columns that contain the NAs)
I try to test:
np.random.seed(123)
N = 60000
df = pd.DataFrame(np.random.choice(['a', None], size=(N, 20), p=(.7, .3)))
In [333]: %timeit df.fillna('b')
93.5 ms ± 1.28 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [337]: %timeit df[df.isna()] = 'b'
122 ms ± 2.75 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
A bit changed solution (but I feel it is a bit hacky):
#pandas below
In [335]: %timeit df.values[df.isna()] = 'b'
56.7 ms ± 799 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
#pandas 0.24+
In [339]: %timeit df.to_numpy()[df.isna()] = 'b'
56.5 ms ± 951 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).
I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?
df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
Use str.strip with indexing by str[-1]:
df['LastDigit'] = df['UserId'].str.strip().str[-1]
If performance is important and no missing values use list comprehension:
df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
Your solution is really slow, it is last solution from this:
6) updating an empty frame (e.g. using loc one-row-at-a-time)
Performance:
np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
'UserId': np.random.choice(users, size=1000),
})
In [139]: %%timeit
...: df['LastDigit'] = np.nan
...: for i in range(0,len(df['UserId'])):
...: df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
...:
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)
%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)