Given a dataframe, I am trying to print out how many cells of one column with a specific value correspond to the same index of another column having other specific values.
In this instance the output should be '2' since the condition is df[z]=4 and df[x]=C and only cells 10 and 11 match this requirement.
My code does not output any result but only a warning message: :5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
:5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
Besides fixing this issue, is there another more 'pythonish' way of doing this without a for loop?
import numpy as np
import pandas as pd
data=[['A', 1,2 ,5, 'blue'],
['A', 1,5,6, 'blue'],
['A', 4,4,7, 'blue']
,['B', 6,5,4,'yellow'],
['B',9,9,3, 'blue'],
['B', 7,9,1,'yellow']
,['B', 2,3,1,'yellow'],
['B', 5,1,2,'yellow'],
['C',2,10,9,'green']
,['C', 8,2,8,'green'],
['C', 5,4,3,'green'],
['C', 8,4 ,3,'green']]
df = pd.DataFrame(data, columns=['x','y','z','xy', 'color'])
k=0
print((df[df['z']==4].index.values))
print(df[df['x']== 'C'].index.values)
for i in (df['z']):
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
k+=1
print(k)
try:
c=df['z'].eq(4) & df['x'].eq('C')
#your condition
Finally:
count=df[c].index.size
#OR
count=len(df[c].index)
output:
print(count)
>>>2
You can do the following:
df[(df['z']==4) & (df['x']=='C')].shape[0]
#2
Assuming just the number is necessary and not the filtered frame, calculating the number of True values in the Boolean Series is faster:
Calculate the conditions as Boolean Series:
m = df['z'].eq(4) & df['x'].eq('C')
Count True values via Series.sum:
k = m.sum()
or via np.count_nonzero:
k = np.count_nonzero(m)
k:
2
Timing Information via %timeit:
All timing excludes creation of the index as they all use the same index so the timing is similar in all cases:
m = df['z'].eq(4) & df['x'].eq('C')
Henry Ecker (This Answer)
%timeit m.sum()
25.6 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.count_nonzero(m)
7 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
IoaTzimas
%timeit df[m].shape[0]
151 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Anurag Dabas
%timeit df[m].index.size
163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit len(df[m].index)
165 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
SeaBean
%timeit df.loc[m].shape[0]
151 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(Without loc is the same as IoaTzimas)
You can use .loc with the boolean condition mask of selecting the rows to locate the rows. Then, use shape[0] to get the row count:
df.loc[(df['z']== 4) & (df['x']== 'C')].shape[0]
You can use with or without .loc for the row selection. So, it's the same as:
df[(df['z']== 4) & (df['x']== 'C')].shape[0]
However, it is a good practice to use .loc rather than without it. You can refer to this post for further information.
Result:
2
Related
Given a large DataFrame df, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.
It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.
I am trying to structure a df for productivity at some point i need to verify if a id exist in list and give a indicator in function of that, but its too slow (something like 30 seg for df).
can you enlighten me on a better way to do it?
thats my current code:
data['first_time_it_happen'] = data['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
(i already try to use the colume like a serie but it do not work correctly)
To settle some debate in the comment section, I ran some timings.
Methods to time:
def isin(df, old_data):
return df["id"].isin(old_data["id"])
def apply(df, old_data):
return df['id'].apply(lambda x: 0 if x in old_data['id'].values else 1)
def set_(df, old_data):
old = set(old_data['id'].values)
return [x in old for x in df['id']]
import pandas as pd
import string
old_data = pd.DataFrame({"id": list(string.ascii_lowercase[:15])})
df = pd.DataFrame({"id": list(string.ascii_lowercase)})
Small DataFrame tests:
# Tests ran in jupyter notebook
%timeit isin(df, old_data)
184 µs ± 5.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit apply(df, old_data)
926 µs ± 64.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit set_(df, old_data)
28.8 µs ± 1.16 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Large dataframe tests:
df = pd.concat([df] * 100000, ignore_index=True)
%timeit isin(df, old_data)
122 ms ± 22.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit apply(df, old_data)
56.9 s ± 6.37 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit set_(df, old_data)
974 ms ± 15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Seems like the set method is a smidge faster than the isin method for a small dataframe. However that comparison radically flips for a much larger dataframe. Seems like in most cases the isin method is will be the best way to go. Then the apply method is always the slowest of the bunch regardless of dataframe size.
I found two forms of replacing some values of a data frame based on a condition:
.loc
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
np.where()
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
Both forms work well, but which is the preferred one? And in relation to the question, when should I use .loc and when np.where?
Well, not a throughout test, but here's a sample. In each run (loc, np.where), the data is reset to the original random with seed.
toy data 1
Here, there are more np.nan than valid values. Also, the column is of float type.
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice((1, np.nan), 1000000, p=(0.3,0.7))})
# loc
%%timeit
mask = df['param'].isnull()
df.loc[mask, 'param'] = 'new_value'
# 46.7 ms ± 177 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# np.where
%%timeit
mask = df['param'].isnull()
df['param'] = np.where(mask, 'new_value', df['param'])
# 86.8 ms ± 2.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
toy data 2:
Here there are less np.nan than valid values, and the column is of object type:
np.random.seed(1)
df = pd.DataFrame({'param': np.random.choice(("1", np.nan), 1000000, p=(0.7,0.3))})
same story:
df.loc[mask, 'param'] = 'new_value'
# 47.8 ms ± 350 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
df['param'] = np.where(mask, 'new_value', df['param'])
# 58.9 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So contrary to #cs95's comment, loc seems to outperform np.where.
for what it's worth, i'm working with a very large dataset (millions of rows, 100+ columns) and i was using df.loc for a simple substitution and it would often take literally hours. when i changed to np.where it worked instantly.
The code runs in jupyter notebook
np.random.seed(42)
df1 = pd.DataFrame({'a':np.random.randint(0, 10, 10000)})
%%timeit
df1["a"] = np.where(df1["a"] == 2, 8, df1["a"])
# 163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df1.loc[df1['a']==2,'a'] = 8
# 203 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df1.loc[np.where(df1.a.values==2)]['a'] = 8
# 383 µs ± 9.44 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
# I have a question about this, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
%%timeit
df1.iloc[np.where(df1.a.values==2),0] = 8
# 101 µs ± 870 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
I have a question about the third way of writing, Why does df1.loc[np.where(df1.a.values==2), 'a']= 8 report an error
Question
I'm looking for the fastest way to drop a set of rows which indices I've got or get the subset of the difference of these indices (which results in the same dataset) from a large Pandas DataFrame.
So far I have two solutions, which seem relatively slow to me:
df.loc[df.difference(indices)]
which takes ~115 sec on my dataset
df.drop(indices)
which takes ~215 sec on my dataset
Is there a faster way to do this? Preferably in Pandas.
Performance of proposed Solutions
~41 sec: df[~df.index.isin(indices)] by #jezrael
I believe you can create boolean mask, inverting by ~ and filtering by boolean indexing:
df1 = df[~df.index.isin(indices)]
As #user3471881 mentioned for avoid chained indexing if you are planning on manipulating the filtered df later is necessary add copy:
df1 = df[~df.index.isin(indices)].copy()
This filtering depends of number of matched indices and also by length of DataFrame.
So another possible solution is create array/list of indices for keeping and then inverting is not necessary:
df1 = df[df.index.isin(need_indices)]
Using iloc (or loc, see below) and Series.drop:
df = pd.DataFrame(np.arange(0, 1000000, 1))
indices = np.arange(0, 1000000, 3)
%timeit -n 100 df[~df.index.isin(indices)]
%timeit -n 100 df.iloc[df.index.drop(indices)]
41.3 ms ± 997 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
32.7 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
As #jezrael points out you can only use iloc if index is a RangeIndex otherwise you will have to use loc. But this is still faster than df[df.isin()] (see why below).
All three options on 10 million rows:
df = pd.DataFrame(np.arange(0, 10000000, 1))
indices = np.arange(0, 10000000, 3)
%timeit -n 10 df[~df.index.isin(indices)]
%timeit -n 10 df.iloc[df.index.drop(indices)]
%timeit -n 10 df.loc[df.index.drop(indices)]
4.98 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
752 ms ± 51.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.65 s ± 69.9 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Why does super slow loc outperform boolean_indexing?
Well, the short answer is that it doesn't. df.index.drop(indices) is just a lot faster than ~df.index.isin(indices) (given above data with 10 million rows):
%timeit -n 10 ~df.index.isin(indices)
%timeit -n 10 df.index.drop(indices)
4.55 s ± 129 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
388 ms ± 10.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
We can compare this to the performance of boolean_indexing vs iloc vs loc:
boolean_mask = ~df.index.isin(indices)
dropped_index = df.index.drop(indices)
%timeit -n 10 df[boolean_mask]
%timeit -n 10 df.iloc[dropped_index]
%timeit -n 10 df.loc[dropped_index]
489 ms ± 25.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
371 ms ± 10.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.38 s ± 153 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
If order of rows doesn't mind, you can arrange them in place :
n=10**7
df=pd.DataFrame(arange(4*n).reshape(n,4))
indices=np.unique(randint(0,n,size=n//2))
from numba import njit
#njit
def _dropfew(values,indices):
k=len(values)-1
for ind in indices[::-1]:
values[ind]=values[k]
k-=1
def dropfew(df,indices):
_dropfew(df.values,indices)
return df.iloc[:len(df)-len(indices)]
Runs :
In [39]: %time df.iloc[df.index.drop(indices)]
Wall time: 1.07 s
In [40]: %time dropfew(df,indices)
Wall time: 219 ms
I want to store in a new variable the last digit from a 'UserId' (such UserId is of type string).
I came up with this, but it's a long df and takes forever. Any tips on how to optimize/avoid for loop?
df['LastDigit'] = np.nan
for i in range(0,len(df['UserId'])):
df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
Use str.strip with indexing by str[-1]:
df['LastDigit'] = df['UserId'].str.strip().str[-1]
If performance is important and no missing values use list comprehension:
df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
Your solution is really slow, it is last solution from this:
6) updating an empty frame (e.g. using loc one-row-at-a-time)
Performance:
np.random.seed(456)
users = ['joe','jan ','ben','rick ','clare','mary','tom']
df = pd.DataFrame({
'UserId': np.random.choice(users, size=1000),
})
In [139]: %%timeit
...: df['LastDigit'] = np.nan
...: for i in range(0,len(df['UserId'])):
...: df.loc[i]['LastDigit'] = df.loc[i]['UserId'].strip()[-1]
...:
__main__:3: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
57.9 s ± 1.48 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [140]: %timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
1.38 ms ± 150 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [141]: %timeit df['LastDigit'] = [x.strip()[-1] for x in df['UserId']]
343 µs ± 8.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Another option is to use apply. Not performant as the list comprehension but very flexible based on your goals. Here some tries on a random dataframe with shape (44289, 31)
%timeit df['LastDigit'] = df['UserId'].apply(lambda x: str(x)[-1]) #if some variables are not strings
12.4 ms ± 215 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%timeit df['LastDigit'] = df['UserId'].str.strip().str[-1]
31.5 ms ± 688 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit df['LastDigit'] = [str(x).strip()[-1] for x in df['UserId']]
9.7 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)