Given a large DataFrame df, which is faster in general?
# combining the masks first
sub_df = df[(df["column1"] < 5) & (df["column2"] > 10)]
# applying the masks sequentially
sub_df = df[df["column1"] < 5]
sub_df = sub_df[sub_df["column2"] > 10]
The first approach only selects from the DataFrame once which may be faster, however, the second selection in the second example only has to consider a smaller DataFrame.
It depends on your dataset.
First let's generate a DataFrame with almost all values that should be dropped in the first condition:
n = 1_000_000
p = 0.0001
np.random.seed(0)
df = pd.DataFrame({'column1': np.random.choice([0,6], size=n, p=[p, 1-p]),
'column2': np.random.choice([0,20], size=n)})
And as expected:
# simultaneous conditions
5.69 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# successive slicing
2.99 ms ± 45.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
It is faster to first generate a small intermediate.
Now, let's change the probability to p = 0.9999. This means that the first condition will remove very few rows.
We could expect both solutions to run with a similar speed, but:
# simultaneous conditions
27.5 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# successive slicing
55.7 ms ± 3.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Now the overhead of creating the intermediate DataFrame is not negligible.
Given a dataframe, I am trying to print out how many cells of one column with a specific value correspond to the same index of another column having other specific values.
In this instance the output should be '2' since the condition is df[z]=4 and df[x]=C and only cells 10 and 11 match this requirement.
My code does not output any result but only a warning message: :5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
:5: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
Besides fixing this issue, is there another more 'pythonish' way of doing this without a for loop?
import numpy as np
import pandas as pd
data=[['A', 1,2 ,5, 'blue'],
['A', 1,5,6, 'blue'],
['A', 4,4,7, 'blue']
,['B', 6,5,4,'yellow'],
['B',9,9,3, 'blue'],
['B', 7,9,1,'yellow']
,['B', 2,3,1,'yellow'],
['B', 5,1,2,'yellow'],
['C',2,10,9,'green']
,['C', 8,2,8,'green'],
['C', 5,4,3,'green'],
['C', 8,4 ,3,'green']]
df = pd.DataFrame(data, columns=['x','y','z','xy', 'color'])
k=0
print((df[df['z']==4].index.values))
print(df[df['x']== 'C'].index.values)
for i in (df['z']):
if (df[df['z']== 4].index.values) == (df[df['x']== 'C'].index.values):
k+=1
print(k)
try:
c=df['z'].eq(4) & df['x'].eq('C')
#your condition
Finally:
count=df[c].index.size
#OR
count=len(df[c].index)
output:
print(count)
>>>2
You can do the following:
df[(df['z']==4) & (df['x']=='C')].shape[0]
#2
Assuming just the number is necessary and not the filtered frame, calculating the number of True values in the Boolean Series is faster:
Calculate the conditions as Boolean Series:
m = df['z'].eq(4) & df['x'].eq('C')
Count True values via Series.sum:
k = m.sum()
or via np.count_nonzero:
k = np.count_nonzero(m)
k:
2
Timing Information via %timeit:
All timing excludes creation of the index as they all use the same index so the timing is similar in all cases:
m = df['z'].eq(4) & df['x'].eq('C')
Henry Ecker (This Answer)
%timeit m.sum()
25.6 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit np.count_nonzero(m)
7 µs ± 267 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
IoaTzimas
%timeit df[m].shape[0]
151 µs ± 2.39 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Anurag Dabas
%timeit df[m].index.size
163 µs ± 3.47 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit len(df[m].index)
165 µs ± 5.53 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
SeaBean
%timeit df.loc[m].shape[0]
151 µs ± 5.12 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
(Without loc is the same as IoaTzimas)
You can use .loc with the boolean condition mask of selecting the rows to locate the rows. Then, use shape[0] to get the row count:
df.loc[(df['z']== 4) & (df['x']== 'C')].shape[0]
You can use with or without .loc for the row selection. So, it's the same as:
df[(df['z']== 4) & (df['x']== 'C')].shape[0]
However, it is a good practice to use .loc rather than without it. You can refer to this post for further information.
Result:
2
I have a dataframe df1 in python as below:
Type Category
a 1
b 2
c 3
d 4
Expected output:
Type
a/1
b/2
c/3
d/4
The actual dataframe is way larger than this thus i can't type out every cells for the new dataframe.
How can I extract the columns and output to another dataframe with the '/' seperated? Maybe using some for loop?
Using str.cat
The right pandas-y way to proceed is by using str.cat
df['Type'] = df.Type.str.cat(others=df.Category.astype(str), sep='/')
others contains the pd.Series to concatenate, and sep the separator to use.
Result
Type
0 a/1
1 b/2
2 c/3
3 d/4
Performance comparison
%%timeit
df.Type.str.cat(others=df.Category.astype(str), sep='/')
>> 286 µs ± 449 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['Type'] + '/' + df['Category'].astype(str)
>> 348 µs ± 5.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Both solutions give the same result, but using str.cat is about ~20% faster.
(Assuming you know the size of the list beforehand)
Comparing these two functions:
def foo_1(size=1000):
res = []
for i in range(size):
res.append(i)
def foo_2(size=1000):
res = [0]*size
for i in range(size):
res[i] = i
%timeit foo_1(5000)
290 µs ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit foo_2(5000)
207 µs ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit foo_1(1000000)
84.5 ms ± 1.76 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit foo_2(1000000)
52 ms ± 365 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
At a few thousand elements the difference is not that significant. At around a million elements there's a 30ms difference in performance.
Questions
In foo_1, we are appending elements one by one to the list and hence traversing the list once, but in foo_2 we are traversing the list twice (once to initiate, and once to fill the correct values) Shouldn't foo_1 be faster?
I may be getting confused because I'm thinking of linked list type implementation for lists in Python (where there is a pointer which knows the current last position of the list) but I'm sure since lists allow index-based access, it is more complicated than a simple linked list
Given I know that foo_2 is faster, is there any situation/use-case where foo_1 approach (appending rather than filling up) is the way to go?
Suppose I have a DataFrame such as:
df = pd.DataFrame(np.random.randn(10,5), columns = ['a','b','c','d','e'])
and I would like to retrieve the last value in column e. I could do:
df['e'].tail(1)
but this would return a series which has index 9 with it. Ideally, I just want to obtain the value as a number that I can work with directly. I could also do:
np.array(df['e'].tail(1))
but this would then require me to access/call the 0'th element of it before I can really work with it. Is there a more direct/easy way to do this?
You could try iloc method of dataframe:
In [26]: df
Out[26]:
a b c d e
0 -1.079547 -0.722903 0.457495 -0.687271 -0.787058
1 1.326133 1.359255 -0.964076 -1.280502 1.460792
2 0.479599 -1.465210 -0.058247 -0.984733 -0.348068
3 -0.608238 -1.238068 -0.126889 0.572662 -1.489641
4 -1.533707 -0.218298 -0.877619 0.679370 0.485987
5 -0.864651 -0.180165 -0.528939 0.270885 1.313946
6 0.747612 -1.206509 0.616815 -1.758354 -0.158203
7 -2.309582 -0.739730 -0.004303 0.125640 -0.973230
8 1.735822 -0.750698 1.225104 0.431583 -1.483274
9 -0.374557 -1.132354 0.875028 0.032615 -1.131971
In [27]: df['e'].iloc[-1]
Out[27]: -1.1319705662711321
Or if you want just scalar you could use iat which is faster. From docs:
If you only want to access a scalar value, the fastest way is to use the at and iat methods, which are implemented on all of the data structures
In [28]: df.e.iat[-1]
Out[28]: -1.1319705662711321
Benchmarking:
In [31]: %timeit df.e.iat[-1]
100000 loops, best of 3: 18 µs per loop
In [32]: %timeit df.e.iloc[-1]
10000 loops, best of 3: 24 µs per loop
Try
df['e'].iloc[[-1]]
Sometimes,
df['e'].iloc[-1]
doesn't work.
We can also access it by indexing df.index and at:
df.at[df.index[-1], 'e']
It's faster than iloc but slower than without indexing.
If we decide to assign a value to the last element in column "e", the above method is much faster than the other two options (9-11 times faster):
>>> %timeit df.at[df.index[-1], 'e'] = 1
11.5 µs ± 355 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
>>> %timeit df['e'].iat[-1] = 1
107 µs ± 4.22 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
>>> %timeit df['e'].iloc[-1] = 1
127 µs ± 7.13 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)```