I have a tiny test dataset for students university applications in different majors
It looks like this
0 35377 female Chemistry False
1 56105 male Physics True
2 31441 female Chemistry False
3 51765 male Physics True
4 53714 female Physics True
shape is 500,4
I need to get the admission rate for females and solved it now in three different ways. Each of them return the correct result.
DONE
Using group by
female_admitted_rate = df.groupby('gender').get_group('female')[df['admitted'] == True].count()/len(df.groupby('gender').get_group('female'))```
```python
[OUT]
student_id 0.287938
gender 0.287938
major 0.287938
admitted 0.287938
dtype: float64
Using plain pandas
len(df[(df['gender']=='female') & (df['admitted'])])/(len(df[df['gender']=='female']))
[Out] 0.28793774319066145
Using query
len(df.query("gender == 'female' & admitted"))/len(df.query("gender == 'female'"))
[Out] 0.28793774319066145
QUESTIONS
What would you use to get this information?
Is there a special
advantage for one of the shown approaches?
Is there one approach
what makes absolutely not sense to you guys?
Is there an perticular
calculation performance benefit by using one of the three above the
others when it comes to big data sets?
I think you only need DataFrame.loc[] + Series.mean():
df.loc[df['gender'].eq('female'), 'admitted'].mean()
True is interpreted as 1 and False as 0 by Series.mean
You can check with timeit
%%timeit
df.loc[df['gender'].eq('female'), 'admitted'].mean()
1.16 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
len(df[(df['gender']=='female') & (df['admitted'])])/(len(df[df['gender']=='female']))
3.45 ms ± 428 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
df.groupby('gender').get_group('female')[df['admitted'] == True].count()/len(df.groupby('gender').get_group('female'))
10.3 ms ± 718 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
len(df.query("gender == 'female' & admitted"))/len(df.query("gender == 'female'"))
11.1 ms ± 604 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
These times are for the sample DataFrame I think performance could vary greatly depending on the shape of your DataFrame. Although I honestly believe that the method I propose will be the fastest in most cases, in addition to providing a clean and simple syntax.
Related
I need to group a DataFrame and apply several chained functions on each group.
My problem is basically the same as in pandas - Groupby two functions: apply cumsum then shift on each group.
There are answers there on how to obtain a correct result, however they seem to have a suboptimal performance. My specific question is thus: is there a more efficient way than the ones I describe below?
First here is some large testing data:
from string import ascii_lowercase
import numpy as np
import pandas as pd
n = 100_000_000
np.random.seed(0)
df = pd.DataFrame(
{
"x": np.random.choice(np.array([*ascii_lowercase]), size=n),
"y": np.random.normal(size=n),
}
)
Below is the performance of each function:
%timeit df.groupby("x")["y"].cumsum()
4.65 s ± 71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df.groupby("x")["y"].shift()
5.29 s ± 54.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
A basic solution is to group twice. It seems suboptimal since grouping is a large part of the total runtime and should only be done once.
%timeit df.groupby("x")["y"].cumsum().groupby(df["x"]).shift()
10.1 s ± 63.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The accepted answer to the aforementioned question suggests to use apply with a custom function to avoid this issue. However for some reason it is actually performing much worse than the previous solution.
def cumsum_shift(s):
return s.cumsum().shift()
%timeit df.groupby("x")["y"].apply(cumsum_shift)
27.8 s ± 858 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Do you have any idea how to optimize this code? Especially in a case where I'd like to chain more than two functions, performance gains can become quite significant.
Let me know if this helps, few weeks back I was having the same issue.
I solved it by just spliting the code. And creating a separate groupby object which contains information about the groups.
# creating groupby object
g = df.groupby('x')['y']
%timeit g.cumsum()
592 ms ± 8.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit g.shift()
1.7 s ± 8.68 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I would suggest to give a try to transform instead of apply
try this:
%timeit df.groupby("x")["y"].transform(np.cumsum).transform(lambda x: x.shift())
or, also try using
from toolz import pipe
%timeit df.groupby("x").pipe(lambda g: g["y"].cumsum().shift())
I am pretty sure that pipe can be more efficient than apply or transform
Let us know if it works well
I'm testing out feather-format as a way to store pandas DataFrame files. The performance of feather seems to be extremely poor when writing columns consisting entirely of None (info() gives 0 non-null object). The following code well encapsulates the issue:
df1 = pd.DataFrame(data={'x': 1000*[None]})
%timeit df1.to_feather('.../x.feather')
5.35 s ± 303 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_pickle('.../x.pkl')
734 ms ± 60.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit df1.to_parquet('.../x.parquet')
200 ms ± 5.84 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I'm using feather-format 0.4.0, pandas 0.23.4, and pyarrow 0.13.0.
How can I get these kinds of DataFrames to save without taking forever?
You could try adding a specific dtype. That being said, the numbers are a little surprising in terms of how poor feather performance is.
I have a long 1-d numpy array with some 10% missing values. I want to change its missing values (np.nan) to other values repeatedly. I know of two ways to do this:
data[np.isnan(data)] = 0 or the function
np.copyto(data, 0, where=np.isnan(data))
Sometimes I want to put zeros there, other times I want to restore the nans.
I thought that recomputing the np.isnan function repeatedly would be slow, and it would be better to save the locations of the nans. Some of the timing results of the code below are counter-intuitive.
I ran the following:
import numpy as np
import sys
print(sys.version)
print(sys.version_info)
print(f'numpy version {np.__version__}')
data = np.random.random(100000)
data[data<0.1] = 0
data[data==0] = np.nan
%timeit missing = np.isnan(data)
%timeit wheremiss = np.where(np.isnan(data))
missing = np.isnan(data)
wheremiss = np.where(np.isnan(data))
print("Use missing list store 0")
%timeit data[missing] = 0
data[data==0] = np.nan
%timeit data[wheremiss] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=missing)
print("Use isnan function store 0")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = 0
data[data==0] = np.nan
%timeit np.copyto(data, 0, where=np.isnan(data))
print("Use missing list store np.nan")
data[data==0] = np.nan
%timeit data[missing] = np.nan
data[data==0] = np.nan
%timeit data[wheremiss] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=missing)
print("Use isnan function store np.nan")
data[data==0] = np.nan
%timeit data[np.isnan(data)] = np.nan
data[data==0] = np.nan
%timeit np.copyto(data, np.nan, where=np.isnan(data))
And I got the following output (I have taken the liberty to add numbers to the timing lines, so that I can refer to them later):
3.7.3 | packaged by conda-forge | (default, Jul 1 2019, 22:01:29) [MSC v.1900 64 bit (AMD64)]
sys.version_info(major=3, minor=7, micro=3, releaselevel='final', serial=0)
numpy version 1.17.1
01. 30 µs ± 2.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
02. 219 µs ± 24.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use missing list store 0
03. 339 µs ± 23.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
04. 26 µs ± 1.92 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
05. 287 µs ± 26.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store 0
06. 38.5 µs ± 2.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
07. 43.8 µs ± 4.67 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Use missing list store np.nan
08. 328 µs ± 30.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
09. 24.8 µs ± 2.03 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
10. 322 µs ± 30 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use isnan function store np.nan
11. 356 µs ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
12. 300 µs ± 4.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
So here is the first question. Why would it take nearly 10 times longer to store a np.nan than to store a 0? (compare lines 6 and 7 vs. lines 11 and 12)
Why would it take much longer to use a stored list of missing as compared to recomputing the missing values using the isnan function? (compare lines 3 and 5 vs. 6 and 7)
This is just for curiosity. I can see that the fastest way is to use np.where to get a list of indices (because I have only 10% missing). But if I had many more, things might not be so obvious.
because you're not measuring what you think you are! you're mutating your data while doing the test, and timeit runs the test multiple times. thus additional runs are running on changed data. when you change the value to 0 the next time you run isnan you get nothing back and assignment is basically a no-op. while when you're assigning nan this causes more work to be done in the next iteration.
your question about when to use np.where vs leaving it as an array of bools is a bit more difficult. it would involve the relative sizes of the different datatypes (e.g. bool is 1 byte, int64 is 8 bytes), the proportion of values that are selected, how well the distribution matches up to the CPU/memory subsystem's optimisations (e.g. are they mostly in one block vs. uniformly distributed), the relative cost of doing np.where vs how many times the result will be reused, and other things I can't think of right now.
for other users, it might be worth pointing out that RAM latency (i.e. speed) is >100 times slower than L1 cache, so keeping memory access predictable is important to maximize cache utilisation
Very much a beginner question, sorry: is there a way to avoid repeating the dataframe name when operating on pandas columns?
In R, data.table allows to operate on a column without repeating the dataframe name like this
very_long_dt_name = data.table::data.table(col1=c(1,2,3),col2=c(3,3,1))
# operate on the columns without repeating the dt name:
very_long_dt_name[,ratio:=round(col1/col2,2)]
I couldn't figure out how to do it with pandas in Python so I keep repeating the df name:
data = {'col1': [1,2,3], 'col2': [3, 3, 1]}
very_long_df_name = pd.DataFrame(data)
# operate on the columns requires repeating the df name
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
I'm sure there's a way to avoid it but I can't find anything on Google. Any hint please? Thank you.
Try assign:
very_long_df_name.assign(ratio=lambda x: np.round(x.col1/x.col2,2))
Output:
col1 col2 ratio
0 1 3 0.33
1 2 3 0.67
2 3 1 3.00
Edit: to reflect comments, tests on 1 million rows:
%%timeit
very_long_df_name.assign(ratio = lambda x:x.col1/x.col2)
# 18.6 ms ± 506 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
and
%%timeit
very_long_df_name['ratio'] = very_long_df_name['col1']/very_long_df_name['col2']
# 13.3 ms ± 359 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
And with np.round, assign
%%timeit
very_long_df_name.assign(ratio = lambda x: np.round(x.col1/x.col2,2))
# 64.8 ms ± 958 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
and not-assign:
%%timeit
very_long_df_name['ratio'] = np.round(very_long_df_name['col1']/very_long_df_name['col2'],2)
# 55.8 ms ± 2.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
SO it appears that assign is vectorized, just not as well tuned.
I am using the pandas vectorized str.split() method to extract the first element returned from a split on "~". I also have also tried using df.apply() with a lambda and str.split() to produce equivalent results. When using %timeit, I'm finding that df.apply() is performing faster than the vectorized version.
Everything that I have read about vectorization seems to indicate that the first version should have better performance. Can someone please explain why I am getting these results? Example:
id facility
0 3466 abc~24353
1 4853 facility1~3.4.5.6
2 4582 53434_Facility~34432~cde
3 9972 facility2~FACILITY2~343
4 2356 Test~23 ~FAC1
The above dataframe has about 500,000 rows and I have also tested at around 1 million with similar results. Here is some example input and output:
Vectorization
In [1]: %timeit df['facility'] = df['facility'].str.split('~').str[0]
1.1 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Lambda Apply
In [2]: %timeit df['facility'] = df['facility'].astype(str).apply(lambda facility: facility.split('~')[0])
650 ms ± 52.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Does anyone know why I am getting this behavior?
Thanks!
Pandas string methods are only "vectorized" in the sense that you don't have to write the loop yourself. There isn't actually any parallelization going on, because string (especially regex problems) are inherently difficult (impossible?) to parallelize. If you really want speed, you actually should fall back to python here.
%timeit df['facility'].str.split('~', n=1).str[0]
%timeit [x.split('~', 1)[0] for x in df['facility'].tolist()]
411 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
132 ms ± 302 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
For more information on when loops are faster than pandas functions, take a look at For loops with pandas - When should I care?.
As for why apply is faster, I'm of the belief that the function apply is applying (i.e., str.split) is a lot more lightweight than the string splitting happening in the bowels of Series.str.split.