When should I (not) want to use pandas apply() in my code? - python
I have seen many answers posted to questions on Stack Overflow involving the use of the Pandas method apply. I have also seen users commenting under them saying that "apply is slow, and should be avoided".
I have read many articles on the topic of performance that explain apply is slow. I have also seen a disclaimer in the docs about how apply is simply a convenience function for passing UDFs (can't seem to find that now). So, the general consensus is that apply should be avoided if possible. However, this raises the following questions:
If apply is so bad, then why is it in the API?
How and when should I make my code apply-free?
Are there ever any situations where apply is good (better than other possible solutions)?
apply, the Convenience Function you Never Needed
We start by addressing the questions in the OP, one by one.
"If apply is so bad, then why is it in the API?"
DataFrame.apply and Series.apply are convenience functions defined on DataFrame and Series object respectively. apply accepts any user defined function that applies a transformation/aggregation on a DataFrame. apply is effectively a silver bullet that does whatever any existing pandas function cannot do.
Some of the things apply can do:
Run any user-defined function on a DataFrame or Series
Apply a function either row-wise (axis=1) or column-wise (axis=0) on a DataFrame
Perform index alignment while applying the function
Perform aggregation with user-defined functions (however, we usually prefer agg or transform in these cases)
Perform element-wise transformations
Broadcast aggregated results to original rows (see the result_type argument).
Accept positional/keyword arguments to pass to the user-defined functions.
...Among others. For more information, see Row or Column-wise Function Application in the documentation.
So, with all these features, why is apply bad? It is because apply is slow. Pandas makes no assumptions about the nature of your function, and so iteratively applies your function to each row/column as necessary. Additionally, handling all of the situations above means apply incurs some major overhead at each iteration. Further, apply consumes a lot more memory, which is a challenge for memory bounded applications.
There are very few situations where apply is appropriate to use (more on that below). If you're not sure whether you should be using apply, you probably shouldn't.
Let's address the next question.
"How and when should I make my code apply-free?"
To rephrase, here are some common situations where you will want to get rid of any calls to apply.
Numeric Data
If you're working with numeric data, there is likely already a vectorized cython function that does exactly what you're trying to do (if not, please either ask a question on Stack Overflow or open a feature request on GitHub).
Contrast the performance of apply for a simple addition operation.
df = pd.DataFrame({"A": [9, 4, 2, 1], "B": [12, 7, 5, 4]})
df
A B
0 9 12
1 4 7
2 2 5
3 1 4
<!- ->
df.apply(np.sum)
A 16
B 28
dtype: int64
df.sum()
A 16
B 28
dtype: int64
Performance wise, there's no comparison, the cythonized equivalent is much faster. There's no need for a graph, because the difference is obvious even for toy data.
%timeit df.apply(np.sum)
%timeit df.sum()
2.22 ms ± 41.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
471 µs ± 8.16 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Even if you enable passing raw arrays with the raw argument, it's still twice as slow.
%timeit df.apply(np.sum, raw=True)
840 µs ± 691 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Another example:
df.apply(lambda x: x.max() - x.min())
A 8
B 8
dtype: int64
df.max() - df.min()
A 8
B 8
dtype: int64
%timeit df.apply(lambda x: x.max() - x.min())
%timeit df.max() - df.min()
2.43 ms ± 450 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
1.23 ms ± 14.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In general, seek out vectorized alternatives if possible.
String/Regex
Pandas provides "vectorized" string functions in most situations, but there are rare cases where those functions do not... "apply", so to speak.
A common problem is to check whether a value in a column is present in another column of the same row.
df = pd.DataFrame({
'Name': ['mickey', 'donald', 'minnie'],
'Title': ['wonderland', "welcome to donald's castle", 'Minnie mouse clubhouse'],
'Value': [20, 10, 86]})
df
Name Value Title
0 mickey 20 wonderland
1 donald 10 welcome to donald's castle
2 minnie 86 Minnie mouse clubhouse
This should return the row second and third row, since "donald" and "minnie" are present in their respective "Title" columns.
Using apply, this would be done using
df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)
0 False
1 True
2 True
dtype: bool
df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
However, a better solution exists using list comprehensions.
df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
Name Title Value
1 donald welcome to donald's castle 10
2 minnie Minnie mouse clubhouse 86
<!- ->
%timeit df[df.apply(lambda x: x['Name'].lower() in x['Title'].lower(), axis=1)]
%timeit df[[y.lower() in x.lower() for x, y in zip(df['Title'], df['Name'])]]
2.85 ms ± 38.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
788 µs ± 16.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The thing to note here is that iterative routines happen to be faster than apply, because of the lower overhead. If you need to handle NaNs and invalid dtypes, you can build on this using a custom function you can then call with arguments inside the list comprehension.
For more information on when list comprehensions should be considered a good option, see my writeup: Are for-loops in pandas really bad? When should I care?.
Note
Date and datetime operations also have vectorized versions. So, for example, you should prefer pd.to_datetime(df['date']), over,
say, df['date'].apply(pd.to_datetime).
Read more at the
docs.
A Common Pitfall: Exploding Columns of Lists
s = pd.Series([[1, 2]] * 3)
s
0 [1, 2]
1 [1, 2]
2 [1, 2]
dtype: object
People are tempted to use apply(pd.Series). This is horrible in terms of performance.
s.apply(pd.Series)
0 1
0 1 2
1 1 2
2 1 2
A better option is to listify the column and pass it to pd.DataFrame.
pd.DataFrame(s.tolist())
0 1
0 1 2
1 1 2
2 1 2
<!- ->
%timeit s.apply(pd.Series)
%timeit pd.DataFrame(s.tolist())
2.65 ms ± 294 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
816 µs ± 40.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Lastly,
"Are there any situations where apply is good?"
Apply is a convenience function, so there are situations where the overhead is negligible enough to forgive. It really depends on how many times the function is called.
Functions that are Vectorized for Series, but not DataFrames
What if you want to apply a string operation on multiple columns? What if you want to convert multiple columns to datetime? These functions are vectorized for Series only, so they must be applied over each column that you want to convert/operate on.
df = pd.DataFrame(
pd.date_range('2018-12-31','2019-01-31', freq='2D').date.astype(str).reshape(-1, 2),
columns=['date1', 'date2'])
df
date1 date2
0 2018-12-31 2019-01-02
1 2019-01-04 2019-01-06
2 2019-01-08 2019-01-10
3 2019-01-12 2019-01-14
4 2019-01-16 2019-01-18
5 2019-01-20 2019-01-22
6 2019-01-24 2019-01-26
7 2019-01-28 2019-01-30
df.dtypes
date1 object
date2 object
dtype: object
This is an admissible case for apply:
df.apply(pd.to_datetime, errors='coerce').dtypes
date1 datetime64[ns]
date2 datetime64[ns]
dtype: object
Note that it would also make sense to stack, or just use an explicit loop. All these options are slightly faster than using apply, but the difference is small enough to forgive.
%timeit df.apply(pd.to_datetime, errors='coerce')
%timeit pd.to_datetime(df.stack(), errors='coerce').unstack()
%timeit pd.concat([pd.to_datetime(df[c], errors='coerce') for c in df], axis=1)
%timeit for c in df.columns: df[c] = pd.to_datetime(df[c], errors='coerce')
5.49 ms ± 247 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.94 ms ± 48.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.16 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.41 ms ± 1.71 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can make a similar case for other operations such as string operations, or conversion to category.
u = df.apply(lambda x: x.str.contains(...))
v = df.apply(lambda x: x.astype(category))
v/s
u = pd.concat([df[c].str.contains(...) for c in df], axis=1)
v = df.copy()
for c in df:
v[c] = df[c].astype(category)
And so on...
Converting Series to str: astype versus apply
This seems like an idiosyncrasy of the API. Using apply to convert integers in a Series to string is comparable (and sometimes faster) than using astype.
The graph was plotted using the perfplot library.
import perfplot
perfplot.show(
setup=lambda n: pd.Series(np.random.randint(0, n, n)),
kernels=[
lambda s: s.astype(str),
lambda s: s.apply(str)
],
labels=['astype', 'apply'],
n_range=[2**k for k in range(1, 20)],
xlabel='N',
logx=True,
logy=True,
equality_check=lambda x, y: (x == y).all())
With floats, I see the astype is consistently as fast as, or slightly faster than apply. So this has to do with the fact that the data in the test is integer type.
GroupBy operations with chained transformations
GroupBy.apply has not been discussed until now, but GroupBy.apply is also an iterative convenience function to handle anything that the existing GroupBy functions do not.
One common requirement is to perform a GroupBy and then two prime operations such as a "lagged cumsum":
df = pd.DataFrame({"A": list('aabcccddee'), "B": [12, 7, 5, 4, 5, 4, 3, 2, 1, 10]})
df
A B
0 a 12
1 a 7
2 b 5
3 c 4
4 c 5
5 c 4
6 d 3
7 d 2
8 e 1
9 e 10
<!- ->
You'd need two successive groupby calls here:
df.groupby('A').B.cumsum().groupby(df.A).shift()
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
Using apply, you can shorten this to a a single call.
df.groupby('A').B.apply(lambda x: x.cumsum().shift())
0 NaN
1 12.0
2 NaN
3 NaN
4 4.0
5 9.0
6 NaN
7 3.0
8 NaN
9 1.0
Name: B, dtype: float64
It is very hard to quantify the performance because it depends on the data. But in general, apply is an acceptable solution if the goal is to reduce a groupby call (because groupby is also quite expensive).
Other Caveats
Aside from the caveats mentioned above, it is also worth mentioning that apply operates on the first row (or column) twice. This is done to determine whether the function has any side effects. If not, apply may be able to use a fast-path for evaluating the result, else it falls back to a slow implementation.
df = pd.DataFrame({
'A': [1, 2],
'B': ['x', 'y']
})
def func(x):
print(x['A'])
return x
df.apply(func, axis=1)
# 1
# 1
# 2
A B
0 1 x
1 2 y
This behaviour is also seen in GroupBy.apply on pandas versions <0.25 (it was fixed for 0.25, see here for more information.)
Not all applys are alike
The below chart suggests when to consider apply1. Green means possibly efficient; red avoid.
Some of this is intuitive: pd.Series.apply is a Python-level row-wise loop, ditto pd.DataFrame.apply row-wise (axis=1). The misuses of these are many and wide-ranging. The other post deals with them in more depth. Popular solutions are to use vectorised methods, list comprehensions (assumes clean data), or efficient tools such as the pd.DataFrame constructor (e.g. to avoid apply(pd.Series)).
If you are using pd.DataFrame.apply row-wise, specifying raw=True (where possible) is often beneficial. At this stage, numba is usually a better choice.
GroupBy.apply: generally favoured
Repeating groupby operations to avoid apply will hurt performance. GroupBy.apply is usually fine here, provided the methods you use in your custom function are themselves vectorised. Sometimes there is no native Pandas method for a groupwise aggregation you wish to apply. In this case, for a small number of groups apply with a custom function may still offer reasonable performance.
pd.DataFrame.apply column-wise: a mixed bag
pd.DataFrame.apply column-wise (axis=0) is an interesting case. For a small number of rows versus a large number of columns, it's almost always expensive. For a large number of rows relative to columns, the more common case, you may sometimes see significant performance improvements using apply:
# Python 3.7, Pandas 0.23.4
np.random.seed(0)
df = pd.DataFrame(np.random.random((10**7, 3))) # Scenario_1, many rows
df = pd.DataFrame(np.random.random((10**4, 10**3))) # Scenario_2, many columns
# Scenario_1 | Scenario_2
%timeit df.sum() # 800 ms | 109 ms
%timeit df.apply(pd.Series.sum) # 568 ms | 325 ms
%timeit df.max() - df.min() # 1.63 s | 314 ms
%timeit df.apply(lambda x: x.max() - x.min()) # 838 ms | 473 ms
%timeit df.mean() # 108 ms | 94.4 ms
%timeit df.apply(pd.Series.mean) # 276 ms | 233 ms
1 There are exceptions, but these are usually marginal or uncommon. A couple of examples:
df['col'].apply(str) may slightly outperform df['col'].astype(str).
df.apply(pd.to_datetime) working on strings doesn't scale well with rows versus a regular for loop.
For axis=1 (i.e. row-wise functions) then you can just use the following function in lieu of apply. I wonder why this isn't the pandas behavior. (Untested with compound indexes, but it does appear to be much faster than apply)
def faster_df_apply(df, func):
cols = list(df.columns)
data, index = [], []
for row in df.itertuples(index=True):
row_dict = {f:v for f,v in zip(cols, row[1:])}
data.append(func(row_dict))
index.append(row[0])
return pd.Series(data, index=index)
Are there ever any situations where apply is good?
Yes, sometimes.
Task: decode Unicode strings.
import numpy as np
import pandas as pd
import unidecode
s = pd.Series(['mañana','Ceñía'])
s.head()
0 mañana
1 Ceñía
s.apply(unidecode.unidecode)
0 manana
1 Cenia
Update
I was by no means advocating for the use of apply, just thinking since the NumPy cannot deal with the above situation, it could have been a good candidate for pandas apply. But I was forgetting the plain ol list comprehension thanks to the reminder by #jpp.
Related
How can I insert a single value into a Pandas dataframe at a given location? [duplicate]
I have created a Pandas DataFrame df = DataFrame(index=['A','B','C'], columns=['x','y']) and have got this x y A NaN NaN B NaN NaN C NaN NaN Now, I would like to assign a value to particular cell, for example to row C and column x. I would expect to get this result: x y A NaN NaN B NaN NaN C 10 NaN with this code: df.xs('C')['x'] = 10 However, the contents of df has not changed. The dataframe contains yet again only NaNs. Any suggestions?
RukTech's answer, df.set_value('C', 'x', 10), is far and away faster than the options I've suggested below. However, it has been slated for deprecation. Going forward, the recommended method is .iat/.at. Why df.xs('C')['x']=10 does not work: df.xs('C') by default, returns a new dataframe with a copy of the data, so df.xs('C')['x']=10 modifies this new dataframe only. df['x'] returns a view of the df dataframe, so df['x']['C'] = 10 modifies df itself. Warning: It is sometimes difficult to predict if an operation returns a copy or a view. For this reason the docs recommend avoiding assignments with "chained indexing". So the recommended alternative is df.at['C', 'x'] = 10 which does modify df. In [18]: %timeit df.set_value('C', 'x', 10) 100000 loops, best of 3: 2.9 µs per loop In [20]: %timeit df['x']['C'] = 10 100000 loops, best of 3: 6.31 µs per loop In [81]: %timeit df.at['C', 'x'] = 10 100000 loops, best of 3: 9.2 µs per loop
Update: The .set_value method is going to be deprecated. .iat/.at are good replacements, unfortunately pandas provides little documentation The fastest way to do this is using set_value. This method is ~100 times faster than .ix method. For example: df.set_value('C', 'x', 10)
You can also use a conditional lookup using .loc as seen here: df.loc[df[<some_column_name>] == <condition>, [<another_column_name>]] = <value_to_add> where <some_column_name is the column you want to check the <condition> variable against and <another_column_name> is the column you want to add to (can be a new column or one that already exists). <value_to_add> is the value you want to add to that column/row. This example doesn't work precisely with the question at hand, but it might be useful for someone wants to add a specific value based on a condition.
Try using df.loc[row_index,col_indexer] = value
The recommended way (according to the maintainers) to set a value is: df.ix['x','C']=10 Using 'chained indexing' (df['x']['C']) may lead to problems. See: https://stackoverflow.com/a/21287235/1579844 http://pandas.pydata.org/pandas-docs/dev/indexing.html#indexing-view-versus-copy https://github.com/pydata/pandas/pull/6031
This is the only thing that worked for me! df.loc['C', 'x'] = 10 Learn more about .loc here.
To set values, use: df.at[0, 'clm1'] = 0 The fastest recommended method for setting variables. set_value, ix have been deprecated. No warning, unlike iloc and loc
.iat/.at is the good solution. Supposing you have this simple data_frame: A B C 0 1 8 4 1 3 9 6 2 22 33 52 if we want to modify the value of the cell [0,"A"] u can use one of those solution : df.iat[0,0] = 2 df.at[0,'A'] = 2 And here is a complete example how to use iat to get and set a value of cell : def prepossessing(df): for index in range(0,len(df)): df.iat[index,0] = df.iat[index,0] * 2 return df y_train before : 0 0 54 1 15 2 15 3 8 4 31 5 63 6 11 y_train after calling prepossessing function that iat to change to multiply the value of each cell by 2: 0 0 108 1 30 2 30 3 16 4 62 5 126 6 22
I would suggest: df.loc[index_position, "column_name"] = some_value To modifiy multiple cells at the same time: df.loc[start_idx_pos: End_idx_pos, "column_name"] = some_value
Avoid Assignment with Chained Indexing You are dealing with an assignment with chained indexing which will result in a SettingWithCopy warning. This should be avoided by all means. Your assignment will have to resort to one single .loc[] or .iloc[] slice, as explained here. Hence, in your case: df.loc['C', 'x'] = 10
In my example i just change it in selected cell for index, row in result.iterrows(): if np.isnan(row['weight']): result.at[index, 'weight'] = 0.0 'result' is a dataField with column 'weight'
Here is a summary of the valid solutions provided by all users, for data frames indexed by integer and string. df.iloc, df.loc and df.at work for both type of data frames, df.iloc only works with row/column integer indices, df.loc and df.at supports for setting values using column names and/or integer indices. When the specified index does not exist, both df.loc and df.at would append the newly inserted rows/columns to the existing data frame, but df.iloc would raise "IndexError: positional indexers are out-of-bounds". A working example tested in Python 2.7 and 3.7 is as follows: import numpy as np, pandas as pd df1 = pd.DataFrame(index=np.arange(3), columns=['x','y','z']) df1['x'] = ['A','B','C'] df1.at[2,'y'] = 400 # rows/columns specified does not exist, appends new rows/columns to existing data frame df1.at['D','w'] = 9000 df1.loc['E','q'] = 499 # using df[<some_column_name>] == <condition> to retrieve target rows df1.at[df1['x']=='B', 'y'] = 10000 df1.loc[df1['x']=='B', ['z','w']] = 10000 # using a list of index to setup values df1.iloc[[1,2,4], 2] = 9999 df1.loc[[0,'D','E'],'w'] = 7500 df1.at[[0,2,"D"],'x'] = 10 df1.at[:, ['y', 'w']] = 8000 df1 >>> df1 x y z w q 0 10 8000 NaN 8000 NaN 1 B 8000 9999 8000 NaN 2 10 8000 9999 8000 NaN D 10 8000 NaN 8000 NaN E NaN 8000 9999 8000 499.0
you can use .iloc. df.iloc[[2], [0]] = 10
set_value() is deprecated. Starting from the release 0.23.4, Pandas "announces the future"... >>> df Cars Prices (U$) 0 Audi TT 120.0 1 Lamborghini Aventador 245.0 2 Chevrolet Malibu 190.0 >>> df.set_value(2, 'Prices (U$)', 240.0) __main__:1: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead Cars Prices (U$) 0 Audi TT 120.0 1 Lamborghini Aventador 245.0 2 Chevrolet Malibu 240.0 Considering this advice, here's a demonstration of how to use them: by row/column integer positions >>> df.iat[1, 1] = 260.0 >>> df Cars Prices (U$) 0 Audi TT 120.0 1 Lamborghini Aventador 260.0 2 Chevrolet Malibu 240.0 by row/column labels >>> df.at[2, "Cars"] = "Chevrolet Corvette" >>> df Cars Prices (U$) 0 Audi TT 120.0 1 Lamborghini Aventador 260.0 2 Chevrolet Corvette 240.0 References: pandas.DataFrame.iat pandas.DataFrame.at
One way to use index with condition is first get the index of all the rows that satisfy your condition and then simply use those row indexes in a multiple of ways conditional_index = df.loc[ df['col name'] <condition> ].index Example condition is like ==5, >10 , =="Any string", >= DateTime Then you can use these row indexes in variety of ways like Replace value of one column for conditional_index df.loc[conditional_index , [col name]]= <new value> Replace value of multiple column for conditional_index df.loc[conditional_index, [col1,col2]]= <new value> One benefit with saving the conditional_index is that you can assign value of one column to another column with same row index df.loc[conditional_index, [col1,col2]]= df.loc[conditional_index,'col name'] This is all possible because .index returns a array of index which .loc can use with direct addressing so it avoids traversals again and again.
I tested and the output is df.set_value is little faster, but the official method df.at looks like the fastest non deprecated way to do it. import numpy as np import pandas as pd df = pd.DataFrame(np.random.rand(100, 100)) %timeit df.iat[50,50]=50 # ✓ %timeit df.at[50,50]=50 # ✔ %timeit df.set_value(50,50,50) # will deprecate %timeit df.iloc[50,50]=50 %timeit df.loc[50,50]=50 7.06 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 5.52 µs ± 64.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 3.68 µs ± 80.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) 98.7 µs ± 1.07 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) 109 µs ± 1.42 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) Note this is setting the value for a single cell. For the vectors loc and iloc should be better options since they are vectorized.
If one wants to change the cell in the position (0,0) of the df to a string such as '"236"76"', the following options will do the work: df[0][0] = '"236"76"' # %timeit df[0][0] = '"236"76"' # 938 µs ± 83.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) Or using pandas.DataFrame.at df.at[0, 0] = '"236"76"' # %timeit df.at[0, 0] = '"236"76"' #15 µs ± 2.09 µs per loop (mean ± std. dev. of 7 runs, 100000 loops each) Or using pandas.DataFrame.iat df.iat[0, 0] = '"236"76"' # %timeit df.iat[0, 0] = '"236"76"' # 41.1 µs ± 3.09 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) Or using pandas.DataFrame.loc df.loc[0, 0] = '"236"76"' # %timeit df.loc[0, 0] = '"236"76"' # 5.21 ms ± 401 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) Or using pandas.DataFrame.iloc df.iloc[0, 0] = '"236"76"' # %timeit df.iloc[0, 0] = '"236"76"' # 5.12 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) If time is of relevance, using pandas.DataFrame.at is the fastest approach.
Soo, your question to convert NaN at ['x',C] to value 10 the answer is.. df['x'].loc['C':]=10 df alternative code is df.loc['C', 'x']=10 df
df.loc['c','x']=10 This will change the value of cth row and xth column.
If you want to change values not for whole row, but only for some columns: x = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) x.iloc[1] = dict(A=10, B=-10)
From version 0.21.1 you can also use .at method. There are some differences compared to .loc as mentioned here - pandas .at versus .loc, but it's faster on single value replacement
In addition to the answers above, here is a benchmark comparing different ways to add rows of data to an already existing dataframe. It shows that using at or set-value is the most efficient way for large dataframes (at least for these test conditions). Create new dataframe for each row and... ... append it (13.0 s) ... concatenate it (13.1 s) Store all new rows in another container first, convert to new dataframe once and append... container = lists of lists (2.0 s) container = dictionary of lists (1.9 s) Preallocate whole dataframe, iterate over new rows and all columns and fill using ... at (0.6 s) ... set_value (0.4 s) For the test, an existing dataframe comprising 100,000 rows and 1,000 columns and random numpy values was used. To this dataframe, 100 new rows were added. Code see below: #!/usr/bin/env python3 # -*- coding: utf-8 -*- """ Created on Wed Nov 21 16:38:46 2018 #author: gebbissimo """ import pandas as pd import numpy as np import time NUM_ROWS = 100000 NUM_COLS = 1000 data = np.random.rand(NUM_ROWS,NUM_COLS) df = pd.DataFrame(data) NUM_ROWS_NEW = 100 data_tot = np.random.rand(NUM_ROWS + NUM_ROWS_NEW,NUM_COLS) df_tot = pd.DataFrame(data_tot) DATA_NEW = np.random.rand(1,NUM_COLS) #%% FUNCTIONS # create and append def create_and_append(df): for i in range(NUM_ROWS_NEW): df_new = pd.DataFrame(DATA_NEW) df = df.append(df_new) return df # create and concatenate def create_and_concat(df): for i in range(NUM_ROWS_NEW): df_new = pd.DataFrame(DATA_NEW) df = pd.concat((df, df_new)) return df # store as dict and def store_as_list(df): lst = [[] for i in range(NUM_ROWS_NEW)] for i in range(NUM_ROWS_NEW): for j in range(NUM_COLS): lst[i].append(DATA_NEW[0,j]) df_new = pd.DataFrame(lst) df_tot = df.append(df_new) return df_tot # store as dict and def store_as_dict(df): dct = {} for j in range(NUM_COLS): dct[j] = [] for i in range(NUM_ROWS_NEW): dct[j].append(DATA_NEW[0,j]) df_new = pd.DataFrame(dct) df_tot = df.append(df_new) return df_tot # preallocate and fill using .at def fill_using_at(df): for i in range(NUM_ROWS_NEW): for j in range(NUM_COLS): #print("i,j={},{}".format(i,j)) df.at[NUM_ROWS+i,j] = DATA_NEW[0,j] return df # preallocate and fill using .at def fill_using_set(df): for i in range(NUM_ROWS_NEW): for j in range(NUM_COLS): #print("i,j={},{}".format(i,j)) df.set_value(NUM_ROWS+i,j,DATA_NEW[0,j]) return df #%% TESTS t0 = time.time() create_and_append(df) t1 = time.time() print('Needed {} seconds'.format(t1-t0)) t0 = time.time() create_and_concat(df) t1 = time.time() print('Needed {} seconds'.format(t1-t0)) t0 = time.time() store_as_list(df) t1 = time.time() print('Needed {} seconds'.format(t1-t0)) t0 = time.time() store_as_dict(df) t1 = time.time() print('Needed {} seconds'.format(t1-t0)) t0 = time.time() fill_using_at(df_tot) t1 = time.time() print('Needed {} seconds'.format(t1-t0)) t0 = time.time() fill_using_set(df_tot) t1 = time.time() print('Needed {} seconds'.format(t1-t0))
I too was searching for this topic and I put together a way to iterate through a DataFrame and update it with lookup values from a second DataFrame. Here is my code. src_df = pd.read_sql_query(src_sql,src_connection) for index1, row1 in src_df.iterrows(): for index, row in vertical_df.iterrows(): src_df.set_value(index=index1,col=u'etl_load_key',value=etl_load_key) if (row1[u'src_id'] == row['SRC_ID']) is True: src_df.set_value(index=index1,col=u'vertical',value=row['VERTICAL'])
Pandas v1.1.0: Groupby rolling count slower than rolling mean & sum
I am running a groupby rolling count, sum & mean using Pandas v1.1.0 and I notice that the rolling count is considerably slower than the rolling mean & sum. This seems counter intuitive as we can derive the count from the mean and sum and save time. Is this a bug or am I missing something? Grateful for advice. import pandas as pd # Generate sample df df = pd.DataFrame({'column1': range(600), 'group': 5*['l'+str(i) for i in range(120)]}) # sort by group for easy/efficient joining of new columns to df df=df.sort_values('group',kind='mergesort').reset_index(drop=True) # timing of groupby rolling count, sum and mean %timeit df['mean']=df.groupby('group').rolling(3,min_periods=1)['column1'].mean().values %timeit df['sum']=df.groupby('group').rolling(3,min_periods=1)['column1'].sum().values %timeit df['count']=df.groupby('group').rolling(3,min_periods=1)['column1'].count().values ### Output 6.14 ms ± 812 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 5.61 ms ± 179 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 76.1 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ### df Output for illustration print(df.head(10)) column1 group mean sum count 0 0 l0 0.0 0.0 1.0 1 120 l0 60.0 120.0 2.0 2 240 l0 120.0 360.0 3.0 3 360 l0 240.0 720.0 3.0 4 480 l0 360.0 1080.0 3.0 5 1 l1 1.0 1.0 1.0 6 121 l1 61.0 122.0 2.0 7 241 l1 121.0 363.0 3.0 8 361 l1 241.0 723.0 3.0 9 481 l1 361.0 1083.0 3.0
Did you really mean count (number of non-NaN values)? That can not be inferred from just sum and mean. I suspect that what you are looking for would be a size operator (just the length of the group, irrespective of whether or not there are any NaNs). While size exists in regular groupby, it seems that it is absent in RollingGroupBy (at least as of pandas 1.1.4). One can calculate the size of the rolling groups with: # DRY: rgb = df.groupby('group').rolling(3, min_periods=1)['column1'] # size is either: rgb.apply(len) # or rgb.apply(lambda g: g.shape[0]) Neither of those two is as fast as it could, of course, because there needs to be a call to the function for each group, rather than being all vectorized and working just off of the rolling window indices start and end. On my system, either of the above is 2x slower than rgb.sum() or rgb.mean(). Thinking about how one would implement size: it is obvious (just end - start for each window). Now, in the case one really wanted to speed up count (count of non-NaN values): one could establish a "cumulative count" at first: cumcnt = (1 - df['column1'].isnull()).cumsum() (this is very fast, about 200x faster than rgb.mean() on my system). Then the rolling function could simply take cumcnt[end] - cumcnt[start]. I don't know enough about the internals of RollingGroupBy (and their use of various mixins) to assess feasibility, but at least functionally it seems pretty straightforward. Update: It seems that the issue is already fixed with these commits. That was fast and simple --I am impressed with the internal architecture of pandas and all the tools they already have on their Swiss army knife!
How to use Argument as a Function name ? Python [duplicate]
Let's have a small dataframe: df = pd.DataFrame({'CID': [1,2,3,4,12345, 6]}) When I search for membership the speed is vastly different based on whether I ask to search in df.CID or in df['CID']. In[25]:%timeit 12345 in df.CID Out[25]:89.8 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In[26]:%timeit 12345 in df['CID'] Out[26]:42.3 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In[27]:type( df.CID) Out[27]: pandas.core.series.Series In[28]:type( df['CID']) Out[28]: pandas.core.series.Series Why is that?
df['CID'] delegates to NDFrame.__getitem__ and it is more obvious you are performing an indexing operation. On the other hand, df.CID delegates to NDFrame.__getattr__, which has to do some additional heavy lifting, mainly to determine whether 'CID' is an attribute, a function, or a column you're calling using the attribute access (a convenience, but not recommended for production code). Now, why is it not recommended? Consider, df = pd.DataFrame({'A': [1, 2, 3]}) df.A 0 1 1 2 2 3 Name: A, dtype: int64 There are no issues referring to column "A" as df.A, because it does not conflict with any attribute or function namings in pandas. However, consider the pop function (just as an example). df.pop # <bound method NDFrame.pop of ...> df.pop is a bound method of df. Now, I'd like to create a column called "pop" for various reasons. df['pop'] = [4, 5, 6] df A pop 0 1 4 1 2 5 2 3 6 Great, but, df.pop # <bound method NDFrame.pop of ...> I cannot use the attribute notation to access this column. However... df['pop'] 0 4 1 5 2 6 Name: pop, dtype: int64 Bracket notation still works. That's why this is better.
Speed difference between bracket notation and dot notation for accessing columns in pandas
Let's have a small dataframe: df = pd.DataFrame({'CID': [1,2,3,4,12345, 6]}) When I search for membership the speed is vastly different based on whether I ask to search in df.CID or in df['CID']. In[25]:%timeit 12345 in df.CID Out[25]:89.8 µs ± 254 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In[26]:%timeit 12345 in df['CID'] Out[26]:42.3 µs ± 334 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each) In[27]:type( df.CID) Out[27]: pandas.core.series.Series In[28]:type( df['CID']) Out[28]: pandas.core.series.Series Why is that?
df['CID'] delegates to NDFrame.__getitem__ and it is more obvious you are performing an indexing operation. On the other hand, df.CID delegates to NDFrame.__getattr__, which has to do some additional heavy lifting, mainly to determine whether 'CID' is an attribute, a function, or a column you're calling using the attribute access (a convenience, but not recommended for production code). Now, why is it not recommended? Consider, df = pd.DataFrame({'A': [1, 2, 3]}) df.A 0 1 1 2 2 3 Name: A, dtype: int64 There are no issues referring to column "A" as df.A, because it does not conflict with any attribute or function namings in pandas. However, consider the pop function (just as an example). df.pop # <bound method NDFrame.pop of ...> df.pop is a bound method of df. Now, I'd like to create a column called "pop" for various reasons. df['pop'] = [4, 5, 6] df A pop 0 1 4 1 2 5 2 3 6 Great, but, df.pop # <bound method NDFrame.pop of ...> I cannot use the attribute notation to access this column. However... df['pop'] 0 4 1 5 2 6 Name: pop, dtype: int64 Bracket notation still works. That's why this is better.
Vectorized way for applying a function to a dataframe to create lists
I have seen few questions like these Vectorized alternative to iterrows , Faster alternative to iterrows , Pandas: Alternative to iterrow loops , for loop using iterrows in pandas , python: using .iterrows() to create columns , Iterrows performance. But it seems like everyone is a unique case rather a generalized approach. My questions is also again about .iterrows. I am trying to pass the first and second row to a function and create a list out of it. What I have: I have a pandas DataFrame with two columns that look like this. I.D Score 1 11 26 3 12 26 5 13 26 6 14 25 What I did: where the term Point is a function I earlier defined. my_points = [Points(int(row[0]),row[1]) for index, row in score.iterrows()] What I am trying to do: The faster and vectorized form of the above.
Try list comprehension: score = pd.concat([score] * 1000, ignore_index=True) def Points(a,b): return (a,b) In [147]: %timeit [Points(int(a),b) for a, b in zip(score['I.D'],score['Score'])] 1.3 ms ± 132 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each) In [148]: %timeit [Points(int(row[0]),row[1]) for index, row in score.iterrows()] 259 ms ± 5.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) In [149]: %timeit [Points(int(row[0]),row[1]) for row in score.itertuples()] 3.64 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Have you ever tried the method .itertuples()? my_points = [Points(int(row[0]),row[1]) for row in score.itertuples()] Is a faster way to iterate over a pandas dataframe. I hope it help.
The question is actually not about how you iter through a DataFrame and return a list, but rather how you can apply a function on values in a DataFrame by column. You can use pandas.DataFrame.apply with axis set to 1: df.apply(func, axis=1) To put in a list, it depends what your function returns but you could: df.apply(Points, axis=1).tolist() If you want to apply on only some columns: df[['Score', 'I.D']].apply(Points, axis=1) If you want to apply on a func that takes multiple args use numpy.vectorize for speed: np.vectorize(Points)(df['Score'], df['I.D']) Or a lambda: df.apply(lambda x: Points(x['Score'], x['I.D']), axis=1).tolist()