I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).
Related
I have a column with strings and I'm trying to find number of tokens in it and then creating a new column in the same dataframe with those values.
data['tokens'] = data['query'].str.split().apply(len)
I get SettingWithCopyWarning. I'm not sure how to fix this. I understand I need to use .loc[row_indexer,col_indexer] = value but don't get how that would apply to this.
a SettingWithCopyWarning happens when you have made a copy of a slice of a DataFrame, but pandas thinks you might be trying to modify the underlying object.
To fix it, you need to understand the difference between a copy and a view. A copy makes an entirely new object. When you index into a DataFrame, like:
data['query'].str.split().apply(len)
or
data['tokens']
you're creating a new DataFrame that is a modified copy of the original one. If you modify this new copy, it won't change the original data object. You can check that with the _is_view attribute, which will return a boolean value.
data['tokens']._is_view
On the other hand, when you use the .at, .loc, or .iloc methods, you are taking a view of the original DataFrame. That means you're subsetting it according to some criteria and manipulating the original object itself.
Pandas raises the SettingWithCopyWarning when you are modifying a copy when you probably mean to be modifying the original. To avoid this, you can explicitly use .copy() on the data that you are copying, or you can use .loc to specify the columns you want to modify in data (or both).
Since it depends a lot on what transformations you've done to your DataFrame already and how it is set up, it's hard to say exactly where and how you can fix it without seeing more of your code. There's unfortunately no one-size-fits-all answer. If you can post more of your code, I'm happy to help you debug it.
One thing you might try is creating an intermediate lengths object explicitly, in case that is the problem. So your code would look like:
lengths = data['query'].str.split().apply(len).copy()
data['tokens'] = lengths
I am taking a Data Science course about data analysis in Python. At one point in the course the professor says:
You can chain operations together.
For instance, we could have rewritten the query for
all Store 1 costs as df.loc['Store 1']['Cost'].
This looks pretty reasonable and gets us the result we wanted.
But chaining can come with some costs and
is best avoided if you can use another approach.
In particular, chaining tends to cause Pandas to return a copy of the DataFrame
instead of a view on the DataFrame.
For selecting data,
this is not a big deal, though it might be slower than necessary.
If you are changing data though, this is an important distinction and
can be a source of error.
Later on, he describes chain indexing as:
Generally bad, pandas could return a copy of a view depending upon NumPy
So, he suggests using multi-axis indexing (df.loc['a', '1']).
I'm wondering whether it is always advisable to stay clear of chain indexing or are there specific uses cases for it where it shines?
Also, if it is true that it can return a copy of a view or a view (depending upon NumPy), what exactly does it depend on and can I influence it to get the desired outcome?
I've found this answer that states:
When you use df['1']['a'], you are first accessing the series object s = df['1'], and then accessing the series element s['a'], resulting in two __getitem__ calls, both of which are heavily overloaded (handle a lot of scenarios, like slicing, boolean mask indexing, and so on).
...which makes it seem chain indexing is always bad. Thoughts?
As I understand it, the advantage to using the set_index function with a particular column is to allow for direct access to a row based on a value. As long as you know the value, this eliminates the need to search using something like loc thus cutting down the running time of the operation. Pandas also allows you to set multiple columns as the index using this function. My question is, after how many columns do these indexes stop being valuable? If I were to specify every column in my dataframe as the index would I still see increased speed in indexing rows over searching with loc?
The real downside of setting everything as index is buried deep in the advanced indexing docs of Pandas: indexing can change the dtype of the column being set to index. I would expect you to encounter this problem before realizing the prospective performance benefit.
As for that performance benefit, you pay for indexing up front when you construct the Series object, regardless of whether you explicitly set them. AFAIK Pandas indexes everything by default. And as Jake VanderPlas puts it in his excellent book:
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.
-- Jake VanderPlas, The Python Data Science Handbook
So, the reason to set something as index is to make it easier for you to work with your data or to support your data access pattern, not necessarily for performance optimization like a database index.
Simple question, and my google-fu is not strong enough to find the right term to get a solid answer from the documentation. Any term I look for that includes either change or modify leads me to questions like 'How to change column name....'
I am reading in a large dataframe, and I may be adding new columns to it. These columns are based on interpolation of values on a row by row basis, and the simple numbers of rows makes this process a couple hours in length. Hence, I save the dataframe, which also can take a bit of time - 30 seconds at least.
My current code will always save the dataframe, even if I have not added any new columns. Since I am still developing some plotting tools around it, I am wasting a lot of time waiting for the save to finish at the termination of the script needlessly.
Is there a DataFrame attribute I can test to see if the DataFrame has been modified? Essentially, if this is False I can avoid saving at the end of the script, but if it is True then a save is necessary. This simple one line if will save me a lot of time and a lost of SSD writes!
You can use:
df.equals(old_df)
You can read the it's functionality in pandas' documentation. It basically does what you want, returning True only if both DataFrames are equal, and it's probably the fastest way to do it since it's an implementation of pandas itself.
Notice you need to use .copy() when assigning old_df before changes in your current df, otherwise you might pass the dataframe by reference and not by value.
I need to do an apply on a dataframe using inputs from multiple rows. As a simple example, I can do the following if all the inputs are from a single row:
df['c'] = df[['a','b']].apply(lambda x: awesome stuff, axis=1)
# or
df['d'] = df[['b','c']].shift(1).apply(...) # to get the values from the previous row
However, if I need 'a' from the current row, and 'b' from the previous row, is there a way to do that with apply? I could add a new 'bshift' column and then just use df[['a','bshift']] but it seems there must be a more direct way.
Related but separate, when accessing a specific value in the df, is there a way to combine labeled indexing with integer-offset? E.g. I know the label of the current row but need the row before. Something like df.at['labelIknow'-1, 'a'] (which of course doesn't work). This is for when I'm forced to iterate through rows. Thanks in advance.
Edit: Some info on what I'm doing etc. I have a pandas store containing tables of OHLC bars (one table per security). When doing backtesting, currently I pull the full date range I need for a security into memory, and then resample it into a frequency that makes sense for the test at hand. Then I do some vectorized operations for things like trade entry signals etc. Finally I loop over the data from start to finish doing the actual backtest, e.g. checking for trade entry exit, drawdown etc - this looping part is the part I'm trying to speed up.
This should directly answer your question and let you use apply, although I'm not sure it's ultimately any better than a two-line solution. It does avoid creating extra variables at least.
df['c'] = pd.concat([ df['a'], df['a'].shift() ], axis=1).apply(np.mean,axis=1)
That will put the mean of 'a' values from the current and previous rows into 'c', for example.
This isn't as general, but for simpler cases you can do something like this (continuing the mean example):
df['c'] = ( df['a'] + df['a'].shift() ) / 2
That is about 10x faster than the concat() method on my tiny example dataset. I imagine that's as fast as you could do it, if you can code it in that style.
You could also look into reshaping the data with stack() and hierarchical indexing. That would be a way to get all your variables into the same row but I think it will likely be more complicated than the concat method or just creating intermediate variables via shift().
For the first part, I don't think such a thing is possible. If you update on what you actually want to achieve, I can update this answer.
Also looking at the second part, your data structure seems to be relying an awfully lot on the order of rows. This is typically not how you want to manage your databases. Again, if you tell us what your overall goal is, we may hint you towards a solution (and potentially a better way to structure the data base).
Anyhow, one way to get the row before, if you know a given index label, is to do:
df.ix[:'labelYouKnow'].iloc[-2]
Note that this is not the optimal thing to do efficiency-wise, so you may want to improve your your db structure in order to prevent the need for doing such things.