I am taking a Data Science course about data analysis in Python. At one point in the course the professor says:
You can chain operations together.
For instance, we could have rewritten the query for
all Store 1 costs as df.loc['Store 1']['Cost'].
This looks pretty reasonable and gets us the result we wanted.
But chaining can come with some costs and
is best avoided if you can use another approach.
In particular, chaining tends to cause Pandas to return a copy of the DataFrame
instead of a view on the DataFrame.
For selecting data,
this is not a big deal, though it might be slower than necessary.
If you are changing data though, this is an important distinction and
can be a source of error.
Later on, he describes chain indexing as:
Generally bad, pandas could return a copy of a view depending upon NumPy
So, he suggests using multi-axis indexing (df.loc['a', '1']).
I'm wondering whether it is always advisable to stay clear of chain indexing or are there specific uses cases for it where it shines?
Also, if it is true that it can return a copy of a view or a view (depending upon NumPy), what exactly does it depend on and can I influence it to get the desired outcome?
I've found this answer that states:
When you use df['1']['a'], you are first accessing the series object s = df['1'], and then accessing the series element s['a'], resulting in two __getitem__ calls, both of which are heavily overloaded (handle a lot of scenarios, like slicing, boolean mask indexing, and so on).
...which makes it seem chain indexing is always bad. Thoughts?
Related
When to use list of objects over dataframes in Python?
I have a list of strings which will have multiple attributes like score, word count, some boolean values, etc. I have created a list of objects with these attributes. But I wonder would it be better to simply create a dataframe with each string as a row and add its attributes as columns
class MyObject():
def getString(self):
return self.str_name
def getSimilarity(self):
return self.similarity
def getSimilarityBand(self):
return self.similarity_band
Which is a better design?
It's very dependent on your context.
If you're building a job which is reading some data, applying transformations on top of that data and then writing it to an output file/bucket then it is common to use dataframes (e.g. pandas if it will fit into memory or pyspark if it needs to be distributed). One reason for this is there are some optimisations that these libraries do under the hood when applying these kinds of transformations which make your jobs more efficient.
On the other hand, if you're building a more complex application with lots of object hierarchies or something that more closely models the real world where you feel well-defined objects will make your code easier to read, then the object approach makes more sense.
In the end, this comes down to style; and in a way functional programming vs object-oriented programming. Python sits in the middle of these worlds so it's natural that there's going to be some conflict. There's no right or wrong way.
As I understand it, the advantage to using the set_index function with a particular column is to allow for direct access to a row based on a value. As long as you know the value, this eliminates the need to search using something like loc thus cutting down the running time of the operation. Pandas also allows you to set multiple columns as the index using this function. My question is, after how many columns do these indexes stop being valuable? If I were to specify every column in my dataframe as the index would I still see increased speed in indexing rows over searching with loc?
The real downside of setting everything as index is buried deep in the advanced indexing docs of Pandas: indexing can change the dtype of the column being set to index. I would expect you to encounter this problem before realizing the prospective performance benefit.
As for that performance benefit, you pay for indexing up front when you construct the Series object, regardless of whether you explicitly set them. AFAIK Pandas indexes everything by default. And as Jake VanderPlas puts it in his excellent book:
If a Series is an analog of a one-dimensional array with flexible indices, a DataFrame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one-dimensional columns, you can think of a DataFrame as a sequence of aligned Series objects. Here, by "aligned" we mean that they share the same index.
-- Jake VanderPlas, The Python Data Science Handbook
So, the reason to set something as index is to make it easier for you to work with your data or to support your data access pattern, not necessarily for performance optimization like a database index.
I have the following dataframe where I want to assign the bottom 1% value to a new column. When I do this calculation with using the ".loc" notification, it takes around 10 seconds for using .loc assignment, where the alternative solution is only 2 seconds.
df_temp = pd.DataFrame(np.random.randn(100000000,1),columns=list('A'))
%time df_temp["q"] = df_temp["A"].quantile(0.01)
%time df_temp.loc[:, "q1_loc"] = df_temp["A"].quantile(0.01)
Why is the .loc solution slower? I understand using the .loc solution is safer, but if I want to assign data to all indices in the column, what can go wrong with the direct assignment?
.loc is searching along the entirety of indices and columns (in this case, only 1 column) in your df along the whole axes, which is time consuming and perhaps redundant, in addition to figuring out the quantiles of df_temp['A'] (which is negligible as far as calculation time). Your direct assignment method, on the other hand, is just parsing df_temp['A'].quantile(0.01), and assigning df_temp['q']. It doesn't need to exhaustively search the indices/columns of your df.
See this answer for a similar description of the .loc method.
As far as safety is concerned, you are not using chained indexing, so you're probably safe (you're not trying to set anything on a copy of your data, it's being set directly on the data itself). It's good to be aware of the potential issues with not using .loc (see this post for a nice overview of SettingWithCopy warnings), but I think that you're OK as far as that goes.
If you want to be more explicit about your column creation, you could do something along the lines of df = df.assign(q=df_temp["A"].quantile(0.01)). It won't really change performance (I don't think), nor the result, but it allows you to see that you're explicitly assigning a new column to your existing dataframe (and thus not setting anything on a copy of said dataframe).
I am trying to use pandas pd.DataFrame.where as follows:
df.where(cond=mask, other=df.applymap(f))
Where f is a user defined function to operate on a single cell. I cannot use other=f as it seems to produce a different result.
So basically I want to evaluate the function f at all cells of the DataFrame which does not satisfy some condition which I am given as the mask.
The above usage using where is not very efficient as it evaluates f immediately for the entire DataFrame df, whereas I only need to evaluate it at some entries of the DataFrame, which can sometimes be very few specific entries compared to the entire DataFrame.
Is there an alternative usage/approach that could be more efficient in solving this general case?
As you correctly stated, df.applymap(f) is evaluated before df.where(). I'm fairly certain that df.where() is a quick function and is not the bottleneck here.
It's more likely that df.applymap(f) is inefficient, and there's usually a faster way of doing f in a vectorized manner. Having said so, if you do believe this is impossible, and f is itself slow, you could modify f to leave the input unchanged wherever your mask is False. This is most likely going to be really slow though, and you'll definitely prefer trying to vectorize f instead.
If you really must do it element-wise, you could use a NumPy array:
result = df.values
for (i,j) in np.where(mask):
result[i,j] = f(result[i,j])
It's critical that you use a NumPy array for this, rather than .iloc or .loc in the dataframe, because indexing a pandas dataframe is slow.
You could compare the speed of this with .applymap; for the same operation, I don't think .applymap is substantially faster (if at all) than simply a for loop, because all pandas does is run a for loop of its own in Python (maybe Cython? But even that only saves on the overhead, and not the function itself). This is different from 'proper' vectorization, because vector operations are implemented in C.
Goal: sorting a sequence in a functional way without using builtin sorted(..) function.
def my_sorted(seq):
"""returns an iterator"""
pass
Motivation: In the FP way, I am constrained:
never mutate seq (which could be an iterator or a realized list)
By implication, no in-place sorting.
Question 1 Since I cannot mutate seq, I would need to maintain a separate mutable data structure to store the sorted sequence. That seems wasteful compared to an in-place list.sort(). How do other functional programming languages handle this ?
Question 2 If I return a mutable sequence, it that ok in the functional paradigm?
Of course sorting cannot be totally lazy (the last element of input could be the first on output) but you could implement a computational lazy sort that after reading the whole sequence only generates exact sorted output on request element-by-element. You can also delay reading input until at least one output is requested so sorting and ignoring the result will require no computation.
For this computationally lazy approach the best candidate I know is the heapsort algorithm (you only do the heap-building step upfront).
Mutation in-place is only safe if no one else has references to the data, expecting it to be as it was prior to the sort. So it isn't really wasteful to have a new structure for the sorted results, in general. The in-place optimization is only safe if you're using the data in a linear fashion.
So, just allocate a new structure, since that is more generally useful. The in-place version is a special case.
The appropriate defensive programming is wasteful at times, but there's also nothing you can do about it.
This is why languages built to support functional use from the ground up use structural sharing for their natively immutable types; programming in a functional style in a language which isn't built for it (such as Python) isn't going to be as well-supported as a matter of course. That said, a sort operation isn't necessarily a good candidate for structural sharing (if more than minor changes need to be made).
As such, there often is at least one copy operation involved in a sort, even in other functional languages. Clojure, for instance, delegates to Java's native (highly optimized) sort operation on a temporary mutable array, and returns a seq wrapping that array (and thus making the result just as immutible as the input which was used to populate same). If the inputs are immutible, and the outputs are immutible, and what happens inbetween isn't visible to the outside world (particularly, to any other thread), transient mutability is often a necessary and appropriate thing.
Use a sorting algorithm that can be performed in a manner that creates a new datastructure, such as heapsort or mergesort.
Wasteful of what? bits? electricity? wall-clock time? A parallel merge-sort may be the quickest to complete if you have enough cpus and a large amount of data, but may produce many intermediary representations.
In general, parallelising an algorithm may lead to a very different optimisation strategy than a serial algorithm. For instance, due to Amdahl's Law, re-performing redundant work locally to avoid sharing. This may be considered "wasteful" in a serial context, but leads to a much more scalable algorithm.