Alternative Efficient approach to pandas where

Alternative Efficient approach to pandas where - python

I am trying to use pandas pd.DataFrame.where as follows:
df.where(cond=mask, other=df.applymap(f))
Where f is a user defined function to operate on a single cell. I cannot use other=f as it seems to produce a different result.
So basically I want to evaluate the function f at all cells of the DataFrame which does not satisfy some condition which I am given as the mask.
The above usage using where is not very efficient as it evaluates f immediately for the entire DataFrame df, whereas I only need to evaluate it at some entries of the DataFrame, which can sometimes be very few specific entries compared to the entire DataFrame.
Is there an alternative usage/approach that could be more efficient in solving this general case?

As you correctly stated, df.applymap(f) is evaluated before df.where(). I'm fairly certain that df.where() is a quick function and is not the bottleneck here.
It's more likely that df.applymap(f) is inefficient, and there's usually a faster way of doing f in a vectorized manner. Having said so, if you do believe this is impossible, and f is itself slow, you could modify f to leave the input unchanged wherever your mask is False. This is most likely going to be really slow though, and you'll definitely prefer trying to vectorize f instead.
If you really must do it element-wise, you could use a NumPy array:
result = df.values
for (i,j) in np.where(mask):
result[i,j] = f(result[i,j])
It's critical that you use a NumPy array for this, rather than .iloc or .loc in the dataframe, because indexing a pandas dataframe is slow.
You could compare the speed of this with .applymap; for the same operation, I don't think .applymap is substantially faster (if at all) than simply a for loop, because all pandas does is run a for loop of its own in Python (maybe Cython? But even that only saves on the overhead, and not the function itself). This is different from 'proper' vectorization, because vector operations are implemented in C.

Related

fast way to modify multiple columns in dataframe with a numpy array Pandas

I have a dataframe with mixed dtypes e.g. shape (1000, 200), now I'd like to assign/update 100 columns of this df with a numpy (integer) array shape (1000, 100).
Initially what I did is very basic, something like: df.loc[:, [100 COLUMNS]] = my_np_array, however I have a running time limit, the faster all my code executes the better. After doing line profiling on my code, it turns out almost 70% time my code spent was on this assignment operation, so then I went on to look for faster method, the only other method I came across is to modify the underlying array directly i.e. df.values[:, [100 COLUMN INDICES] = my_np_array. This was much much faster and I was pretty satisfied with it for a moment. (if any can shed some light on why its much faster would also appreciate)
Only until I discovered the values of my dataframe is actually not changed, it seems like if the dataframe has mixed dtypes then .values would return a copy instead of a view/reference, which means all the changes I made were not applied to the original values, so I can't just do modification on .values. Now I'm stuck, if anyone can think of ways to improve this assignment operation would be great.
(One potential method I'm thinking is to create a separate dataframe from this numpy array and then do a merge with original df, however i actually need to do the assignment operation many times, so this probably doesn't work as the time of creating a dataframe and merge would also not be too small)

Using Custom C functions with Pandas (an easy way)

Is there a way to have custom C functions act over a pandas DF? I know I can wrap a c function in a python function and use that over row wise iteration, for instance, but that seems inefficient. I know pandas is written in c. I would love a simple way of telling pandas "use this c function". This is naiive, but something like this
...
cFunc = get_c_function_some_how()
for i in range(1000):
df = df.use_c_function(cFunc)
use_df(df)
...
My use case is that I do a simple, but somewhat computationally expensive set of computations over and over again, and I would like to make that particular set of computations significantly faster
EDIT: I suppose passing the entirety of the Pandas Dataframe somehow to the C function would be fine, realistically the iteration should probably happen inside C anyway, so If a python wrapped c function needs to be used once then the data is just handed over to C for computation, that seems like a pretty good solution. I personally couldn't find documentation on doing something like that.

There is a way to do it, but I wouldn't describe it as "easy."
Internally, Pandas uses numpy to store data. If you can get the data as a numpy vector, you can pass that to C, and have it operate on the vector.
Getting a numpy vector from a column is easy:
vec = df["foo"].to_numpy()
Next, you need to ensure that the vector is contiguous. You can't assume that it is, because pandas will store data from multiple columns in the same numpy array if the data has compatible types.
vec = np.ascontiguousarray(vec)
Then, you can pass the numpy array to C as described in this answer. This will work for numerical data. If you want to work with strings, that's more complicated.
I recommend reading Pandas Under The Hood if you go this route. It explains many important things, like why the numpy arrays are not contiguous.

Vectorizing CONDITIONAL logic in Python?

Old-school c programmer trying to get with the times and learn Python. Struggling to see how to use vectorization effectively to replace for loops. I get the basic concept that Python can do mathematical functions on entire matricies in a single statement, and that's really cool. But I seldom work with mathematical relationships. Almost all my for loops apply CONDITIONAL logic.
Here's a very simple example to illustrate the concept:
import numpy as np
# Initial values
default = [1,2,3,4,5,6,7,8]
# Override values should only replace initial values when not nan
override = [np.nan,np.nan,3.5,np.nan,5.6,6.7,np.nan,8.95]
# I wish I knew how to replace this for loop with a single line of vectorized code
for i in range(len(default)):
if(np.isnan(override[i])==False): #Only override when override value is other than nan
default[i]=override[i]
default
I have a feeling that for loop could be eliminated with a single python statement that only overwrites values of default with values of override that are not np.nan. But I can't see how to do it.
This is just a simplified example to illustrate the concept. My real question is whether or not vectorization is generally useful to replace for loops with conditional logic, or if it's only applicable to mathematical relationships, where the benefits and method of achieving them are obvious. All of my real code challenges are much more complex and the conditional logic is more complex than just a simple "only use this value if it's non-nan".
I found hundreds of articles online about how to use vectorization in Python, but they all seem to focus on replacing mathematical calculations in for loops. All my for loops involve conditional logic. Can vectorization help me or am I trying to fit a square peg in a round hole?
Thanks!

First thing's first, the vectorized version:
override_is_not_nan = np.logical_not(np.isnan(override))
np.where(override_is_not_nan, override, default)
As for your real question, vectorization is useful for multiprocessing.
And not just for multi-core CPUs.
Considering today's GPUs have thousands of cores, using tensors with similar code can make it run much faster.
How much faster? That depends on your data, implementation and hardware.
Evidently, the combination of vectorization with GPUs is part of what enabled the huge progress in the field of Deep Learning.

List comprehension is usually the preferred one line alternative to for loops in Python. It is possible to throw in a conditional into the comprehension as well.
In this specific case we iterate over elements of default and override by zipping them together and replace values of default according to the conditional check.
>>> [y if not(np.isnan(y)) else x for (x,y) in zip(default, override)]
[1, 2, 3.5, 4, 5.6, 6.7, 7, 8.95]
To answer your broader question about vectorization and speedups, the answer unfortunately is it depends. There are situations where a simple for loop performs better than its vectorized counterparts. List comprehensions for example, is just for improving the readability of code as opposed to providing a serious speedup.
The answers on this question address this in more detail.

First find the indices where the non-nan values are located.
Replace the values as indices in the default array with override array values.
import numpy as np
np_default = np.array(default).asdtype(float) # Convert np_default to numpy array with float values
non_nan_indices = np.where(~np.isnan(override)) # Get non nan indices
np_default[non_nan_indices] = np.array(override)[non_nan_indices] # Replacing the values at non-nan indices
np_default # Returns array([1. , 2. , 3.5 , 4. , 5.6 , 6.7 , 7. , 8.95])
Vectorization is where numpy comes to your help it takes the advantage of typed natured of array which result in much faster operations. See [BlogPost] for detail.

How and When to use Chain Indexing in Python Pandas?

I am taking a Data Science course about data analysis in Python. At one point in the course the professor says:
You can chain operations together.
For instance, we could have rewritten the query for
all Store 1 costs as df.loc['Store 1']['Cost'].
This looks pretty reasonable and gets us the result we wanted.
But chaining can come with some costs and
is best avoided if you can use another approach.
In particular, chaining tends to cause Pandas to return a copy of the DataFrame
instead of a view on the DataFrame.
For selecting data,
this is not a big deal, though it might be slower than necessary.
If you are changing data though, this is an important distinction and
can be a source of error.
Later on, he describes chain indexing as:
Generally bad, pandas could return a copy of a view depending upon NumPy
So, he suggests using multi-axis indexing (df.loc['a', '1']).
I'm wondering whether it is always advisable to stay clear of chain indexing or are there specific uses cases for it where it shines?
Also, if it is true that it can return a copy of a view or a view (depending upon NumPy), what exactly does it depend on and can I influence it to get the desired outcome?
I've found this answer that states:
When you use df['1']['a'], you are first accessing the series object s = df['1'], and then accessing the series element s['a'], resulting in two __getitem__ calls, both of which are heavily overloaded (handle a lot of scenarios, like slicing, boolean mask indexing, and so on).
...which makes it seem chain indexing is always bad. Thoughts?

Questions regarding numpy in Python

I wrote a program using normal Python, and I now think it would be a lot better to use numpy instead of standard lists. The problem is there are a number of things where I'm confused how to use numpy, or whether I can use it at all.
In general how do np.arrays work? Are they dynamic in size like a C++ vector or do I have declare their length and type beforehand like a standard C++ array? In my program I've got a lot of cases where I create a list
ex_list = [] and then cycle through something and append to it ex_list.append(some_lst). Can I do something like with a numpy array? What if I knew the size of ex_list, could I declare and empty one and then add to it?
If I can't, let's say I only call this list, would it be worth it to convert it to numpy afterwards, i.e. is calling a numpy list faster?
Can I do more complicated operations for each element using a numpy array (not just adding 5 to each etc), example below.
full_pallete = [(int(1+i*(255/127.5)),0,0) for i in range(0,128)]
full_pallete += [col for col in right_palette if col[1]!=0 or col[2]!=0 or col==(0,0,0)]
In other words, does it make sense to convert to a numpy array and then cycle through it using something other than for loop?

Numpy arrays can be appended to (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html), although in general calling the append function many times in a loop has a heavy performance cost - it is generally better to pre-allocate a large array and then fill it as necessary. This is because the arrays themselves do have fixed size under the hood, but this is hidden from you in python.
Yes, Numpy is well designed for many operations similar to these. In general, however, you don't want to be looping through numpy arrays (or arrays in general in python) if they are very large. By using inbuilt numpy functions, you basically make use of all sorts of compiled speed up benefits. As an example, rather than looping through and checking each element for a condition, you would use numpy.where().
The real reason to use numpy is to benefit from pre-compiled mathematical functions and data processing utilities on large arrays - both those in the core numpy library as well as many other packages that use them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.