Is there a way to have custom C functions act over a pandas DF? I know I can wrap a c function in a python function and use that over row wise iteration, for instance, but that seems inefficient. I know pandas is written in c. I would love a simple way of telling pandas "use this c function". This is naiive, but something like this
...
cFunc = get_c_function_some_how()
for i in range(1000):
df = df.use_c_function(cFunc)
use_df(df)
...
My use case is that I do a simple, but somewhat computationally expensive set of computations over and over again, and I would like to make that particular set of computations significantly faster
EDIT: I suppose passing the entirety of the Pandas Dataframe somehow to the C function would be fine, realistically the iteration should probably happen inside C anyway, so If a python wrapped c function needs to be used once then the data is just handed over to C for computation, that seems like a pretty good solution. I personally couldn't find documentation on doing something like that.
There is a way to do it, but I wouldn't describe it as "easy."
Internally, Pandas uses numpy to store data. If you can get the data as a numpy vector, you can pass that to C, and have it operate on the vector.
Getting a numpy vector from a column is easy:
vec = df["foo"].to_numpy()
Next, you need to ensure that the vector is contiguous. You can't assume that it is, because pandas will store data from multiple columns in the same numpy array if the data has compatible types.
vec = np.ascontiguousarray(vec)
Then, you can pass the numpy array to C as described in this answer. This will work for numerical data. If you want to work with strings, that's more complicated.
I recommend reading Pandas Under The Hood if you go this route. It explains many important things, like why the numpy arrays are not contiguous.
Related
I am just learning to use dask and read many threads on this forum related to Dask and for loops. But I am still unclear how to apply those solutions to my problem. I am working with climate data that are functions of (time, depth, location). The 'location' coordinate is a linear index such that each value corresponds to a unique (longitude, latitude). I am showing below a basic skeleton of what I am trying to do, assuming var1 and var2 are two input variables. I want to parallelize over the location parameter 'nxy', as my calculations can proceed simultaneously at different locations.
for loc in range(0,nxy): # nxy = total no. of locations
for it in range(0,ntimes):
out1 = expression1 involving ( var1(loc), var2(it,loc) )
out2 = expression2 involving ( var1(loc), var2(it,loc) )
# <a dozen more output variables>
My questions:
(i) Many examples illustrating the use of 'delayed' show something like "delayed(function)(arg)". In my case, I don't have too many (if any) functions, but lots of expressions. If 'delayed' only operates at the level of functions, should I convert each expression into a function and add a 'delayed' in front?
(ii) Should I wrap the entire for loop shown above inside a function and then call that function using 'delayed'? I tried doing something like this but might not be doing it correctly as I did not get any speed-up compared to without using dask. Here's what I did:
def test_dask(n):
for loc in range(0,n):
# same code as before
return var1 # just returning one variable for now
var1=delayed(tast_dask)(nxy)
var1.compute()
Thanks for your help.
Every delayed task adds about 1ms of overhead. So if your expression is slow (maybe you're calling out to some other expensive function), then yes dask.delayed might be a good fit. If not, then you should probably look elsewhere.
In particular, it looks like you're just iterating through a couple arrays and operating element by element. Please be warned that Python is very slow at this. You might want to not use Dask at all, but instead try one of the following approaches:
Find some clever way to rewrite your computation with Numpy expressions
Use Numba
Also, given the terms your using like lat/lon/depth, it may be that Xarray is a good project for you.
In python, I'd like to minimize hundreds of thousands of scalar valued functions. Is there a faster way than a for-loop over a def of the objective function and a call to scipy.optimize.minimize_scalar? Basically a vectorized version, where I could give a function foo and a 2D numpy array where each row is given as extra data to the objective function.
This is not possible currently in scipy, without using a workaround like multiprecessing, threads, or converting the many one dimensional problems to a giant multidimensional one. There is however a pull request currently open to adress this.
I am trying to use pandas pd.DataFrame.where as follows:
df.where(cond=mask, other=df.applymap(f))
Where f is a user defined function to operate on a single cell. I cannot use other=f as it seems to produce a different result.
So basically I want to evaluate the function f at all cells of the DataFrame which does not satisfy some condition which I am given as the mask.
The above usage using where is not very efficient as it evaluates f immediately for the entire DataFrame df, whereas I only need to evaluate it at some entries of the DataFrame, which can sometimes be very few specific entries compared to the entire DataFrame.
Is there an alternative usage/approach that could be more efficient in solving this general case?
As you correctly stated, df.applymap(f) is evaluated before df.where(). I'm fairly certain that df.where() is a quick function and is not the bottleneck here.
It's more likely that df.applymap(f) is inefficient, and there's usually a faster way of doing f in a vectorized manner. Having said so, if you do believe this is impossible, and f is itself slow, you could modify f to leave the input unchanged wherever your mask is False. This is most likely going to be really slow though, and you'll definitely prefer trying to vectorize f instead.
If you really must do it element-wise, you could use a NumPy array:
result = df.values
for (i,j) in np.where(mask):
result[i,j] = f(result[i,j])
It's critical that you use a NumPy array for this, rather than .iloc or .loc in the dataframe, because indexing a pandas dataframe is slow.
You could compare the speed of this with .applymap; for the same operation, I don't think .applymap is substantially faster (if at all) than simply a for loop, because all pandas does is run a for loop of its own in Python (maybe Cython? But even that only saves on the overhead, and not the function itself). This is different from 'proper' vectorization, because vector operations are implemented in C.
I wrote a program using normal Python, and I now think it would be a lot better to use numpy instead of standard lists. The problem is there are a number of things where I'm confused how to use numpy, or whether I can use it at all.
In general how do np.arrays work? Are they dynamic in size like a C++ vector or do I have declare their length and type beforehand like a standard C++ array? In my program I've got a lot of cases where I create a list
ex_list = [] and then cycle through something and append to it ex_list.append(some_lst). Can I do something like with a numpy array? What if I knew the size of ex_list, could I declare and empty one and then add to it?
If I can't, let's say I only call this list, would it be worth it to convert it to numpy afterwards, i.e. is calling a numpy list faster?
Can I do more complicated operations for each element using a numpy array (not just adding 5 to each etc), example below.
full_pallete = [(int(1+i*(255/127.5)),0,0) for i in range(0,128)]
full_pallete += [col for col in right_palette if col[1]!=0 or col[2]!=0 or col==(0,0,0)]
In other words, does it make sense to convert to a numpy array and then cycle through it using something other than for loop?
Numpy arrays can be appended to (see http://docs.scipy.org/doc/numpy/reference/generated/numpy.append.html), although in general calling the append function many times in a loop has a heavy performance cost - it is generally better to pre-allocate a large array and then fill it as necessary. This is because the arrays themselves do have fixed size under the hood, but this is hidden from you in python.
Yes, Numpy is well designed for many operations similar to these. In general, however, you don't want to be looping through numpy arrays (or arrays in general in python) if they are very large. By using inbuilt numpy functions, you basically make use of all sorts of compiled speed up benefits. As an example, rather than looping through and checking each element for a condition, you would use numpy.where().
The real reason to use numpy is to benefit from pre-compiled mathematical functions and data processing utilities on large arrays - both those in the core numpy library as well as many other packages that use them.
i'm using python + numpy + scipy to do some convolution filtering over a complex-number array.
field = np.zeros((field_size, field_size), dtype=complex)
...
field = scipy.signal.convolve(field, kernel, 'same')
So, when i want to use a complex array in numpy all i need to do is pass the dtype=complex parameter.
For my research i need to implement two other types of complex numbers: dual (i*i=0) and double (i*i=1). It's not a big deal - i just take the python source code for complex numbers and change the multiplication function.
The problem: how do i make a numpy array of those exotic numeric types?
It looks like you are trying to create a new dtype for e.g. dual numbers. It is possible to do this with the following code:
dual_type = np.dtype([("a", np.float), ("b", np.float)])
dual_array = np.zeros((10,), dtype=dual_type)
However this is just a way of storing the data type, and doesn't tell numpy anything about the special algebra which it obeys.
You can partially achieve the desired effect by subclassing numpy.ndarray and overriding the relevant member functions, such as __mul__ for multiply and so on. This should work fine for any python code, but I am fairly sure that any C or fortran-based routines (i.e. most of numpy and scipy) would multiply the numbers directly, rather than calling the __mul__. I suspect that convolve would fall into this basket, therefore it would not respect the rules which you define unless you wrote your own pure python version.
Here's my solution:
from iComplex import SplitComplex as c_split
...
ctype = c_split
constructor = np.vectorize(ctype, otypes=[np.object])
field = constructor(np.zeros((field_size, field_size)))
That is the easy way to create numpy object array.
What about scipy.signal.convolve - it doesn't seem to work with my complex numbers and i had to make my own convolution and it works deadly slow. So now i am looking for ways to speed it up.
Would it work to turn things inside-out? I mean instead of an array as the outer container holding small containers holding a couple floating point values as a complex number, turn that around so that your complex number is the outer container. You'd have two arrays, one of plain floats as the real part, and another array as the imaginary part. The basic super-fast convolver can do its job although you'd have to write code to use it four times, for all combinations of real/imaginary of the two factors.
In color image processing, I have often refactored my code from using arrays of RGB values to three arrays of scalar values, and found a good speed-up due to simpler convolutions and other operations working much faster on arrays of bytes or floats.
YMMV, since locality of the components of the complex (or color) can be important.