Efficiently executing complex Python code in R

Efficiently executing complex Python code in R - python

I'm looking for an efficient way to execute Python code in a fast and flexible way in R.
The problem is that I want to execute a particular method from topological data analysis, called persistent homology, which you don't really need to know in detail to understand the problem I'm dealing with.
The thing is that computing persistent homology is not the most easiest thing to do when working with large data sets, as it requires both a lot of memory and computational time.
Executing the method implemented in the R-package TDA is however very inconvenient: it crashes starting at around 800 data points and doesn't really allow me to use approximation algorithms.
In contrast, the Python package Ripser allows me to do the computation on a few thousands of points with ease in Python. Furthermore, it also allows one to provide a parameter that approximates the result for even larger data sets, and can also store the output I want. In summary, it is much more convenient to compute persistent homology with this package.
However, since everything else I'm working on is in R, it would also be lot more convenient for me to execute the code from the Ripser package in R as well. An illustrative example of this is as follows:
# Import R libraries
library("reticulate") # Conduct Python code in R
library("ggplot2") # plotting
# Import Python packages
ripser <- import("ripser") # Python package for computing persistent homology
persim <- import("persim") # Python package for many tools used in analyzing Persistence Diagrams
# Generate data on circle
set.seed(42)
npoints <- 200
theta <- 2 * pi * runif(npoints)
noise <- cbind(rnorm(npoints, sd=0.1), rnorm(npoints, sd=0.1))
X <- data.frame(x=cos(theta) + noise[,1], y=sin(theta) + noise[,2])
# Compute persistent homology on full data
PHfull <- ripser$ripser(X, do_cocycles=TRUE) # slow if X is large
# Plot diagrams
persim$plot_diagrams(PHfull$dgms) # crashes
Now I have two problems when using this code:
The persistent homology computation performed by the ripser function works perfectly fine in this example. However, when I increase the number of data points in X, say npoints ~ 2000, the computation will take ages compared to around 30 seconds when I would perform the computation straightforwardly in Python. I don't really know what's happening behind the scenes that causes this huge difference in computational times. Is this because R is perhaps less convenient than Python for this example, and in stead of converting my arguments to the custom types in Python and executing the code in Python, the Python code is converted to R code? I am looking for an approach that combines this flexibility and efficient type conversion with the speed of Python.
In Python, the analog to the last line would plot an image based on the result of my persistent homology computation. However, executing this line causes R to crash. Is there a possible way to show the image that would normally result in Python in R?

Related

Is there a GPU/parallelized python implementation of k-NN which does not rebuild the KDTree upon the addition or subtraction of one point?

I have a dynamic point cloud in 3D, and I would like to use nanoflann to dynamically add/subtract points in between queries, without rebuilding the tree (as seen here):
https://github.com/jlblancoc/nanoflann/blob/master/examples/dynamic_pointcloud_example.cpp
I have found a python wrapper for nanoflann as well (awesome!):
https://github.com/u1234x1234/pynanoflann
This is great! However, speed is extremely important for my application so I will almost certainly need to parallelize this k-NN implementation. Is there an existing python implementation or wrapper for such a dynamic k-NN implementation written with OpenCL or CUDA? I wanted to check if one exists before writing my own. This "dynamic" is different than wanting to be able to specify a new k with each query. I don't mind if it is a single KDTree with a set k. I simply need to be able to remove or add points in between queries without rebuilding the KDTree.
Thank you in advance!

Need something similar to biglm but for mixed effect models

I am currently working with a dataset of approximately 4 million data points. I am using R in Rstudio on a Macbook Pro (32gb Ram, 2.2GHz Intel Core i7, 155 available memory on the Hard Drive).
My goal is to perform a non-linear mixed effect regression on the data. the data has two random effects which are nested and the model needs varying slopes as well as intercepts.
My code for this model is:
model <- lmer(DV ~ I(IVr^2) + IV + (IV | group/episode), data = data, REML=FALSE)
However, when running the model, the process uses up ~ 42gb of RAM before crashing. The log is:
Vector memory exhausted (limit reached?)
I want to manipulate R in someway so that it can run slower but use up my available hard drive memory so that it can handle this run. The closest solution I have found is the biglm package in R, however, I cannot find an equivalent for lmer(). I also can't find much on manipulating swap space on a Mac. Any solutions to the problem welcome.
Alternatively, I speak python so a solution using python instead would be great. (i.e. a module which can handle polynomial mixed effect models and a module to sort out my memory issue)

multithreading optimization in python

I am using lmfit to do some optimization and it is tediously slow. I have a large image and I am basically running least square minimization at every pixel. It seems to me that perhaps it is ideal for multithreading or other types of optimizations as it is painfully slow at the moment.
So, my optimization code is as follows:
I define the objective function as:
def objective(params, x, data):
s0 = params['S0']
t1 = params['T1']
m = s0 * (1.0 - np.exp(-x / t1))
return m - data
So, I am trying to minimise the difference between the model and observation. I think lmfit ensures that the absolute value is minimized but I am not sure and needs to check.
The main loop is as follows:
The parameters I need to estimate are as initialized as follows with some initial values:
p = Parameters()
p.add('S0', value=1.0)
p.add('T1', value=3.0)
final_data = <numpy image>
tis = np.asarray([1.0, 2.0, 3.0])
for i in range(final_data.shape[0]):
print "Processing pixel: ", i
minner = Minimizer(objective, params=p, fcn_args=np.asarray(tis),
final_data[i, :]),
nan_policy='propagate')
result = minner.minimize(method='least_squares')
s0_data[i] = result.params['S0']
t1_data[i] = result.params['T1']
This works fine but it is tediously slow. I was trying to figure out how to do multithreading in python and got thoroughly confused about posts regarding GIL locking and that multithreading in python does not really exist.
My question is:
1: Can this be scaled with multithreading in python easily?
2: Is there any other optimizations that I can try?

As the comments suggest, multithreading here won't be very fruitful. Basically, any single fit with lmfit or scipy ends up with a single-threaded fortran routine calling your python objective function repeatedly, and using those results to generate the next step. Trying to use multithreading means that the python objective function and parameters have to be managed among the threads -- the fortran code is not geared for this, and the calculation is not really I/O bound anyway.
Multiprocessing in order to use multiple cores is a better approach. But trying to use multiprocessing for a single fit is not as trivial as it sounds, as the objective function and parameters have to be pickle-able. For your simple example, this should work, but the approach can break as the problem uses more complex objects. The dill package can help with that.
But also: there is an even easier solution for your problem, as it is naturally parallelized. Just to do a separate fit per pixel, each in their own process. You can use multiprocessing for that, or you could even break the problem up into N separate real processes run N different scripts each fitting 1/N of the pixels.

astropy.io fits efficient element access of a large table

I am trying to extract data from a binary table in a FITS file using Python and astropy.io. The table contains an events array with over 2 million events. What I want to do is store the TIME values of certain events in an array, so I can then do analysis on that array. The problem I have is that, whereas in fortran (using FITSIO) the same operation takes maybe a couple of seconds on a much slower processor, the exact same operation in Python using astropy.io is taking several minutes. I would like to know where exactly the bottleneck is, and if there is a more efficient way to access the individual elements in order to determine whether or not to store each time value in the new array. Here is the code I have so far:
from astropy.io import fits
minenergy=0.3
maxenergy=0.4
xcen=20000
ycen=20000
radius=50
datafile=fits.open('datafile.fits')
events=datafile['EVENTS'].data
datafile.close()
times=[]
for i in range(len(events)):
energy=events['PI'][i]
if energy<maxenergy*1000:
if energy>minenergy*1000:
x=events['X'][i]
y=events['Y'][i]
radius2=(x-xcen)*(x-xcen)+(y-ycen)*(y-ycen)
if radius2<=radius*radius:
times.append(events['TIME'][i])
print times
Any help would be appreciated. I am an ok programmer in other languages, but I have not really had to worry about efficiency in Python before. The reason I have chosen to do this in Python now is that I was using fortran with both FITSIO and PGPLOT, as well as some routines from Numerical Recipes, but the newish fortran compiler I have on this machine cannot be persuaded to produce a properly working program (there are some issues of 32- vs. 64-bit, etc.). Python seems to have all the functionality I need (FITS I/O, plotting, etc), but if it takes forever to access the individual elements in a list, I will have to find another solution.
Thanks very much.

You need to do this using numpy vector operations. Without special tools like numba, doing large loops like you've done will always be slow in Python because it is an interpreted language. Your program should look more like:
energy = events['PI'] / 1000.
e_ok = (energy > min_energy) & (energy < max_energy)
rad2 = (events['X'][e_ok] - xcen)**2 + (events['Y'][e_ok] - ycen)**2
r_ok = rad2 < radius**2
times = events['TIMES'][e_ok][r_ok]
This should have performance comparable to Fortran. You can also filter the entire event table, for instance:
events_filt = events[e_ok][r_ok]
times = events_filt['TIMES']

Python Multiprocessing/EM

I did a machine learning Expectation Maximization algorithm in Python, basically an implementation of IBM Model1 for doing machine translation ( here is my GitHub if you want to look at the code) and it works, but reeeaaaaallly sloowwwlly. I'm taking a class now in parallel computing and I was wondering if I could use Python Multiprocessing to reach convergence faster. Can anyone give me any pointers or tips? I don't even know where to start.
EDIT: I was reading around and found this paper on using EM with MapReduce to do parallelization -- maybe this is a better idea?

Most of your problem is that Python is really slow. Remember, your code is executing in an interpreter. When you do code (such as line 82) where you perform a numerical computation one element at a time, you have that one computation - and all the overhead of the Python interpreter.
The first thing you will want to do is vectorize you code with numpy. Unlike your normal python code, numpy is calling out to precompiled efficient binary code. The more work you can hide into numpy, the less time you will waist in the interpreter.
Once you vectorize your code, you can then start profiling it if its still too slow. You should be able to find a lot of simple examples on how to vectorize python, and some of the alternative options.
EDIT: Let me clarify, that parallelizing inherently slow code is mostly pointless. First, is the issue that parallelizing slow code gives the false impression that you have made an improvement. The "scaling up" of parallel code should always be done against the fastest possible single threaded version of the same code (within reason, no need to write everything in assembly before starting any parallel code). For example, consider a lock under contention. The more threads fighting for the lock, the slower the code will run, and you will get no (or negative) performance gains. One way to reduce contention for the lock is to simply slow down the code competing for the lock. This makes it appear as if there is no overhead from lock contention, when in actuality - you have no improvements because the fastest single threaded version of your code will outperform your parallel code.
Also, python really isn't a great language to learn how to write parallel code in. Python has the GIL , which essentially forces all multithreaded code in python to run as if there was but one CPU core. This means bizarre hacks (such as the one you linked) must be done, which have their own additional drawbacks and issues (there are times where such tricks are needed / used, but they shouldn't be the default for running code on a single machine). Don't expect what you learn writing any parallel python code to carry over to other languages or help you with your course.

I think you will have some good success depending on where your bottleneck is. One caveat - When I do code optimization I always like to profile the code, even informally to get an idea of where the bottlenecks are. This will help identify where the time is being spent i.e. file io, network latency, resource contention, not enough cpu cycles etc...
For others who may not be familiar with the Expectation Maximization algorithm a very nice introduction is in Motion Segmentation using EM - a short tutorial, by Yair Weiss. Let us assume we have M data points and N classes/models.
In the EM algorithm there are two steps: Computing the distance between data points and models and Updating our model weights using weighted least squares.
Step 1 - Expectation stage
for data_point in M:
for current_model in N:
compute distance or residual between data_point and current_model
Step 2 - Maximization stage
for each model, compute weighted least squares solving for the model parameters
This requires solving N weighted least square problems where the size is
dependent on the number of parameters in the model that will be solved for.
Your bottleneck may be in the stage of computing the residuals or distances between the data points and the models stage 1 - E Step. In this stage the computations are all independent. I would consider the first stage as embarassingly parallel and quite amenable to parallel computation using parallel map reduce or some other tools in python. I have good success using IPython for such tasks, but there are other good python packages as well.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.