I would like to use the function sigest() of kernlab in Python to estimate a good range for sigmas that I'll use in the construction of RBF Kernels. I am using rpy2 but I can't figure out what would be the argument for "na_action".
Recommended syntax in R:
sigest(x, frac = 0.5, scaled = TRUE, na.action = na.omit)
My syntax:
sigest(np.asmatrix(x), frac = 0.5, scaled = True,
na_action = pandas2ri.pandas.DataFrame.dropna)
x is the data matrix. I also tried
sigest(np.asmatrix(x), frac = 0.5, scaled = True,
na_action = pd.DataFrame.dropna)
Libraries used: matplotlib ,numpy, pandas. Also numpy2ri and pandas2ri
import matplotlib
import numpy as np
import pandas as pd
import rpy2
import rpy2.robjects as robj
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri
rpy2.robjects.numpy2ri.activate()
lab = importr("kernlab")
# ommiting the part of x initialization. it reads the data of a csv file and it's an array (40,1))
y = lab.sigest(np.asmatrix(x), frac = 0.5, scaled = True, na_action = 'ignore')
None of those Pandas methods will work for the na.action argument which expects an R call to stats::na.omit. Therefore, you must somehow reference this R method. Additionally, because the parameter maintains a dot in its name which is not allowed in identifiers of Python variables, consider adjusting parameter name manually with rpy2's SignatureTranslatedFunction if it is not handled automatically with importr:
from rpy2.robjects.functions import SignatureTranslatedFunction
from rpy2.robjects.packages import importr
lab = importr('kernlab')
lab.sigest = SignatureTranslatedFunction(lab.sigest,
init_prm_translate = {'na_action': 'na.action'})
Then, try passing the needed action call (to renamed parameter) as a string to avoid it being called directly by Python like you can for other methods, t.test, cor.test, lm, using same na.action argument:
y = lab.sigest(np.asmatrix(x), frac=0.5, scaled=True, na_action="na.omit")
Related
I am trying to do some parameter inference on an ODE compared to some experimental data observed at 3 time points (2h, 4h and 6h). I set everything up according to the first example:
https://pyabc.readthedocs.io/en/latest/examples/adaptive_distances.html?highlight=PCMAD
But get and error about parsing from list to numeric:
TypeError: Cannot parse variable Contamination=[array([253.36919232]), array([482.10280333]), array([700.764029])] of type <class 'list'> to numeric.
I think this refers to the output from the deterministic_run() function. How can I convert it to numeric?
Tested code which creates the error.
Preamble
import pyabc as pyabc
from pyabc import (ABCSMC,
RV, Distribution,
MedianEpsilon,
LocalTransition)
from pyabc.visualization import plot_kde_2d, plot_data_callback
import matplotlib.pyplot as plt
import os
import tempfile
import numpy as np
#import scipy as sp
from scipy.integrate import odeint
import math
import seaborn as sns
#pyabc.settings.set_figure_params('pyabc') # for beautified plots
db_path = ("sqlite:///" +
os.path.join(tempfile.gettempdir(), "test5.db"))
Here we define the ODE model
def ode_model(contamination,t,r,C,d,g):
Contamination = contamination;
return(r*(1-Contamination/C)-d*math.exp(-g*t)*Contamination)
Here we create the input parameters and extract only specific time-points
def deterministic_run(parameters):#precision,initial_contamination,r,C,d,g):
precision=5000
tmax = 6
time_space = np.linspace(0,tmax,precision+1)#precision+1?
sim=odeint(ode_model,initial_contamination,time_space,args=(parameters["r"],parameters["C"],parameters["d"],parameters["g"]))
num_at_2=sim[int(precision*2/tmax)]
num_at_4=sim[int(precision*4/tmax)]
num_at_6=sim[int(precision*6/tmax)]
return{"Contamination":[num_at_2,num_at_4,num_at_6]}
Parameter priors
parameter_prior = Distribution(r=RV("uniform", 0.0, 200.0),
C=RV("uniform", 1000.0, 6000.0),
d=RV("uniform", 10.0, 1000.0),
g=RV("uniform", 2.0, 200.0))
parameter_prior.get_parameter_names()
Distance function and set-up
distance = pyabc.PNormDistance(p=2)
abc = pyabc.ABCSMC(models=deterministic_run,parameter_priors=parameter_prior, distance_function=distance)
Observed data for comparison and initial conditions for ODE
initial_contamination=1200.0
measurement_data = np.array([134.0,202.0,294.0]) #Mean observed data at 2h, 4h and 6h.
s=np.array([93.70165,86.13942,162.11107]) #STD of observation
precision=5000
measurement_times = np.array([2,4,6])
and we define where to store the results
history = abc.new(db_path, {"Contamination": measurement_data,"sd": s})
We run the ABC until criterion is met
history = abc.run(max_nr_populations=7)
This gives the error:
TypeError: Cannot parse variable Contamination=[array([253.36919232]), array([482.10280333]), array([700.764029])] of type <class 'list'> to numeric.
I have data that I have created and preprocessed in Python that I would like to import to R and perform a k-fold cross-validated LASSO fit using glmnet. I want control over which observations are used in each fold, so I want to use caret to do this.
However, I have found that caret interprets my data as a classification instead of a regression problem, and promptly fails. Here is what I hope is a reproducible example:
import numpy as np
import pandas as pd
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import numpy2ri
from rpy2.robjects.conversion import localconverter
pandas2ri.activate()
numpy2ri.activate()
# Import essential R packages
glmnet = importr('glmnet')
caret = importr('caret')
base = importr('base')
# Define X and y input
dummy_x = pd.DataFrame(np.random.rand(10000, 5), columns=('a', 'b', 'c', 'd', 'e'))
dummy_y = np.random.rand(10000)
# Convert pandas DataFrame to R data.frame
with localconverter(robjects.default_converter + pandas2ri.converter):
dummy_x_R = robjects.conversion.py2rpy(dummy_x)
# Use caret to perform the fit using default settings
caret_test = caret.train(**{'x': dummy_x_R, 'y': dummy_y, 'method': 'glmnet'})
rpy2 fails, giving this cryptic error message from R:
RRuntimeError: Error: Metric RMSE not applicable for classification models
What could be causing this? According to this previous question, it may be the case that caret is assuming that at least one of my variables is an integer type, and so defaults to thinking this is a classification instead of a regression problem.
However, I have checked both X and y using typeof, and they are clearly doubles:
base.sapply(dummy_x_R, 'typeof')
>>> array(['double', 'double', 'double', 'double', 'double'], dtype='<U6')
base.sapply(dummy_y, 'typeof')
>>> array(['double', 'double', 'double', ..., 'double', 'double', 'double'],
dtype='<U6')
Why am I getting this error? All the default settings to train assume a regression model, so why does caret assume a classification model when used in this way?
In situations like this, the first step is to identify whether the unexpected outcome originated from the Python or rpy2 side, or the R side.
The conversion from pandas to R, or numpy to R appears to work as expected, as least for array types:
>>> [x.typeof for x in dummy_x_R]
[<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>,
<RTYPES.REALSXP: 14>]
I am guessing that this is what you might have done for dummy_y.
>>> from rpy2.robjects import numpy2ri
>>> with localconverter(robjects.default_converter + numpy2ri.converter):
dummy_y_R = robjects.conversion.py2rpy(dummy_y)
>>> dummy_y_R.typeof
<RTYPES.REALSXP: 14>
However, a rather subtle conversion detail is at root of the issue. dummy_y_R has a "shape" (attribute dim in R), while caret expects a shape-less R array (a "vector" in R lingo) in order to perform a regression. One can force dummy_y to be an R vector with:
caret_test = caret.train(**{'x': dummy_x_R,
'y': robjects.FloatVector(dummy_y),
'method': 'glmnet'})
To use R methods, be sure all inputs are R objects. Therefore, consider converting the dummy_y numpy array to an R vector which you can do with base.as_double:
...
dummy_y_R = base.as_double(dummy_y)
caret.train(x=dummy_x_R, y=dummy_y_R, method='glmnet')
A now closed discussion shows how to use the R dtw package in python. This is a little clumsy, but the R dtw package is great and better than currently available python dtw implementations. Unfortunately, the windowing functions like the Sakoe-Chiba band do not work when trying to specify a "window.size". There appears to be an issue with the mapping to the argument. Note that "." in arguments is supposed to be replaced with "_" when using rpy2. But following this convention, the argument is not being used for some reason.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=5)
>>> RRuntimeError: Error in window.function(row(wm), col(wm), query.size= n, reference.size = m, :
argument "window.size" is missing, with no default
You can see that the error states "window.size" is missing, despite "window_size" clearly being specified in the rpy2 fashion.
Just a note from the future: this question is now superseded by the feature-equivalent dtw-python package (also found on PyPI). The rpy2-R-dtw bridge should no longer be necessary.
Answering my own question in case anyone ever has the same issue. The problem is the argument mapping and the R three dots ellipsis ‘...’. This can be fixed by specifying the mapping manually.
from rpy2.robjects.functions import SignatureTranslatedFunction
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
So with this specification the window_size argument is used correctly.
import numpy as np
import rpy2.robjects.numpy2ri
from rpy2.robjects.packages import importr
from rpy2.robjects.functions import SignatureTranslatedFunction
rpy2.robjects.numpy2ri.activate()
# Set up our R namespaces
R = rpy2.robjects.r
DTW = importr('dtw')
R.dtw = SignatureTranslatedFunction(R.dtw,
init_prm_translate={'window_size': 'window.size'})
# Generate our data
idx = np.linspace(0, 2*np.pi, 100)
template = np.cos(idx)
query = np.sin(idx) + np.array(R.runif(100))/10
# Calculate the alignment vector and corresponding distance
alignment = R.dtw(query, template, keep=True,window_type='sakoechiba',
window_size=10)
dist = alignment.rx('distance')[0][0]
print(dist)
>>> 117.348292359
Now I use Rpy2 in Jupyter notebook to fit von Mises distribution, the code is
%load_ext rpy2.ipython
%R require(movMF)
%%R -i dir_data,n_vM_dir -o theta,alpha
result = movMF(dir_data, n_vM_dir, nruns = 10)
theta = result$theta
alpha = result$alpha
Input: dir_data,n_vM_dir
Output: theta,alpha
It will take the dir_data, n_vM_dir variables in Python and pass them into R. After the fitting, theta and alpha will pass back to Python, so I can use them in later analysis.
Now, I want to refactor the code into a Python function, so I can reuse it, how can I do it?
I can do this so far
import rpy2.robjects as robjects
# pass dir_data, n_vM_dir into R
robjects.r('''
result = movMF(dir_data, n_vM_dir, nruns = 10)
theta = result$theta
alpha = result$alpha
''')
theta = robjects.r('theta')
alpha = robjects.r('alpha')
# Return theta, alpha
I can access the data through robjects.r, the main problem is that
I don't know how to pass the data stored in Python to R (dir_data,n_vM_dir in this example).
I've read the docs in http://rpy2.readthedocs.io/en/version_2.8.x/introduction.html
I find the variables are created by
from rpy2.robjects import FloatVector
ctl = FloatVector([4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14])
But this looks very complex compared to the R magic in jupyter notebook.
from numpy import *
from pylab import *
from scipy import *
from scipy.signal import *
from scipy.stats import *
testimg = imread('path')
hist = hist(testimg.flatten(), 256, range=[0.0,1.0])[0]
hist = hist + 0.000001
prob = hist/sum(hist)
entropia = -1.0*sum(prob*log(prob))#here is error
print 'Entropia: ', entropia
I have this code and I do not know what could be the problem, thanks for any help
This is an example of why you should never use from module import *. You lose sight of where functions come from. When you use multiple from module import * calls, one module's namespace may clobber another module's namespace. Indeed, based on the error message, that appears to be what is happening here.
Notice that when log refers to numpy.log, then -1.0*sum(prob*np.log(prob)) can be computed without error:
In [43]: -1.0*sum(prob*np.log(prob))
Out[43]: 4.4058820963782122
but when log refers to math.log, then a TypeError is raised:
In [44]: -1.0*sum(prob*math.log(prob))
TypeError: only length-1 arrays can be converted to Python scalars
The fix is to use explicit module imports and explicit references to functions from the module's namespace:
import numpy as np
import matplotlib.pyplot as plt
testimg = np.random.random((10,10))
hist = plt.hist(testimg.flatten(), 256, range=[0.0,1.0])[0]
hist = hist + 0.000001
prob = hist/sum(hist)
# entropia = -1.0*sum(prob*np.log(prob))
entropia = -1.0*(prob*np.log(prob)).sum()
print 'Entropia: ', entropia
# prints something like: Entropia: 4.33996609845
The code you posted does not produce the error, but somewhere in your actual code log must be getting bound to math.log instead of numpy.log. Using import module and referencing functions with module.function will help you avoid this kind of error in the future.