I am trying to run DEseq2 from Python using rpy2.
How should I pass the design matrix?
My script is as follows:
from numpy import *
from numpy.random import multinomial, random
from rpy2 import robjects
import rpy2.robjects.numpy2ri
robjects.numpy2ri.activate()
from rpy2.robjects.packages import importr
deseq = importr('DESeq2')
# Generate some data. 1000 genes, 10 samples
n = 1000
probabilities = random(n)
probabilities /= sum(probabilities)
data = zeros((n,10), int)
for i in range(10):
data[:,i] = multinomial(1000000, probabilities)
# Make the data frame
d = {}
categories = ('1','2') * 5
d["key_1"] = robjects.IntVector(categories)
dataframe = robjects.DataFrame(d)
# Create the design matrix, and run DESeqDataSetFromMatrix
design = "~ key_1" # <--- I guess this is wrong
dds = deseq.DESeqDataSetFromMatrix(countData=data, colData=dataframe,design=design)
The error I am getting is
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/rinterface/__init__.py:186: RRuntimeWarning: Error: $ operator is invalid for atomic vectors
warnings.warn(x, RRuntimeWarning)
Traceback (most recent call last):
File "testrpy.py", line 23, in <module>
dds = deseq.DESeqDataSetFromMatrix(countData=data, colData=dataf,design=design)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/robjects/functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/rpy2-2.8.5-py3.6-macosx-10.11-x86_64.egg/rpy2/robjects/functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error: $ operator is invalid for atomic vectors
My guess is that the design argument is not correct.
Does anybody have an example of running DEseq via rpy2?
Thanks.
Ah ! You were almost there:
# Create the design matrix, and run DESeqDataSetFromMatrix
design = "~ key_1" # <--- I guess this is wrong
design is a string, but I guess that it should be a formula. Formulae are language objects in R.
Try with:
from rpy2.robjects import Formula
design = Formula("~ key_1")
Related
I am tackling a bayesian inference problem and am having trouble using a pymc3 sampler provided by pypesto on my windows laptop. To make sure I can run with the sampler I create a simple dummy objective to use.
I install create a conda (I tried both 3.7 & 3.8) environment and install the pymc3 and theano modules using pip3/pip. I've tried several different versions of both pymc3/theano and managed to import them succesfully. However, there is an error message I cannot figure out how to go around. I have tried looking online for a solution but was not able to find it either. I currently have the latest versions of pymc3 and theano installed (3.11.0 and 1.0.5 respectively). This is the final line of the message
theano.graph.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute sigmoid(x2_interval__), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
Here is the full message:
Sampling 1 chain for 1_000 tune and 100 draw iterations (1_000 + 100 draws total) took 7 seconds.
Traceback (most recent call last):
File "samplingPymc3.py", line 70, in <module>
result2 = sample.sample(problem1, 100, sampler2, x0=np.array([0,0]))
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\pypesto\sample\sample.py", line 68, in sample
sampler.sample(n_samples=n_samples)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\pypesto\sample\pymc3.py", line 102, in sample
**self.options)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\pymc3\sampling.py", line 637, in sample
idata = arviz.from_pymc3(trace, **ikwargs)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\arviz\data\io_pymc3.py", line 559, in from_pymc3
density_dist_obs=density_dist_obs,
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\arviz\data\io_pymc3.py", line 163, in __init__
self.observations, self.multi_observations = self.find_observations()
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\arviz\data\io_pymc3.py", line 176, in find_observations
multi_observations[key] = val.eval() if hasattr(val, "eval") else val
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\graph\basic.py", line 554, in eval
self._fn_cache[inputs] = theano.function(inputs, self)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\compile\function\__init__.py", line 350, in function
output_keys=output_keys,
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\compile\function\pfunc.py", line 532, in pfunc
output_keys=output_keys,
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\compile\function\types.py", line 1978, in orig_function
name=name,
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\compile\function\types.py", line 1584, in __init__
fgraph, additional_outputs = std_fgraph(inputs, outputs, accept_inplace)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\compile\function\types.py", line 188, in std_fgraph
fgraph = FunctionGraph(orig_inputs, orig_outputs, update_mapping=update_mapping)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\graph\fg.py", line 162, in __init__
self.import_var(output, reason="init")
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\graph\fg.py", line 330, in import_var
self.import_node(var.owner, reason=reason)
File "C:\Users\germa\anaconda3\envs\sampling\lib\site-packages\theano\graph\fg.py", line 383, in import_node
raise MissingInputError(error_msg, variable=var)
theano.graph.fg.MissingInputError: Input 0 of the graph (indices start from 0), used to compute sigmoid(x2_interval__), was not provided and not given a value. Use the Theano flag exception_verbosity='high', for more information on this error.
I read somewhere that the issue may lie with the version of arviz used but that does not appear to be the issue in my case.
I wanted to include the script I am running. Here is the code for the script:
import numpy as np
import scipy as sp
import scipy.optimize as so
from scipy.stats import multivariate_normal
import pypesto
import pypesto.sample as sample
from pypesto import Objective
A = np.array([[2.0, 0.0], [0.0, 1.0]])
b = np.array([2.0, 1.0])
x_init = np.array([3.4302, 2.915])
x_true = np.array([1.0, 1.0])
temp = lambda x: A.dot(x) - b
f = lambda x: .5 * np.linalg.norm(temp(x))
A_t = A.transpose()
K = np.dot(A_t, A)
df = lambda x: K.dot(x) - A_t.dot(b)
def obj1(x):
# f_val = f(x)
# grad = df(x)
return (f(x), df(x))
objfun = lambda x: obj1(x)
dim_full = 2
lb = -10 * np.ones((dim_full, 1))
ub = 10 * np.ones((dim_full, 1))
x_names = ['x1', 'x2']
# step_fcn = pymc3.step_methods.hmc.hmc.HamiltonianMC
objective = pypesto.Objective(fun=objfun, grad=True, hess=False)
problem1 = pypesto.Problem(objective=objective, lb=lb, ub=ub, x_names=x_names)
sampler = sample.AdaptiveMetropolisSampler()
print('function val: ', objfun(x_init))
sampler2 = sample.Pymc3Sampler()
result2 = sample.sample(problem1, 100, sampler2, x0=np.array([0, 0]))
print('Done sampling!')
Thank you in advance for any help!
pymc3 support of pypesto is at the moment limited, as it was implemented at a time when theano was discontinued in favor of aesara in pymc3. Thus, pypesto only supports specific version of the involved tools, specifically
arviz >= 0.8.1, < 0.9.0
theano >= 1.0.4
packaging >= 20.0
pymc3 >= 3.8, < 3.9.2
(see https://github.com/ICB-DCM/pyPESTO/blob/main/setup.cfg#L111).
The switch to full aesara and later pymc3 version support is underway, but not out yet.
I am doing a quantile regression on the engel dataset with rpy2 (2.7.6):
import statsmodels as sm
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
pandas2ri.activate()
quantreg = importr('quantreg')
data = sm.datasets.engel.load_pandas().data
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
However this generates the following error:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
Traceback (most recent call last):
File "<ipython-input-22-02ee1015737c>", line 1, in <module>
quantreg.rq('foodexp ~ income', data=data, tau=0.5)
File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 178, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "C:\Anaconda\lib\site-packages\rpy2\robjects\functions.py", line 106, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
RRuntimeError: Error in y - x %*% z$coef : non-conformable arrays
From what I understand, non-conformable arrays in this case would mean there are some missing values or the 'arrays' being used are different sizes. I can confirm that this is NOT the case:
data.count()
Out[26]:
income 235
foodexp 235
dtype: int64
data.shape
Out[27]: (235, 2)
What else could this error mean? Is it possible that the conversion from DataFrame to data.frame in rpy2 is not working correctly or maybe I'm missing something here? Can anyone else confirm this error?
Just in case here is some info regarding the version of R and Python.
R version 3.2.0 (2015-04-16) -- "Full of Ingredients"
Copyright (C) 2015 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)
Python 2.7.11 |Anaconda 2.3.0 (64-bit)| (default, Dec 7 2015, 14:10:42) [MSC v.1500 64 bit (AMD64)]
on win32
Any help would be appreciated.
Edit 1:
If I load the dataset directly from R I don't get an error:
from rpy2.robjects import r
r.data('engel')
data = r['engel']
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
So I think there is something wrong with the conversion with pandas2ri. The same error occurs when I try to convert the DataFrame to data.frame manually with pandas2ri.py2ri.
Edit 2:
Interestingly enough, if I used the deprecated pandas.rpy.common.convert_to_r_dataframe the error is gone:
import pandas.rpy.common as com
rdata = com.convert_to_r_dataframe(data)
qreg = quantreg.rq('foodexp ~ income', data=rdata, tau=0.5)
There is definitely a bug in pandas2ri which is also confirmed here.
As answered on the rpy2 issue tracker:
The root of the issue seems to be that the columns in the pandas data frame are converted to Array objects each with only one column.
>>> pandas2ri.py2ri_pandasdataframe(data)
<DataFrame - Python:0x7f8af3c2afc8 / R:0x92958b0>
[Array, Array]
income: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af57ef908 / R:0x92e1bf0>
[420.157651, 541.411707, 901.157457, ..., 581.359892, 743.077243, 1057.676711]
foodexp: <class 'rpy2.robjects.vectors.Array'>
<Array - Python:0x7f8af3c2ab88 / R:0x92e7600>
[255.839425, 310.958667, 485.680014, ..., 468.000798, 522.601906, 750.320163]
The distinction is a subtle one, but this seems to be confusing the quantreg package. There are other R functions appear to be working independently of whether the objects is an array with one column or a vector.
Turning the columns to R vectors appears to be what is required to solve the problem:
from rpy2.robjects.vectors import FloatVector
mydata=pandas2ri.py2ri_pandasdataframe(data)
from rpy2.robjects.packages import importr
base=importr('base')
mydata[0]=base.as_vector(mydata[0])
mydata[1]=base.as_vector(mydata[1])
# now this is working
qreg = quantreg.rq('foodexp ~ income', data=mydata, tau=0.5)
Now I would like to gather more data about whether this could solve the issue without breaking things else. For this I turned the fix into a custom converter derived from the pandas converter:
from rpy2.robjects import default_converter
from rpy2.robjects.conversion import Converter, localconverter
from rpy2.robjects.packages import importr
from rpy2.robjects import numpy2ri, pandas2ri, vectors
import numpy
my_converter = Converter('my converter',
template=pandas2ri.converter)
base=importr('base')
def ndarray_forcevector(obj):
func=numpy2ri.converter.py2ri.registry[numpy.ndarray]
# current conversion as performed by numpy
res=func(obj)
if len(obj.shape) == 1:
# force into an R vector
res=base.as_vector(res)
return res
#my_converter.py2ri.register(pandas2ri.PandasSeries)
def py2ri_pandasseries(obj):
# this is a copy of the function with the same name in pandas2ri, with
# the call to ndarray_forcevector() as the only difference
if obj.dtype == '<M8[ns]':
# time series
d = [vectors.IntVector([x.year for x in obj]),
vectors.IntVector([x.month for x in obj]),
vectors.IntVector([x.day for x in obj]),
vectors.IntVector([x.hour for x in obj]),
vectors.IntVector([x.minute for x in obj]),
vectors.IntVector([x.second for x in obj])]
res = vectors.ISOdatetime(*d)
#FIXME: can the POSIXct be created from the POSIXct constructor ?
# (is '<M8[ns]' mapping to Python datetime.datetime ?)
res = vectors.POSIXct(res)
else:
# converted as a numpy array
res = ndarray_forcevector(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.do_slot_assign('names',
vectors.StrVector(tuple(str(x) for x in obj.index)))
else:
res.do_slot_assign('dimnames',
vectors.SexpVector(conversion.py2ri(obj.index)))
return res
The easiest way to use this new converter might be in a context manager:
with localconverter(default_converter + my_converter) as cv:
qreg = quantreg.rq('foodexp ~ income', data=data, tau=0.5)
I am trying to run a Augmented Dickey-Fuller test in statsmodels in Python, but I seem to be missing something.
This is the code that I am trying:
import numpy as np
import statsmodels.tsa.stattools as ts
x = np.array([1,2,3,4,3,4,2,3])
result = ts.adfuller(x)
I get the following error:
Traceback (most recent call last):
File "C:\Users\Akavall\Desktop\Python\Stats_models\stats_models_test.py", line 12, in <module>
result = ts.adfuller(x)
File "C:\Python27\lib\site-packages\statsmodels-0.4.1-py2.7-win32.egg\statsmodels\tsa\stattools.py", line 201, in adfuller
xdall = lagmat(xdiff[:,None], maxlag, trim='both', original='in')
File "C:\Python27\lib\site-packages\statsmodels-0.4.1-py2.7-win32.egg\statsmodels\tsa\tsatools.py", line 305, in lagmat
raise ValueError("maxlag should be < nobs")
ValueError: maxlag should be < nobs
My Numpy Version: 1.6.1
My statsmodels Version: 0.4.1
I am using windows.
I am looking at the documentation here but can't figure what I am doing wrong. What am I missing?
Thanks in Advance.
I figured it out. By default maxlag is set to None, while it should be set to integer. Something like this works:
import numpy as np
import statsmodels.tsa.stattools as ts
x = np.array([1,2,3,4,3,4,2,3])
result = ts.adfuller(x, 1) # maxlag is now set to 1
Output:
>>> result
(-2.6825663173365015, 0.077103947319183241, 0, 7, {'5%': -3.4775828571428571, '1%': -4.9386902332361515, '10%': -2.8438679591836733}, 15.971188911270618)
I am having problems interpolating some data points using Scipy. I guess that it might depend on the fact that the function I'm trying to interpolate is discontinuous at x roughly 4.
Here is the code I'm using to interpolate:
from scipy import *
y_interpolated = interp1d(x,y,buonds_error=False,fill_value=0.,kind='cubic')
new_x_array = arange(min(x),max(x),0.05)
plot(new_x_array,x_interpolated(new_x_array),'r-')
The error I get is
File "<stdin>", line 2, in <module>
File "/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/scipy/interpolate/interpolate.py", line 357, in __call__
out_of_bounds = self._check_bounds(x_new)
File "/Library/Frameworks/EPD64.framework/Versions/7.1/lib/python2.7/site-packages/scipy/interpolate/interpolate.py", line 415, in _check_bounds
raise ValueError("A value in x_new is above the interpolation "
ValueError: A value in x_new is above the interpolation range.
These are my data points:
1.56916432074 -27.9998263169
1.76773750527 -27.6198430485
1.98360238449 -27.2397962268
2.25133982943 -26.8596491107
2.49319293195 -26.5518194791
2.77823462692 -26.1896935372
3.07201297519 -25.9540514619
3.46090507092 -25.7362456112
3.65968688527 -25.6453922172
3.84116464506 -25.53652509
3.97070419447 -25.3374215879
4.03087127145 -24.8493356465
4.08217147954 -24.0540196233
4.12470899596 -23.0960856364
4.17612639206 -22.4634289328
4.19318305992 -22.1380894034
4.2708234589 -21.902951035
4.3745696768 -21.9027079759
4.52158254627 -21.9565591238
4.65985875536 -21.8839570732
4.80666329863 -21.6486676004
4.91026629192 -21.4496126386
5.05709528961 -21.2685401725
5.29054655428 -21.2860476871
5.54129211534 -21.3215908912
5.73174988353 -21.6645019816
6.06035782465 -21.772138994
6.30243916407 -21.7715483093
6.59656410998 -22.0238656166
6.86481948673 -22.3665921479
7.01182409559 -22.4385289076
7.17609125906 -22.4200564296
7.37494987052 -22.4376476472
7.60844044988 -22.5093814451
7.79869207061 -22.5812017094
8.00616642549 -22.5445612485
8.17903446593 -22.4899243886
8.29141325457 -22.4715846981
What version of scipy are you using?
The script you posted has some syntax errors (I assume due to wrong copy and paste).
This script works, with scipy.__version__ == 0.9.0. .
import sys
from scipy import *
from scipy.interpolate import *
from pylab import plot
x = []
y = []
for line in sys.stdin:
a, b = line.split()
x.append(float(a))
y.append(float(b))
y_interpolated = interp1d(x,y,bounds_error=False,fill_value=0.,kind='cubic')
new_x_array = arange(min(x),max(x),0.05)
plot(new_x_array,y_interpolated(new_x_array),'r-')
I am trying to fit a nonlinear curve using rpy2 from numpy array, but are stuck as I do not know how to pass the 'start' argument on the R side. I use R 2.12.1 and python 2.6.6
Error in function (formula, data = parent.frame(), start, control = nls.control(), :
parameters without starting value in 'data': responsev, predictorv
Traceback (most recent call last):
File "./employmentsHoro.py", line 279, in <module>
nls.nls2(formula=formula, data=dataf, start=mylist)
File "/usr/lib/python2.6/dist-packages/rpy2/robjects/functions.py", line 83, in __call__
return super(SignatureTranslatedFunction, self).__call__(*args, **kwargs)
File "/usr/lib/python2.6/dist-packages/rpy2/robjects/functions.py", line 35, in __call__
res = super(Function, self).__call__(*new_args, **new_kwargs)
rpy2.rinterface.RRuntimeError: Error in function (formula, data = parent.frame(),start, control = nls.control(), :
parameters without starting value in 'data': responsev, predictorv
Can anyone help me determine how to pass a list() object to the nls formula?
the relevant part of my code is this:
import rpy2.robjects as robjects
from rpy2.robjects import DataFrame, Formula
import rpy2.robjects.numpy2ri as npr
import numpy as np
from rpy2.robjects.packages import importr
nls = importr('nls2')
stats = importr('stats')
mylist = robjects.r('list(a=700,b=0.8,c=200000)')
dataf = DataFrame({'responsev': professions, 'predictorv': totalEmployment})
starter= DataFrame({'a':700,'b':0.80,'c':200000})
formula = Formula('responsev ~I( a*(predictorv/c)^b )/( 1+( predictorv/c )^b )')
nls.nls2(formula=formula, data=dataf, start=starter)
The main error is this one:
Error in function (formula, data = parent.frame(), start, control =
nls.control(), : parameters without starting value in
'data': responsev, predictorv
Where are declared the variable professions? and DataEmployment?
seems they don't have a starting value, maybe you have to change/transform in something that R
understands?