How to use rpy2 (V3.4.5) with R's tidyr nest()? - python

My attempt to use rpy2 in a Jupyter (iPython) notebook fails at the point where I wish to use tidyr::nest() to make a dataframe that has some elements which are a series. (This cannot be avoided, it is necessary for the next step of the analysis.)
After trying many things (including advice from the package home I've managed to get most of the way to the end, failing to understand rpy2 with tidyr::nest at the last step. (Some of the following example may be superfluous.)
How can I fix this?
% Import packages
from rpy2 import robjects
from rpy2.robjects import Formula, Environment
from rpy2.robjects.vectors import IntVector, FloatVector
from rpy2.robjects.lib import grid
from rpy2.robjects.packages import importr, data
from rpy2.rinterface_lib.embedded import RRuntimeError
import warnings
base = importr('base')
datasets = importr('datasets')
from functools import partial
import rpy2.robjects.lib.tidyr as tidyr
from collections import OrderedDict
from rpy2.robjects.vectors import (StrVector,
IntVector)
from rpy2.robjects.lib.tidyr import DataFrame
from rpy2.robjects import rl
tidyr = importr('tidyr')
dplyr = importr('dplyr')
% Obtain the R dataset "iris"
iris = data(datasets).fetch('iris')['iris']
% Use only two columns necessary for the "tidyr::nest" trial
dataf = (
DataFrame(iris)
.select(base.c(1,5))
)
print(dataf.head())
% Failure occurs below
irisNested = tidyr.nest(dataf, data=rl('Sepal.Length'))
EDIT: This isn't a question anymore, as it turns out my original effort was correct. A fragment of another attempt to solve the problem caused it to fail. However, I asked the question as I didn't find the various sources of help available provided a clear way to address my issue, I felt I made a lucky educated guess.
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
Removing this solves the problem.

Related

Using tsclust from dtwclust R library in Python. How do I access cluster?

I'm translating a R code to Python code and I need to implement a hierarchical time series clustering with dtw distance for time series of different lengths. The only solution that I find is to use the same library that I used in R (dtwclust, with tsclust) through rpy2 package of Python:
import rpy2
import rpy2.robjects.packages as rpackages
from rpy2.robjects.vectors import StrVector
from rpy2.robjects.packages import importr
from mRMRr import *
from datasets import *
import rpy2.robjects.packages as packages
packa = packages.importr('dtwclust')
Z = packa.tsclust(list_t_matrix_prof,type = "hierarchical", distance = "dtw_basic")
Now I have to access to the cluster assigned to each element and in R i can get this by Z#cluster.
Can anyone tell me which is the corresponding command in Python?

Library/package contains multiple modules, those modules contains many classes (methods) / functions. How to import them in the following case?

The following code works-
import sklearn.linear_model
clf= sklearn.linear_model.LogisticRegressionCV()
The following code does not work-
import sklearn
clf= sklearn.linear_model.LogisticRegressionCV()
whereas in case of Numpy, the following also works
import numpy as np
np.random.randint()
Why is that? Please elaborate.

Dill installed - throwing error that part of the module is missing

I'm writing code in a Jupyter Notebook that involves cleaning and analyzing a large amount of consumer data. I'm trying to use dill to save the dataframes with thousands of rows so I don't have to run the code every time I want to make an adjustment, so dill seems like the perfect package to do so... Except I'm getting this error when attempting to pickle the notebook:
AttributeError: module 'dill' has no attribute 'dump_session'
Let me know if the program code is necessary - I don't think it should make a difference. The imports are:
import numpy as np
import pandas as pd
import dill
import scipy
from matplotlib import pyplot as plt
from __future__ import division
from collections import OrderedDict
from sklearn.cluster import KMeans
pd.options.display.max_columns = None
and when I run this code I get the error from above:
dill.dump_session('recengine.db')
Is there another package that's interfering with dill's use of pickle vs. cpickle?

Running glmnet with rpy2 on sparse design matrix?

I have a python snippet which works just fine to run GLMNET on np.array X and y. However, when X is a column sparse matrix from scipy, the code fails as rpy2 is not able to convert X. Am I making an obvious mistake?
A MCVE is:
import numpy as np
from scipy import sparse
from rpy2 import robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects import numpy2ri
from rpy2.robjects import pandas2ri
if __name__ == "__main__":
X = sparse.rand(5, 20, density=0.1)
y = np.random.randn(5)
numpy2ri.activate()
pandas2ri.activate()
utils = rpackages.importr('utils')
utils.chooseCRANmirror(ind=1)
if not rpackages.isinstalled('glmnet'):
utils.install_packages("glmnet")
glmnet = rpackages.importr('glmnet')
glmnet = robjects.r['glmnet']
glmnet_fit = glmnet(X, y, intercept=False, standardize=False)
And when I run it I get a NotImplementedError:
Conversion 'py2ri' not defined for objects of type '<class 'scipy.sparse.csc.csc_matrix'>'
Could I provide X in a different way? I'd be surprised if rpy2 could not handle sparse matrices.
You can create a sparse matrix with rpy2 as follows:
import numpy as np
import rpy2.robjects as ro
from rpy2.robjects.packages import importr
from scipy import sparse
X = sparse.rand(5, 20, density=0.1).tocoo()
r_Matrix = importr("Matrix")
r_Matrix.sparseMatrix(
i=ro.IntVector(X.row + 1),
j=ro.IntVector(X.col + 1),
x=ro.FloatVector(X.data),
dims=ro.IntVector(X.shape))
There is indeed no converter Python -> R for your object type included in rpy2. Your Python object is not a conventional arrays but a sparse matrix as you note it (scipy.sparse.csc.csc_matrix to be specific), implemented as one of the numerical extensions available for numpy. As numpy itself is not even required to use rpy2 the support for extension of numpy is rather sparse, at the notable exception of pandas since data tables are ubiquitous.
You may want to write your own converter from css_matrix to gcCMatrix in the R package Matrix (https://stat.ethz.ch/R-manual/R-devel/library/Matrix/html/dgCMatrix-class.html) as the package glmnet appears to be able to handle them.
Writing a custom converter will require how to map or copy the content of the Python object to its chosen R counterpart, but once done plugging the code into rpy2 should be quite easy:
https://rpy2.github.io/doc/v2.9.x/html/generated_rst/s4class.html#custom-conversion
Consider opening an issue as a "feature request" on the rpy2 issue tracker, and reporting progress and outcome, with the hope to see this turn into a pull request complete with unit tests
Also a quick solution that might work would be to save the sparse matrix file temporarily.
import numpy as np
import rpy2.robjects as ro
import warnings
from rpy2.rinterface import RRuntimeWarning
import rpy2.robjects.numpy2ri as numpy2ri
from scipy.io import mmwrite
mmwrite('temp.mtx',matrix)
ro.r('X <- readMM("temp.mtx")')
I would be very interested though, if someone comes with a custom converter for avoiding that copy to disk.

Syntax for rpy2 base.with function

I'm trying to figure out rpy2 for plotting some graphs. I'd like to be able to use the with function that's part of R's base like it's used it the following R code:
with(res, plot(log2FoldChange, -log10(pvalue), pch=20, main="Volcano plot", xlim=c(-2.5,2)))
with(subset(res, padj<.05 ), points(log2FoldChange, -log10(pvalue), pch=20, col="red"))
Where res is a dataframe and log2FoldChange and pvalue are columns from that dataframe.
When I import the base package using rpy2's importr I can see that 'with' is in the object by doing:
from rpy2.robjects.packages import importr
base = importr('base')
dir(base)
However, I can't seem to figure out the correct syntax:
from rpy2.robjects.packages import importr
from rpy2 import robjects
base = importr('base')
base.with(res, robjects.r.plot(log2FoldChange, padj))
File "<stdin>", line 1
base.with(res, robjects.r.plot(log2FoldChange, padj))
^
SyntaxError: invalid syntax
Unfortunately, searching for something like 'base.with' has proven intractable. My question: what is the syntax for using 'base.with' in rpy2 python code?
Alternatively, while using 'with' is the most R forward approach to doing this, perhaps there's a more rpy2 friendly approach to this same problem that I'm unaware of.
Python might be getting a conflict with its own with() command which requires a space right after it. This is the challenge of interfacing with another language.
Try running the command natively in R syntax wrapped around the robjects function. Below I pass Python objects into R's global environment scope.
import rpy2.robjects as ro
ro.globalenv['res'] = res_frompy
ro.globalenv['log2FoldChang'] = log2FoldChang_frompy
ro.globalenv['padj'] = padj_frompy
ro.r('with(res, plot(log2FoldChange, padj))')

Categories