How to manipulate columns of a data frame - python

I read data from a CSV file using rpy2's read_csv() which creates a DataFrame. Now I want to directly manipulate an entire column. What I tried so far:
from rpy2.robjects.packages import importr
utils = importr('utils')
df = utils.read_csv(logn, header=args.head, skip=args.skip)
df.rx2('a').ro / 10
which I expected to write back to the DataFrame which it apparently doesn't: df is not affected by this operation. So, another idea was
df.rx2('a') = df.rx2('a').ro / 10
but that produces an error that function calls are not assignable - which is not obvious to me since the LHS should return a Vector(?)
So what did I miss?

In Python function calls are indeed not assignable, which creates the necessity to adapt a little the R code.
Try:
df[df.names.index('a')] = df.rx2('a').ro / 10

Related

Method called twice instead of single call in Dask's multiprocessing

I am trying to download a file from google storage bucket and parse them. There are millions of such file, that needs to be downloaded, parsed and do some operations(Natural language processing etc) on them.
I am trying below code using dask's parallel processing and it is working but it is calling extract_skill twice instead of once for each row in panda's dataframe. Please help me understand why extract_skill method is being called twice.
import pandas as pd
import numpy as np
import dask
import dask.dataframe as dd
# downloading file and extract skill sets and store in skill_sets column
chunk_size = 20
df_list = np.array_split(temp_df, temp_df.shape[0]/chunk_size)
temp_df["skill_sets"] = ""
result_df = pd.DataFrame(data={}, columns=temp_df.columns)
for df_ in df_list:
df_["skill_sets"] = dd.from_pandas(df_, npartitions=4, sort=False, name='x').apply(extract_skill, axis=1, meta='object').compute()
result_df = pd.concat([result_df, df_], axis=0)
extract_skill()
def extract_skill(row):
// download file, parse and do some nlp stuff
file_name = row['file_path']
......
......
return skill_sets
Thanks in advance.
The DataFrame.apply method runs your function on a small sample of data in order to determine the datatypes and columns of the output. See the docstring of this function and look for the keyword "meta" for more information.

Rpy2 (working with Dataframes) - How to solve the Python(NaN) and R(NA) conflict?

I have a pandas dataFrame: result with 2 columns.
doy (day_of_year: independent variable, values 1,2,3,....365).
bookings (dependent variable, 279 numeric values and 86 NaN values)
Please find a portion of the dataFrame below:
My goal is to impute the missing values using R (randomForest::rfImpute) for further spectral analysis.
So, I am using rpy2 to use R-code inside Python script. I have imported necessary packages/libraries. I have also activated `pandas2ri.
import random
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
base = importr('base')
utils = importr('utils')
randomForest = importr('randomForest')
from rpy2.robjects import pandas2ri
pandas2ri.activate()
r_df = pandas2ri.py2ri(result)
type(r_df)
print(r_df.head())
print(base.summary(r_df))
random.seed(222)
result.imputed = randomForest.rfImpute('bookings', 'doy', data = r_df)`
But whenever I run the code, I get the error: No NA values found in bookings.
It's clear that the R code fails to interpret the missing values.
I have also tried to replace NaN with NA in the R dataFrame r_df,
robjects.r('r_df[is.nan(as.numeric(r_df))] = NA')
but when I run the code, I get the error: object r_df is not found.
Is there a way around this issue? As of now, I am a bit stuck and can't seem to find a helpful documentation.
Please find below some other outputs of separate lines of code.
(...)
robjects.r('r_df[is.nan(as.numeric(r_df))] = NA')
but when I run the code, I get the error: object r_df is not found.
r_dfis the name of a Python variable (name defined in your Python code), so embedded R does not know about it.
Is there a way around this issue? As of now, I am a bit stuck and can't seem to find a helpful documentation.
What about this:
https://rpy2.github.io/doc/v2.9.x/html/robjects_rinstance.html#r-the-instance-of-r

python dataframe write to R data format

I have a question with writing a dataframe format to R.
I have 1000 column X 77 row data. I want to write this dataframe to R data.
When I use function of
r_dataframe = com.convert_to_r_dataframe(df)
it gives me an error like dataframe object has no arttribute type.
When I see the code of com.convert_to_r_dataframe(). it just get the column of dataframe, and get the colunm.dtype.type.
In this moment, the column is dataframe, I think large columns dataframe has inside dataframes?
Any one have some idea to solve this problem?
The data.frame transfer from Python to R could be accomplished with the feather format. Via this link you can find more information.
Quick example.
Export in Python:
import feather
path = 'my_data.feather'
feather.write_dataframe(df, path)
Import in R:
library(feather)
path <- "my_data.feather"
df <- read_feather(path)
In this case you'll have the data in R as a data.frame. You can then decide to write it to an RData file.
save(df, file = 'my_data.RData')
simplest, bestest practical solution is to export in csv
import pandas as pd
dataframe.to_csv('mypath/file.csv')
and then read in R using read.csv

Is there a way to access R data frame column names in python/rpy2?

I have an R data frame, saved in Database02.Rda. Loading it
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
works fine. However:
print(robjects.r.names("df"))
yields
NULL
Also, as an example, column 214 (213 if we count starting with 0) is named REGION.
print(robjects.r.table(robjects.r["df"][213]))
works fine:
Region 1 Region 2 ...
9811 3451 ...
but we should also be able to do
print(robjects.r.table("df$REGION"))
This, however, results in
df$REGION
1
(which it does also for column names that do not exist at all); also:
print(robjects.r.table(robjects.r["df"]["REGION"]))
gives an error:
TypeError: SexpVector indices must be integers, not str
Now, the docs say, names can not be used for subsetting in python. Am I correct to assume that the column names are not imported whith the rest of the data when loading the data frame with python/rpy2? Am I thus correct that the easiest way to access them is to save and load them as a seperate list and construct a dict or so in python mapping the names to the column index numbers? This does not seem very generic, however. Is there a way to extract the column names directly?
The versions of R, python, rpy2 I use are:
R: 3.2.2
python: 3.5.0
rpy2: 2.7.8
When doing the following, you are loading whatever objects are Database02.Rda into R's "global environment".
import rpy2.robjects as robjects
robjects.r.load("Database02.Rda")
robjects.globalenv is an Environement. You can list its content with:
tuple(robjects.globalenv.keys())
Now I am understanding that one of your objects is called df. You can access it with:
df = robjects.globalenv['df']
if df is a list or a data frame, you can access its named elements with
rx2 (the doc is your friend here again). To get the one called REGION, do:
df.rx2("REGION")
To list all named elements in a list or dataframe that's easy:
tuple(df.names)
If you run R code in python, the global environment answer will not work. But kudos to #lgautier the creator/maintainer of this package. In R the dollar sign $ is used frequently. This is what I learned:
print(pamk_clusters$pamobject$clusinfo)
will not work, and its equivalent
print(pamk_clusters[["pamobject"]][["clusinfo"]])
also will not work ... however, after some digging in the "man"
http://rpy2.readthedocs.io/en/version_2.7.x/vector.html#extracting-r-style
Access to R-style extracting/subsetting is granted though the two delegators rx and rx2, representing the R functions [ and [[ respectively.
This works as expected
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
I commented in the forums about "man" clarity:
https://bitbucket.org/rpy2/rpy2/issues/436/acessing-dataframe-elements-using-rpy2
I am using rpy2 on Win7 with ipython. To help others dig through the formatting, here is a setup that seems to work:
import rpy2
import rpy2.robjects as robjects
import rpy2.robjects.packages as rpackages
from rpy2.robjects.packages import importr
base = importr('base')
utils = importr('utils')
utils.chooseCRANmirror(ind=1)
cluster = importr('cluster')
stats = importr('stats')
#utils.install_packages("fpc")
fpc = importr('fpc')
import pickle
with open ('points', 'rb') as fp:
points = pickle.load(fp)
# data above is stored as binary object
# online: http://www.mshaffer.com/arizona/dissertation/points
import rpy2.robjects.numpy2ri as npr
npr.activate()
k = robjects.IntVector(range(3, 8)) # r-syntax 3:7 # I expect 5
pamk_clusters = fpc.pamk(points,k)
print( base.summary(pamk_clusters) )
base.print( base.summary(pamk_clusters) )
utils.str(pamk_clusters)
print(pamk_clusters$pamobject$clusinfo)
base.print(pamk_clusters$pamobject$clusinfo)
print(pamk_clusters[["pamobject"]][["clusinfo"]])
print(pamk_clusters.rx2("pamobject").rx2("clusinfo"))
pam_clusters = cluster.pam(points,5) # much slower
kmeans_clusters = stats.kmeans(points,5) # much faster
utils.str(kmeans_clusters)
print(kmeans_clusters.rx2("cluster"))
R has been a standard for statistical computing for nearly 25 years, based on a forty-year old S - back when computing efficiency mattered a lot.
https://en.wikipedia.org/wiki/R_(programming_language)
Again #lgautier, thank you for making R more readily accessible within Python

rpy2: Converting a data.frame to a numpy array

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.
I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)
import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)
This code runs, but expression_data is simply array([[1]]).
I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:
In [40]: e._get_ncol()
Out[40]: 1
In [41]: e._get_nrow()
Out[41]: 1
But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.
Anyone any thoughts?
This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.
To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.
data(iris)
iris$Species = unclass(iris$Species)
write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")
# now start a python session
import numpy as NP
fpath = "/path/to/my/file/np_iris.txt"
A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)
# print(type(A))
# returns: <type 'numpy.ndarray'>
print(A.shape)
# returns: (150, 5)
print(A[1:5,])
# returns:
[[ 4.9  3.   1.4  0.2  1. ]
[ 4.7  3.2  1.3  0.2  1. ]
[ 4.6  3.1  1.5  0.2  1. ]
[ 5.   3.6  1.4  0.2  1. ]]
According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.
You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).
Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.
If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tempfile.dat')
# now create a memory-mapped file with shape and data type
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))
# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to
# the data stored on disk)
A[:] = somedata[:]
Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?
Passing the matrix to numpy is straightforward (and can even be made without making a copy):
http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.
You seem to be working with bioconductor classes, and might be interested in the following:
http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

Categories