rpy2: Converting a data.frame to a numpy array

rpy2: Converting a data.frame to a numpy array - python

I have a data.frame in R. It contains a lot of data : gene expression levels from many (125) arrays. I'd like the data in Python, due mostly to my incompetence in R and the fact that this was supposed to be a 30 minute job.
I would like the following code to work. To understand this code, know that the variable path contains the full path to my data set which, when loaded, gives me a variable called immgen. Know that immgen is an object (a Bioconductor ExpressionSet object) and that exprs(immgen) returns a data frame with 125 columns (experiments) and tens of thousands of rows (named genes). (Just in case it's not clear, this is Python code, using robjects.r to call R code)
import numpy as np
import rpy2.robjects as robjects
# ... some code to build path
robjects.r("load('%s')"%path) # loads immgen
e = robjects.r['data.frame']("exprs(immgen)")
expression_data = np.array(e)
This code runs, but expression_data is simply array([[1]]).
I'm pretty sure that e doesn't represent the data frame generated by exprs() due to things like:
In [40]: e._get_ncol()
Out[40]: 1
In [41]: e._get_nrow()
Out[41]: 1
But then again who knows? Even if e did represent my data.frame, that it doesn't convert straight to an array would be fair enough - a data frame has more in it than an array (rownames and colnames) and so maybe life shouldn't be this easy. However I still can't work out how to perform the conversion. The documentation is a bit too terse for me, though my limited understanding of the headings in the docs implies that this should be possible.
Anyone any thoughts?

This is the most straightforward and reliable way i've found to to transfer a data frame from R to Python.
To begin with, I think exchanging the data through the R bindings is an unnecessary complication. R provides a simple method to export data, likewise, NumPy has decent methods for data import. The file format is the only common interface required here.
data(iris)
iris$Species = unclass(iris$Species)
write.table(iris, file="/path/to/my/file/np_iris.txt", row.names=F, sep=",")
# now start a python session
import numpy as NP
fpath = "/path/to/my/file/np_iris.txt"
A = NP.loadtxt(fpath, comments="#", delimiter=",", skiprows=1)
# print(type(A))
# returns: <type 'numpy.ndarray'>
print(A.shape)
# returns: (150, 5)
print(A[1:5,])
# returns:
[[ 4.9  3.   1.4  0.2  1. ]
[ 4.7  3.2  1.3  0.2  1. ]
[ 4.6  3.1  1.5  0.2  1. ]
[ 5.   3.6  1.4  0.2  1. ]]
According to the Documentation (and my own experience for what it's worth) loadtxt is the preferred method for conventional data import.
You can also pass in to loadtxt a tuple of data types (the argument is dtypes), one item in the tuple for each column. Notice 'skiprows=1' to step over the column headers (for loadtxt rows are indexed from 1, columns from 0).
Finally, i converted the dataframe factor to integer (which is actually the underlying data type for factor) prior to exporting--'unclass' is probably the easiest way to do this.
If you have big data (ie, don't want to load the entire data file into memory but still need to access it) NumPy's memory-mapped data structure ('memmap') is a good choice:
from tempfile import mkdtemp
import os.path as path
filename = path.join(mkdtemp(), 'tempfile.dat')
# now create a memory-mapped file with shape and data type
# based on original R data frame:
A = NP.memmap(fpath, dtype="float32", mode="w+", shape=(150, 5))
# methods are ' flush' (writes to disk any changes you make to the array), and 'close'
# to write data to the memmap array (acdtually an array-like memory-map to
# the data stored on disk)
A[:] = somedata[:]

Why going through a data.frame when 'exprs(immgen)' returns a /matrix/ and your end goal is to have your data in a matrix ?
Passing the matrix to numpy is straightforward (and can even be made without making a copy):
http://rpy.sourceforge.net/rpy2/doc-2.1/html/numpy.html#from-rpy2-to-numpy
This should beat in both simplicity and efficiency the suggestion of going through text representation of numerical data in flat files as a way to exchange data.
You seem to be working with bioconductor classes, and might be interested in the following:
http://pypi.python.org/pypi/rpy2-bioconductor-extensions/

Related

Combine two FITS files by taking a 'pizza wedge' slice?

I would like to combine two FITS files, by taking a slice out of one and inserting it into the other. The slice would be based on an angle measured from the centre pixel, see example image below:
Can this be done using Astropy? There are many questions on combining FITS on the site, but most of these are related to simply adding two files together, rather that combining segments like this.

Here is one recommended approach:
1.Read in your two files
Assuming the data is in an ImageHDU data array..
from astropy.io import fits
# read the numpy arrays out of the files
# assuming they contain ImageHDUs
data1 = fits.getdata('file1.fits')
data2 = fits.getdata('file2.fits')
2. Cut out the sections and put them into a new numpy array
Build up indices1 & indices2 for the desired sections in the new file... A simple numpy index to fill in the missing section into a new numpy array.
After being inspired by https://stackoverflow.com/a/18354475/15531842
The sector_mask function defined in the answer to get indices for each array using angular slices.
mask = sector_mask(data1.shape, centre=(53,38), radius=100, angle_range=(280,340))
indices1 = ~mask
indices2 = mask
Then these indices can be used to transfer the data into a new array.
import numpy as np
newdata = np.zeros_like(data1)
newdata[indices1] = data1[indices1]
newdata[indices2] = data2[indices2]
If the coordinate system is well known then it may be possible to use astropy's Cutout2D class, although I was not able to figure out how to fully use it. It wasn't clear if it could do an angular slice from the example given. See astropy example https://docs.astropy.org/en/stable/nddata/utils.html#d-cutout-using-an-angular-size
3a. Then write out the new array as a new file.
If special header information is not needed in the new file. Then the numpy array with the new image can be written out to a FITS file with one line of astropy code.
# this is an easy way to write a numpy array to FITS file
# no header information is carried over
fits.writeto('file_combined.fits', data=newdata)
3b. Carry the FITS header information over to the new file
If there is a desire to carry over header information then an ImageHDU can be built from the numpy array and include the desired header as a dictionary.
img_hdu = fits.ImageHDU(data=newdata, header=my_header_dict)
hdu_list = fits.HDUList()
hdu_list.append(fits.PrimaryHDU())
hdu_list.append(img_hdu)
hdu_list.writeto('file_combined.fits')

numpy won't print full (unsummarized array)

I've looked at this response to try and get numpy to print the full array rather than a summarized view, but it doesn't seem to be working.
I have a CSV with named headers. Here are the first five rows
v0 v1 v2 v3 v4
1001 5529 24 56663 16445
1002 4809 30.125 49853 28069
1003 407 20 28462 8491
1005 605 19.55 75423 4798
1007 1607 20.26 79076 12962
I'd like to read in the data and be able to view it fully. I tried doing this:
import numpy as np
np.set_printoptions(threshold=np.inf)
main_df2=np.genfromtxt('file location', delimiter=",")
main_df2[0:3,:]
However this still returns the truncated array, and the performance seems greatly slowed. What am I doing wrong?

OK, in a regular Python session (I usually use Ipython instead), I set the print options, and made a large array:
>>> np.set_printoptions(threshold=np.inf, suppress=True)
>>> x=np.random.rand(25000,5)
When I execute the next line, it spends about 21 seconds formatting the array, and then writes the resulting string to the screen (with more lines than fit the terminal's window buffer).
>>> x
This is the same as
>>> print(repr(x))
The internal storage for x is a buffer of floats (which you can 'see' with x.tostring(). To print x it has to format it, create a multiline string that contains a print representation of each number, all 125000 of them. The result of repr(x) is a string 1850000 char long, 25000 lines. This is what takes 21 seconds. Displaying that on the screen is just limited by the terminal scroll speed.
I haven't looked at the details, but I think the numpy formatting is mostly written in Python, not compiled. It's designed more for flexibility than speed. It's normal to want to see 10-100 lines of an array. 25000 lines is an unusual case.
Somewhat curiously, writing this array as a csv is fast, with a minimal delay:
>>> np.savetxt('test.txt', x, fmt='%10f', delimiter=',')
And I know what savetxt does - it iterates on rows, and does a file write
f.write(fmt % tuple(row))
Evidently all the bells-n-whistles of the regular repr are expensive. It can summarize, it can handle many dimensions, it can handle complicated dtypes, etc. Simply formatting each row with a known fixed format is not the time consuming step.
Actually that savetxt route might be more useful, as well as fast. You can control the display format, and you can view the resulting text file in an editor or terminal window at your leisure. You won't be limited by the scroll buffer of your terminal window. But how will this savetxt file be different from the original csv?

I'm surprised you get an array at all as your example does not use ',' as delimiter. But maybe you forgot to included commas in your example file.
I would use the DataFrame functionality of pandas if I work with csv data. It uses numpy under the hood, so all numpy operation work on pandas DataFrames.
Pandas has many tricks for operating with table like data.
import pandas as pd
df = pd.read_csv('nothing.txt')
#==============================================================================
# next line remove blanks from the column names
#==============================================================================
df.columns = [name.strip(' ') for name in df.columns]
pd.set_option('display.height', 1000)
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
print(df)

When I copied and pasted it the data here it was open in Excel, but the file is a CSV.
I'm doing a class exercise and we have to use numpy. One thing I noticed was that the results were quite illegible thanks for the scientific notation, so I did the following and things are much smoother:
np.set_printoptions(threshold=100000, suppress=True)
The suppress statement saved me a lot of formatting. The performance does suffer a lot when I change the threshold to something like 'nan' or inf, and I'm not sure why.

Efficiently save to disk (heterogeneous) graph of lists, tuples, and NumPy arrays

I am regularly dealing with large amounts of data (order of several GB), which are stored in memory in NumPy arrays. Often, I will be dealing with nested lists/tuples of such NumPy arrays. How should I store these to disk? I want to preserve the list/tuple structure of my data, the data has to be compressed to conserve disk space, and saving/loading needs to be fast.
(The particular use case I'm facing right now is a 4000-element long list of 2-tuples x where x[0].shape = (201,) and x[1].shape = (201,1000).)
I have tried several options, but all have downsides:
pickle storage into a gzip archive. This works well, and results in acceptable disk space usage, but is extremely slow and consumes a lot of memory while saving.
numpy.savez_compressed. Is much faster than pickle, but unfortunately only allows either a sequence of numpy arrays (not nested tuples/lists as I have) or a dictionary-style way of specifying the arguments.
Storing into HDF5 through h5py. This seems too cumbersome for my relatively simple needs. More importantly, I looked a lot into this, and also there does not seem to be a straightforward way to store heterogeneous (nested) lists.
hickle seems to do exactly what I want, however unfortunately it's incompatible with Python 3 at the moment (which is what I'm using).
I was thinking of writing a wrapper around numpy.savez_compressed, which would determine the nested structure of the data, store this structure in some variable nest_structure, flatten the full graph, and store both nest_structure and all the flattened data using numpy.savez_compressed. Then, the corresponding wrapper around numpy.load would understand the nest_structure variable, and re-create the graph and return it. However, I was hoping there is something like this already out there.

You may like the shelve package. It effectively wraps heterogeneous pickled objects in a convenient file. shelve is oriented more toward a "persistent storage" than classic save-to-file model.
The main benefit of using shelve is that you can conveniently save most kinds of structured data. The main disadvantage of using shelve is that it is Python-specific. Unlike HDF-5 or saved Matlab files or even simple CSV files, it isn't so easy to use other tools with your data.
Example of saving (Out of habit, I created objects and copy them to df, but you don't need to do this. You could just save directly to items in df):
import shelve
import numpy as np
a = np.arange(0, 1000, 12)
b = "This is a string"
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
class C(object):
alpha = 1.0
beta = [3, 4]
c = C()
df = shelve.open('test.shelve', 'c')
df['a'] = a
df['b'] = b
df['c'] = c
df.sync()
exit()
Following the above example, recovering data:
import shelve
import numpy as np
class C():
alpha = 1.0
beta = [3, 4]
df = shelve.open('test.shelve')
print(df['a'])
print(df['b'])
print(df['c'].alpha)

Working with array and Dictionaries after loading them from Disk using np.load and np.save

I have several huge arrays, and I am using np.save and np.load to save each array or dictionary in a single file and then I reload them, in order not to compute them another time as follows.
save(join(dir, "ListTitles.npy"), self.ListTitles)
self.ListTitles = load(join(dir,"ListTitles.npy"))
The problem is that when I try to use them afterwards, I have errors like (field name not found) or (len() of unsized object).
For example:
len(self.ListTitles) or when accessing a field of a dictionary return an error.
I don't know how to resolve this. Because when I simply use this code, it works perfectly:
M = array([[1,2,0], [3,4,0], [3,0,1]])
vector = zeros(3529)
save("M.npy", M)
save("vector.npy", vector)
vector = load("vector.npy")
B = load("M.npy")
print len(B)
print len(vector)

numpy's save and load functions are for numpy arrays, not for general Python data like dicts. Use the pickle module to save to file, and reload from file, most kinds of Python data structures (there are alternatives like dill which are however not in the standard library -- I'd recommend sticking with standard pickle unless it gives you specific problems).

Lazy Evaluation for iterating through NumPy arrays

I have a Python program that processes fairly large NumPy arrays (in the hundreds of megabytes), which are stored on disk in pickle files (one ~100MB array per file). When I want to run a query on the data I load the entire array, via pickle, and then perform the query (so that from the perspective of the Python program the entire array is in memory, even if the OS is swapping it out). I did this mainly because I believed that being able to use vectorized operations on NumPy arrays would be substantially faster than using for loops through each item.
I'm running this on a web server which has memory limits that I quickly run up against. I have many different kinds of queries that I run on the data so writing "chunking" code which loads portions of the data from separate pickle files, processes them, and then proceeds to the next chunk would likely add a lot of complexity. It'd definitely be preferable to make this "chunking" transparent to any function that processes these large arrays.
It seems like the ideal solution would be something like a generator which periodically loaded a block of the data from the disk and then passed the array values out one by one. This would substantially reduce the amount of memory required by the program without requiring any extra work on the part of the individual query functions. Is it possible to do something like this?

PyTables is a package for managing hierarchical datasets. It is designed to solve this problem for you.

NumPy's memory-mapped data structure (memmap) might be a good choice here.
You access your NumPy arrays from a binary file on disk, without loading the entire file into memory at once.
(Note, i believe, but i am not certain, that Numpys memmap object is not the same as Pythons--in particular, NumPys is array-like, Python's is file-like.)
The method signature is:
A = NP.memmap(filename, dtype, mode, shape, order='C')
All of the arguments are straightforward (i.e., they have the same meaning as used elsewhere in NumPy) except for 'order', which refers to order of the ndarray memory layout. I believe the default is 'C', and the (only) other option is 'F', for Fortran--as elsewhere, these two options represent row-major and column-major order, respectively.
The two methods are:
flush (which writes to disk any changes you make to the array); and
close (which writes the data to the memmap array, or more precisely to an array-like memory-map to the data stored on disk)
example use:
import numpy as NP
from tempfile import mkdtemp
import os.path as PH
my_data = NP.random.randint(10, 100, 10000).reshape(1000, 10)
my_data = NP.array(my_data, dtype="float")
fname = PH.join(mkdtemp(), 'tempfile.dat')
mm_obj = NP.memmap(fname, dtype="float32", mode="w+", shape=1000, 10)
# now write the data to the memmap array:
mm_obj[:] = data[:]
# reload the memmap:
mm_obj = NP.memmap(fname, dtype="float32", mode="r", shape=(1000, 10))
# verify that it's there!:
print(mm_obj[:20,:])

It seems like the ideal solution would
be something like a generator which
periodically loaded a block of the
data from the disk and then passed the
array values out one by one. This
would substantially reduce the amount
of memory required by the program
without requiring any extra work on
the part of the individual query
functions. Is it possible to do
something like this?
Yes, but not by keeping the arrays on disk in a single pickle -- the pickle protocol just isn't designed for "incremental deserialization".
You can write multiple pickles to the same open file, one after the other (use dump, not dumps), and then the "lazy evaluator for iteration" just needs to use pickle.load each time.
Example code (Python 3.1 -- in 2.any you'll want cPickle instead of pickle and a -1 for protocol, etc, of course;-):
>>> import pickle
>>> lol = [range(i) for i in range(5)]
>>> fp = open('/tmp/bah.dat', 'wb')
>>> for subl in lol: pickle.dump(subl, fp)
...
>>> fp.close()
>>> fp = open('/tmp/bah.dat', 'rb')
>>> def lazy(fp):
... while True:
... try: yield pickle.load(fp)
... except EOFError: break
...
>>> list(lazy(fp))
[range(0, 0), range(0, 1), range(0, 2), range(0, 3), range(0, 4)]
>>> fp.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.