Import MTX files into R - python

I have a large hdf file from which I get one of the data-frame and convert it into Sparse Matrix in python (sparse.csr_matrix). Now, I save this as .MTX file and trying to load this in R. I went through some documentation and links for loading MTX files into R using externalFormats {Matrix}. Unfortunately, I get the following error.
TestDataMatrix = readMM(system.file("./Downloads/TestDataMatrix.mtx",
package = "Matrix"))
When I run the above code, I get the following error and I have no clue what it means.
TestDataMatrix = readMM(system.file("./Downloads/TestDataMatrix.mtx",
+ package = "Matrix"))
1:
Could someone let me know if there is an easy way to convert python objects to R objects (like RDS).

I found a way to read hdf files in R and got the data-frame converted to Sparse-Matrix. But there is a problem writing to HDF files. I posted a new question here.

Related

Picking out a specific column in a table

My goal is to import a table of astrophysical data that I have saved to my computer (obtained from matching 2 other tables in TOPCAT, if you know it), and extract certain relevant columns. I hope to then do further manipulations on these columns. I am a complete beginner in python, so I apologise for basic errors. I've done my best to try and solve my problem on my own but I'm a bit lost.
This script I have written so far:
import pandas as pd
input_file = "location\\filename"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The file that I'm trying to import is listed as having file type "File", in my drive. I've looked at this file in Notepad and it has a lot of descriptive bumf in the first few rows, so to try and get rid of this I've used "skiprows" as you can see. The data in the file is separated column-wise by lines--at least that's how it appears in Notepad.
The problem is when I try to extract the first column using "usecol" it instead returns what appears to be the first row in the command window, as well as a load of vertical bars between each value. I assume it is somehow not interpreting the table correctly? Not understanding what's a column and what's a row.
What I've tried: Modifying the file and saving it in a different filetype. This gives the following error:
FileNotFoundError: \[Errno 2\] No such file or directory: 'location\\filename'
Despite the fact that the new file is saved in exactly the same location.
I've tried using "pd.read_table" instead of csv, but this doesn't seem to change anything (nor does it give me an error).
When I've tried to extract multiple columns (ie "usecol=[1,2]") I get the following error:
ValueError: Usecols do not match columns, columns expected but not found: \[1, 2\]
My hope is that someone with experience can give some insight into what's likely going on to cause these problems.
Maybie you can try dataset.iloc[:,0] . With iloc you can extract the column or line you want by index(not only). [:,0] for all the lines of 1st column.
The file is incorrectly named.
I expect that you are reading a csv file or an xlsx or txt file. So the (windows) path would look similar to this:
import pandas as pd
input_file = "C:\\python\\tests\\test_csv.csv"
dataset = pd.read_csv(input_file,skiprows=12,usecols=[1])
The error message tell you this:
No such file or directory: 'location\\filename'

How can i save a list/matrix in binary format in KDB?

I am trying to save a matrix to file in binary format in KDB as per below:
matrix: (til 10)*/:til 10;
save matrix;
However, I get the error 'type.
I guess save only works with tables? In which case does anyone know of a workaround?
Finally, I would like to read the matrix from the binary file into Python with NumPy, which I presume is just:
import numpy as np
matrix = np.fromfile('C:/q/w32/matrix', dtype='f')
Is that right?
Note: I'm aware of KDB-Python libraries, but have been unable to install them thus far.
save does work, you just have to reference it by name.
save`matrix
You can also save using
`:matrix set matrix;
`:matrix 1: matrix;
But I don't think you'll be able to read this into python directly using numpy as it is stored in kdb format. It could be read into python using one of the python-kdb interfaces (e.g PyQ) or by storing it in a common format such as csv.
Another option is to save in KDB+ IPC format and then read it into Python with qPython as a Pandas DataFrame.
On the KDB+ side you can save it with
matrix:(til 10)*/:til 10;
`:matrix.ipc 1: -8!matrix;
On the Python side you do
from pandas import DataFrame
from qpython.qreader import QReader
with open('matrix.ipc',"rb") as f:
matrix = DataFrame(QReader(f).read().data)
print(matrix)

How to read this kind of .cvs file (html style content) by Python2.7?

I am practicing github machine learning contest using Python. I start from other's submission, but stuck at the first step: use pandas to read CSV file:
import pandas as pd
import numpy as np
filename = './facies_vectors.csv'
training_data = pd.read_csv(filename)
print(set(training_data["Well Name"]))
[enter image description here][1]training_data.head()
This gave me the following error message:
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 104, saw 3
I could not understand that the .csv file describe itself as html DOCTYPE. Please help.
The representing segments of the csv data content are attached. Thanks
It turns out I download the csv file following the convention of regular web operation: right click and save as. The right way is open the item from github, and then open it from the github desktop. I got the tables now. But the way to work with html files from python is definite something I would learn more about. Thanks.

Creating a .mat file of v7.3 in python

I need to perform multiplication involving 60000X70000 matrix either in python or matlab. I have a 16GB RAM and am able to load each row of the matrix easily (which is what I require). I am able to create the matrix as a whole in python but not in matlab.
Is there anyway I can save the array as .mat file of v7.3 using h5py or scipy so that I can load each row separately?
For MATLAB v7.3 you can use hdf5storage which requires h5py, download the file here, extract, then type: python setup.py install from a command prompt.
https://pypi.python.org/pypi/hdf5storage
import h5py
import hdf5storage
import numpy as np
matfiledata = {} # make a dictionary to store the MAT data in
matfiledata[u'variable1'] = np.zeros(100) # *** u prefix for variable name = unicode format, no issues thru Python 3.5; advise keeping u prefix indicator format based on feedback despite docs ***
matfiledata[u'variable2'] = np.ones(300)
hdf5storage.write(matfiledata, '.', 'example.mat', matlab_compatible=True)
If MATLAB can't load the whole thing at once, I think you'll have to save it in different variables matfiledata[u'chunk1'] matfiledata[u'chunk2'] matfiledata[u'chunk3'] etc.
Then in MATLAB if you save each chunk as a variable
load(filename,'chunk1')
do stuff...
clear chunk1
load(filename,'chunk2')
do stuff...
clear chunk2
etc.
The hdf5storage.savemat has a parameter to allow the file to be read into Python correctly in the future so worth checking out, and follows the scipy.io.loadmat format... although you can do something like this if saving data from MATLAB to make it easy to import back into Python:
MATLAB
save('example.mat','-v7.3')
Python
matdata = hdf5storage.loadmat('example.mat')
That will load back into Python as a dictionary which you can then convert into whatever datatypes you need.

Writing Fortran unformatted files with Python

I have some single-precision little-endian unformatted data files written by Fortran77. I am reading these files using Python using the following commands:
import numpy as np
original_data = np.dtype('float32')
f = open(file_name,'rb')
original_data = np.fromfile(f,dtype='float32',count=-1)
f.close()
After some data manipulation in Python, I (am trying to) write them back in the original format using Python using the following commands:
out_file = open(output_file,"wb")
s = struct.pack('f'*len(manipulated_data), *manipulated_data)
out_file.write(s)
out_file.close()
But it doesn't seem to be working. Any ideas what is the right way of writing the data using Python back in the original fortran unformatted format?
Details of the problem:
I am able to read the final file with manipulated data from Fortran. However, I want to visualize these data using a software (Paraview). For this I convert the unformatted data files in the *h5 format. I am able to convert both the original and manipulated data in h5 format using h5 utilities. But while Paraview is able to read the *h5 files created from original data, Paraview is not able to read the *h5 files created from the manipulated data. I am guessing something is being lost in translation.
This is how I am opening the file written by Python in Fortran (single precision data):
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
And this is I am writing the original unformatted data by Fortran:
open(out_file_id,FILE=out_file,form="unformatted")
Is this information sufficient?
Have you tried using the .tofile method of the manipulated data array? It will write the array in C order but is capable of writing plain binary.
The documentation for .tofile also suggests this is the same as:
with open(outfile, 'wb') as fout:
fout.write(manipulated_data.tostring())
this is creating an unformatted sequential access file:
open(out_file_id,FILE=out_file,form="unformatted")
Assuming you are writing a single array real a(n,n,n) using simply write(out_file_id)a you should see a file size 4*n^3+8 bytes. The extra 8 bytes being a 4 byte integer (=4n^3) repeated at the start and end of the record.
the second form:
open (in_file_id,FILE=in_file,form='unformatted',access='direct',recl=4*n*n*n)
opens direct acess, which does not have those headers. For writing now you'd have write(unit,rec=1)a. If you read your sequential access file using direct acess it will read without error but you'll get that integer header read as a float (garbage) as the (1,1,1) array value, then everything else is shifted. You say you can read with fortran ,but are you looking to see that you are really reading what you expect?
The best fix to this is to fix your original fortran code to use unformatted,direct access for both reading and writing. This gives you an 'ordinary' raw binary file, no headers.
Alternately in your python you need to first read that 4 byte integer, then your data. On output you could put the integer headers back or not depending on what your paraview filter is expecting.
---------- here is python to read/modify/write an unformatted sequential fortran file containing a single record:
import struct
import numpy as np
f=open('infile','rb')
recl=struct.unpack('i',f.read(4))[0]
numval=recl/np.dtype('float32').itemsize
data=np.fromfile(f,dtype='float32',count=numval)
endrec=struct.unpack('i',f.read(4))[0]
if endrec is not recl: print "error unexpected end rec"
f.close()
f=open('outfile')
f.write(struct.pack('i',recl))
for i in range(0,len(data)):data[i] = data[i]**2 #example data modification
data.tofile(f)
f.write(struct.pack('i',recl)
just loop for multiple records.. note that the data here is read as a vector and assumed to be all floats. Of course you need to know the actuall data type to make use if it..
Also be aware you may need to deal with byte order issues depending on platform.

Categories