From text file to a market matrix format

From text file to a market matrix format - python

I am working in Python and I have a matrix stored in a text file. The text file is arranged in such a format:
row_id, col_id row_id, col_id ... row_id, col_id
row_id and col_id are integers and they take values from 0 to n (in order to know n for row_id and col_id I have to scan the entire file first).
there's no header and row_ids and col_ids appear multiple times in the file, but each combination row_id,col_id appears once. There's no explicit value for each combination row_id,col_id , actually each cell value is 1. The file is almost 1 gigabyte of size.
Unfortunately the file is difficult to handle in the memory, in fact, it is 2257205 row_ids and 122905 col_ids for 26622704 elements. So I was looking for better ways to handle it. Matrix market format could be a way to deal with it.
Is there a fast and memory efficient way to convert this file into a file in a market matrix format (http://math.nist.gov/MatrixMarket/formats.html#mtx) using Python?

There is a fast and memory efficient way of handling such matrices: using the sparse matrices offered by SciPy (which is the de facto standard in Python for this kind of things).
For a matrix of size N by N:
from scipy.sparse import lil_matrix
result = lil_matrix((N, N)) # In order to save memory, one may add: dtype=bool, or dtype=numpy.int8
with open('matrix.csv') as input_file:
for line in input_file:
x, y = map(int, line.split(',', 1)) # The "1" is only here to speed the splitting up
result[x, y] = 1
(or, in one line instead of two: result[map(int, line.split(',', 1))] = 1).
The argument 1 given to split() is just here to speed things up when parsing the coordinates: it instructs Python to stop parsing the line when the first (and only) comma is found. This can matter some, since you are reading a 1 GB file.
Depending on your needs, you might find one of the other six sparse matrix representations offered by SciPy to be better suited.
If you want a faster but also more memory-consuming array, you can use result = numpy.array(…) (with NumPy) instead.

Unless I am missing something...
The MatrixMarket MM format is a line with the dimensions and "row col value". If you already have rows and cols and all values are 1, simply add the value and that should be it.
Wouldn't it be easier to simply use sed as in
n=`wc -l file`
echo "2257205 122905 $n" > file.mm
cat file | sed -e 's/$/ 1/g' >> file.mm
That should work if your coordinates are one-offset. If they are zero-offset you should add +1 to each coordinate, simply read coordinates, add one to each of them and print coordx, coordy, "1". Which you can do from the shell, from Awk or from python with very little effort.
Q&D code (untested, produced just as a hint, YMMV and you may want to preprocess the file to compute some values):
In the shell
echo "2257205 122905 $n"
cat file | while read x,y ; do x=$((x+1)); y=$((y+1)); echo "$x $y 1" ; done
In python, more or less...
f=open("file")
lines=f.readlines()
print 2257205, 122905, len(lines)
for l in lines:
(x,y) = l.split(' ')
x = int(x) + 1
y = int(y) + 1
print x, y, 1
Or am I missing something?

Related

File format optimized for sparse matrix exchange

I want to save a sparse matrix of numbers (integers, but it could be floats) to a file for data exchange. For sparse matrix I mean a matrix where a high percentage of values (typically 90%) are equal to 0. Sparse in this case does not relate to the file format but to the actual content of the matrix.
The matrix is formatted in the following way:
col1 col2 ....
row1 int1_1 int1_2 ....
row2 int2_1 .... ....
.... .... .... ....
By using a text file (tab-delimited) the size of the file is 4.2G. Which file format, preferably ubiquitous such as a .txt file, can I use to easily load and save this sparse data matrix? We usually work with Python/R/Matlab, so formats that are supported by these are preferred.

I found the Feather format (which currently does not support Matlab, afaik).
Some comparison on reading and writing, and memory performance in Pandas is provided in this section.
It provides also support for the Julia language.
Edit:
I found that this format in my case uses more disk space than the .txt one, probably to increase performance in I/O. Compressing with zip alleviates the problem but compression during writing seems to not be supported yet.

You have several solutions, but generally what you need to do it output the indices of the non-zero elements as well as the values. Lets assume that you want to export to a single text file.
Generate array
Lets first generate a 10000 x 5000 sparse array with ~10% filled (it will be a bit less due to replicated indices):
N = 10000;
M = 5000;
rho = .1;
rN = ceil(sqrt(rho)*N);
rM = ceil(sqrt(rho)*M);
S = sparse(N, M);
S(randi(N, [rN 1]), randi(M, [rM 1])) = randi(255, rN, rM);
If your array is not stored as a sparse array, you can create it simply using (where M is the full array):
S = sparse(M);
Save as text file
Now we will save the matrix in the following format
row_indx col_indx value
row_indx col_indx value
row_indx col_indx value
This is done by extracting the row and column indices as well as data values and then saving it to a text file in a loop:
[n, m, s] = find(S);
fid = fopen('Sparse.txt', 'wt');
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%d\n', n, m, s), n, m, s);
fclose(fid);
If the underlying data is not an integer, then you can use the %f flag on the last output, e.g. (saved with 15 decimal places)
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%.15f\n', n, m, s), n, m, s);
Compare this to the full array:
fid = fopen('Full.txt', 'wt');
arrayfun(#(n) fprintf(fid, '%s\n', num2str(S(n, :))), (1:N).');
fclose(fid);
In this case, the sparse file is ~50MB and the full file ~170MB representing a factor of 3 efficiency. This is expected since I need to save 3 numbers for every nonzero element of the array, and ~10% of the array is filled, requiring ~30% as many numbers to be saved compared to the full array.
For floating point format, the saving is larger since the size of the indices compared to the floating point value is much smaller.
In Matlab, a quick way to extract the data would be to save the string given by:
mat2str(S)
This is essentially the same but wraps it in the sparse command for easy loading in Matlab - one would need to parse this in other languages to be able to read it in. The command tells you how to recreate the array, implying you may need to store the size of the matrix in the file as well (I recommend doing it in the first line since you can read this in and create the sparse matrix before parsing the rest of the file.
Save as binary file
A much more efficient method is to save as a binary file. Assuming the data and indices can be stored as unsigned 16 bit integers you can do the following:
[n, m, s] = find(S);
fid = fopen('Sparse.dat', 'w');
fwrite(fid, size(S), 'uint16');
fwrite(fid, [n m s], 'uint16');
fclose(fid);
Then to read the data:
fid = fopen('Sparse.dat', 'r');
sz = fread(fid, 2, 'uint16');
s = reshape(fread(fid, 'uint16'), [], 3);
s = sparse(s(:, 1), s(:, 2), s(:, 3), sz(1), sz(2));
fclose(fid);
Now we can check they are equal:
isequal(S, s)
Saving the full array:
fid = fopen('Full.dat', 'w');
fwrite(fid, full(S), 'uint16');
fclose(fid);
Comparing the sparse and full file sizes I get 21MB and 95MB.
A couple of notes:
Using a single write/read command is much (much much) quicker than looping, so the last method is by far the fastest, and also most space efficient.
The maximum index/data value size that can be saved as a binary integer is 2^n - 1, where n is the bitdepth. In my example of 16 bits (uint16), that corresponds to a range of 0..65,535. By the sounds of it, you may need to use 32 bits or even 64 bits just to store the indices.
Higher efficiency can be obtained by saving the indices as one data type (e.g. uint32) and the actual values as another (e.g. uint8). However, this adds additional complexity in the saving and reading.
You will still want to store the matrix size first, as I showed in the binary example.
You can store the values as doubles if required, but indices should always be integers. Again, extra complexity, but doable.

Going out of memory for python dictionary when the numbers are integer

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed in the second case while will work in the first case.

I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:
May i suggest you use Pandas for this kind of work?
It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html
import pandas as pd
pd.read_csv('file.txt', sep=' ', skiprows=1)
then do all your manipulations
Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

input box to populate excel formulas in a file

I would like to create a Python script which would open the csv (or xls) file and with an input box I could copy and paste the Excel formula to the specific row...then apply this to the rest of the empty rows in that column. To help visualize it here is an example
DATA, FORMULA
001, [here would be inserted the formula]
002, [here would be populated the amended formula]
003, [here would be populated the amended formula]
004, [here would be populated the amended formula]
So, the idea is to have a script, which would get me the input box asking
- from which row you want to start? | answer = B2
- what formula you want to populat there? | "=COUNTIF(A:A,A2)"
...and then it will populate the formula in the B2 column and auto populate the next B3, B4, B5 and B6, where the formula is adjustusted to the specific cell. The reason why I want to do this is the fact I deal with large excel files which very often crash on my computer, so I would like to execute it without running Excel itself.
I did some research adn xlwt probably is not capable to do this. Could you please help me to find the solution how should I do this? I would appreciate any ideas and guidance from you.

Unfortunately what you want to do can't be done without implementing a part of the spreadsheet program (Excel) in your code. There are no shortcuts there.
As for the file format, Python can deal natively with the CSV files, but I think you'd have trouble importing raw formulas (as opposed to numeric or textual content) from CSV into Excel itself.
Since you are already into Python, maybe it would be a better idea to move you logic from the spreadsheet into the program: use Excel or other spreadsheet program to input your data, just the numbers, and use your script not to modify the sheet, but to effect the calculations you need - maybe storing the data in a SQL database (Python s built-in SQLite will perform nicely for a single user app like in this case) - and output just the calculated numbers to a spreadsheet file, or maybe, generate your intend charts directly from Python using matplotlib.
That said, what you are asking can be done from Python - but it might lead to more and more complications on your general workflow as your dataset grows -
Hre - these helper functions will allow you to convert from the Excel cell naming convention to numbers and vice-versa - so that you can have the numeric indices with which to operate in the Python programing.
Parsing the formula typed in to extract the cell - addresses, is no easy-deal, however
rendering them back into the formula, after the numeric indices are adjusted should be a lot easier). I'd suggest you to hard-code your formula in the script, instead of allowing for the input of any possible formula.
def parse_num(address):
x = ""
for chr in (address):
if chr.isdigit():
x += chr
return int(x) - 1
def parse_col(address):
x = 0
for chr in address:
if chr.isdigit():
break
x = x * 26 + (ord(chr.upper()) - ord("A"))
return x
def render_address(col, row):
base = 26
power = col // base
col_letters = ""
tmp_col = col
for p in xrange(power, -1, -1):
dig = tmp_col // (base ** p)
letter = chr(dig + ord("A"))
col_letters += letter
tmp_col %= base ** p
return col_letters + str(row + 1)
Now, if you are willing to do your work in Python, just have your data input as CSV and use a small python program to get your results,instead of trying to fit them in a spreadsheet - for the formula above COUNTIF(A:A,A2) Basically, you want to count how many other rows have the first column as this row - for 750000 data positions, it is a piece of cake in Python - (it starts to get tougher if all data won't fit in RAM - but that would happen with about 100 million data points in a 2GB machine - at that point you can still fit everything in RAM resorting to specialized structures- above that it would start to need some more logic, which would be a few lines long using SQLIte as I mentioned above.
Now, the code for, given a CSV file with one column of data produce a second CSV file, where the second column contains the total of occurrences of the number in the first column:
import csv
from collections import Counter
data_count = Counter()
with open("data.csv", "rt") as input_file:
reader = csv.reader(input_file)
# skip header:
reader.next()
for row in reader():
data_count[int(row[0])] += 1
# everything is accounted for now - output the result:
with open("data.csv", "rt") as input_file, open("counted_data.csv", "wt") as output_file:
reader = csv.reader(input_file)
writer = csv.writer(output_file)
header = reader.next()
header.append("Count")
writer.writerow(header)
for row in reader():
writer.writerow(row + [str(data_count[int(row[0])])] )
And that is only if you really need all of the first column in order on
the final file. If all you want are the count for each number in column 1,
regardless of the order they appear, you just need the data in data_count after the first block - and you can play with that interactively in the Python prompt and have results in fractions of a second what would take tens of minutes in a spreadsheet program.
If you have datasets that don't fit in memory, you just drop them in a database with a simpler script than this, and you still will have your results in a fraction of a second.

Efficient way to check last term of a line in Python file

I'm writing a Python script that takes in a (potentially large) file. Here is an example of a way that input file could be formatted:
class1 1:v1 2:v2 3:v3 4:v4 5:v5
class2 1:v6 4:v7 5:v8 6:v9
class1 3:v10 4:v11 5:v12 6:v13 8:v14
class2 1:v15 2:v16 3:v17 5:v18 7:v19
Where class1 and class2 are some number, e.g. 1 and -1. (A curious user may notice that this is a LIBSVM-related file, but knowing the software isn't necessary in this case.) The values v1, v2, ..., v19 represent any integer or float value. Obviously, my files would be much larger than this, in terms of total lines and length per line, which is why I'm concerned about efficiency here.
I am trying to check what is the greatest value to the left of a colon. In LIBSVM, these are called "features" and are always integers here. For instance, in the example I outlined above, line 1 has 5 as its largest feature. Line 2 has 6 as its largest feature, line 3 has 8 as its largest feature, and finally, line 4 has 7 as its largest feature. Since 8 is the largest of these values, that is my desired value. I'm looking at a file with possibly thousands of features per line, and many hundreds of thousands of lines.
The file satisfies the following properties:
The features must be strictly increasing. I.e. "3:v1 4:v2" is allowed, but not "3:v1 3:v2."
The features are not necessarily consecutive and can be skipped. In the first example I gave, the first line has its features in consecutive order (1,2,3,4,5) and skips features 6, 7, and 8. The other 3 lines do not have their features in consecutive order. That's okay, as long as those features are strictly increasing.
Right now, my approach is to check each line, split up each line by a space, split up the final term by a colon, and then check the feature value. Following that, I do a procedure to check the maximum such featureNum.
file1 = open(...)
max = 0
for line in file1:
linesplit = line.rstrip('\n').split(' ')
val = linesplit[len(linesplit) - 1]
valsplit = val.split(':')
featureNum = valsplit[0]
if (featureNum > max):
max = featureNum
print max
file1.close()
But I'm hoping there is a better or more efficient way of doing this, e.g. some way of analyzing the file by only getting those terms that directly precede a newline character (maybe to avoid reading all the lines?). I'm new to Python so it wouldn't surprise me if I missed something obvious.
Possible reference: http://docs.python.org/library/stdtypes.html

Since you don't care about all the features in a line but just the last one, you don't need to split the whole line. I don't know if this is actually faster though, you need to time it and see. It definitely isn't as Pythonic as splitting the entire line.
def last_feature(line):
start = line.rfind(' ') + 1
end = line.rfind(':')
return int(line[start:end])
with open(...) as file1:
largest = max(last_feature(line) for line in file1)

What is the equivalent of 'fread' from Matlab in Python?

I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)

Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.

From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.

Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.

The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.