File format optimized for sparse matrix exchange - python

I want to save a sparse matrix of numbers (integers, but it could be floats) to a file for data exchange. For sparse matrix I mean a matrix where a high percentage of values (typically 90%) are equal to 0. Sparse in this case does not relate to the file format but to the actual content of the matrix.
The matrix is formatted in the following way:
col1 col2 ....
row1 int1_1 int1_2 ....
row2 int2_1 .... ....
.... .... .... ....
By using a text file (tab-delimited) the size of the file is 4.2G. Which file format, preferably ubiquitous such as a .txt file, can I use to easily load and save this sparse data matrix? We usually work with Python/R/Matlab, so formats that are supported by these are preferred.

I found the Feather format (which currently does not support Matlab, afaik).
Some comparison on reading and writing, and memory performance in Pandas is provided in this section.
It provides also support for the Julia language.
Edit:
I found that this format in my case uses more disk space than the .txt one, probably to increase performance in I/O. Compressing with zip alleviates the problem but compression during writing seems to not be supported yet.

You have several solutions, but generally what you need to do it output the indices of the non-zero elements as well as the values. Lets assume that you want to export to a single text file.
Generate array
Lets first generate a 10000 x 5000 sparse array with ~10% filled (it will be a bit less due to replicated indices):
N = 10000;
M = 5000;
rho = .1;
rN = ceil(sqrt(rho)*N);
rM = ceil(sqrt(rho)*M);
S = sparse(N, M);
S(randi(N, [rN 1]), randi(M, [rM 1])) = randi(255, rN, rM);
If your array is not stored as a sparse array, you can create it simply using (where M is the full array):
S = sparse(M);
Save as text file
Now we will save the matrix in the following format
row_indx col_indx value
row_indx col_indx value
row_indx col_indx value
This is done by extracting the row and column indices as well as data values and then saving it to a text file in a loop:
[n, m, s] = find(S);
fid = fopen('Sparse.txt', 'wt');
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%d\n', n, m, s), n, m, s);
fclose(fid);
If the underlying data is not an integer, then you can use the %f flag on the last output, e.g. (saved with 15 decimal places)
arrayfun(#(n, m, s) fprintf(fid, '%d\t%d\t%.15f\n', n, m, s), n, m, s);
Compare this to the full array:
fid = fopen('Full.txt', 'wt');
arrayfun(#(n) fprintf(fid, '%s\n', num2str(S(n, :))), (1:N).');
fclose(fid);
In this case, the sparse file is ~50MB and the full file ~170MB representing a factor of 3 efficiency. This is expected since I need to save 3 numbers for every nonzero element of the array, and ~10% of the array is filled, requiring ~30% as many numbers to be saved compared to the full array.
For floating point format, the saving is larger since the size of the indices compared to the floating point value is much smaller.
In Matlab, a quick way to extract the data would be to save the string given by:
mat2str(S)
This is essentially the same but wraps it in the sparse command for easy loading in Matlab - one would need to parse this in other languages to be able to read it in. The command tells you how to recreate the array, implying you may need to store the size of the matrix in the file as well (I recommend doing it in the first line since you can read this in and create the sparse matrix before parsing the rest of the file.
Save as binary file
A much more efficient method is to save as a binary file. Assuming the data and indices can be stored as unsigned 16 bit integers you can do the following:
[n, m, s] = find(S);
fid = fopen('Sparse.dat', 'w');
fwrite(fid, size(S), 'uint16');
fwrite(fid, [n m s], 'uint16');
fclose(fid);
Then to read the data:
fid = fopen('Sparse.dat', 'r');
sz = fread(fid, 2, 'uint16');
s = reshape(fread(fid, 'uint16'), [], 3);
s = sparse(s(:, 1), s(:, 2), s(:, 3), sz(1), sz(2));
fclose(fid);
Now we can check they are equal:
isequal(S, s)
Saving the full array:
fid = fopen('Full.dat', 'w');
fwrite(fid, full(S), 'uint16');
fclose(fid);
Comparing the sparse and full file sizes I get 21MB and 95MB.
A couple of notes:
Using a single write/read command is much (much much) quicker than looping, so the last method is by far the fastest, and also most space efficient.
The maximum index/data value size that can be saved as a binary integer is 2^n - 1, where n is the bitdepth. In my example of 16 bits (uint16), that corresponds to a range of 0..65,535. By the sounds of it, you may need to use 32 bits or even 64 bits just to store the indices.
Higher efficiency can be obtained by saving the indices as one data type (e.g. uint32) and the actual values as another (e.g. uint8). However, this adds additional complexity in the saving and reading.
You will still want to store the matrix size first, as I showed in the binary example.
You can store the values as doubles if required, but indices should always be integers. Again, extra complexity, but doable.

Related

Gensim word2vec model outputs 1000 dimension ndarray but the maximum number of ndarray dimensions is 32 - how?

I'm trying to use this 1000 dimension wikipedia word2vec model to analyze some documents.
Using introspection I found out that the vector representation of a word is a 1000 dimension numpy.ndarray, however whenever I try to create an ndarray to find the nearest words I get a value error:
ValueError: maximum supported dimension for an ndarray is 32, found 1000
and from what I can tell by looking around online 32 is indeed the maximum supported number of dimensions for an ndarray - so what gives? How is gensim able to output a 1000 dimension ndarray?
Here is some example code:
doc = [model[word] for word in text if word in model.vocab]
out = []
n = len(doc[0])
print(n)
print(len(model["hello"]))
print(type(doc[0]))
for i in range(n):
sum = 0
for d in doc:
sum += d[i]
out.append(sum/n)
out = np.ndarray(out)
which outputs:
1000
1000
<class 'numpy.ndarray'>
ValueError: maximum supported dimension for an ndarray is 32, found 1000
The goal here would be to compute the average vector of all words in the corpus in a format that can be used to find nearby words in the model so any alternative suggestions to that effect are welcome.
You're calling numpy's ndarray() constructor-function with a list that has 1000 numbers in it – your hand-calculated averages of each of the 1000 dimensions.
The ndarray() function expects its argument to be the shape of the matrix constructed, so it's trying to create a new matrix of shape (d[0], d[1], ..., d[999]) – and then every individual value inside that matrix would be addressed with a 1000-int set of coordinates. And, indeed numpy arrays can only have 32 independent dimensions.
But even if you reduced the list you're supplying to ndarray() to just 32 numbers, you'd still have a problem, because your 32 numbers are floating-point values, and ndarray() is expecting integral counts. (You'd get a TypeError.)
Along the approach you're trying to take – which isn't quite optimal as we'll get to below – you really want to create a single vector of 1000 floating-point dimensions. That is, 1000 cell-like values – not d[0] * d[1] * ... * d[999] separate cell-like values.
So a crude fix along the lines of your initial approach could be replacing your last line with either:
result = np.ndarray(len(d))
for i in range(len(d)):
result[i] = d[i]
But there are many ways to incrementally make this more efficient, compact, and idiomatic – a number of which I'll mention below, even though the best approach, at bottom, makes most of these interim steps unnecessary.
For one, instead of that assignment-loop in my code just above, you could use Python's bracket-indexing assignment option:
result = np.ndarray(len(d))
result[:] = d # same result as previous 3-lines w/ loop
But in fact, numpy's array() function can essentially create the necessary numpy-native ndarray from a given list, so instead of using ndarray() at all, you could just use array():
result = np.array(d) # same result as previous 2-lines
But further, numpy's many functions for natively working with arrays (and array-like lists) already include things to do averages-of-many-vectors in a single step (where even the looping is hidden inside very-efficient compiled code or CPU bulk-vector operations). For example, there's a mean() function that can average lists of numbers, or multi-dimensional arrays of numbers, or aligned sets of vectors, and so forth.
This allows faster, clearer, one-liner approaches that can replace your entire original code with something like:
# get a list of available word-vetors
doc = [model[word] for word in text if word in model.vocab]
# average all those vectors
out = np.mean(doc, axis=0)
(Without the axis argument, it'd average together all individual dimension-values , in all slots, into just one single final average number.)

Obtain lengths of vectors without loading multiple .npy files

I have around 2000 .npy files, each representing a 1-dimensional vector of floats with between 100,000 and 1,000,000 entries (both of these numbers will substantially grow in the future). For each file, I would like the length of the vector it contains. The following option would be possible but time consuming:
lengths = [numpy.shape(numpy.load(whatever))[0] for whatever in os.listdir(some_dir)]
Question:
What is the most efficient/fastest way to derive this list of vector lengths? Surely I should be able to work directly from the filesizes- but what is the best way to do this?
Using memmapped files will speed this up considerably.
By memmapping the file numpy only loads the header to get array shapes and datatype, while the actual array data is left on disk until needed.
import numpy as np
# Load files using memmap
data = [np.load(f, mmap_mode='r')) for f in os.listdir(some_dir)]
# Checking your assumptions never hurts
assert (d.ndim == 1 for d in data).all()
lengths = [d.shape[0] for d in data]
edit The reason you need to load the file headers rather than using file size directly is that the header for npy files is not necessarily a fixed length. Although for a single dimensional array without fields or fieldnames it probably won't change (see https://www.numpy.org/devdocs/reference/generated/numpy.lib.format.html).
you probably can try this
import os
fileinfo = os.stats('1darray.npy')
array length
a = os.stat('1darray.npy')
int((a.st_size - 128)/itemsize)
128 is the extra size npy file takes when saved in a directory in the OS. the actual size in bytes of any any numpy array can be found as array.nbytes. So a.st_size - 128 = array.nbytes and array.bytes/array.itemsize = array.size = array lenght
Where itemsize = 2 if array is of type float 16 bit, 4 if type is float 32 bit and 8 if array if of type float 64 bit
Here is a demo
import numpy as np
import os
array = np.arange(12, dtype=np.float64)
print(a.itemsize) # >> gives 8 for float 64 bit
np.save('1darray.npy', array)
a = os.stat('1darray.npy')
length = int((a.st_size - 128)/8) # >> gives 12 which is equal to array.size
so you have to know what is the dtype of saved numpy npy files
Therefore, for your case you might do this
lengths = [(os.stat(whatever).st_size - 128)/8 for whatever in os.listdir(some_dir)]
assuming dtypes of npy arrays is float64

FFT in numpy vs FFT in MATLAB do not have the same results

I have a vector with complex numbers (can be found here), both in Python and in MATLAB. I am calculating the ifft-transformation with
ifft(<vector>)
in MATLAB and with
np.fft.ifft(<vector>)
in Python. My problem is that I get two completely different results out of it, i.e. while the vector in Python is complex, it is not in MATLAB. While some components in MATLAB are zero, none are in Python. Why is that? The fft-version works as intended. The minimal values are at around 1e-10, i.e. not too low.
Actually, they are the same but Python is showing the imaginary part with extremely high precision. The imaginary components are being shown with values with a magnitude of around 10^{-12}.
Here's what I wrote to reconstruct your problem in MATLAB:
format long g;
data = importdata('data.txt');
out = ifft(data);
format long g; is a formatting option that shows you more significant digits where we show 15 significant digits including decimal places.
When I show the first 10 elements of the inverse FFT output, this is what I get:
>> out(1:10)
ans =
-6.08077329443768
-5.90538963023573
-5.72145198564976
-5.53037208039314
-5.33360059559345
-5.13261402212083
-4.92890104744583
-4.72394865937531
-4.51922820694745
-4.31618153490126
For numpy, be advised that complex numbers are read in with the j letter, not i. Therefore when you load in your text, you must transform all i characters to j. Once you do that, you can load in the data as normal:
In [15]: import numpy as np
In [16]: with open('data.txt', 'r') as f:
....: lines = map(lambda x: x.replace('i', 'j'), f)
....: data = np.loadtxt(lines, dtype=np.complex)
When you open up the file, the call to map would thus take the contents of the file and transform each i character into j and return a list of strings where each element in this list is a complex number in your text file with the i replaced as j. We would then call numpy.loadtxt function to convert these strings into an array of complex numbers.
Now when I take the IFFT and display the first 10 elements of the inversed result as we saw with the MATLAB version, we get:
In [20]: out = np.fft.ifft(data)
In [21]: out[:10]
Out[21]:
array([-6.08077329 +0.00000000e+00j, -5.90538963 +8.25472974e-12j,
-5.72145199 +3.56159535e-12j, -5.53037208 -1.21875843e-11j,
-5.33360060 +1.77529105e-11j, -5.13261402 -1.58326676e-11j,
-4.92890105 -6.13731196e-12j, -4.72394866 +5.46673985e-12j,
-4.51922821 -2.59774424e-11j, -4.31618154 -1.77484689e-11j])
As you can see the real part is the same but the imaginary part still exists. However, note how small in magnitude the imaginary components are. MATLAB in this case chose to not display the imaginary components because their magnitudes are very small. Actually, the data type returned from the ifft call in MATLAB is real so there was probably some post-processing after ifft was called to discard these imaginary components. numpy does not do the same thing by the way but you might as well consider these components to be very small and insignificant.
All in all, both ifft calls in Python and MATLAB are essentially the same but the imaginary components are different in the sense that Python/numpy returns those imaginary components even though they are insignificant where as the ifft call in MATLAB does not. Also take note that you need to ensure that the imaginary variable is replaced with j and you can't use i as in the original text file you've provided. If you know for certain that the output type should be real, you can also drop the imaginary components by giving a call to numpy.real on the ifft result if you so wish.

Python Array size is doubled after saving in binary format

I have data stored in an array of size (4320,2160), reshaped from a list of length 4320*2160. When I save the file in binary format using numpy's tofile method, and then open the file I noticed that the array is double in length. How do I get the original values of the array? I'm assuming it has something to do with endianness, but i'm unfamiliar with dealing with it.
cdom=np.reshape(cdom, (4320,2160), order='F') # array of float values
len(cdom) # 4320*2160
cdom.tofile(filename)
arr = np.fromfile(filename, dtype=np.float32)
len(arr) # double the size of cdom: 2*4320*2160
It looks like cdom has type np.float64, and you are reading the binary file as np.float32, so the length is doubled (and the values are effectively garbage).

What is the equivalent of 'fread' from Matlab in Python?

I have practically no knowledge of Matlab, and need to translate some parsing routines into Python. They are for large files, that are themselves divided into 'blocks', and I'm having difficulty right from the off with the checksum at the top of the file.
What exactly is going on here in Matlab?
status = fseek(fid, 0, 'cof');
fposition = ftell(fid);
disp(' ');
disp(['** Block ',num2str(iBlock),' File Position = ',int2str(fposition)]);
% ----------------- Block Start ------------------ %
[A, count] = fread(fid, 3, 'uint32');
if(count == 3)
magic_l = A(1);
magic_h = A(2);
block_length = A(3);
else
if(fposition == file_length)
disp(['** End of file OK']);
else
disp(['** Cannot read block start magic ! Note File Length = ',num2str(file_length)]);
end
ok = 0;
break;
end
fid is the file currently being looked at
iBlock is a counter for which 'block' you're in within the file
magic_l and magic_h are to do with checksums later, here is the code for that (follows straight from the code above):
disp(sprintf(' Magic_L = %08X, Magic_H = %08X, Length = %i', magic_l, magic_h, block_length));
correct_magic_l = hex2dec('4D445254');
correct_magic_h = hex2dec('43494741');
if(magic_l ~= correct_magic_l | magic_h ~= correct_magic_h)
disp(['** Bad block start magic !']);
ok = 0;
return;
end
remaining_length = block_length - 3*4 - 3*4; % We read Block Header, and we expect a footer
disp(sprintf(' Remaining Block bytes = %i', remaining_length));
What is going on with the %08X and the hex2dec stuff?
Also, why specify 3*4 instead of 12?
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32'); in Python, as io.readline() is just pulling the first 3 characters of the file. Apologies if I'm missing the point somewhere here. It's just that using io.readline(3) on the file seems to return something it shouldn't, and I don't understand how the block_length can fit in a single byte when it could potentially be very long.
Thanks for reading this ramble. I hope you can understand kind of what I want to know! (Any insight at all is appreciated.)
Python Code for Reading a 1-Dimensional Array
When replacing Matlab with Python, I wanted to read binary data into a numpy.array, so I used numpy.fromfile to read the data into a 1-dimensional array:
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16)
Some advantages of using numpy.fromfile versus other Python solutions include:
Not having to manually determine the number of items to be read. You can specify them using the count= argument, but it defaults to -1 which indicates reading the entire file.
Being able to specify either an open file object (as I did above with fid) or you can specify a filename. I prefer using an open file object, but if you wanted to use a filename, you could replace the two lines above with:
data_array = numpy.fromfile(inputfilename, numpy.int16)
Matlab Code for a 2-Dimensional Array
Matlab's fread has the ability to read the data into a matrix of form [m, n] instead of just reading it into a column vector. For instance, to read data into a matrix with 2 rows use:
fid = fopen(inputfilename, 'r');
data_array = fread(fid, [2, inf], 'int16');
fclose(fid);
Equivalent Python Code for a 2-Dimensional Array
You can handle this scenario in Python using Numpy's shape and transpose.
import numpy as np
with open(inputfilename, 'rb') as fid:
data_array = np.fromfile(fid, np.int16).reshape((-1, 2)).T
The -1 tells numpy.reshape to infer the length of the array for that dimension based on the other dimension—the equivalent of Matlab's inf infinity representation.
The .T transposes the array so that it is a 2-dimensional array with the first dimension—the axis—having a length of 2.
From the documentation of fread, it is a function to read binary data. The second argument specifies the size of the output vector, the third one the size/type of the items read.
In order to recreate this in Python, you can use the array module:
f = open(...)
import array
a = array.array("L") # L is the typecode for uint32
a.fromfile(f, 3)
This will read read three uint32 values from the file f, which are available in a afterwards. From the documentation of fromfile:
Read n items (as machine values) from the file object f and append them to the end of the array. If less than n items are available, EOFError is raised, but the items that were available are still inserted into the array. f must be a real built-in file object; something else with a read() method won’t do.
Arrays implement the sequence protocol and therefore support the same operations as lists, but you can also use the .tolist() method to create a normal list from the array.
Really though, I want to know how to replicate [A, count] = fread(fid, 3, 'uint32');
In Matlab, one of fread()'s signatures is fread(fileID, sizeA, precision). This reads in the first sizeA elements (not bytes) of a file, each of a size sufficient for precision. In this case, since you're reading in uint32, each element is of size 32 bits, or 4 bytes.
So, instead, try io.readline(12) to get the first 3 4-byte elements from the file.
The first part is covered by Torsten's answer... you're going to need array or numarray to do anything with this data anyway.
As for the %08X and the hex2dec stuff, %08X is just the print format for those unit32 numbers (8 digit hex, exactly the same as Python), and hex2dec('4D445254') is matlab for 0x4D445254.
Finally, ~= in matlab is a bitwise compare; use == in Python.

Categories