Dump a sparse matrix into a file - python

I have a scipy.sparse.csr matrix and would like to dump it to a CSV file. Is there a way to preserve the sparsity of the matrix and write it to a CSV?

SciPy includes functions that read/write sparse matrices in the MatrixMarket format via the scipy.io module, including mmwrite: http://docs.scipy.org/doc/scipy/reference/generated/scipy.io.mmwrite.html
MatrixMarket is not CSV, but close. It consists of a one-line header that has #rows, #cols, # of nonzeros, followed by one line per nonzero. Each of these lines is the row index, column index, value. You could write a simple script that turns whitespace into commas and you'd have a CSV.

Now with Scipy 0.19, it is super easy:
import scipy.sparse as sp
m = sp.csr_matrix([[1,0,0],[0,1,0],[0,0,1]])
sp.save_npz("file_name.npz", m)
Loading file to memory
new_m = sp.load_npz("file_name.npz")

Related

Combine two FITS files by taking a 'pizza wedge' slice?

I would like to combine two FITS files, by taking a slice out of one and inserting it into the other. The slice would be based on an angle measured from the centre pixel, see example image below:
Can this be done using Astropy? There are many questions on combining FITS on the site, but most of these are related to simply adding two files together, rather that combining segments like this.
Here is one recommended approach:
1.Read in your two files
Assuming the data is in an ImageHDU data array..
from astropy.io import fits
# read the numpy arrays out of the files
# assuming they contain ImageHDUs
data1 = fits.getdata('file1.fits')
data2 = fits.getdata('file2.fits')
2. Cut out the sections and put them into a new numpy array
Build up indices1 & indices2 for the desired sections in the new file... A simple numpy index to fill in the missing section into a new numpy array.
After being inspired by https://stackoverflow.com/a/18354475/15531842
The sector_mask function defined in the answer to get indices for each array using angular slices.
mask = sector_mask(data1.shape, centre=(53,38), radius=100, angle_range=(280,340))
indices1 = ~mask
indices2 = mask
Then these indices can be used to transfer the data into a new array.
import numpy as np
newdata = np.zeros_like(data1)
newdata[indices1] = data1[indices1]
newdata[indices2] = data2[indices2]
If the coordinate system is well known then it may be possible to use astropy's Cutout2D class, although I was not able to figure out how to fully use it. It wasn't clear if it could do an angular slice from the example given. See astropy example https://docs.astropy.org/en/stable/nddata/utils.html#d-cutout-using-an-angular-size
3a. Then write out the new array as a new file.
If special header information is not needed in the new file. Then the numpy array with the new image can be written out to a FITS file with one line of astropy code.
# this is an easy way to write a numpy array to FITS file
# no header information is carried over
fits.writeto('file_combined.fits', data=newdata)
3b. Carry the FITS header information over to the new file
If there is a desire to carry over header information then an ImageHDU can be built from the numpy array and include the desired header as a dictionary.
img_hdu = fits.ImageHDU(data=newdata, header=my_header_dict)
hdu_list = fits.HDUList()
hdu_list.append(fits.PrimaryHDU())
hdu_list.append(img_hdu)
hdu_list.writeto('file_combined.fits')

Text transform with sklearn TF-IDF vectorizer generates too big csv file

I have a 1000 texts each text has 200-1000 words. size of text csv file is about 10 MB. when I vectorize them with this code, the size of output CSV is exceptionally big (2.5 GB). I am not sure what I did wrong. Your help is highly appreciated. Code:
import numpy as np
import pandas as pd
from copy import deepcopy
import glob
from sklearn.feature_extraction.text import TfidfVectorizer
from numpy import savetxt
df = pd.read_csv('data.csv')
#data has two columns: teks and groups
filtered_df = deepcopy(df)
vectorizer = TfidfVectorizer()
vectorizer.fit(filtered_df["teks"])
vector = vectorizer.transform(filtered_df["teks"])
print(vector.shape) # shape (1000, 83000)
savetxt('dataVectorized1.csv', vector.toarray(), delimiter=',')
Sparse matrices (like your vector here) are not supposed to be converted to dense ones (as you do with .toarray()) and saved as CSV files; doing that makes no sense, and invalidates the whole concept of sparse matrices itself. Given that, the big size is not a surprise.
You should seriously consider saving your sparse vector to an appropriate format, e.g. using scipy.sparse:
import scipy.sparse
scipy.sparse.save_npz('dataVectorized1.npz', vector)
See also Save / load scipy sparse csr_matrix in portable data format for possible other options.
If, for any reason, you must stick to a CSV file for storage, you could try compressing the output file by simply using the .gz extension in the file name; from the np.savetxt() documentation:
If the filename ends in .gz, the file is automatically saved in compressed gzip format. loadtxt understands gzipped files transparently.
So, this should do the job:
np.savetxt('dataVectorized1.csv.gz', vector.toarray(), delimiter=',')
However, I would not really recommend this; keep in mind that:
Beyond their convenience for tutorials and introductory exhibitions, CSV files do not really hold any "special" status as inputs to ML tasks, as you might seem to believe.
There is absolutely no reason why the much more efficient .npz file cannot be used as input for further downstream tasks, like classification, visualization, and clustering; on the contrary, its use is very much justified and recommended in similar situations.

Why 6GB csv file is not possible to read whole to memory (64GB) in numpy

I have csv file which sze is 6.8GB and I am not able to read it into memory into numpy array although I have 64GB RAM
CSV file has 10 milion of lines, each line has 131 records (mix of int and float)
I tried to read it to float numpy array
import numpy as np
data = np.genfromtxt('./data.csv', delimiter=';')
it failed due to memory.
when I read just one line and get size
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1)
data.nbytes
I get 1048 bytes
So , I would expect that 10.000.000 * 1048 = 10,48 GB which should be stored in memory without any problem. Why it doesn't work?
Finaly I tried to optimize array in memory by defining types
data = np.genfromtxt('./data.csv', delimiter=';', max_rows=1,
dtype="i1,i1,f4,f4,....,i2,f4,f4,f4")
data.nbytes
so I get only 464B per line, so it would be only 4,56 GB but it is still not possible to load to memory.
Do you have any idea?
I need to use this array in Keras.
Thank you
genfromtext is regular python code, that converts the data to a numpy array only as a final step. During this last step, the RAM needs to hold a giant python list as well as the resultant numpy array, both at the same time. Maybe you could try numpy.fromfile, or the Pandas csv reader. Since you know the type of data per column and the number of lines, you also preallocate a numpy array yourself and fill it using a simple for-loop.

Efficient way to process CSV file into a numpy array

CSV file may not be clean (lines with inconsistent number of elements), unclean lines would need to be disregarded.
String manipulation is required during processing.
Example input:
20150701 20:00:15.173,0.5019,0.91665
Desired output: float32 (pseudo-date, seconds in the day, f3, f4)
0.150701 72015.173 0.5019 0.91665 (+ the trailing trash floats usually get)
The CSV file is also very big, the numpy array in memory would be expected to take 5-10 GB, CSV file is over 30GB.
Looking for an efficient way to process the CSV file and end up with a numpy array.
Current solution: use csv module, process line by line and use a list() as a buffer that later gets turned to numpy array with asarray(). Problem is, during the turning process memory consumption is doubled and the copying process adds execution overhead.
Numpy's genfromtxt and loadtxt don't appear to be able to process the data as desired.
If you know in advance how many rows are in the data, you could dispense with the intermediate list and write directly to the array.
import numpy as np
no_rows = 5
no_columns = 4
a = np.zeros((no_rows, no_columns), dtype = np.float)
with open('myfile') as f:
for i, line in enumerate(f):
a[i,:] = cool_function_that_returns_formatted_data(line)
did you think for using pandas read_csv (with engine='C')
I find it as one of the best and easy solutions to handling csv. I worked with 4GB file and it worked for me.
import pandas as pd
df=pd.read_csv('abc.csv',engine='C')
print(df.head(10))
I think i/o capability of pandas is the best way to get data into a numpy array. Specifically the read_csv method will read into a pandas DataFrame. You can then access the underlying numpy array using the as_matrix method of the returned DataFrame.

How to output the numbers in array from Fourier transform to a one single line in Excel file using Python

I want to output the numeric result from image processing to the .xls file in one line exactly in cells horizontally, does anybody can advice what Python module to use and what code to add, please? In other words, how to arrange digits from an array and put them exactly in Excel cells horizontally?
Code fragment:
def fouriertransform(self): #function for FT computation
for filename in glob.iglob ('*.tif'):
imgfourier = mahotas.imread (filename) #read the image
arrayfourier = numpy.array([imgfourier])#make an array
# Take the fourier transform of the image.
F1 = fftpack.fft2(imgfourier)
# Now shift so that low spatial frequencies are in the center.
F2 = fftpack.fftshift(F1)
# the 2D power spectrum is:
psd2D = np.abs(F2)**2
print psd2D
f.write(str(psd2D))#write to file
This should be a comment, but I don't have enough rep.
What other people say
It looks like this is a duplicate of Python Writing a numpy array to a CSV File, which uses numpy.savetxt to generate a csv file (which excel can read)
Example
Numpy is probably the easiest way to go in my opinion. You could get fancy and use the np.flatten module. After that, it would be as simple as saving the array with a comma separating each element. Something like this (tested!) is even simpler
import numpy as np
arr = np.ones((3,3))
np.savetxt("test.csv",arr,delimiter=',',newline=',')
Note this is a small hack, since newlines are treated as commas. This makes "test.csv" look like:
1,1,1,1,1,1,1,1,1, # there are 9 ones there, I promise! 3x3
Note there is a trailing comma. I was able to open this in excel no problem, voila!

Categories