I'm reading data from 20k mat files into an array.
After reading around 13k files, the process is ended with "Killed" message.
Apparently, it looks like the problem is that too many files are open.
I've tried to find out how to explicitly "close" mat files in Python, but didn't find any except for savemat which is not what I need in this case.
How can I explicitly close mat files in python?
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
mat = scipy.io.loadmat(l)
x.append(mat['data'])
You don't need to. loadmat does not keep the file open. If given a file name, it loads the contents of the file into memory, then immediately closes it. You can use a file object like #nils-werner suggested, but you will get no benefit from doing so. You can see this from looking at the source code.
You are most likely running out of memory due to simply having too much data at a time. The first thing I would try is to load all the data into one big numpy array. You know the size of each file, and you know how many files there are, so you can pre-allocate an array of the right size and write the data to slices of that array. This will also tell you right away if this is a problem with your array size.
If you are still running out of memory, you will need another solution. A simple solution would be to use dask. This allows you to create something that looks and acts like a numpy array, but lives in a file rather than in memory. This allows you to work with data sets too large to fit into memory. bcolz and blaze offer similar capabilities, although not as seamlessly.
If these are not an option, h5py and pytables allow you to store data sets to files incrementally rather than having to keep the whole thing in memory at once.
Overall, I think this question is a classic example of the XY Problem. It is generally much better to state your symptoms, and ask for help on those symptoms, rather than guessing what the solution is and asking for someone to help you implement the solution.
You can pass an open file handle to scipy.io.loadmat:
import scipy.io
x=[]
with open('mat_list.txt','r') as f:
for l in f:
l=l.replace('\n','')
with open(l, 'r') as matfile:
mat = scipy.io.loadmat(matfile)
x.append(mat['data'])
leaving the with open() context will then automatically close the file.
Related
I'm trying to create very large .npy files and I'm having a bit of difficulty. For instance, I need to create a (500, 1586, 2048, 3) matrix and save it to a npy file. And preferably, I need to put it in an npz_compressed file. I also need this to be memory efficient to be ran on low-memory systems. I've tried a few methods, but none have seemed to work so far. I've written and re-written things so many times, that I don't have code snippets for everything, but I'll describe the methods as best I can with code snippets where I can. Also, apologies for bad formatting.
Create an ndarray with all my data in it, then use savez_compressed to export it.
This gets all my data into the array, but it's terrible for memory efficiency. I filled all 8gb of RAM, plus 5gb of Swap space. I got it to save my file, but it doesn't scale, as my matrix could get significantly larger.
Use " np.memmap('file_name_to_create', mode='w+', shape=(500,1586,2048,3)) " to create the large, initial npy file, then add my data.
This method worked for getting my data in, and it's pretty memory efficient. However, i can no longer use np.load to open the file (get errors associated with pickle, regardless of if allow_pickle is true or false), which means I can't put it into compressed. I'd be happy with this format, if I can get it into the compressed format, but I just can't figure it out. I'm trying to avoid using gzip if possible.
Create a (1,1,1,1) zeros array and save it with np.save. Then try opening it with np.memmap with the same size as before.
This runs into the same issues as method 2. Can no longer use np.load to read it in afterwards
Create 5 [100,...] npy files with method 1, and saving them with np.save. Then read 2 in using np.load(mmap_mode='r+') and then merge them into 1 large npy file.
Creating the individual npy files wan't bad on memory, maybe 1gb to 1.5gb. However, I couldn't figure out how to then merge the npy files without actually loading the entire npy file into RAM. I read in other stackoverflow that npy files aren't really designed for this at all. They mentioned it would be better to use a .h5 file for doing this kind of 'appending'.
Those are the main methods that I've used. I'm looking for feedback on if any of these methods would work, which one would work 'best' for memory efficiency, and maybe some guidance on getting that method to work. I also wouldn't be opposed to moving to .h5 if that would be the best method, I just haven't tried it yet.
try it at Google colab which uses the GPU to run it
I've been recently working with large matrices. My inputs are stored in form of 15GB .npz files, which I am trying to read incrementally, in small batches.
I am familiar with memory mapping, and having seen numpy also supports these kinds of operations seemed like a perfect solution. However, the problem I am facing is as follows:
I first load the matrix with:
foo = np.load('matrix.npz',mmap_mode="r+")
foo has a single key: data.
When I try to, for example do:
foo['data'][1][1]
numpy seems to endlessly spend the available RAM, almost as if there were no memory mapping. Am I doing anything wrong?
My goal would be, for example, to read 30 lines at a time:
for x in np.arange(0,matrix.shape[1],30):
batch = matrix[x:(x+30),:]
do_something_with(batch)
Thank you!
My guess would be that mmap_mode="r+" is ignored when the file in question is a zipped numpy file. I haven't used numpy in this way, so some of what follows is my best guess. The documentation for load states
If the file is a .npz file, then a dictionary-like object is returned, containing {filename: array} key-value pairs, one for each file in the archive.
No mention of what it does with mmap_mode. However in the code for loading .npz files no usage is made of the mmap_mode keyword:
if magic.startswith(_ZIP_PREFIX):
# zip-file (assume .npz)
# Transfer file ownership to NpzFile
tmp = own_fid
own_fid = False
return NpzFile(fid, own_fid=tmp, allow_pickle=allow_pickle, pickle_kwargs=pickle_kwargs)
So, your initial guess is indeed correct. Numpy uses all of the ram because there is no memmapping ocurring. This is a limitation of the implementation of load; since the npz format is an uncompressed zip archive it should be possible to memmap the variables (unless of course your files were created with savez_compressed).
Implementing a load function that memmaps npz would be quite a bit of work though, so you may want to take a look at structured arrays. They provide similar usage (access of fields by keys) and are already compatible with memmapping.
I have a speed/efficiency related question about python:
I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.
Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.
Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:
1) Assemble each line as a string
2) Concatenate all lines as single huge string
3) Write string to file
I have several problems with this:
1) The large number of string-concatenations ends up taking a lot of time
2) I run of of RAM to keep strings in memory
3) ...which in turn leads to more separate file.write commands, which are very slow as well.
So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.
... or maybe this strategy is simply just bad and I should do something completely different?
Thanks in advance!
Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.
Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.
DataFrame.to_csv(path_or_buf, sep='\t')
There's a bunch of other configuration things you can do to make your tab separated file just right.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html
Unless you are running into a performance issue, you can probably write to the file line by line. Python internally uses buffering and will likely give you a nice compromise between performance and memory efficiency.
Python buffering is different from OS buffering and you can specify how you want things buffered by setting the buffering argument to open.
I think what you might want to do is create a memory mapped file. Take a look at the following documentation to see how you can do this with numpy:
http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html
I've currently got a project running on PiCloud that involves multiple iterations of an ODE Solver. Each iteration produces a NumPy array of about 30 rows and 1500 columns, with each iterations being appended to the bottom of the array of the previous results.
Normally, I'd just let these fairly big arrays be returned by the function, hold them in memory and deal with them all at one. Except PiCloud has a fairly restrictive cap on the size of the data that can be out and out returned by a function, to keep down on transmission costs. Which is fine, except that means I'd have to launch thousands of jobs, each running on iteration, with considerable overhead.
It appears the best solution to this is to write the output to a file, and then collect the file using another function they have that doesn't have a transfer limit.
Is my best bet to do this just dumping it into a CSV file? Should I add to the CSV file each iteration, or hold it all in an array until the end and then just write once? Is there something terribly clever I'm missing?
Unless there is a reason for the intermediate files to be human-readable, do not use CSV, as this will inevitably involve a loss of precision.
The most efficient is probably tofile (doc) which is intended for quick dumps of file to disk when you know all of the attributes of the data ahead of time.
For platform-independent, but numpy-specific, saves, you can use save (doc).
Numpy and scipy also have support for various scientific data formats like HDF5 if you need portability.
I would recommend looking at the pickle module. The pickle module allows you to serialize python objects as streams of bytes (e.g., strings). This allows you to write them to a file or send them over a network, and then reinstantiate the objects later.
Try Joblib - Fast compressed persistence
One of the key components of joblib is it’s ability to persist arbitrary Python objects, and read them back very quickly. It is particularly efficient for containers that do their heavy lifting with numpy arrays. The trick to achieving great speed has been to save in separate files the numpy arrays, and load them via memmapping.
Edit:
Newer (2016) blog entry on data persistence in Joblib
In my python environment, the Rpy and Scipy packages are already installed.
The problem I want to tackle is such:
1) A huge set of financial data are stored in a text file. Loading into Excel is not possible
2) I need to sum a certain fields and get the totals.
3) I need to show the top 10 rows based on the totals.
Which package (Scipy or Rpy) is best suited for this task?
If so, could you provide me some pointers (e.g. documentation or online example) that can help me to implement a solution?
Speed is a concern. Ideally scipy and Rpy can handle the large files when even when the files are so large that they cannot be fitted into memory
Neither Rpy or Scipy is necessary, although numpy may make it a bit easier.
This problem seems ideally suited to a line-by-line parser.
Simply open the file, read a row into a string, scan the row into an array (see numpy.fromstring), update your running sums and move to the next line.
Python's File I/O doesn't have bad performance, so you can just use the file module directly. You can see what functions are available in it by typing help (file) in the interactive interpreter. Creating a file is part of the core language functionality and doesn't require you to import file.
Something like:
f = open ("C:\BigScaryFinancialData.txt", "r");
for line in f.readlines():
#line is a string type
#do whatever you want to do on a per-line basis here, for example:
print len(line)
Disclaimer: This is a Python 2 answer. I'm not 100% sure this works in Python 3.
I'll leave it to you to figure out how to show the top 10 rows and find the row sums. This can be done with simple program logic that shouldn't be a problem without any special libraries. Of course, if the rows have some kind of complicated formatting that makes it difficult to parse out the values, you might want to use some kind of module for parsing, re for example (type help(re) into the interactive interpreter).
As #gsk3 noted, bigmemory is a great package for this, along with the packages biganalytics and bigtabulate (there are more, but these are worth checking out). There's also ff, though that isn't as easy to use.
Common to both R and Python is support for HDF5 (see the ncdf4 or NetCDF4 packages in R), which makes it very speedy and easy to access massive data sets on disk. Personally, I primarily use bigmemory, though that's R specific. As HDF5 is available in Python and is very, very fast, it's probably going to be your best bet in Python.
How huge is your data, is it larger than your PC's memory? If it can be loaded into memory, you can use numpy.loadtxt() to load text data into a numpy array. for example:
import numpy as np
with file("data.csv", "rb") as f:
title = f.readline() # if your data have a title line.
data = np.loadtxt(f, delimiter=",") # if your data splitted by ","
print np.sum(data, axis=0) # sum along 0 axis to get the sum of every column
I don't know anything about Rpy. I do know that SciPy is used to do serious number-crunching with truly large data sets, so it should work for your problem.
As zephyr noted, you may not need either one; if you just need to keep some running sums, you can probably do it in Python. If it is a CSV file or other common file format, check and see if there is a Python module that will parse it for you, and then write a loop that sums the appropriate values.
I'm not sure how to get the top ten rows. Can you gather them on the fly as you go, or do you need to compute the sums and then choose the rows? To gather them you might want to use a dictionary to keep track of the current 10 best rows, and use the keys to store the metric you used to rank them (to make it easy to find and toss out a row if another row supersedes it). If you need to find the rows after the computation is done, slurp all the data into a numpy.array, or else just take a second pass through the file to pull out the ten rows.
Since this has the R tag I'll give some R solutions:
Overview
http://www.r-bloggers.com/r-references-for-handling-big-data/
bigmemory package http://www.cybaea.net/Blogs/Data/Big-data-for-R.html
XDF format http://blog.revolutionanalytics.com/2011/03/analyzing-big-data-with-revolution-r-enterprise.html
Hadoop interfaces to R (RHIPE, etc.)