It's probably simple, but I haven't been able to find a solution online...
I'm trying to work with a series of datasets stored as netcdf files. I open each one up, read in some keys points, then move onto the next file. I am finding that I constantly hit mmap errors/the script slows down as more files are being read in. I believe it may be because the netcdf files are not being properly closed by the .close() command.
I've been testing this:
from scipy.io.netcdf import netcdf_file as ncfile
f=ncfile(netcdf_file,mode='r')
f.close()
then if I try
>>>f
<scipy.io.netcdf.netcdf_file object at 0x24d29e10>
and
>>>f.variables['temperature'][:]
array([ 1234.68034431, 1387.43136567, 1528.35794546, ..., 3393.91061952,
3378.2844357 , 3433.06715226])
So it appears the file is still open? What does close() actually do? how do I know it has worked?
Is there a way to close/clear all open files from python?
Software:
Python 2.7.6, scipy 0.13.2, netcdf 4.0.1
The code for f.close is:
Definition: f.close(self)
Source:
def close(self):
"""Closes the NetCDF file."""
if not self.fp.closed:
try:
self.flush()
finally:
self.fp.close()
f.fp is the file object. So
In [451]: f.fp
Out[451]: <open file 'test.cdf', mode 'wb' at 0x939df40>
In [452]: f.close()
In [453]: f.fp
Out[453]: <closed file 'test.cdf', mode 'wb' at 0x939df40>
But I see from playing around with the f, that I can still create dimensions and variables. But f.flush() returns an error.
It does not look like it uses mmap during data writes, just during read.
def _read_var_array(self):
....
if self.use_mmap:
mm = mmap(self.fp.fileno(), begin_+a_size, access=ACCESS_READ)
data = ndarray.__new__(ndarray, shape, dtype=dtype_,
buffer=mm, offset=begin_, order=0)
else:
pos = self.fp.tell()
self.fp.seek(begin_)
data = fromstring(self.fp.read(a_size), dtype=dtype_)
data.shape = shape
self.fp.seek(pos)
I don't have much experience with mmap. It looks like it sets up a mmap object based on a block of bytes in the file, and uses that as the data buffer for the variable. I don't know what happens to that access if the underlying file is closed. I wouldn't be surprised if there is some sort of mmap error.
If the file is opened with mmap=False, then the whole variable is read into memory, and accessed like a regular numpy array.
mmap : None or bool, optional
Whether to mmap `filename` when reading. Default is True
when `filename` is a file name, False when `filename` is a
file-like object
My guess is that if you open a file without specifying the mmap mode, read an variable from it, and then close the file, that it is unsafe to reference that variable and its data later. Any reference that requires loading more data could result in a mmap error.
But if you open the file with mmap=False, you should be able slice the variable even after closing the file.
I don't see how the mmap for one file or variable could interfer with access to other files and variables. But I'd have to read more on mmap to be sure of that.
And from the netcdf docs:
Note that when netcdf_file is used to open a file with mmap=True (default for read-only), arrays returned by it refer to data directly on the disk. The file should not be closed, and cannot be cleanly closed when asked, if such arrays are alive. You may want to copy data arrays obtained from mmapped Netcdf file if they are to be processed after the file is closed, see the example below.
Related
csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?
It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file
pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.
cPickle.dump(object,file) always dumps at the end of the file. Is there a way to dump at specific position in the file? I expected the following snippet to work
file = open("test","ab")
file.seek(50,0)
cPickle.dump(object, file)
file.close()
However, the above snippet dumps the object at the end of the file (assume file already contains 1000 chars), no matter where I seek the file pointer to.
I think this may be more of a problem with how you open the file than with cPickle.
ab mode, besides being an append mode (which should bear no relevance, since you seek), provides the O_TRUNC flag to the low-level open syscall. If you don't want truncation, you should try the r+ mode.
If this doesn't solve yout problem and your objects are not very large, you can still use dumps:
file = open("test","ab")
file.seek(50,0)
dumped= cPickle.dumps(object)
file.write(dumped)
file.close()
I have a huge file of numbers in binary format, and only certain parts of it needs to be parsed into an array. I looked into numpy.fromfile and open, but they don't have the option to read from location A to location B in the file. Can this be done?
If you're dealing with "huge files", I would not simply read-ignore everything up until the point where you actually need the data.
Instead: file objects in Python have a .seek() method which you can use to jump right where you need to start parsing the data efficiently bypassing everything before.
with open('huge_file.dat', 'rb') as f:
f.seek(1024 * 1024 * 1024) # skip 1GB
...
See also: http://docs.python.org/2/tutorial/inputoutput.html#methods-of-file-objects
If you know about the precise location of the data you are interested in, you could just use the seek(<n bytes>) method on the file object as documented. Just call it once (with the given offset) before you start to read.
I have a bz2 compressed binary (big endian) file containing an array of data. Uncompressing it with external tools and then reading the file in to Numpy works:
import numpy as np
dim = 3
rows = 1000
cols = 2000
mydata = np.fromfile('myfile.bin').reshape(dim,rows,cols)
However, since there are plenty of other files like this I cannot extract each one individually beforehand. Thus, I found the bz2 module in Python which might be able to directly decompress it in Python. However I get an error message:
dfile = bz2.BZ2File('myfile.bz2').read()
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
>>IOError: first argument must be an open file
Obviously, the BZ2File function does not return a file object. Do you know what is the correct way read the compressed file?
BZ2File does return a file-like object (although not an actual file). The problem is that you're calling read() on it:
dfile = bz2.BZ2File('myfile.bz2').read()
This reads the entire file into memory as one big string, which you then pass to fromfile.
Depending on your versions of numpy and python and your platform, reading from a file-like object that isn't an actual file may not work. In that case, you can use the buffer you read in with frombuffer.
So, either this:
dfile = bz2.BZ2File('myfile.bz2')
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
… or this:
dbuf = bz2.BZ2File('myfile.bz2').read()
mydata = np.frombuffer(dbuf).reshape(dim,rows,cols)
(Needless to say, there are a slew of other alternatives that might be better than reading the whole buffer into memory. But if your file isn't too huge, this will work.)
I'm querying a database and archiving the results using Python, and I'm trying to compress the data as I write it to the log files. I'm having some problems with it, though.
My code looks like this:
log_file = codecs.open(archive_file, 'w', 'bz2')
for id, f1, f2, f3 in cursor:
log_file.write('%s %s %s %s\n' % (id, f1 or 'NULL', f2 or 'NULL', f3))
However, my output file has a size of 1,409,780. Running bunzip2 on the file results in a file with a size of 943,634, and running bzip2 on that results in a size of 217,275. In other words, the uncompressed file is significantly smaller than the file compressed using Python's bzip codec. Is there a way to fix this, other than running bzip2 on the command line?
I tried Python's gzip codec (changing the line to codecs.open(archive_file, 'a+', 'zip')) to see if it fixed the problem. I still get large files, but I also get a gzip: archive_file: not in gzip format error when I try to uncompress the file. What's going on there?
EDIT: I originally had the file opened in append mode, not write mode. While this may or may not be a problem, the question still holds if the file's opened in 'w' mode.
As other posters have noted, the issue is that the codecs library doesn't use an incremental encoder to encode the data; instead it encodes every snippet of data fed to the write method as a compressed block. This is horribly inefficient, and just a terrible design decision for a library designed to work with streams.
The ironic thing is that there's a perfectly reasonable incremental bz2 encoder already built into Python. It's not difficult to create a "file-like" class which does the correct thing automatically.
import bz2
class BZ2StreamEncoder(object):
def __init__(self, filename, mode):
self.log_file = open(filename, mode)
self.encoder = bz2.BZ2Compressor()
def write(self, data):
self.log_file.write(self.encoder.compress(data))
def flush(self):
self.log_file.write(self.encoder.flush())
self.log_file.flush()
def close(self):
self.flush()
self.log_file.close()
log_file = BZ2StreamEncoder(archive_file, 'ab')
A caveat: In this example, I've opened the file in append mode; appending multiple compressed streams to a single file works perfectly well with bunzip2, but Python itself can't handle it (although there is a patch for it). If you need to read the compressed files you create back into Python, stick to a single stream per file.
The problem seems to be that output is being written on every write(). This causes each line to be compressed in its own bzip block.
I would try building a much larger string (or list of strings if you are worried about performance) in memory before writing it out to the file. A good size to shoot for would be 900K (or more) as that is the block size that bzip2 uses
The problem is due to your use of append mode, which results in files that contain multiple compressed blocks of data. Look at this example:
>>> import codecs
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("ABCD")
On my system, this produces a file 12 bytes in size. Let's see what it contains:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
Okay, now let's do another write in append mode:
>>> with codecs.open("myfile.zip", "a+", "zip") as f:
>>> f.write("EFGH")
The file is now 24 bytes in size, and its contents are:
>>> with codecs.open("myfile.zip", "r", "zip") as f:
>>> f.read()
'ABCD'
What's happening here is that unzip expects a single zipped stream. You'll have to check the specs to see what the official behavior is with multiple concatenated streams, but in my experience they process the first one and ignore the rest of the data. That's what Python does.
I expect that bunzip2 is doing the same thing. So in reality your file is compressed, and is much smaller than the data it contains. But when you run it through bunzip2, you're getting back only the first set of records you wrote to it; the rest is discarded.
I'm not sure how different this is from the codecs way of doing it but if you use GzipFile from the gzip module you can incrementally append to the file but it's not going to compress very well unless you are writing large amounts of data at a time (maybe > 1 KB). This is just the nature of the compression algorithms. If the data you are writing isn't super important (i.e. you can deal with losing it if your process dies) then you could write a buffered GzipFile class wrapping the imported class that writes out larger chunks of data.