parsing large compressed xml files, python

parsing large compressed xml files, python - python

file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file)
Here's code that tries to parse xml file compressed with bz2. Unfortunately it fails with a message:
TypeError: Parse() argument 1 must be string or read-only buffer, not bz2.BZ2File
Is there a way to parse on the fly compressed bz2 xml files?
Note: p.Parse(file.read()) is not an option here. I want to parse a file which is larger than available memory, so I need to have a stream.

Just use p.ParseFile(file) instead of p.Parse(file).
Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required.
Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

Use .read() on the file object to read in the entire file as a string, and then pass that to Parse?
file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file.read())

Can you pass in an mmap()'ed file? That should take care of automatically paging the needed parts of the file in, and avoid memory overflow. Of course if expat builts a parse tree, it might still run out of memory.
http://docs.python.org/library/mmap.html
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file.

Related

Read PDF tables from memory with Python

I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')

camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.

How to convert a byte object to a list of tuples in python 3?

I have a list of tuples [(x,y,z),...,] and I want to store this list of tuples in a file. For this I chose a .txt file. I write to the file in the mode "wb" and then I close it. Later, I want to open the file in mode "rb" and convert this byte object back to a list of tuples. How would I go about this without regular expression nonsense? Is there a file type that would allow me to store this data and read it easily that I've overlooked?

The .txt extension is typically not used for binary data, as you seem to intend.
Since your data structure is not known on a byte level, it's not that simple.
If you do know your data (types and length), you could "encode" it as a binary structure with https://docs.python.org/3.4/library/struct.html and write that to a (binary) file.
Otherwise, there are many solutions to the problem of writing (structured) data to and reading data from files (that's why there are soo many file formats):
Standard library:
https://docs.python.org/3/library/fileformats.html
https://docs.python.org/3/library/persistence.html
https://docs.python.org/3/library/xml.html
https://docs.python.org/3/library/json.html
3rd party:
https://pypi.python.org/pypi/PyYAML
and other modules on https://pypi.python.org/
Related Q&A on Stackoverflow:
How to save data with Python?

Why does pyPdf2.PdfFileReader() require a file object as an input?

csv.reader() doesn't require a file object, nor does open(). Does pyPdf2.PdfFileReader() require a file object because of the complexity of the PDF format, or is there some other reason?

It's just a matter of how the library was written. csv.reader allows any iterable that returns strings (which includes files). open is opening the file, so of course it doesn't take an open file (although it can take an integer pointing at an open file descriptor). Typically, it is better to handle the file separately, usually within a with block so that it is closed properly.
with open('input.pdf', 'rb') as f:
# do something with the file

pypdf can take a BytesIO stream or a file path as well. I actually recommend passing the file path in most cases as pypdf will then take care of closing the file for you.

Importing bz2 compressed binary file as numpy array

I have a bz2 compressed binary (big endian) file containing an array of data. Uncompressing it with external tools and then reading the file in to Numpy works:
import numpy as np
dim = 3
rows = 1000
cols = 2000
mydata = np.fromfile('myfile.bin').reshape(dim,rows,cols)
However, since there are plenty of other files like this I cannot extract each one individually beforehand. Thus, I found the bz2 module in Python which might be able to directly decompress it in Python. However I get an error message:
dfile = bz2.BZ2File('myfile.bz2').read()
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
>>IOError: first argument must be an open file
Obviously, the BZ2File function does not return a file object. Do you know what is the correct way read the compressed file?

BZ2File does return a file-like object (although not an actual file). The problem is that you're calling read() on it:
dfile = bz2.BZ2File('myfile.bz2').read()
This reads the entire file into memory as one big string, which you then pass to fromfile.
Depending on your versions of numpy and python and your platform, reading from a file-like object that isn't an actual file may not work. In that case, you can use the buffer you read in with frombuffer.
So, either this:
dfile = bz2.BZ2File('myfile.bz2')
mydata = np.fromfile(dfile).reshape(dim,rows,cols)
… or this:
dbuf = bz2.BZ2File('myfile.bz2').read()
mydata = np.frombuffer(dbuf).reshape(dim,rows,cols)
(Needless to say, there are a slew of other alternatives that might be better than reading the whole buffer into memory. But if your file isn't too huge, this will work.)

cPickle.load( ) error

I am working with cPickle for the purpose to convert the structure data into datastream format and pass it to the library. The thing i have to do is to read file contents from manually written file name "targetstrings.txt" and convert the contents of file into that format which Netcdf library needs in the following manner,
Note: targetstrings.txt contains latin characters
op=open("targetstrings.txt",'rb')
targetStrings=cPickle.load(op)
The Netcdf library take the contents as strings.
While loading a file it stuck with the following error,
cPickle.UnpicklingError: invalid load key, 'A'.
Please tell me how can I rectify this error, I have googled around but did not find an appropriate solution.
Any suggestions,

pickle is not for reading/writing generic text files, but to serialize/deserialize Python objects to file. If you want to read text data you should use Python's usual IO functions.
with open('targetstrings.txt', 'r') as f:
fileContent = f.read()
If, as it seems, the library just wants to have a list of strings, taking each line as a list element, you just have to do:
with open('targetstrings.txt', 'r') as f:
lines=[l for l in f]
# now in lines you have the lines read from the file

As stated - Pickle is not meant to be used in this way.
If you need to manually edit complex Python objects taht are to be read and passed as Python objects to another function, there are plenty of other formats to use - for example XML, JSON, Python files themselves. Pickle uses a Python specific protocol, that while note being binary (in the version 0 of the protocol), and not changing across Python versions, is not meant for this, and is not even the recomended method to record Python objects for persistence or comunication (although it can be used for those purposes).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

parsing large compressed xml files, python - python

Just use p.ParseFile(file) instead of p.Parse(file). Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required. Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

Use .read() on the file object to read in the entire file as a string, and then pass that to Parse? file = BZ2File(SOME_FILE_PATH) p = xml.parsers.expat.ParserCreate() p.Parse(file.read())

Related

Read PDF tables from memory with Python

How to convert a byte object to a list of tuples in python 3?

Why does pyPdf2.PdfFileReader() require a file object as an input?

Importing bz2 compressed binary file as numpy array

cPickle.load( ) error

Categories

Resources