Read PDF tables from memory with Python

Read PDF tables from memory with Python - python

I'm trying to read a PDF file extracted from a zip file in memory to get the tables inside the file. Camelot seems a good way to do it, but I'm getting the following error:
AttributeError: '_io.StringIO' object has no attribute 'lower'
Is there some way to read the file and extract the tables with camelot, or should I use another library?
z = zipfile.ZipFile(self.zip_file)
for file in z.namelist():
if file.endswith(".pdf"):
pdf = z.read(file).decode(encoding="latin-1")
pdf = StringIO(pdf)
pdf = camelot.read_pdf(pdf, codec='utf-8')

camelot.read_pdf(filepath,...)
Accepts a file path as the first parameter. It appears to be a bad match for your requirements. Search for another library.
In any case StringIO(pdf), will return the following:
<_io.StringIO object at 0x000002592DD33E20>
For starters, when you read a file from StringIO, do it by calling the read() function
pdf = StringIO(pdf)
pdf.read()
That bit will indeed return the file bytes themselves. Next think about the encoding that the library will accept.

Related

How to delete line of large CSV file in place for upload to Postgres

I have a 102gb CSV file exported from MongoDB that I'm trying to upload to Postgres. The file contains ~55 million rows. I'm using \copy to upload. However, I get a carriage return error on line 47,867,184:
ERROR: unquoted carriage return found in data
HINT: Use quoted CSV field to represent carriage return.
CONTEXT: COPY reso_facts, line 47867184
To my knowledge, Postgres doesn't allow for skipping bad rows on import. Seems like I need to fix the file. Is there a way to delete a CSV row in-place using Python? I strongly prefer not to write the file to an external hard drive.
I found this elegant solution for txt files:
import os
from mmap import mmap
def removeLine(filename, lineno):
f=os.open(filename, os.O_RDWR)
m=mmap(f,0)
p=0
for i in range(lineno-1):
p=m.find('\n',p)+1
q=m.find('\n',p)
m[p:q] = ' '*(q-p)
os.close(f)
But it errors out when fed a CSV: TypeError: a bytes-like object is required, not 'str'. Is there a way to modify the above for CSVs, or is there an alternative method entirely?

Unable to Read Corrupted pdf using PyPDF2

I am trying to merge 2 pdfs and write to one pdf using pypdf2.
Below is the code to read the file content
output = PyPDF2.PdfFileWriter()
for pdffile in self.files:
input = PyPDF2.PdfFileReader(pdffile, strict=False)
where self.files is file objects
I am getting below error when trying to read one particular pdf file
TypeError: 'NumberObject' object has no attribute 'getitem'
When i ran ghostscript on the pdf file i found that the file is corrupted and the repaired one i am able to read without errors. I wanted to check if there is any way i can read the corrupted pdf file using pypdf2 only without errors?
Thanks in Advance.

How can I create a word (.docx) document if not found using python and write in it?

How can I create a word (.docx) document if not found using python and write in it?
I certainly cannot do either of the following:
file = open(file_name, 'r')
file = open(file_name, 'w')
or, to create or append if found:
f = open(file_name, 'a+')
Also I cannot find any related info in python-docx documentation at:
https://python-docx.readthedocs.io/en/latest/
NOTE:
I need to create an automated report via python with text and pie charts, graphs etc.

Probably the safest way to open (and truncate) a new file for writing is using 'xb' mode. 'x' will raise a FileExistsError if the file is already there. 'b' is necessary because a word document is fundamentally a binary file: it's a zip archive with XML and other files inside it. You can't compress and decompress a zip file if you convert bytes through character encoding.
Document.save accepts streams, so you can pass in a file object opened like that to save your document.
Your work-flow could be something like this:
doc = docx.Document(...)
...
# Make your document
...
with open('outfile.docx', 'xb') as f:
doc.save(f)
It's a good idea to use with blocks instead of raw open to ensure the file gets closed properly even in case of an error.
In the same way that you can't simply write to a Word file directly, you can't append to it either. The way to "append" is to open the file, load the Document object, and then write it back, overwriting the original content. Since the word file is a zip archive, it's very likely that appended text won't even be at the end of the XML file it's in, much less the whole docx file:
doc = docx.Document('file_to_append.docx')
...
# Modify the contents of doc
...
doc.save('file_to_append.docx')
Keep in mind that the python-docx library may not support loading some elements, which may end up being permanently discarded when you save the file this way.

Looks like I found an answer:
The important point here was to create a new file, if not found, or
otherwise edit the already present file.
import os
from docx import Document
#checking if file already present and creating it if not present
if not os.path.isfile(r"file_path"):
#Creating a blank document
document = Document()
#saving the blank document
document.save('file_name.docx')
#------------editing the file_name.docx now------------------------
#opening the existing document
document = Document('file_name.docx')
#editing it
document.add_heading("hello world" , 0)
#saving document in the end
document.save('file_name.docx')
Further edits/suggestions are welcome.

How do you load an XML file into a string with python on GAE?

I'm using the latest GAE default python environment. Both of these give expected results:
isTrue = os.path.exists(path)
numberGreaterThanZero = os.path.getsize(path)
But this:
myStrLen = len(open(path))
Gives this error:
TypeError: object of type 'FakeFile' has no len()
There are no results for that error in Google. Not being able to open files is a real bummer. What am I doing wrong? Why does Python/GAE think my file is fake?

The open function returns an open file, not a string. Open files have no len.
You need to actually read the string from the file, for example with the read method.
contents = open(path).read()
myStrLen = len(contents)
If you don't need the contents, you can also get the file size with os.stat.
myStrLen = os.stat('/tmp/x.py').st_size
FakeFile is just GAE's sandboxed implementation of file.

parsing large compressed xml files, python

file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file)
Here's code that tries to parse xml file compressed with bz2. Unfortunately it fails with a message:
TypeError: Parse() argument 1 must be string or read-only buffer, not bz2.BZ2File
Is there a way to parse on the fly compressed bz2 xml files?
Note: p.Parse(file.read()) is not an option here. I want to parse a file which is larger than available memory, so I need to have a stream.

Just use p.ParseFile(file) instead of p.Parse(file).
Parse() takes a string, ParseFile() takes a file handle, and reads the data in as required.
Ref: http://docs.python.org/library/pyexpat.html#xml.parsers.expat.xmlparser.ParseFile

Use .read() on the file object to read in the entire file as a string, and then pass that to Parse?
file = BZ2File(SOME_FILE_PATH)
p = xml.parsers.expat.ParserCreate()
p.Parse(file.read())

Can you pass in an mmap()'ed file? That should take care of automatically paging the needed parts of the file in, and avoid memory overflow. Of course if expat builts a parse tree, it might still run out of memory.
http://docs.python.org/library/mmap.html
Memory-mapped file objects behave like both strings and like file objects. Unlike normal string objects, however, these are mutable. You can use mmap objects in most places where strings are expected; for example, you can use the re module to search through a memory-mapped file.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read PDF tables from memory with Python - python

Related

How to delete line of large CSV file in place for upload to Postgres

Unable to Read Corrupted pdf using PyPDF2

How can I create a word (.docx) document if not found using python and write in it?

How do you load an XML file into a string with python on GAE?

parsing large compressed xml files, python

Categories

Resources