I want to convert a pdf file to json format. So I was using PyPDF2 module to read the pdf. But I am unable to read it. I gives me some "\n" characters but no text. The pdf I am using can be retrieve from here:
pdf_to_json.pdf
The code I am using is:
import PyPDF2
file = open("pdf_to_json.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file)
page_one = pdf.getPage(0)
page_one.extractText()
It's returning something like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
DISCLAIMER: The pdf is in spanish
Related
So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?
Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.
import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this
Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.
can we extract text from pdf file object collected from request for example
f = request.FILES.get('file', None)
So from f can we extract text of the document as we get text content from text file object.
Try using this library called textract
http://textract.readthedocs.io/en/latest/
It supports a lot of formats including PDF
import textract
text = textract.process("path/to/file.extension")
I have tried PyPDF2 to extract and parse text from PDF using following code segment;
import PyPDF2
import re
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")
Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,
In this case, line break or new line can't be found in both rawText or extractedText and results like below-
input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.
Case 2: And for following case,
It gives result as-
2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11
which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?