Reading from pdf file to text yields no results - python

So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?

Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.

Related

How do insert a HTML file in the content of a chapter when using ebooklib?

I'm making an EPUB using the EbookLib library and I'm following along their documentation. I am trying to set the content of a chapter to be the content of a HTML file. The only way I got it to work was giving plain HTML when setting the content.
c1 = epub.EpubHtml(title='Chapter one', file_name='ch1.xhtml', lang='en')
c1.set_content(u'<html><body><h1>Introduction</h1><p>Introduction paragraph.</p></body></html>')'
Is it possible to give a HTML file to be the content of the chapter?
I've tried things like c1.set_content(file_name='ch1.xhtml') but that didn't work, it only accepts plain HTML.
I figured it out! I'm opening and reading the file in a variable and then passing that variable to the set_content function. Posting this so it could be of use to someone in the future.
file = open('ch1.xhtml', 'r')
lines = file.read()
c2.set_content(lines)
file.close()

Reading nothing from a pdf file using PyPDF2

I want to convert a pdf file to json format. So I was using PyPDF2 module to read the pdf. But I am unable to read it. I gives me some "\n" characters but no text. The pdf I am using can be retrieve from here:
pdf_to_json.pdf
The code I am using is:
import PyPDF2
file = open("pdf_to_json.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file)
page_one = pdf.getPage(0)
page_one.extractText()
It's returning something like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
DISCLAIMER: The pdf is in spanish

pdftotext return blank but pdf has multiple lines and multiple pages why?

import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

How to parse text extracted from PDF file with delimiter using Python?

I have tried PyPDF2 to extract and parse text from PDF using following code segment;
import PyPDF2
import re
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")
Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,
In this case, line break or new line can't be found in both rawText or extractedText and results like below-
input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.
Case 2: And for following case,
It gives result as-
2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11
which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?

Categories