How to parse text extracted from PDF file with delimiter using Python? - python

I have tried PyPDF2 to extract and parse text from PDF using following code segment;
import PyPDF2
import re
pdfFileObj = open('test.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
rawText = pdfReader.getPage().extractText()
extractedText = re.split('\n|\t', rawText)
print("Extracted Text: " + str(extractedText) + "\n")
Case 1: When I try to parse pdf text, I failed to parse them as exactly as they appear in pdf. For example,
In this case, line break or new line can't be found in both rawText or extractedText and results like below-
input field, your old automation script will try to submit a form with missing data unless you update it.Another common case is asserting that a specific error message appeared and then updating the error message, which will also break the script.
Case 2: And for following case,
It gives result as-
2B. Community Living5710509-112C. Lifelong Learning69116310-122D. Employment5710509-11
which is more difficult to parse and differentiate between these individual scores. Is it possible to parse perfectly these scenario with PyPDF2 or any other Python library?

Related

Reading nothing from a pdf file using PyPDF2

I want to convert a pdf file to json format. So I was using PyPDF2 module to read the pdf. But I am unable to read it. I gives me some "\n" characters but no text. The pdf I am using can be retrieve from here:
pdf_to_json.pdf
The code I am using is:
import PyPDF2
file = open("pdf_to_json.pdf", "rb")
pdf = PyPDF2.PdfFileReader(file)
page_one = pdf.getPage(0)
page_one.extractText()
It's returning something like this:
'\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
DISCLAIMER: The pdf is in spanish

Reading from pdf file to text yields no results

So I'm trying something very simple: I just want to read text from a pdf file in to a variable - that's it. This is what I'm getting:
Does anyone know a reliable way to just read pdf in to a text file?
Try the following library - pdfplumber:
import pdfplumber
pdf_file = pdfplumber.open('anyfile.pdf')
page = pdf_file.pages[0]
text = page.extract_text()
print(text)
pdf_file.close()
I haven't used PyPDF2 before but pdfplumber seems to work well for me.

pdftotext return blank but pdf has multiple lines and multiple pages why?

import pdftotext
# Load your PDF
with open("docs/doc1.pdf", "rb") as f:
docs = pdftotext.PDF(f)
print(docs[0])
this code print blank for this specific file, if i change the file it is giving me result.
I tried even apache Tika. Tika also return None, How to solve this problem?
One thing I would like to mention here is that pdf is made of multiple images
Here is the file
This is sample pdf, not the original one. but i want to extract text from the pdf something like this

Convert pdf files to raw text in new directory

Here is what I'm trying:
import PyPDF2
from PyPDF2 import PdfFileWriter, PdfFileReader
import re
import config
import sys
import os
with open(config.ENCRYPTED_FILE_PATH, mode='rb') as f:
reader = PyPDF2.PdfFileReader(f)
if reader.isEncrypted:
reader.decrypt('Password123')
print(f"Number of page: {reader.getNumPages()}")
for i in range(reader.numPages):
output = PdfFileWriter()
output.addPage(reader.getPage(i))
with open("./pdfs/document-page%s.pdf" % i, "wb") as outputStream:
output.write(outputStream)
print(outputStream)
for page in output.pages: # failing here
print page.extractText() # failing here
The entire program is decrypting a large pdf file from one location, and splitting into a separate pdf file per page in new directory -- this is working fine. However, after this I would like to convert each page to a raw .txt file in a new directory. i.e. /txt_versions/ (for which I'll use later)
Ideally, I can use my current imports, i.e. PyPDF2 without importing/installing more modules/. Any thoughts?
You have not described how the last two lines are failing, but extract text does not function well on some PDFs:
def extractText(self):
"""
Locate all text drawing commands, in the order they are provided in the
content stream, and extract the text. This works well for some PDF
files, but poorly for others, depending on the generator used. This will
be refined in the future. Do not rely on the order of text coming out of
this function, as it will change if this function is made more
sophisticated.
:return: a unicode string object.
"""
One thing to do is to see if there is text in your pdf. Just because you can see words doesn't mean they have been OCR'd or otherwise encoded in the file as text. Attempt to highlight the text in the pdf and copy/paste it into a text file to see what sort of text can even be extracted.
If you can't get your solution working you'll need to use another package like Tika.

Trouble parsing html files (to csv) using ElementTree xpath in python

I am trying to parse a few thousand html files and dump the variables into a csv file (excel spreadsheet). I've come up against several roadblocks--the first one which was (thankfully) solve here, a few days ago. The (hopefully) final roadblock is this: I can not get it to properly parse the file using xpath. Below is a brief explanation, the python code and example of the html code.
The trouble starts here:
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
category=node.text
It runs, but does not parse. I do not get any traceback errors.
I think I am misunderstanding the logic of parsing with ElementTree.
There are several headers that are the same--it is therefor difficult to find a unique id/header. Here is an example of the html:
<span class="s1">Business: Give Back to the Community and Save Money
on Equipment, Technology, Promotional Products, and Market<span
class="Apple-converted-space"> </span></span>
For which the xpath is:
/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]
/table/tbody/tr[1]/td[1]/p/span
I would like to scrape the text from this span (among others) and put it in the excel spreadsheet.
You can see an example of a similar page HERE
At any rate, because many spans/headers are no uniquely identified, I think I should use xpath. However, I have yet to be able to figure out how to successfully use xpath commands with ElementTree. In searching the documentation, the answer to this question (as well as the logic) eludes me. I have read up on http://lxml.de/parsing.html as well as on this site and have yet to find something that works.
So far, the code iterates through all the files (in dropbox) nicely. It also creates the csv file and creates the headers (though not in separate columns, only as one line separated by semicolons-- but that should be easy to fix).
In sum, I would like it to parse the text from different lines on in each file (webpage) and dump it into the excel file.
Any input would be greatly appreciated.
The python code:
import xml.etree.ElementTree as ET
import csv, codecs, os
from cStringIO import StringIO
# Note: you need to download and install this..
import unicodecsv
import lxml.html
# TODO: make into command line params (instead of constant)
CSV_FILE='output.csv'
HTML_PATH='/Users/C/data/Folder_NS'
f = open(CSV_FILE, 'wb')
w = unicodecsv.writer(f, encoding='utf-8', delimiter=';')
w.writerow(['file', 'category', 'about', 'title', 'subtitle', 'date', 'bodyarticle'])
# redundant declarations:
category=''
about=''
title=''
subtitle=''
date=''
bodyarticle=''
print "headers created"
allFiles = os.listdir(HTML_PATH)
#with open(CSV_FILE, 'wb') as csvfile:
print "all defined"
for file in allFiles:
#print allFiles
if '.html' in file:
print "in html loop"
tree = lxml.html.parse(HTML_PATH+"/"+file)
print '===================='
print 'Parsing file: '+file
print '===================='
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
f.close()
14 June 2015 (most recent change); I have just changed this section
for node in tree.iter():
name = node.attrib.get('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
if category =='/html/body/center/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/font':
print 'Category:'
category=node.text
to this:
for node in tree.iter():
row = dict.fromkeys(cols)
Category_name = tree.xpath('/html/body/table/tbody/tr/td/table/tbody/tr[3]/td/table/tbody/tr[2]/td[2]/table/tbody/tr[1]/td[1]/p/span')
row['category'] = Category_name[0].text_content().encode('utf-8')
It still runs, but does not parse.
Try following code:
from lxml import etree
import requests
from StringIO import StringIO
data = requests.get('http://www.usprwire.com/Detailed/Banking_Finance_Investment/Confused.com_reveals_that_Life_Insurance_is_more_than_a_form_of_future_protection_284764.shtml').content
parser = etree.HTMLParser()
root = etree.parse(StringIO(data), parser)
category = root.xpath('//table/td/font/text()')
print category[0]
It uses requests library to download the html code of the page. You can choose whatever method that fits your needs. The important part is the xpath that searches any <table> followed by <td> followed by <font>, and it returns a list with two elements. The second one are blank characters and the first one contains the text.
Run it and yields just the sentence you are looking for:
Banking, Finance & Investment: Confused.com reveals that Life Insurance is more than a form of future protection

Categories