How to use Python to pick up texts on PDF documents - python

I have tons of commercial invoices to work with, in PDF format. Some information such the billing party, transaction occurred date and amount of money are needed to be picked.
In another word, I need to copy these information from each commercial invoice and paste them into an Excel spreadsheet.
These information are all at the same position on the PDF document, always the same place on each PDF.
Is there a way that I can have Python to pick up the information and store them into Excel spreadsheet, instead of manually copy&paste?
Thanks.

to read the pdf file you can use StringIO
from StringIO import StringIO
pdfContent = StringIO(getPDFContent("Billineg.pdf").encode("ascii", "ignore"))
for line in pdfContent:
print line
other option you can use pypdf
small example
from pyPdf import PdfFileReader
input1 = PdfFileReader(file("Billineg.pdf", "rb"))
# print the title of document1.pdf
print "title = %s" % (input1.getDocumentInfo().title)
after extracting data you can write them into csv or for excel you can use xlwt
getpdf content is method
import pyPdf
def getPDFContent(path):
content = ""
num_pages = 10
p = file(path, "rb")
pdf = pyPdf.PdfFileReader(p)
for i in range(0, num_pages):
content += pdf.getPage(i).extractText() + "\n"
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content

Related

How to summarize pdf file into plain text, and create and place new file on desktop?

I want to automatically turn pdf files into text, and then take that output to save a file on my desktop.
Example:
-- pdf converted text: "HELLO WORLD"
-- save file on desktop on a .txt file with "HELLO WORLD" saved.
I have done:
fp = open('/Users/zain/Desktop', 'pdf_summary')
fp.write(text)
I thought this would save my file on the desktop given the input (text) which I used as the variable to house the converted text.
Full Code:
from PyPDF2 import PdfReader
reader = PdfReader("/Users/zain/Desktop/Week2_POL305_Manfieldetal.pdf")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
print(text)
fp = open('/Users/zain/Desktop', 'pdf_summary')
fp.write(text)
fp.write(text)
PDF may consist of all sorts of things, not only text.
You therefore have to explicitly extract text from a PDF - if that is what you want.
In package PyMuPDF you could do it this way:
import fitz # import pymupdf
import pathlib
doc=fitz.open("input.pdf")
text = "\n".join([page.get_text() for page in doc])
pathlib.Path("input.txt").write_bytes(text.encode()) # supports non ASCII text
This works for me.
from PyPDF2 import PdfReader
#path to pdf file
reader=PdfReader(r'C:\Users\zain\Desktop\Week2_POL305_Manfieldetal.pdf')
text = ""
for page in reader.pages:
text += page.extract_text() + '\n'
#path to save file on desktop
#you can keep txt, leave nothing, or change it to another file type
fp = open(r'C:\Users\zain\Desktop\pdf_summary.txt','a')
fp.writelines(text)

is it possible to write image to csv file?

Hi everyone this is my first post here and wanted to know how can ı write image files that ı scraped from a website to a csv file or if its not possible to write on csv how can ı write this header,description,time info and image to a maybe word file Here is the code
Everything works perfectly just wanna know how can ı write the images that i downloaded to disk to a csv or word file
Thanks for your helps
import csv
import requests
from bs4 import BeautifulSoup
site_link = requests.get("websitenamehere").text
soup = BeautifulSoup(site_link,"lxml")
read_file = open("blogger.csv","w",encoding="UTF-8")
csv_writer = csv.writer(read_file)
csv_writer.writerow(["Header","links","Publish Time"])
counter = 0
for article in soup.find_all("article"):
###Counting lines
counter += 1
print(counter)
#Article Headers
headers = article.find("a")["title"]
print(headers)
#### Links
links = article.find("a")["href"]
print(links)
#### Publish time
publish_time = article.find("div",class_="mkdf-post-info-date entry-date published updated")
publish_time = publish_time.a.text.strip()
print(publish_time)
###image links
images = article.find("img",class_="attachment-full size-full wp-post-image nitro-lazy")["nitro-lazy-src"]
print(images)
###Download Article Pictures to disk
pic_name = f"{counter}.jpg"
with open(pic_name, 'wb') as handle:
response = requests.get(images, stream=True)
for block in response.iter_content(1024):
handle.write(block)
###CSV Rows
csv_writer.writerow([headers, links, publish_time])
print()
read_file.close()
You could basically convert to base64 and write to a file as you need it
import base64
with open("image.png", "rb") as image_file:
encoded_string= base64.b64encode(img_file.read())
print(encoded_string.decode('utf-8'))
A csv file is supposed to only contain text fields. Even if the csv module does its best to quote fields to allow almost any character in them, including the separator or a new line, it is not able to process NULL characters that could exist in an image file.
That means that you will have to encode the image bytes if you want to store them in a csv file. Base64 is a well known format natively supported by the Python Standard Library. So you could change you code to:
import base64
...
###Download Article Pictures
response = requests.get(images, stream=True)
image = b''.join(block for block in response.iter_content(1024)) # raw image bytes
image = base64.b64encode(image) # base 64 encoded (text) string
###CSV Rows
csv_writer.writerow([headers, links, publish_time, image])
Simply the image will have to be decoded before being used...

Print to excel first line of each page in pdf file

I am new to python, only one script behind me for searching strings in pdfs. Now, I would like to build script which will give me results in new CSV/xlsx file where I will have first lines and their page numbers of given pdf file. For now I have code below for printing whole page:
from PyPDF2 import PdfFileReader
pdf_document = "example.pdf"
with open(pdf_document, "rb") as filehandle:
pdf = PdfFileReader(filehandle)
info = pdf.getDocumentInfo()
pages = pdf.getNumPages()
print (info)
print ("number of pages: %i" % pages)
page1 = pdf.getPage(0)
print(page1)
print(page1.extractText())
You can read pdf file page by page, split by '\n' (if that is the character that splits lines), then use the CSV package to write into a CSV file. A script like below. Just to mention that it if the PDF contains images this code will not be able to extract text. You need an OCR module to convert images to text first.
from PyPDF2 import PdfFileReader
import csv
pdf_document = "test.pdf"
with open(pdf_document, "rb") as filehandle:
pdf = PdfFileReader(filehandle)
with open('result.csv','w') as csv_file:
csv_writer = csv.writer(csv_file)
csv_writer.writerow(['page numebr','firts line'])
for i in range(0, pdf.getNumPages()):
content= pdf.getPage(i).extractText().split('\n')
print(content[0]) # prints first line
print(i+1) # prints page number
print('-------------')
csv_writer.writerow([i+1,content[0]])

How to convert whole pdf to text in python

I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
page = pdf.getPage(0)
text = page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
How to convert whole pdf file to text without using getpage()??
You may want to use textract as this answer recommends to get the full document if all you want is the text.
If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
Though you may want to remember which page the text came from in which case you could use a list:
page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page
You could use tika to accomplish this task, but the output needs a little cleaning.
from tika import parser
parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)
This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.
from PyPDF2 import PdfFileReader
def pdf_text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# Get total pdf page number.
totalPageNumber = pdf.numPages
currentPageNumber = 0
while (currentPageNumber < totalPageNumber):
page = pdf.getPage(currentPageNumber)
text = page.extractText()
# The encoding put each page on a single line.
# type is <class 'bytes'>
print(text.encode('utf-8'))
#################################
# This outputs the text to a list,
# but it doesn't keep paragraphs
# together
#################################
# output = text.encode('utf-8')
# split = str(output, 'utf-8').split('\n')
# print (split)
#################################
# Process next page.
currentPageNumber += 1
path = 'mypdf.pdf'
pdf_text_extractor(path)
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.
What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.
Your extractText() function simply gets the extracted text blocks in document order, not presentation order.
Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.
Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.
I found out a very simple way to do this.
You have to follow this steps:
Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.
pip install PyPDF2
If you're not using Anaconda you have to install pip and put its path
to your cmd or terminal.
Python Code: This following code shows how to convert a pdf file very easily:
import PyPDF2
with open("pdf file path here",'rb') as file_obj:
pdf_reader = PyPDF2.PdfFileReader(file_obj)
raw = pdf_reader.getPage(0).extractText()
print(raw)
I just used pdftotext module to get this done easily.
import pdftotext
# Load your PDF
with open("test.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
file.write(page)
file.close()
Link: https://github.com/manojitballav/pdf-text

Reading PDF using PyPDF2 not resulting anything

Here is my code - courtesy - http://code.activestate.com/recipes/511465-pure-python-pdf-to-text-converter/ .
I modified it to include next version of PyPDF.
import PyPDF2
def getPDFContent(path):
content = ""
# Load PDF into pyPDF
pdf = PyPDF2.PdfFileReader(file(path, "rb"))
# Iterate pages
print "Number of pages is ", pdf.getNumPages()
for i in range(0, pdf.getNumPages()):
# Extract text from page and add to content
content += pdf.getPage(i).extractText() + "\n"
print (content)
# Collapse whitespace
content = " ".join(content.replace(u"\xa0", " ").strip().split())
return content
print getPDFContent("RL.pdf").encode("ascii", "xmlcharrefreplace")
The file I am reading is here.
http://dmc.kar.nic.in/RL.pdf
All I get is this.
Number of pages is 1
Blank after this.
Is this a problem with the PDF or am I going wrong somewhere?
All help appreciated!
The file turned out to be corrupt.

Categories