How to remove watermark from PDF file using Python's PyPDF2 lib - python

I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib.
Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem?
Is there a way to remove watermark from page or something like that?
I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?

The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.
When I look at the underlying info of your PDF I note that it was created using these apps:
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:
'/Creator': 'Acrobat PDFMaker 11 für Excel'
'/Producer': 'Adobe PDF Library 11.0'
I can read the second file perfectly with PyPDF2.
When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.
I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.
I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.
Here are some of the sequences:
page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!
page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜
˘ˆ!"""ˆ˘ˆ!
I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.
I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.
You can definitely extract the content from your document using other Python modules.
Tika
This example uses Tika and BeautifulSoup to extract the content in German from your source document.
import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
parse_pdf = parser.from_buffer(data, xmlContent=True)
# Parse metadata from the PDF
metadata = parse_pdf['metadata']
# Parse the content from the PDF
content = parse_pdf['content']
# Convert double newlines into single newlines
content = content.replace('\n\n', '\n')
soup = BeautifulSoup(content, "lxml")
body = soup.find('body')
for p_tag in body.find_all('p'):
print(p_tag.text.strip())
pdfminer
This example uses pdfminer to extract the content from your source document.
import requests
from io import BytesIO
from pdfminer.high_level import extract_text
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
codec='utf-8', laparams=None)
print(text.replace('\n\n', '\n').strip())

import requests
from io import StringIO, BytesIO
import PyPDF2
def remove_watermark(wm_text, inputFile, outputFile):
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
with open(inputFile, "rb") as f:
source = PdfFileReader(f, "rb")
output = PdfFileWriter()
for page in range(source.getNumPages()):
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
for operands, operator in content.operations:
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
output.addPage(page)
with open(outputFile, "wb") as outputStream:
output.write(outputStream)
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)

In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.
The PageObject method extractText is documented as follows:
extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Returns: a unicode string object.
(PyPDF2 1.26.0 documentation, visited 2021-03-15)
So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.
As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.
Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.

Related

Python: Import text from HTML or text document into Word

I've been looking at some of the documentation, but all of the work I've seen around docx is primarily directed towards working with text already in a word document. What I'd like to know, is is there a simple way to take text either from HTML or a Text document, and import that into a word document, and to do that wholesale? with all of the text in the HTML/Text document? It doesn't seem to like the string, it's too long.
My understanding of the documentation, is that you have to work with text on a paragraph by paragraph basis. The task that I'd like to do is relatively simple - however it's beyond my python skills. I'd like to set up the margins on the word document, and then import the text into the word document so that it adheres to the margins that I previously specified.
Does anyone have any-thoughts? None of the previous posts have been very helpful that I have found.
import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")
#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)
final_document = "Prior_Analytics.txt"
with open(final_document, "w") as f:
f.writelines(new_string)
document = Document(final_document)
### Specified margin specifications
sections = document.sections
for section in sections:
section.top_margin = (Inches(1.00))
section.bottom_margin = (Inches(1.00))
section.right_margin = (Inches(1.00))
section.left_margin = (Inches(1.00))
document.save(final_document)
The error that I get thrown is below:
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'
This error simply means there is no .docx file at the location you specified.. So you can modify your code to create the file it it doesnt exist.
final_document = "Prior_Analytics.txt"
with open(final_document, "w+") as f:
f.writelines(new_string)
You are providing a relative path. How do you know what Python's current working directory is? That's where the relative path you give will start from.
A couple lines of code like this will tell you:
import os
print(os.path.realpath('./'))
Note that:
docx is used to open .docx files
I got it.
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(2)
section.bottom_margin = Inches(2)
section.left_margin = Inches(2)
section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.

How to convert whole pdf to text in python

I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
page = pdf.getPage(0)
text = page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
How to convert whole pdf file to text without using getpage()??
You may want to use textract as this answer recommends to get the full document if all you want is the text.
If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
Though you may want to remember which page the text came from in which case you could use a list:
page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page
You could use tika to accomplish this task, but the output needs a little cleaning.
from tika import parser
parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)
This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.
from PyPDF2 import PdfFileReader
def pdf_text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# Get total pdf page number.
totalPageNumber = pdf.numPages
currentPageNumber = 0
while (currentPageNumber < totalPageNumber):
page = pdf.getPage(currentPageNumber)
text = page.extractText()
# The encoding put each page on a single line.
# type is <class 'bytes'>
print(text.encode('utf-8'))
#################################
# This outputs the text to a list,
# but it doesn't keep paragraphs
# together
#################################
# output = text.encode('utf-8')
# split = str(output, 'utf-8').split('\n')
# print (split)
#################################
# Process next page.
currentPageNumber += 1
path = 'mypdf.pdf'
pdf_text_extractor(path)
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.
What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.
Your extractText() function simply gets the extracted text blocks in document order, not presentation order.
Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.
Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.
I found out a very simple way to do this.
You have to follow this steps:
Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.
pip install PyPDF2
If you're not using Anaconda you have to install pip and put its path
to your cmd or terminal.
Python Code: This following code shows how to convert a pdf file very easily:
import PyPDF2
with open("pdf file path here",'rb') as file_obj:
pdf_reader = PyPDF2.PdfFileReader(file_obj)
raw = pdf_reader.getPage(0).extractText()
print(raw)
I just used pdftotext module to get this done easily.
import pdftotext
# Load your PDF
with open("test.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
file.write(page)
file.close()
Link: https://github.com/manojitballav/pdf-text

How to translate url encoded string python

This code is supposed to download a list of pdfs into a directory
for pdf in preTag:
pdfUrl = "https://the-eye.eu/public/Books/Programming/" +
pdf.get("href")
print("Downloading...%s"% pdfUrl)
#downloading pdf from url
page = requests.get(pdfUrl)
page.raise_for_status()
#saving pdf to new directory
pdfFile = open(os.path.join(filePath, os.path.basename(pdfUrl)), "wb")
for chunk in page.iter_content(1000000):
pdfFile.write(chunk)
pdfFile.close()
I used os.path.basename() just to make sure the files would actually download. However, I want to know how to change the file name from 3D%20Printing%20Blueprints%20%5BeBook%5D.pdf to something like "3D Printing Blueprints.pdf"
You can use the urllib2 unquote function:
import urllib2
print urllib2.unquote("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf") #3D Printing Blueprints.pdf
use this:
os.rename("3D%20Printing%20Blueprints%20%5BeBook%5D.pdf", "3D Printing Blueprints.pdf")
you can find more info here

PDFMiner conditional extraction of text

So I've just played around with PDFMiner and can now extract text from a PDF and throw it into an html or textfile.
pdf2txt.py -o outputfile.txt -t txt inputfile.pdf
I have then written a simple script to extract all certain strings:
with open('output.txt', 'r') as searchfile:
for line in searchfile:
if 'HELLO' in line:
print(line)
And now I can use all these strings containing the word HELLO to add to my databse if that is what I wanted.
My questions is:
Is the only way or can PDFinder grab conditional stuff before even spitting it out to the txt, html or even straight into the database?
Well, yes, you can: PDFMiner has API.
The basic example sais
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfpage import PDFTextExtractionNotAllowed
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfdevice import PDFDevice
# Open a PDF file.
fp = open('mypdf.pdf', 'rb')
# Create a PDF parser object associated with the file object.
parser = PDFParser(fp)
# Create a PDF document object that stores the document structure.
# Supply the password for initialization.
document = PDFDocument(parser, password)
# Check if the document allows text extraction. If not, abort.
if not document.is_extractable:
raise PDFTextExtractionNotAllowed
# Create a PDF resource manager object that stores shared resources.
rsrcmgr = PDFResourceManager()
# Create a PDF device object.
device = PDFDevice(rsrcmgr)
# Create a PDF interpreter object.
interpreter = PDFPageInterpreter(rsrcmgr, device)
# Process each page contained in the document.
for page in PDFPage.create_pages(document):
interpreter.process_page(page)
# do stuff with the page here
and in the loop you should go with
# receive the LTPage object for the page.
layout = device.get_result()
and then use LTTextBox object. You have to analyse that. There's no full example in the docs, but you may check out the pdf2txt.py source which will help you to find missing pieces (although it does much more since it parses options and applies them).
And once you have the code that extracts the text, you can do what you want before saving a file. Including searching for certain parts of text.
PS looks like this was, in a way, asked before: How do I use pdfminer as a library which should be helpful, too.

Python django writing ascii characters in coded format in file

I am using Django to generate the abc.tex file
I am displaying the the data in browser and same data i am writing to tex file like this
with open("sample.tex") as f:
t = Template(f.read())
head = ['name','class']
c = Context({"head":headers, "table": rowlist})
# Render template
output = t.render(c)
with open("mytable.tex", 'w') as out_f:
out_f.write(output)
Now in the broser i can see the text as speaker-hearer's but in the file it is coming as speaker-hearer's
How can i fix that
As far as I know, the browser decodes this data automatically, but the text within the file will be raw; so you are seeing the data "as it is".
Maybe you can use the HTMLParser library to decode the data generated by Django (output) before writing to the abc.tex file.
For your sample string:
import HTMLParser
h = HTMLParser.HTMLParser()
s = "speaker-hearer's"
s = h.unescape(s)
So then it would be just a matter of unescaping your output when you write it to a file, and probably handling the parsing exception.
Source (see step #3)

Categories