I have a text file with some basic text:
For more information on this topic, go to (http://moreInfo.com)
This tool is available from (https://www.someWebsite.co.uk)
Contacts (https://www.contacts.net)
I would like the urls to show up as hyperlinks in a QTextBrowser, so that when clicked, the web browser will open and load the website. I have seen this post which uses:
Bar
but as the text file can be edited by anyone (i.e. they might include text which does not provide a web address), I would like it if these addresses, if any, can be automatically hyperlinked before being added to the text browser.
This is how I read the text file:
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(file_path, 'r')
text = f.read()
text_browser.setText(text)
text_browser.setOpenExternalLinks(True)
self.dockwidget.show()
Edit:
Made some headway and managed to get the hyperlinks using (assuming the links are inside parenthesis):
import re
def info(self):
text_browser = self.dockwidget.text_browser
file_path = 'path/to/text.txt'
f = open(about_file_path, 'r')
text = f.read()
urls = re.findall('http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', text)
for x in urls:
if x in text:
text = text.replace(x, x.replace('http', '') + x + '')
textBrowser.setHtml(text)
textBrowser.setOpenExternalLinks(True)
self.dockwidget.show()
However, it all appears in one line and not in the same format as in the text file. How could I solve this?
Matching urls correctly is more complex than your current solution might suggest. For a full breakdown of the issues, see: What is the best regular expression to check if a string is a valid URL?
.
The other problem is much easier to solve. To preserve newlines, you can use this:
text = '<br>'.join(text.splitlines())
Related
I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib.
Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem?
Is there a way to remove watermark from page or something like that?
I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?
The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.
When I look at the underlying info of your PDF I note that it was created using these apps:
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:
'/Creator': 'Acrobat PDFMaker 11 für Excel'
'/Producer': 'Adobe PDF Library 11.0'
I can read the second file perfectly with PyPDF2.
When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.
I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.
I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.
Here are some of the sequences:
page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!
page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜
˘ˆ!"""ˆ˘ˆ!
I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.
I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.
You can definitely extract the content from your document using other Python modules.
Tika
This example uses Tika and BeautifulSoup to extract the content in German from your source document.
import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
parse_pdf = parser.from_buffer(data, xmlContent=True)
# Parse metadata from the PDF
metadata = parse_pdf['metadata']
# Parse the content from the PDF
content = parse_pdf['content']
# Convert double newlines into single newlines
content = content.replace('\n\n', '\n')
soup = BeautifulSoup(content, "lxml")
body = soup.find('body')
for p_tag in body.find_all('p'):
print(p_tag.text.strip())
pdfminer
This example uses pdfminer to extract the content from your source document.
import requests
from io import BytesIO
from pdfminer.high_level import extract_text
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
codec='utf-8', laparams=None)
print(text.replace('\n\n', '\n').strip())
import requests
from io import StringIO, BytesIO
import PyPDF2
def remove_watermark(wm_text, inputFile, outputFile):
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
with open(inputFile, "rb") as f:
source = PdfFileReader(f, "rb")
output = PdfFileWriter()
for page in range(source.getNumPages()):
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
for operands, operator in content.operations:
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
output.addPage(page)
with open(outputFile, "wb") as outputStream:
output.write(outputStream)
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)
In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.
The PageObject method extractText is documented as follows:
extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Returns: a unicode string object.
(PyPDF2 1.26.0 documentation, visited 2021-03-15)
So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.
As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.
Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.
I've been looking at some of the documentation, but all of the work I've seen around docx is primarily directed towards working with text already in a word document. What I'd like to know, is is there a simple way to take text either from HTML or a Text document, and import that into a word document, and to do that wholesale? with all of the text in the HTML/Text document? It doesn't seem to like the string, it's too long.
My understanding of the documentation, is that you have to work with text on a paragraph by paragraph basis. The task that I'd like to do is relatively simple - however it's beyond my python skills. I'd like to set up the margins on the word document, and then import the text into the word document so that it adheres to the margins that I previously specified.
Does anyone have any-thoughts? None of the previous posts have been very helpful that I have found.
import textwrap
import requests
from bs4 import BeautifulSoup
from docx import Document
from docx.shared import Inches
class DocumentWrapper(textwrap.TextWrapper):
def wrap(self, text):
split_text = text.split('\n\n')
lines = [line for para in split_text for line in textwrap.TextWrapper.wrap(self, para)]
return lines
page = requests.get("http://classics.mit.edu/Aristotle/prior.mb.txt")
soup = BeautifulSoup(page.text,"html.parser")
#we are going to pull in the text wrap extension that we have added.
#The typical width that we want tow
text_wrap_extension = DocumentWrapper(width=82,initial_indent="",fix_sentence_endings=True)
new_string = text_wrap_extension.fill(page.text)
final_document = "Prior_Analytics.txt"
with open(final_document, "w") as f:
f.writelines(new_string)
document = Document(final_document)
### Specified margin specifications
sections = document.sections
for section in sections:
section.top_margin = (Inches(1.00))
section.bottom_margin = (Inches(1.00))
section.right_margin = (Inches(1.00))
section.left_margin = (Inches(1.00))
document.save(final_document)
The error that I get thrown is below:
docx.opc.exceptions.PackageNotFoundError: Package not found at 'Prior_Analytics.txt'
This error simply means there is no .docx file at the location you specified.. So you can modify your code to create the file it it doesnt exist.
final_document = "Prior_Analytics.txt"
with open(final_document, "w+") as f:
f.writelines(new_string)
You are providing a relative path. How do you know what Python's current working directory is? That's where the relative path you give will start from.
A couple lines of code like this will tell you:
import os
print(os.path.realpath('./'))
Note that:
docx is used to open .docx files
I got it.
document = Document()
sections = document.sections
for section in sections:
section.top_margin = Inches(2)
section.bottom_margin = Inches(2)
section.left_margin = Inches(2)
section.right_margin = Inches(2)
document.add_paragraph(###Add your text here. Add Paragraph Accepts text of whatever size.###)
document.save()#name of document goes here, as a string.
I have to convert whole pdf to text. i have seen at many places converting pdf to text but particular page.
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
page = pdf.getPage(0)
text = page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
How to convert whole pdf file to text without using getpage()??
You may want to use textract as this answer recommends to get the full document if all you want is the text.
If you want to use PyPDF2 then you can first get the number of pages then iterate over each page such as:
from PyPDF2 import PdfFileReader
import os
def text_extractor(path):
with open(os.path.join(path,file), 'rb') as f:
pdf = PdfFileReader(f)
###Here i can specify page but i need to convert whole pdf without specifying pages###
text = ""
for page_num in range(pdf.getNumPages()):
page = pdf.getPage(page_num)
text += page.extractText()
print(text)
if __name__ == '__main__':
path="C:\\Users\\AAAA\\Desktop\\BB"
for file in os.listdir(path):
if not file.endswith(".pdf"):
continue
text_extractor(path)
Though you may want to remember which page the text came from in which case you could use a list:
page_text = []
for page_num in range(pdf.getNumPages()): # For each page
page = pdf.getPage(page_num) # Get that page's reference
page_text.append(page.extractText()) # Add that page to our array
for page in page_text:
print(page) # print each page
You could use tika to accomplish this task, but the output needs a little cleaning.
from tika import parser
parse_entire_pdf = parser.from_file('mypdf.pdf', xmlContent=True)
parse_entire_pdf = parse_entire_pdf['content']
print (parse_entire_pdf)
This answer uses PyPDF2 and encode('utf-8') to keep the output per page together.
from PyPDF2 import PdfFileReader
def pdf_text_extractor(path):
with open(path, 'rb') as f:
pdf = PdfFileReader(f)
# Get total pdf page number.
totalPageNumber = pdf.numPages
currentPageNumber = 0
while (currentPageNumber < totalPageNumber):
page = pdf.getPage(currentPageNumber)
text = page.extractText()
# The encoding put each page on a single line.
# type is <class 'bytes'>
print(text.encode('utf-8'))
#################################
# This outputs the text to a list,
# but it doesn't keep paragraphs
# together
#################################
# output = text.encode('utf-8')
# split = str(output, 'utf-8').split('\n')
# print (split)
#################################
# Process next page.
currentPageNumber += 1
path = 'mypdf.pdf'
pdf_text_extractor(path)
Try pdfreader. You can extract either plain text or decoded text containing "pdf markdown":
from pdfreader import SimplePDFViewer, PageDoesNotExist
fd = open(you_pdf_file_name, "rb")
viewer = SimplePDFViewer(fd)
plain_text = ""
pdf_markdown = ""
try:
while True:
viewer.render()
pdf_markdown += viewer.canvas.text_content
plain_text += "".join(viewer.canvas.strings)
viewer.next()
except PageDoesNotExist:
pass
PDF is a page-oriented format & therefore you'll need to deal with the concept of pages.
What makes it perhaps even more difficult, you're not guaranteed that the text excerpts you're able to extract are extracted in the same order as they are presented on the page: PDF allows one to say "put this text within a 4x3 box situated 1" from the top, with a 1" left margin.", and then I can put the next set of text somewhere else on the same page.
Your extractText() function simply gets the extracted text blocks in document order, not presentation order.
Tables are notoriously difficult to extract in a common, meaningful way... You see them as tables, PDF sees them as text blocks placed on the page with little or no relationship.
Still, getPage() and extractText() are good starting points & if you have simply formatted pages, they may work fine.
I found out a very simple way to do this.
You have to follow this steps:
Install PyPDF2 :To do this step if you use Anaconda, search for Anaconda Prompt and digit the following command, you need administrator permission to do this.
pip install PyPDF2
If you're not using Anaconda you have to install pip and put its path
to your cmd or terminal.
Python Code: This following code shows how to convert a pdf file very easily:
import PyPDF2
with open("pdf file path here",'rb') as file_obj:
pdf_reader = PyPDF2.PdfFileReader(file_obj)
raw = pdf_reader.getPage(0).extractText()
print(raw)
I just used pdftotext module to get this done easily.
import pdftotext
# Load your PDF
with open("test.pdf", "rb") as f:
pdf = pdftotext.PDF(f)
# creating a text file after iterating through all pages in the pdf
file = open("test.txt", "w")
for page in pdf:
file.write(page)
file.close()
Link: https://github.com/manojitballav/pdf-text
What I am trying to accomplish here is basically have Reg ex return the match I want based on the pattern from a text file that Python has created and written too.
Currently I am getting TypeError: 'NoneType' object is not iterable error and I am not sure why. If I need more information let me know.
#Opens Temp file
TrueURL = open("TrueURL_tmp.txt","w+")
#Reviews Data grabbed from BeautifulSoup and write urls to file
for link in g_data:
TrueURL.write(link.get("href") + '\n')
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, str(TrueURL))
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url)
#Closes and deletes file
TrueURL.close()
os.remove("TrueURL_tmp.txt")
Your search is returning no match because you are doing it on the str representation of the file object not the actual file content.
You are basically searching something like:
<open file 'TrueURL_tmp.txt', mode 'w+' at 0x7f2d86522390>
If you want to search the file content, close the file so the content is definitely written, then reopen and read the lines or maybe just search in the loop for link in g_data:
If you actually want to write to temporary file then use a tempfile:
from tempfile import TemporaryFile
with TemporaryFile() as f:
for link in g_data:
f.write(link.get("href") + '\n')
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
search_pattern = re.search(pattern, f.read())
search_pattern is a _sre.SRE_Match object so you would call group i,e print(search_pattern.group()) or maybe you want to use findAll.
search_pattern = re.findall(pattern, f.read())
for url in search_pattern:
print (url)
I still think doing the search before you write anything might be the best approach and maybe not writing at all but I am not fully sure what it is you actually want to do because I don't see how the file fits into what you are doing, concatenating to a string would achieve the same.
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
for link in g_data:
match = pattern.search(link.get("href"))
if match:
print(match.group())
Here is the solution I have found to answer my original question with, although Padraic way is correct and less painful process.
with TemporaryFile() as f:
for link in g_data:
f.write(bytes(link.get("href") + '\n', 'UTF-8'))
f.seek(0)
#Creates Regex Pattern for TrueURL_tmp
pattern = re.compile(r'thread/.*/*apple|thread/.*/*potato')
read = f.read()
search_pattern = re.findall(pattern,read)
#Uses Regex Pattern against TrueURL_tmp file.
for url in search_pattern:
print (url.decode('utf-8'))
I'm using python-docx module to do some edits on a large number of documents. They all contain a header in which I need to replace a number, but everytime I do this the document won't open, with the error that the content is unreadable. Anyone have any ideas as to why this is happening, or sample working code snippets? Thanks.
from docx import *
#document = yourdocument.docx
filename = "NUR-ADM-2001"
relationships = relationshiplist()
document = opendocx("C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
docbody = document.xpath('/w:document/w:body',namespaces=nsprefixes)[0]
advReplace(docbody, "NUR-NPM 101", "NUR-NPM 202")
# Create our properties, contenttypes, and other support files
coreprops = coreproperties(title='Nursing Doc',subject='Policies',creator='IA',keywords='Policy'])
appprops = appproperties()
contenttypes = contenttypes()
websettings = websettings()
wordrelationships = wordrelationships(relationships)
# Save our document
savedocx(document,coreprops,appprops,contenttypes,websettings, wordrelationships,"C:/Users/ai/My Documents/Nursing docs/" + filename + ".docx")
Edit: So it eventually can open the document, but it says some content cannot be displayed and the headers have vanished... thoughts?
I don't know this module, but in general you should not edit a file in place. Open file "A", write file "/tmp/A". Close both files and make sure you have no errors, then move "/tmp/A" to "A". Otherwise you risk clobbering your file if something goes wrong during the write.