I am using Django to generate the abc.tex file
I am displaying the the data in browser and same data i am writing to tex file like this
with open("sample.tex") as f:
t = Template(f.read())
head = ['name','class']
c = Context({"head":headers, "table": rowlist})
# Render template
output = t.render(c)
with open("mytable.tex", 'w') as out_f:
out_f.write(output)
Now in the broser i can see the text as speaker-hearer's but in the file it is coming as speaker-hearer's
How can i fix that
As far as I know, the browser decodes this data automatically, but the text within the file will be raw; so you are seeing the data "as it is".
Maybe you can use the HTMLParser library to decode the data generated by Django (output) before writing to the abc.tex file.
For your sample string:
import HTMLParser
h = HTMLParser.HTMLParser()
s = "speaker-hearer's"
s = h.unescape(s)
So then it would be just a matter of unescaping your output when you write it to a file, and probably handling the parsing exception.
Source (see step #3)
Related
I have wrote a code that extracts the text from PDF file with Python and PyPDF2 lib.
Code works good for most docs but sometimes it returns some strange characters. I think thats because PDF has watermark over the page so it does not recognise the text:
import requests
from io import StringIO, BytesIO
import PyPDF2
def pdf_content_extraction(pdf_link):
all_pdf_content = ''
#sending requests
response = requests.get(pdf_link)
my_raw_data = response.content
pdf_file_text = 'PDF File: ' + pdf_link + '\n\n'
#extract text page by page
with BytesIO(my_raw_data) as data:
read_pdf = PyPDF2.PdfFileReader(data)
#looping trough each page
for page in range(read_pdf.getNumPages()):
page_content = read_pdf.getPage(page).extractText()
page_content = page_content.replace("\n\n\n", "\n").strip()
#store data into variable for each page
pdf_file_text += page_content + '\n\nPAGE '+ str(page+1) + '/' + str(read_pdf.getNumPages()) +'\n\n\n'
all_pdf_content += pdf_file_text + "\n\n"
return all_pdf_content
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
print(pdf_content_extraction(pdf_link))
This is the result that I'm getting:
#$%˘˘
&'(˝˙˝˙)*+"*˜
˜*
,*˜*˜ˆ+-*˘!(
.˜($*%(#%*˜-/
"*
*˜˜0!0˘˘*˜˘˜ˆ
+˜(%
*
*(+%*˜+"*˜'
$*1˜ˆ
...
...
My question is, how can I fix this problem?
Is there a way to remove watermark from page or something like that?
I mean, maybe this problem can be fixed in some other way, maybe the problem is not in that watermark/logo?
The garbled text issue that you're having has nothing to do with the watermark in the document. Your issue seems to be related to the encoding in the document. The German characters within your document should be able to be extracted using PyPDF2, because it uses the latin-1 (iso-8859-1) encoding/decoding model. This encoding model isn't working with your PDF.
When I look at the underlying info of your PDF I note that it was created using these apps:
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
When I look at one of the PDFs in this question also written in German, I note that it was created using different apps:
'/Creator': 'Acrobat PDFMaker 11 für Excel'
'/Producer': 'Adobe PDF Library 11.0'
I can read the second file perfectly with PyPDF2.
When I look at this file from your other question I noted that is also cannot be read correctly by PyPDF2. This file was created with the same apps as the file from this bounty question.
'Producer': 'GPL Ghostscript 9.10'
'Creator': 'PDFCreator Version 1.7.3
This is the same file that throw an error when attempting to extract the text using pdfreader.SimplePDFViewer.
I looked at the bugs for ghostscript and noted that there are some font related issues for Ghostscript 9.10, which was release in 2015. I also noted that some people mentioned that PDFCreator Version 1.7.3 released in 2018 also had some font embedding issues.
I have been trying to find the correct decoding/encoding sequence, but some far I haven't been able to extract the text correctly.
Here are some of the sequences:
page_content.encode('raw_unicode_escape').decode('ascii', 'xmlcharrefreplace'))
# output
\u02d8
\u02c7\u02c6\u02d9\u02dd\u02d9\u02db\u02da\u02d9\u02dc
\u02d8\u02c6!"""\u02c6\u02d8\u02c6!
page_content.encode('ascii', 'xmlcharrefreplace').decode('raw_unicode_escape'))
# output
# ˘
ˇˆ˙˝˙˛˚˙˜
˘ˆ!"""ˆ˘ˆ!
I will keep looking for the correct encoding/decoding sequence to use with PyPDF2. It is worth nothing that PyPDF2 hasn't been updated since May 18, 2016. Also encoding issues is common problem with the module. Plus the maintenance of this module is dead, thus the ports to the modules PyPDF3 and PyPDF4.
I attempted to extract the text from your PDF using PyPDF2, PyPDF3 and PyPDF4. All 3 modules failed to extract the content from the PDF that you provided.
You can definitely extract the content from your document using other Python modules.
Tika
This example uses Tika and BeautifulSoup to extract the content in German from your source document.
import requests
from tika import parser
from io import BytesIO
from bs4 import BeautifulSoup
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
parse_pdf = parser.from_buffer(data, xmlContent=True)
# Parse metadata from the PDF
metadata = parse_pdf['metadata']
# Parse the content from the PDF
content = parse_pdf['content']
# Convert double newlines into single newlines
content = content.replace('\n\n', '\n')
soup = BeautifulSoup(content, "lxml")
body = soup.find('body')
for p_tag in body.find_all('p'):
print(p_tag.text.strip())
pdfminer
This example uses pdfminer to extract the content from your source document.
import requests
from io import BytesIO
from pdfminer.high_level import extract_text
pdf_link = 'http://www.dielsdorf.ch/dl.php/de/5f867e8255980/2020.10.12.pdf'
response = requests.get(pdf_link)
with BytesIO(response.content) as data:
text = extract_text(data, password='', page_numbers=None, maxpages=0, caching=True,
codec='utf-8', laparams=None)
print(text.replace('\n\n', '\n').strip())
import requests
from io import StringIO, BytesIO
import PyPDF2
def remove_watermark(wm_text, inputFile, outputFile):
from PyPDF4 import PdfFileReader, PdfFileWriter
from PyPDF4.pdf import ContentStream
from PyPDF4.generic import TextStringObject, NameObject
from PyPDF4.utils import b_
with open(inputFile, "rb") as f:
source = PdfFileReader(f, "rb")
output = PdfFileWriter()
for page in range(source.getNumPages()):
page = source.getPage(page)
content_object = page["/Contents"].getObject()
content = ContentStream(content_object, source)
for operands, operator in content.operations:
if operator == b_("Tj"):
text = operands[0]
if isinstance(text, str) and text.startswith(wm_text):
operands[0] = TextStringObject('')
page.__setitem__(NameObject('/Contents'), content)
output.addPage(page)
with open(outputFile, "wb") as outputStream:
output.write(outputStream)
wm_text = 'wm_text'
inputFile = r'input.pdf'
outputFile = r"output.pdf"
remove_watermark(wm_text, inputFile, outputFile)
In contrast to my initial assumption in comments to the question, the issue is not some missing ToUnicode map. I didn't see the URL to the file immediately and, therefore, guessed. Instead, the issue is a very primitively implemented text extraction method.
The PageObject method extractText is documented as follows:
extractText()
Locate all text drawing commands, in the order they are provided in the content stream, and extract the text. This works well for some PDF files, but poorly for others, depending on the generator used. This will be refined in the future. Do not rely on the order of text coming out of this function, as it will change if this function is made more sophisticated.
Returns: a unicode string object.
(PyPDF2 1.26.0 documentation, visited 2021-03-15)
So this method extracts the string arguments of text drawing instructions in the content stream ignoring the encoding information in the respectively current font object. Thus, only text drawn using a font with some ASCII'ish encoding are properly extracted.
As the text in question uses a custom ad-hoc encoding (generated while creating the page, containing the used characters in the order of their first occurrence), that extractText method is unable to extract the text.
Proper text extraction methods, on the other hand, can extract the text without issue as tested by Life is complex and documented in his answer.
I am trying to create a program that utilizes pdfminer to read a DnD Character Sheet (fillable PDF) and put the fill-ins into a dictionary. Upon editing the PDF and running the program again, I get a strange sequence of characters when printing the dictionary items. The code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
import collections
filename = "Edited_CS.pdf"
fp = open(filename, 'rb')
my_dict = {}
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
# Checks if PDF file is blank
if isinstance(fields, collections.abc.Sequence) is False:
print("This Character Sheet is blank. Please submit a filled Character Sheet!")
else:
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
if value is None or str(value)[2:-1] == "":
value = "b'None'"
my_dict[str(name)[2:-1]] = str(value)[2:-1]
for g in list(my_dict.items()):
print(g)
The output from the unedited PDF file:
('ClassLevel', 'Assassin 1')
('Background', 'Lone Survivor')
('PlayerName', 'None')
('CharacterName', 'Tumas Mitshil')
('Race ', 'Human')
etc...
The output when it was edited (I changed the ClassLevel, etc. completely in the PDF):
('ClassLevel', '\\xfe\\xff\\x00C\\x00l\\x00a\\x00s\\x00s\\x00L\\x00e\\x00v\\x00e\\x00l')
('Background', '\\xfe\\xff\\x00B\\x00a\\x00c\\x00k\\x00g\\x00r\\x00o\\x00u\\x00n\\x00d\\x00r')
('PlayerName', '\\xfe\\xff\\x00P\\x00l\\x00a\\x00y\\x00e\\x00r\\x00N\\x00a\\x00m\\x00e')
('CharacterName', '\\xfe\\xff\\x00T\\x00h\\x00o\\x00m\\x00a\\x00s')
('Race ', '\\xfe\\xff\\x00R\\x00a\\x00c\\x00e')
('Alignment', '\\xfe\\xff\\x00A\\x00l\\x00i\\x00g\\x00n\\x00m\\x00e\\x00n\\x00t')
etc...
I know this is an encoding of some sort, and a few Google searches led me to believe it was in UTF-8 encode, so I attempted to decode the PDF when opening the file:
fp = open(filename, 'rb').read().decode('utf-8')
Unfortunately, I am met with an error:
Traceback (most recent call last):
File "main.py", line 16, in <module>
fp = open(filename, 'rb').read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
When I first made the PDF, I used Adobe Acrobat. However, I used Microsoft Edge to edit the file, which resulted in the problem I am facing. Here are the files:
Original File
Edited File
Is there any way to properly decode this? Is there a way to encode the edited pdf so it can be loaded into python without trouble? And if this is encoded, are there other forms of encoding, and how would I decode those?
Any help will be greatly appreciated.
You can fix the problem by using Adobe Acrobat Reader DC to edit the form fields. I've edited the form fields of Edited_CS.pdf using it and pdfminer.six returns the expected output.
Probably Microsoft Edge is causing this problem.
After some digging, I was able to find a better solution. Instead of using pdfminer to open the PDF, I used PyPDF2. Somehow, it can read any PDF regardless of encoding, and it has a function that can automatically turn the fillable spaces into a proper dictionary. The result is a finer, cleaner code:
from PyPDF2 import PdfFileReader
infile = "Edited_CS.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
dictionary = pdf_reader.getFormTextFields()
for g in list(dictionary.items()):
print(g)
Regardless, thank you for all of your answers! :)
I have a script that reads a bunch of JavaScript files into a variable, and then places the contents of those files into placeholders in a Python template. This results in the value of the variable src (described below) being a valid HTML document including scripts.
# Open the source HTML file to get the paths to the JavaScript files
f = open(srcfile.html, 'rU')
src = f.read()
f.close()
js_scripts = re.findall('script\ssrc="(.*)"', src)
# Put all of the scripts in a variable
js = ''
for script in js_scripts:
f = open(script, 'rU')
js = js + f.read() + '\n'
f.close()
# Open/read the template
template = open('template.html)
templateSrc = Template(template.read())
# Substitute the scripts for the placeholder variable
src = str(templateSrc.safe_substitute(javascript_content=js))
# Write a Python file containing the string
with open('htmlSource.py', 'w') as f:
f.write('#-*- coding: utf-8 -*-\n\nhtmlSrc = """' + src + '"""')
If I try to open it up via PyQt5/QtWebKit in Python...
from htmlSource import htmlSrc
webWidget.setHtml(htmlSrc)
...it doesn't load the JS files in the web widget. I just end up with a blank page.
But if I get rid of everything else, and just write to file '"""src"""', when I open the file up in Chrome, it loads everything as expected. Likewise, it'll also load correctly in the web widget if I read from the file itself:
f = open('htmlSource.py', 'r')
htmlSrc = f.read()
webWidget.setHtml(htmlSrc)
In other words, when I run this script, it produces the Python output file with the variable; then I try to import that variable and pass it to webWidget.setHtml(); but the page doesn't render. But if I use open() and read it as a file, it does.
I suspect there's an encoding issue going on here. But I've tried several variations of encode and decode without any luck. The scripts are all UTF-8.
Any suggestions? Many thanks!
I have a script that regularly reads a text file on a server and over writes a copy of the text to a local copy of the text file. I have an issue of the process adding extra carriage returns and an extra invisible character after the last character. How do I make an identical copy of the server file?
I use the following to read the file
for link in links:
try:
f = urllib.urlopen(link)
myfile = f.read()
except IOError:
pass
and to write it to the local file
f = open("C:\\localfile.txt", "w")
try:
f.write(myfile)
except NameError:
pass
finally:
f.close()
This is how the file looks on the server
!http://i.imgur.com/rAnUqmJ.jpg
and this is how the file looks locally. Besides, an additional invisible character after the last 75
!http://i.imgur.com/xfs3E8D.jpg
I have seen quite a few similar questions, but not sure how to handle the urllib to read in binary
Any solution please?
If you want to copy a remote file denoted by a URL to a local file i would use urllib.urlretrieve:
import urllib
urllib.urlretrieve("http://anysite.co/foo.gz", "foo.gz")
I think urllib is reading binary.
Try changing
f = open("C:\\localfile.txt", "w")
to
f = open("C:\\localfile.txt", "wb")
I'm parsing an xml file using the code below:
import lxml
file_name = input('Enter the file name, including .xml extension: ')
print('Parsing ' + file_name)
from lxml import etree
parser = lxml.etree.XMLParser()
tree = lxml.etree.parse(file_name, parser)
root = tree.getroot()
nsmap = {'xmlns': 'urn:tva:metadata:2010'}
with open(file_name+'.log', 'w', encoding='utf-8') as f:
for info in root.xpath('//xmlns:ProgramInformation', namespaces=nsmap):
crid = (info.get('programId'))
titlex = (info.find('.//xmlns:Title', namespaces=nsmap))
title = (titlex.text if titlex != None else 'Missing')
synopsis1x = (info.find('.//xmlns:Synopsis[1]', namespaces=nsmap))
synopsis1 = (synopsis1x.text if synopsis1x != None else 'Missing')
synopsis1 = synopsis1.replace('\r','').replace('\n','')
f.write('{}|{}|{}\n'.format(crid, title, synopsis1))
Let take an example title of 'Přešité bydlení'. If I print the title whilst parsing the file, it comes out as expected. When I write it out however, it displays as 'PÅ™eÅ¡ité bydlenÃ'.
I understand that this is do to with encoding (as I was able to change the print command to use UTF-8, and 'corrupt' the output), but I couldn't get the written output to print as I desired. I had a look at the codecs library, but couldn't wasn't successful. Having 'encoding = "utf-8"' in the XML Parser line didn't make any difference.
How can I configure the written output to be human readable?
I had all sorts of troubles with this before. But the solution is rather simple. There is a chapter on how to read and write in unicode to a file in the documentation. This Python talk is also very enlightening to understand the issue. Unicode can be a pain. It gets a lot easier if you start using python 3 though.
import codecs
f = codecs.open('test', encoding='utf-8', mode='w+')
f.write(u'\u4500 blah blah blah\n')
f.seek(0)
print repr(f.readline()[:1])
f.close()
Your code looks ok, so I reckon your input is duff. Assuming you're viewing your output file with a UTF-8 viewer or shell then I suspect that the encoding in the <?xml doesn't match the actual encoding.
This would explain why printing works but not writing to a file. If your shell/IDE is set to "ISO-8859-2" and your input XML is also "ISO-8859-2" then printing is pushing out the raw encoding.