PDF File to Dict Returns Weird Characters

PDF File to Dict Returns Weird Characters - python

I am trying to create a program that utilizes pdfminer to read a DnD Character Sheet (fillable PDF) and put the fill-ins into a dictionary. Upon editing the PDF and running the program again, I get a strange sequence of characters when printing the dictionary items. The code:
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1
import collections
filename = "Edited_CS.pdf"
fp = open(filename, 'rb')
my_dict = {}
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']
# Checks if PDF file is blank
if isinstance(fields, collections.abc.Sequence) is False:
print("This Character Sheet is blank. Please submit a filled Character Sheet!")
else:
for i in fields:
field = resolve1(i)
name, value = field.get('T'), field.get('V')
if value is None or str(value)[2:-1] == "":
value = "b'None'"
my_dict[str(name)[2:-1]] = str(value)[2:-1]
for g in list(my_dict.items()):
print(g)
The output from the unedited PDF file:
('ClassLevel', 'Assassin 1')
('Background', 'Lone Survivor')
('PlayerName', 'None')
('CharacterName', 'Tumas Mitshil')
('Race ', 'Human')
etc...
The output when it was edited (I changed the ClassLevel, etc. completely in the PDF):
('ClassLevel', '\\xfe\\xff\\x00C\\x00l\\x00a\\x00s\\x00s\\x00L\\x00e\\x00v\\x00e\\x00l')
('Background', '\\xfe\\xff\\x00B\\x00a\\x00c\\x00k\\x00g\\x00r\\x00o\\x00u\\x00n\\x00d\\x00r')
('PlayerName', '\\xfe\\xff\\x00P\\x00l\\x00a\\x00y\\x00e\\x00r\\x00N\\x00a\\x00m\\x00e')
('CharacterName', '\\xfe\\xff\\x00T\\x00h\\x00o\\x00m\\x00a\\x00s')
('Race ', '\\xfe\\xff\\x00R\\x00a\\x00c\\x00e')
('Alignment', '\\xfe\\xff\\x00A\\x00l\\x00i\\x00g\\x00n\\x00m\\x00e\\x00n\\x00t')
etc...
I know this is an encoding of some sort, and a few Google searches led me to believe it was in UTF-8 encode, so I attempted to decode the PDF when opening the file:
fp = open(filename, 'rb').read().decode('utf-8')
Unfortunately, I am met with an error:
Traceback (most recent call last):
File "main.py", line 16, in <module>
fp = open(filename, 'rb').read().decode('utf-8')
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte
When I first made the PDF, I used Adobe Acrobat. However, I used Microsoft Edge to edit the file, which resulted in the problem I am facing. Here are the files:
Original File
Edited File
Is there any way to properly decode this? Is there a way to encode the edited pdf so it can be loaded into python without trouble? And if this is encoded, are there other forms of encoding, and how would I decode those?
Any help will be greatly appreciated.

You can fix the problem by using Adobe Acrobat Reader DC to edit the form fields. I've edited the form fields of Edited_CS.pdf using it and pdfminer.six returns the expected output.
Probably Microsoft Edge is causing this problem.

After some digging, I was able to find a better solution. Instead of using pdfminer to open the PDF, I used PyPDF2. Somehow, it can read any PDF regardless of encoding, and it has a function that can automatically turn the fillable spaces into a proper dictionary. The result is a finer, cleaner code:
from PyPDF2 import PdfFileReader
infile = "Edited_CS.pdf"
pdf_reader = PdfFileReader(open(infile, "rb"))
dictionary = pdf_reader.getFormTextFields()
for g in list(dictionary.items()):
print(g)
Regardless, thank you for all of your answers! :)

Related

Can't get rid of illegible contents while writing to a csv file

I've written a script in python using post requests to scrape the json content from a webpage. When I run my script, I get the result in the console as expected. However, I encounter an issue, when I try to write the same in a csv file.
When I try like:
with open ("outputContent.csv","w",newline="") as f:
I encounter the following error:
Traceback (most recent call last):
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\all_reviews_grabber.py", line 27, in <module>
writer.writerow([nom,ville,region])
File "C:\Users\WCS\AppData\Local\Programs\Python\Python36-32\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufb02' in position 16: character maps to <undefined>
When I try like the following, the script does produce a data ridden csv file:
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
But, the csv file contains some illegible contents, as in:
BeijingshÃ¬
XinjiangwÃ©iwÃºerzÃ¬zhÃ¬qu
ShÃ nghaishÃ¬
Qingpuqu
ShÃ nghaishÃ¬
XÃºhuÃ¬qu
PutuÃ³qu
This is my script so far:
import csv
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
with open ("outputContent.csv","w",newline="",encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(['nom','ville','region'])
for item in res.json():
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
print(nom,ville,region)
writer.writerow([nom,ville,region])
How can I write the content in the right way in a csv file?

Take a look at this - http://www.pgbovine.net/unicode-python-errors.htm
Check your default encoding in your interpreter:
import sys
sys.stdout.encoding
An old version of Python can also cause this error.

Would using pandas to parse and then write alleviate the issue?
import pandas as pd
import requests
from bs4 import BeautifulSoup
baseUrl = "https://fr-vigneron.gilbertgaillard.com/importer"
postUrl = "https://fr-vigneron.gilbertgaillard.com/importer/ajax"
with requests.Session() as s:
req = s.get(baseUrl)
sauce = BeautifulSoup(req.text,"lxml")
token = sauce.select_one("input[name='_token']")['value']
payload = {
'data': 'country=0&type=0&input_search=',
'_token': token
}
res = s.post(postUrl,data=payload)
jsonObj = res.json()
results = pd.DataFrame()
for item in jsonObj:
nom = item['prospect_nom']
ville = item['prospect_ville']
region = item['prospect_region']
#print(id_,nom,ville,region)
temp_df = pd.DataFrame([[nom,ville,region]], columns = ['nom','ville','region'])
results = results.append(temp_df)
results = results.reset_index(drop=True)
results.to_csv("outputContent.csv", idex=False)

The code works correctly, as long as the print statement is removed*.
The corrupted data that you are seeing is because you are decoding the file data from cp1252, rather than UTF-8 when you view it.
>>> s = 'Xinjiangwéiwúerzìzhìqu'
>>> encoded = s.encode('utf-8')
>>> encoded.decode('cp1252')
'XinjiangwÃ©iwÃºerzÃ¬zhÃ¬qu'
If you are viewing the data by opening the csv file in Python, ensure that you specify UTF-8 encoding** when you open it:
open('outputContent.csv', 'r', encoding='utf-8'...
If you are opening the file with an application such as Excel, ensure that you specify that the encoding is UTF-8 when opening it.
If you don't specify an encoding the default cp1252 encoding will be used to decode the data in the file, and you will see garbage data.
* print will automatically use the default encoding, so you'll get an exception if it tries to encode characters which can't be encoded as cp1252.
** It may also be worth trying the 'utf-8-sig' encoding, which is a Microsoft-specific version of UTF-8 that inserts a byte-order-mark or BOM (b'\xef\xbb\xbf') at the beginning of encoded strings, but is otherwise identical to UTF-8.

unable to decode this string using python

I have this text.ucs file which I am trying to decode using python.
file = open('text.ucs', 'r')
content = file.read()
print content
My result is
\xf\xe\x002\22
I tried doing decoding with utf-16, utf-8
content.decode('utf-16')
and getting error
Traceback (most recent call last): File "", line 1, in
File "C:\Python27\lib\encodings\utf_16.py", line 16, in
decode
return codecs.utf_16_decode(input, errors, True) UnicodeDecodeError: 'utf16' codec can't decode bytes in position
32-33: illegal encoding
Please let me know if I am missing anything or my approach is wrong
Edit: Screenshot has been asked

The string is encoded as UTF16-BE (Big Endian), this works:
content.decode("utf-16-be")

oooh, as i understand you using python 2.x.x but encoding parameter was added only in python 3.x.x as I know, i am doesn't master of python 2.x.x but you can search in google about io.open for example try:
file = io.open('text.usc', 'r',encoding='utf-8')
content = file.read()
print content
but chek do you need import io module or not

You can specify which encoding to use with the encoding argument:
with open('text.ucs', 'r', encoding='utf-16') as f:
text = f.read()

your string need to Be Uncoded With The Coding utf-8 you can do What I Did Now for decode your string
f = open('text.usc', 'r',encoding='utf-8')
print f

Python 3: JSON File Load with Non-ASCII Characters

just trying to load this JSON file(with non-ascii characters) as a python dictionary with Unicode encoding but still getting this error:
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 90: ordinal not in range(128)
JSON file content = "tooltip":{
"dxPivotGrid-sortRowBySummary": "Sort\"{0}\"byThisRow",}
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
for line in f:
data.append(json.loads(line.encode('utf-8','replace')))

You have several problems as near as I can tell. First, is the file encoding. When you open a file without specifying an encoding, the file is opened with whatever sys.getfilesystemencoding() is. Since that may vary (especially on Windows machines) its a good idea to explicitly use encoding="utf-8" for most json files. Because of your error message, I suspect that the file was opened with an ascii encoding.
Next, the file is decoded from utf-8 into python strings as it is read by the file system object. The utf-8 line has already been decoded to a string and is already ready for json to read. When you do line.encode('utf-8','replace'), you encode the line back into a bytes object which the json loads (that is, "load string") can't handle.
Finally, "tooltip":{ "navbar":"Operações de grupo"} isn't valid json, but it does look like one line of a pretty-printed json file containing a single json object. My guess is that you should read the entire file as 1 json object.
Putting it all together you get:
import json
with open('/Users/myvb/Desktop/Automation/pt-PT.json', encoding="utf-8") as f:
data = json.load(f)
From its name, its possible that this file is encoded as a Windows Portugese code page. If so, the "cp860" encoding may work better.

I had the same problem, what worked for me was creating a regular expression, and parsing every line from the json file:
REGEXP = '[^A-Za-z0-9\'\:\.\;\-\?\!]+'
new_file_line = re.sub(REGEXP, ' ', old_file_line).strip()

Having a file with content similar to yours I can read the file in one simple shot:
>>> import json
>>> fname = "data.json"
>>> with open(fname) as f:
... data = json.load(f)
...
>>> data
{'tooltip': {'navbar': 'Operações de grupo'}}

You don't need to read each line. You have two options:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.load(f))
Or, you can load all lines and pass them to the json module:
import sys
import json
data = []
with open('/Users/myvb/Desktop/Automation/pt-PT.json') as f:
data.append(json.loads(''.join(f.readlines())))
Obviously, the first suggestion is the best.

How to fix unicode error while generating xml file using python xml.etree.ElementTree without misssing any data?

I am generating xml file using xml.etree.ElementTree in python and then writting the generated xml file into a file. One of the tag of the xml is regarding system installed software which will have all the details of the system installed software. The xml output on console which i get on execution of the script is perfect but when i try to place output into a file i encounter following error as:
Traceback (most recent call last):
File "C:\xmltry.py", line 65, in <module>
f.write(prettify(top))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 4305: ordinal not in range(128)
Following is the script:
def prettify(elem):
"""Return a pretty-printed XML string for the Element.
"""
rough_string = ElementTree.tostring(elem, 'utf-8')
reparsed = minidom.parseString(rough_string)
return reparsed.toprettyxml(indent=" ")
##Here starts populating elements inside xml file
top = Element('my_practice_document')
comment = Comment('details')
top.append(comment)
child = SubElement(top, 'my_information')
childs = SubElement(child,'my_name')
childs.text = str(options.my_name)
#Following section is for retrieving list of software installed on the system
import wmi
w = wmi.WMI()
for p in w.Win32_Product():
if (p.Version is not None) and (p.Caption is not None):
child = SubElement(top, 'sys_info')
child.text = p.Caption + " version "+ p.Version
## Following portion places the xml output into test.xml file
with open("test.xml", 'w') as f:
f.write(prettify(top))
When the script is executed i get the unicode error . I searched on internet and tried out following also:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
But this also did not resolved my issue. I want to have all the data which i am getting on console into the file without missing anything. So, how can i achieve that. Thanx in advance for your assistance.

You need to specify an encoding for your output file; sys.setdefaultencoding doesn't do that for you.
Try
import codecs
with codecs.open("test.xml", 'w', encoding='utf-8') as f:
f.write(prettify(top))

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PDF File to Dict Returns Weird Characters - python

You can fix the problem by using Adobe Acrobat Reader DC to edit the form fields. I've edited the form fields of Edited_CS.pdf using it and pdfminer.six returns the expected output. Probably Microsoft Edge is causing this problem.

Related

Can't get rid of illegible contents while writing to a csv file

unable to decode this string using python

Python 3: JSON File Load with Non-ASCII Characters

How to fix unicode error while generating xml file using python xml.etree.ElementTree without misssing any data?

Why does printing to a utf-8 file fail?

Categories

Resources