I'm trying to make a html entities encoder/decoder on Python that behaves similar to PHP's htmlentities and html_entity_decode, it works normally as a standalone script:
My input:
Lorem ÁÉÍÓÚÇÃOÁáéíóúção ##$%*()[]<>+ 0123456789
python decode.py
Output:
Lorem ÁÉÍÓÚÇÃOÁáéíóúção ##$%*()[]<>+ 0123456789
Now if I run it as an Autokey script I get this error:
Script name: 'html_entity_decode'
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/autokey/service.py", line 454, in execute
exec script.code in scope
File "<string>", line 40, in <module>
File "/usr/local/lib/python2.7/dist-packages/autokey/scripting.py", line 42, in send_keys
self.mediator.send_string(keyString.decode("utf-8"))
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 6-12: ordinal not in range(128)
What am I doing wrong? Here's the script:
import htmlentitydefs
import re
entity_re = re.compile(r'&(%s|#(\d{1,5}|[xX]([\da-fA-F]{1,4})));' % '|'.join(
htmlentitydefs.name2codepoint.keys()))
def html_entity_decode(s, encoding='utf-8'):
if not isinstance(s, basestring):
raise TypeError('argument 1: expected string, %s found' \
% s.__class__.__name__)
def entity_2_unichr(matchobj):
g1, g2, g3 = matchobj.groups()
if g3 is not None:
codepoint = int(g3, 16)
elif g2 is not None:
codepoint = int(g2)
else:
codepoint = htmlentitydefs.name2codepoint[g1]
return unichr(codepoint)
if isinstance(s, unicode):
entity_2_chr = entity_2_unichr
else:
entity_2_chr = lambda o: entity_2_unichr(o).encode(encoding,
'xmlcharrefreplace')
def silent_entity_replace(matchobj):
try:
return entity_2_chr(matchobj)
except ValueError:
return matchobj.group(0)
return entity_re.sub(silent_entity_replace, s)
text = clipboard.get_selection()
text = html_entity_decode(text)
keyboard.send_keys("%s" % text)
I found it on a Gist https://gist.github.com/607454, I'm not the author.
Looking at the backtrace the likely problem is that you are passing in a unicode string to keyboard.send_keys, which expects a UTF-8 encoded bytestring. autokey then tries to decode your string which fails because the input is unicode instead of utf-8. This looks like a bug in autokey: it should not try to decode strings unless their are really plain (byte)sstrings.
If this guess is correct you should be able to work around this by making sure you pass a unicode instance to send_keys. Try something like this:
text = clipboard.get_selection()
if isinstance(text, unicode):
text = text.encode('utf-8')
text = html_entity_decode(text)
assert isinstance(text, str)
keyboard.send_keys(text)
The assert is not needed but is a handy sanity check to make sure html_entity_decode does the right thing.
The problem is the the output of:
clipboard.get_selection()
is an unicode string.
to solve the problem replace:
text = clipboard.get_selection()
by:
text = clipboard.get_selection().encode("utf8")
Related
I am trying to run the following script, which scans for *.csproj files and checks for project dependencies in Visual Studio solutions, but I am getting the following error. I have already tried all sorts of codec and encode/decode and u'' combination, to no avail...
(the diacritics are intended and I plan to keep them).
Traceback (most recent call last):
File "E:\00 GIT\SolutionDependencies.py", line 44, in <module>
references = GetProjectReferences("MiotecGit")
File "E:\00 GIT\SolutionDependencies.py", line 40, in GetProjectReferences
outputline = u'"{}" -> "{}"'.format(projectName, referenceName)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xed in position 19: ordinal not in range(128)
import glob
import os
import fnmatch
import re
import subprocess
import codecs
gvtemplate = """
digraph g {
rankdir = "LR"
#####
}
""".strip()
def GetProjectFiles(rootFolder):
result = []
for root, dirnames, filenames in os.walk(rootFolder):
for filename in fnmatch.filter(filenames, "*.csproj"):
result.append(os.path.join(root, filename))
return result
def GetProjectName(path):
result = os.path.splitext(os.path.basename(path))[0]
return result
def GetProjectReferences(rootFolder):
result = []
projectFiles = GetProjectFiles(rootFolder)
for projectFile in projectFiles:
projectName = GetProjectName(projectFile)
with codecs.open(projectFile, 'r', "utf-8") as pfile:
content = pfile.read()
references = re.findall("<ProjectReference.*?</ProjectReference>", content, re.DOTALL)
for reference in references:
referenceProject = re.search('"([^"]*?)"', reference).group(1)
referenceName = GetProjectName(referenceProject)
outputline = u'"{}" -> "{}"'.format(projectName, referenceName)
result.append(outputline)
return result
references = GetProjectReferences("MiotecGit")
output = u"\n".join(*references)
with codecs.open("output.gv", "w", 'utf-8') as outputfile:
outputfile.write(gvtemplate.replace("#####", output))
graphvizpath = glob.glob(r"C:\Program Files*\Graphviz*\bin\dot.*")[0]
command = '{} -Gcharset=latin1 -T pdf -o "output.pdf" "output.gv"'.format(graphvizpath)
subprocess.call(command)
When Python 2.x tries to use a byte string in a Unicode context, it automatically tries to decode the byte string to a Unicode string using the ascii codec. While the ascii codec is a safe choice, it often doesn't work.
For Windows environments the mbcs codec will select the code page that Windows uses for 8-bit characters. You can decode the string yourself explicitly.
outputline = u'"{}" -> "{}"'.format(projectName.decode('mbcs'), referenceName.decode('mbcs'))
I am making an API call and the response has unicode characters. Loading this response into a file throws the following error:
'ascii' codec can't encode character u'\u2019' in position 22462
I've tried all combinations of decode and encode ('utf-8').
Here is the code:
url = "https://%s?start_time=%s&include=metric_sets,users,organizations,groups" % (api_path, start_epoch)
while url != None and url != "null" :
json_filename = "%s/%s.json" % (inbound_folder, start_epoch)
try:
resp = requests.get(url,
auth=(api_user, api_pwd),
headers={'Content-Type': 'application/json'})
except requests.exceptions.RequestException as e:
print "|********************************************************|"
print e
return "Error: {}".format(e)
print "|********************************************************|"
sys.exit(1)
try:
total_records_extracted = total_records_extracted + rec_cnt
jsonfh = open(json_filename, 'w')
inter = resp.text
string_e = inter#.decode('utf-8')
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
encoded_data = final.encode('utf-8')
cleaned_data = json.loads(encoded_data)
json.dump(cleaned_data, jsonfh, indent=None)
jsonfh.close()
except ValueError as e:
tb = traceback.format_exc()
print tb
print "|********************************************************|"
print e
print "|********************************************************|"
sys.exit(1)
Lot of developers have faced this issue. a lot of places have asked to use .decode('utf-8') or having a # _*_ coding:utf-8 _*_ at the top of python.
It is still not helping.
Can someone help me with this issue?
Here is the trace:
Traceback (most recent call last):
File "/Users/SM/PycharmProjects/zendesk/zendesk_tickets_api.py", line 102, in main
cleaned_data = json.loads(encoded_data)
File "/Users/SM/anaconda/lib/python2.7/json/__init__.py", line 339, in loads
return _default_decoder.decode(s)
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 364, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/Users/SM/anaconda/lib/python2.7/json/decoder.py", line 380, in raw_decode
obj, end = self.scan_once(s, idx)
ValueError: Invalid \escape: line 1 column 2826494 (char 2826493)
|********************************************************|
Invalid \escape: line 1 column 2826494 (char 2826493)
inter = resp.text
string_e = inter#.decode('utf-8')
encoded_data = final.encode('utf-8')
The text property is a Unicode character string, decoded from the original bytes using whatever encoding the Requests module guessed might be in use from the HTTP headers.
You probably don't want that; JSON has its own ideas about what the encoding should be, so you should let the JSON decoder do that by taking the raw response bytes from resp.content and passing them straight to json.loads.
What's more, Requests has a shortcut method to do the same: resp.json().
final = string_e.replace('\\n', ' ').replace('\\t', ' ').replace('\\r', ' ')#.replace('\\ ',' ')
Trying to do this on the JSON-string-literal formatted input is a bad idea: you will miss some valid escapes, and incorrectly unescape others. Your actual error is nothing to do with Unicode at all, it's that this replacement is mangling the input. For example consider the input JSON:
{"message": "Open the file C:\\newfolder\\text.txt"}
after replacement:
{"message": "Open the file C:\ ewfolder\ ext.txt"}
which is clearly not valid JSON.
Instead of trying to operate on the JSON-encoded string, you should let json decode the input and then filter any strings you have in the structured output. This may involve using a recursive function to walk down into each level of the data looking for strings to filter. eg
def clean(data):
if isinstance(data, basestring):
return data.replace('\n', ' ').replace('\t', ' ').replace('\r', ' ')
if isinstance(data, list):
return [clean(item) for item in data]
if isinstance(data, dict):
return {clean(key): clean(value) for (key, value) in data.items()}
return data
cleaned_data = clean(resp.json())
My code works perfectly for some pdf, but some show error:
Traceback (most recent call last):
File "con.py", line 24, in <module>
print getPDFContent("abc.pdf")
File "con.py", line 17, in getPDFContent
f.write(a)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u02dd' in position 64: ordinal not in range(128)
My code is
import pyPdf
def getPDFContent(path):
content = ""
pdf = pyPdf.PdfFileReader(file(path, "rb"))
for i in range(0, pdf.getNumPages()):
f=open("xxx.txt",'a')
content= pdf.getPage(i).extractText() + "\n"
import string
c=content.split()
for a in c:
f.write(" ")
f.write(a)
f.write('\n')
f.close()
return content
print getPDFContent("abc.pdf")
Your problem is that when you call f.write() with a string, it is trying to encode it using the ascii codec. Your pdf contains characters that can not be represented by the ascii codec. Try explicitly encoding your str, e.g.
a = a.encode('utf-8')
f.write(a)
Try
import sys
print getPDFContent("abc.pdf").encode(sys.getfilesystemencoding())
def openFile(fileName):
try:
trainFile = io.open(fileName,"r",encoding = "utf-8")
except IOError as e:
print ("File could not be opened: {}".format(e))
else:
trainData = csv.DictReader(trainFile)
print trainData
return trainData
def computeTFIDF(trainData):
bodyList = []
print "Inside computeTFIDF"
for row in trainData:
for key, value in row.iteritems():
print key, unicode(value, "utf-8", "ignore")
print "Done"
return
if __name__ == "__main__":
print "Main"
trainData = openFile("../Data/TrainSample.csv")
print "File Opened"
computeTFIDF(trainData)
Error:
Traceback (most recent call last):
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 62, in <module>
computeTFIDF(trainData)
File "C:\DebSeal\IUB MS Program\IUB Sem III\Facebook Kaggle Comp\Src\facebookChallenge.py", line 42, in computeTFIDF
for row in trainData:
File "C:\Python27\lib\csv.py", line 104, in next
row = self.reader.next()
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201c' in position 215: ordinal not in range(128)
TrainSample.csv: Is a csv file with 4 columns (with header).
OS: Windows 7 64 bit.
Using Python 2.x
I don't know what is going wrong here. I said it to ignore the encoding. But still is throws the same error.
I think before the control reaches the encoding, it throws an error.
Can anybody tell me where I am going wrong.
The Python 2 CSV module does not handle Unicode input.
Open the file in binary mode, and decode after parsing it as CSV. This is safe for the UTF-8 codec as newlines, delimiters and quotes all encode to 1 byte.
The csv module documentation includes a UnicodeReader wrapper class in the example section that will do the decoding for you; it is easily adapted to the DictReader class:
import csv
class UnicodeDictReader:
"""
A CSV reader which will iterate over lines in the CSV file "f",
which is encoded in the given encoding.
"""
def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
self.encoding = encoding
self.reader = csv.DictReader(f, dialect=dialect, **kwds)
def next(self):
row = self.reader.next()
return {k: unicode(v, "utf-8") for k, v in row.iteritems()}
def __iter__(self):
return self
Use this with the file opened in binary mode:
def openFile(fileName):
try:
trainFile = open(fileName, "rb")
except IOError as e:
print "File could not be opened: {}".format(e)
else:
return UnicodeDictReader(trainFile)
I can't give a comment to Martijn, which solution works for me perfectly after little upgrade which I leave here for others:
def next(self):
row = self.reader.next()
try:
d = dict((unicode(k, self.encoding), unicode(v, self.encoding)) for k, v in row.iteritems())
except TypeError:
d = row
return d
One thing is that python 2.6 and lower doesn't support dict comprahension.
Another, that dicts can use different types, and unicode function not, so it's worth to catch TypeError in case of null or number.
One more thing which drive me creazy was, it doesn't work when you open file with encoding! Just leave it simple open().
I'm using Microsoft's free translation service to translate some Hindi characters to English. They don't provide an API for Python, but I borrowed code from: tinyurl.com/dxh6thr
I'm trying to use the 'Detect' method as described here: tinyurl.com/bxkt3we
The 'hindi.txt' file is saved in unicode charset.
>>> hindi_string = open('hindi.txt').read()
>>> data = { 'text' : hindi_string }
>>> token = msmt.get_access_token(MY_USERID, MY_TOKEN)
>>> request = urllib2.Request('http://api.microsofttranslator.com/v2/Http.svc/Detect?'+urllib.urlencode(data))
>>> request.add_header('Authorization', 'Bearer '+token)
>>> response = urllib2.urlopen(request)
>>> print response.read()
<string xmlns="http://schemas.microsoft.com/2003/10/Serialization/">en</string>
>>>
The response shows that the Translator detected 'en', instead of 'hi' (for Hindi). When I check the encoding, it shows as 'string':
>>> type(hindi_string)
<type 'str'>
For reference, here is content of 'hindi.txt':
हाय, कैसे आप आज कर रहे हैं। मैं अच्छी तरह से, आपको धन्यवाद कर रहा हूँ।
I'm not sure if using string.encode or string.decode applies here. If it does, what do I need to encode/decode from/to? What is the best method to pass a Unicode string as a urllib.urlencode argument? How can I ensure that the actual Hindi characters are passed as the argument?
Thank you.
** Additional Information **
I tried using codecs.open() as suggested, but I get the following error:
>>> hindi_new = codecs.open('hindi.txt', encoding='utf-8').read()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\codecs.py", line 671, in read
return self.reader.read(size)
File "C:\Python27\lib\codecs.py", line 477, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte
Here is the repr(hindi_string) output:
>>> repr(hindi_string)
"'\\xff\\xfe9\\t>\\t/\\t,\\x00 \\x00\\x15\\tH\\t8\\tG\\t \\x00\\x06\\t*\\t \\x00
\\x06\\t\\x1c\\t \\x00\\x15\\t0\\t \\x000\\t9\\tG\\t \\x009\\tH\\t\\x02\\td\\t \
\x00.\\tH\\t\\x02\\t \\x00\\x05\\t'"
Your file is utf-16, so you need to decode the content before sending it:
hindi_string = open('hindi.txt').read().decode('utf-16')
data = { 'text' : hindi_string.encode('utf-8') }
...
You could try opening the file using codecs.open and decode it with utf-8:
import codecs
with codecs.open('hindi.txt', encoding='utf-8') as f:
hindi_text = f.read()