I am trying to understand the python parser on how it handles the source code text and how it tokenizes while parsing. I have 3 statements in the same source file , essentially doing the same function.
# -*- coding: latin-1 -*-
print("This is a Unicode List")
eval("\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029")
\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029
The first 2 lines works as expected however I get a syntax error for line 3
File "...\UnicodeInput.py", line 4
\u0070\u0072\u0069\u006E\u0074\u0028\u0022This is a Unicode List\u0022\u0029
^
SyntaxError: unexpected character after line continuation character
is there a way for me to provide python statements with its equivalent codepoints
No, Python won't allow escape sequences as valid source code, they only make sense inside strings. Backslash is the line continuation character in Python source. The only way you can do what you want is to use eval or generate the python file by using another python script using the escape sequences.
Related
I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...
Is there any way to define a string with accented letters in python?
An extreme example is this one:
message = "ÂÃÄÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
Error:
SyntaxError: Non-UTF-8 code starting with '\xc2'
When souce code contains something else than ASCII, you have to add a line to tell the python interpreter:
#!/usr/bin/env python
# encoding: utf-8
Read more in PEP-0263 for the exact rules how to include the encoding hint in a magic comment.
If you use Python 3.x you can use accented (Unicode) strings without doing anything special. If you are using Python 2.x, use u prefix to denote Unicode:
message = u"ÂÃÄÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ"
Also remember to include the following line at the top of your script:
# coding=utf-8
PEP-0263 explains this in detail:
To define a source code encoding, a magic comment must
be placed into the source files either as first or second
line in the file, such as:
# coding=<encoding name>
suffixes = {
1: ["ो", "े", "ू", "ु", "ी", "ि", "ा"]}
When I done
message given by IDLE is
Unsupported characters in input
Also not see the proper font in MS-DOS.
What encoding is your source file in?
If it is UTF8, put the comment
# -*- coding: utf-8 -*-
at the top of the file.
If you don't declare encoding in your first or second line in your python source file, then the python interpreter will use ASCII encoding system to decode the characters in the file. As these characters you used couldn't be decoded by ASCII encoding system, errors happended.
The solution is as #RemcoGerlich said. Here is the doc.
The encoding is used for all lexical analysis, in particular to find the end of a string, and to interpret the contents of Unicode literals. String literals are converted to Unicode for syntactical analysis, then converted back to their original encoding before interpretation starts. The encoding declaration must appear on a line of its own.
This seems to be a known bug in the 2.x IDLE console: http://bugs.python.org/issue15809. A fix was made for Python 3.x, but doesn't appear to be backported.
Instead, use an alternative console, such as iPython/Jupyter, or a fully-fledged IDE, such as PyCharm.
Say I have a function:
def NewFunction():
return '£'
I want to print some stuff with a pound sign in front of it and it prints an error when I try to run this program, this error message is displayed:
SyntaxError: Non-ASCII character '\xa3' in file 'blah' but no encoding declared;
see http://www.python.org/peps/pep-0263.html for details
Can anyone inform me how I can include a pound sign in my return function? I'm basically using it in a class and it's within the '__str__' part that the pound sign is included.
I'd recommend reading that PEP the error gives you. The problem is that your code is trying to use the ASCII encoding, but the pound symbol is not an ASCII character. Try using UTF-8 encoding. You can start by putting # -*- coding: utf-8 -*- at the top of your .py file. To get more advanced, you can also define encodings on a string by string basis in your code. However, if you are trying to put the pound sign literal in to your code, you'll need an encoding that supports it for the entire file.
Adding the following two lines at the top of my .py script worked for me (first line was necessary):
#!/usr/bin/env python
# -*- coding: utf-8 -*-
First add the # -*- coding: utf-8 -*- line to the beginning of the file and then use u'foo' for all your non-ASCII unicode data:
def NewFunction():
return u'£'
or use the magic available since Python 2.6 to make it automatic:
from __future__ import unicode_literals
The error message tells you exactly what's wrong. The Python interpreter needs to know the encoding of the non-ASCII character.
If you want to return U+00A3 then you can say
return u'\u00a3'
which represents this character in pure ASCII by way of a Unicode escape sequence. If you want to return a byte string containing the literal byte 0xA3, that's
return b'\xa3'
(where in Python 2 the b is implicit; but explicit is better than implicit).
The linked PEP in the error message instructs you exactly how to tell Python "this file is not pure ASCII; here's the encoding I'm using". If the encoding is UTF-8, that would be
# coding=utf-8
or the Emacs-compatible
# -*- encoding: utf-8 -*-
If you don't know which encoding your editor uses to save this file, examine it with something like a hex editor and some googling. The Stack Overflow character-encoding tag has a tag info page with more information and some troubleshooting tips.
In so many words, outside of the 7-bit ASCII range (0x00-0x7F), Python can't and mustn't guess what string a sequence of bytes represents. https://tripleee.github.io/8bit#a3 shows 21 possible interpretations for the byte 0xA3 and that's only from the legacy 8-bit encodings; but it could also very well be the first byte of a multi-byte encoding. But in fact, I would guess you are actually using Latin-1, so you should have
# coding: latin-1
as the first or second line of your source file. Anyway, without knowledge of which character the byte is supposed to represent, a human would not be able to guess this, either.
A caveat: coding: latin-1 will definitely remove the error message (because there are no byte sequences which are not technically permitted in this encoding), but might produce completely the wrong result when the code is interpreted if the actual encoding is something else. You really have to know the encoding of the file with complete certainty when you declare the encoding.
Adding the following two lines in the script solved the issue for me.
# !/usr/bin/python
# coding=utf-8
Hope it helps !
You're probably trying to run Python 3 file with Python 2 interpreter. Currently (as of 2019), python command defaults to Python 2 when both versions are installed, on Windows and most Linux distributions.
But in case you're indeed working on a Python 2 script, a not yet mentioned on this page solution is to resave the file in UTF-8+BOM encoding, that will add three special bytes to the start of the file, they will explicitly inform the Python interpreter (and your text editor) about the file encoding.
I'm using the following regular expression basically to search for and delete these characters.
invalid_unicode = re.compile(ur'(Û|²|°|±|É|¹|Í)')
My source code in ASCII encoded, and whenever I try to run the script it spits out:
SyntaxError: Non-ASCII character '\xdb' in file ./release.py on line 273, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
If I follow the instructions at the given website, and place utf-8 on the second line encoding, my script doesn't run. Instead it gives me this error:
SyntaxError: (unicode error) 'utf8' codec can't decode byte 0xdb in position 0: unexpected end of data
How do I get this one regular expression running in an ASCII written script that'd be great.
You need to find out what encoding your editor is using, and set that per PEP263; or, make things more stable and portable (though alas perhaps a bit less readable) and use escape sequences in your string literal, i.e., use u'(\xdb|\xb2|\xb0|\xb1|\xc9|\xb9|\xcd)' as the parameter to the re.compile call.
After telling Python that your source file uses UTF-8 encoding, did you actually make sure that your editor is saving the file using UTF-8 encoding? The error you get indicates that your editor is probably not using UTF-8.
What text editor are you using?
\x{c0de}
In a regex will match the Unicode character at code point c0de.
Python uses PCRE, right? (If it doesn't, it's probably \uC0DE instead...)