Python printing special Characters [duplicate] - python

This question already has an answer here:
SyntaxError of Non-ASCII character [duplicate]
(1 answer)
Closed 9 years ago.
In Python how do I print special characters such as √, ∞, ²,³, ≤, ≥, ±, ≠
When I try printing this to the console I the get this error:
print("√")
SyntaxError: Non-ASCII character '\xe2' in file /Users/williamfiset/Desktop/MathAid - Python/test.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I get around this?

Running this code results into the same SyntaxError you've provided:
chars = ["√", "∞", "²","³", "≤", "≥", "±", "≠"]
for c in chars:
print(c)
But if I add # -*- coding: utf-8 -*- at the top of the script:
# -*- coding: utf-8 -*-
chars = ["√", "∞", "²","³", "≤", "≥", "±", "≠"]
for c in chars:
print(c)
it will print:
√
∞
²
³
≤
≥
±
≠
Also, see SyntaxError of Non-ASCII character.
Hope that helps.

Related

Python character encoding returns incorrect value

I'm using Python 2.7.11 , I get a wrong value when getting decimal value of a character from extended ascii table
# -*- coding: utf-8 -*-
str="è"
print(ord(str[0])) #prints 232 decimal
but the value of this char is 138 decimal
(http://www.asciitable.com/)
When i remove the coding utf-8 line i get this error SyntaxError: Non-ASCII character '\xe8'
UTF-8 is not extended asci. If you check the UTF-8 table here, you will see that 232 is indeed the correct ordinal.
Also, I recommend Joel on software's UTF-8 article
The character è referes in the unicode/utf-8 encoding to 0x00E8 which means 232.
See this reference
The character is contained in the extended ASCII see this
question for the extended ASCII and python.

Printing a UTF char by its non-ASCII code in Python

I want to print a non-ASCII (UTF-8) by its code rather than the character itself using Python 2.7.
For example, I have the following:
# -*- coding: utf-8 -*-
print "…"
and that's OK. However, I want to print '…' using '\xe2', the corresponding code, instead.
Any ideas?
printing '\xe2\x80\xa6' will give you ...
In [36]: print'\xe2\x80\xa6'
…
In [45]: print repr("…")
'\xe2\x80\xa6'

Python 2.5 sub function from regex module not recognizing a pattern

I'm trying to use Python's sub function from the regex module to recognize and change a pattern in a string. Below is my code.
old_string = "afdëhë:dfp"
newString = re.sub(ur'([aeiouäëöüáéíóúàèìò]|ù:|e:|i:|o:|u:|ä:|ë:|ö:|ü:|á:|é:|í:|ó:|ú:|à:|è:|ì:|ò:|ù:)h([aeiouäëöüáéíóúàèìòù])', ur'\1\2', old_string)
So what I'm looking to get after the code is applied is afdëë:dfp (without the h). So I'm trying to match a vowel (sometimes with accents, sometimes with a colon after it) then the h then another vowel (sometimes with accents). So a few examples...
ò:ha becomes ò:a
ä:hà becomes ä:hà
aha becomes aa
üha becomes üa
ëhë becomes ëë
So I'm trying to remove the h when it is between two vowels and also remove the h when it follows a volume with a colon after it then another vowel (ie a:ha). Any help is greatly appreciated. I've been playing around with this for a while.
A single user-perceived character may consist of multiple Unicode codepoints. Such characters can break u'[abc]'-like regex that sees only codepoints in Python. To workaround it, you could use u'(?:a|b|c)' regex instead. In addition, don't mix bytes and Unicode strings i.e., old_string should be also Unicode.
Applying the last rule fixes your example.
You could write your regex using lookahead/lookbehind assertions:
# -*- coding: utf-8 -*-
import re
from functools import partial
old_string = u"""
ò:ha becomes ò:a
ä:hà becomes ä:à
aha becomes aa
üha becomes üa
ëhë becomes ëë"""
# (?<=a|b|c)(:?)h(?=a|b|c)
chars = u"a e i o u ä ë ö ü á é í ó ú à è ì ò".split()
pattern = u"(?<=%(vowels)s)(:?)h(?=%(vowels)s)" % dict(vowels=u"|".join(chars))
remove_h = partial(re.compile(pattern).sub, ur'\1')
# remove 'h' followed and preceded by vowels
print(remove_h(old_string))
Output
ò:a becomes ò:a
ä:à becomes ä:à
aa becomes aa
üa becomes üa
ëë becomes ëë
For completeness, you could also normalize all Unicode strings in the program using unicodedata.normalize() function (see the example in the docs, to understand why you might need it).
It is encoding issue. Different combinations of file encoding and old_string being non-unicode behave differently for different pythons.
For example your code works fine for python for 2.6 to 2.7 this way (all data below is cp1252 encoded):
# -*- coding: cp1252 -*-
old_string = "afdëhë:dfp"
but fails with SyntaxError: Non-ASCII character '\xeb' if no encoding specified in file.
However, those lines fails for python 2.5 with
`UnicodeDecodeError: 'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)` for python 2.5
While for all pythons fails to remove h with old_string being non-unicode:
# -*- coding: utf8 -*-
old_string = "afdëhë:dfp"
So you have to provide correct encoding and define old_unicode being unicode string as well, for example this one will do:
# -*- coding: cp1252 -*-
old_string = u"afdëhë:dfp"

Declaring encoding in Python [duplicate]

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
Closed 8 years ago.
I want to split a string in python using this code:
means="a ، b ، c"
lst=means.split("،")
but I get this error message:
SyntaxError: Non-ASCII character '\xd8' in file dict.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I declare an encoding?
Put:
# -*- coding: UTF-8 -*-
as the first line of the file (or second line if using *nix) and save the file as UTF-8.
If you're using Python 2, use Unicode string literals (u"..."), for example:
means = u"a ، b ، c"
lst = means.split(u"،")
If you're using Python 3, string literals are Unicode already (unless marked as bytestrings b"...").
You need to declare an encoding for your file, as documented here and here.

How do I define strings read from a file as Unicode? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?
My question is now how do I define the input from a file as unicode?
Straight from the docs.
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Categories