Declaring encoding in Python [duplicate] - python

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
Closed 8 years ago.
I want to split a string in python using this code:
means="a ، b ، c"
lst=means.split("،")
but I get this error message:
SyntaxError: Non-ASCII character '\xd8' in file dict.py on line 2, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I declare an encoding?

Put:
# -*- coding: UTF-8 -*-
as the first line of the file (or second line if using *nix) and save the file as UTF-8.
If you're using Python 2, use Unicode string literals (u"..."), for example:
means = u"a ، b ، c"
lst = means.split(u"،")
If you're using Python 3, string literals are Unicode already (unless marked as bytestrings b"...").

You need to declare an encoding for your file, as documented here and here.

Related

How to convert unicode string to bytes Python [duplicate]

This question already has answers here:
getting bytes from unicode string in python
(6 answers)
Closed 2 years ago.
I have a string which I get from a function
>>> example = Some_function()
This Some_function return a very long combination of Unicode and ASCII string like 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'
My Problem is that when I try to convert this unicode string to bytes it gives me an error that \ud919 cannot be encoded by utf-8. I tried :
>>> further=bytes(example,encoding='utf-8')
Note: I cannot ignore this \ud919. If there is a way to solve this problem or how can I convert 'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123' to 'gn1\ud123a\ud123\ud123\ud123\\ud919\ud123\ud123' to treat \ud919 as simple string not unicode.
based on the version.
print type(unicode_string), repr(unicode_string) Python 3.x : print type(unicode_string), ascii(unicode_string)
\ud919 is a surrogate character, one does not simply convert it. Use surrogatepass flag:
'gn1\ud123a\ud123\ud123\ud123\ud919\ud123\ud123'.encode('utf-8', 'surrogatepass')
>>> b'gn1\xed\x84\xa3a\xed\x84\xa3\xed\x84\xa3\xed\x84\xa3\xed\xa4\x99\xed\x84\xa3\xed\x84\xa3'

unicode code points to utf-8 python [duplicate]

This question already has answers here:
Python 2.7: How to convert unicode escapes in a string into actual utf-8 characters
(2 answers)
Closed 5 years ago.
I downloaded tweets in Urdu language. When I read the csv file using pandas in python, the tweet is shown as follows:
Sample tweet text
Unicode code point
I want to convert this into utf-8.
When you are writing the tweet data to a file use .decode('utf-8')
And when you try to read data from that file use .encode('utf-8')
Here I am posting an example:
# -*- coding: utf-8 -*-
string1 = "آکاش کمار"
string2 = string1.decode('utf-8')
string3 = string2.encode('utf-8')
print(string3)

utf-8 encoding and greek characters [duplicate]

This question already has answers here:
Working with UTF-8 encoding in Python source [duplicate]
(2 answers)
How to output a utf-8 string list as it is in python?
(4 answers)
Closed 6 years ago.
While I managed to get all the data that I need as well as save it on a cv file, the output I get is in UTF-8 format, which is normal(correct me If I'm wrong)
TBH I've already "played" with the .encode() and .decode() option without any results.
here is my code
brands=[name.text for name in Unibrands]
here is the output
u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae'
And this is the desired output
u'Spirulina Ελληνική'
That string is already fine; you're seeing the repr of it, which does escape certain characters because this is intended to be safe to copy and paste directly into Python source code (which in Python 2.x means it needs to have only printable ASCII characters) - eg, \u0395 represents the codepoint U+0395 GREEK CAPITAL LETTER EPSILON. You're seeing this form of it because printing a list (or other container) always shows you the repr of its contents - if you instead print the string directly, you should see an appropriate glyph instead of the escaped form:
>>> print(u'Spirulina \u0395\u03bb\u03bb\u03b7\u03bd\u03b9\u03ba\u03ae')
>>> 'Spirulina Ελληνική'
You could also consider upgrading to a newer Python version; Python 3.5 (and possibly earlier 3.x versions) no longer escape these letters in the repr, since Python now accepts Unicode characters in source files by default.

Python printing special Characters [duplicate]

This question already has an answer here:
SyntaxError of Non-ASCII character [duplicate]
(1 answer)
Closed 9 years ago.
In Python how do I print special characters such as √, ∞, ²,³, ≤, ≥, ±, ≠
When I try printing this to the console I the get this error:
print("√")
SyntaxError: Non-ASCII character '\xe2' in file /Users/williamfiset/Desktop/MathAid - Python/test.py on line 4, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
How do I get around this?
Running this code results into the same SyntaxError you've provided:
chars = ["√", "∞", "²","³", "≤", "≥", "±", "≠"]
for c in chars:
print(c)
But if I add # -*- coding: utf-8 -*- at the top of the script:
# -*- coding: utf-8 -*-
chars = ["√", "∞", "²","³", "≤", "≥", "±", "≠"]
for c in chars:
print(c)
it will print:
√
∞
²
³
≤
≥
±
≠
Also, see SyntaxError of Non-ASCII character.
Hope that helps.

How do I define strings read from a file as Unicode? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Character reading from file in Python
I want to strip a input string from a file from all special characters, except for actual letters (even Cyrillic letters shouldn't be stripped). The solution I found manually declares the string as unicode and the pattern with the re.UNICODE flag so actual letters from different languages are detected.
# -*- coding: utf-8 -*-
import re
pattern = re.compile("[^\w\d]",re.UNICODE)
n_uni = 'ähm whatßs äüöp ×äØü'
uni = u'ähm whatßs äüöp ×äØü'
words = pattern.split(n_uni) #doesn't work
u_words = pattern.split(uni) #works
So if I write the string directly in the source and manually define it as Unicode it gives me the desired output while the non-Unicode string gives me just garbage:
"ähm whatßs äüöp äØü" -> unicode
"hm what s ü p ü" -> non-unicode even with some invalid characters
My question is now how do I define the input from a file as Unicode?
My question is now how do I define the input from a file as unicode?
Straight from the docs.
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)

Categories