How to generate hebrew strings in python 3? - python

I'm trying to create hebrew strings but get syntax errors. It works in the IDLE shell but not in Pydev.
Here's what I've tried so far:
s = 'מחרוזת בעברית' #works in the shell only
s = u'מחרוזת בעברית' #doesn't work at all
s = unicode("מחרוזת בעברית", "UTF-8") #also doesn't work at all
I get a syntax error: Non-UTF-8 code starting with '\xee'.
What does it mean and what shall I do to create hebrew strings?

Does your source file start with a # -*- coding: utf-8 -*- line? Is your file actually encoded as utf-8 (and not some other encoding)?
It's supposed to work (the first line, other lines are not valid Python 3).

Related

Comparison of Non ASCII only works in IDLE

I'm doing a fairly simple code that transforms European Portuguese input into Brazilian Portuguese -- so there are a lot of accented characters such as á,é,À,ç, etc.
Basically, the goal is to find words in the text from a list and replace them with the BR words from a second list.
Here's the code:
#-*- coding: latin-1 -*-
listapt=["gestão","utilizador","telemóvel"]
listabr=["gerenciamento", "usuário", "celular"]
while True:
#this is all because I need to be able to input multiple lines of text, seems to be working fine
print ("Insert text")
lines = []
while True:
line = raw_input()
if line != "FIM":
lines.append(line)
else:
break
text = '\n'.join(lines)
for word in listapt:
if word in text:
num = listapt.index(word)
wordbr = listabr[num]
print(word + " --> " + wordbr) #just to show what changes were made
text = text.replace(word, wordbr)
print(text)
I run the code on Windows using IDLE and by double-clicking on the .py file.
The code works fine when using IDLE, but does not match and replace characters when double-clicking the .py file.
Here's why the code works as expected in IDLE but not from CMD or by doubleclicking:
Your code is UTF-8 encoded, not latin-1 encoded
IDLE always works in UTF-8 "input/output" mode.
On Windows, CMD/Doubleclicking will use a non-UTF-8 8bit locale.
When your code compares the input to the hardcoded strings it's doing so at a byte level. On IDLE, it's comparing UTF-8 to hardcoded UTF-8. On CMD, it's comparing non-UTF-8 8bit to hardcoded UTF-8 (If you were on a stock MacOS, it would also work).
The way to fix this is to make sure you're comparing "apples with apples". You could do this by converting everything to the same encoding. E.g. Convert the input read to UTF-8 so it matches the hardcoded strings. The better solution is to convert all [byte] strings to Unicode strings (Strings with no encoding). If you were on Python 3, this would be all automatic.
On Python 2.x, you need to do three things:
Prefix all sourcecode strings with u to make them Unicode strings:
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celula]
...
if line != u"FIM":
Alternatively, add from __future__ import unicode_literals to avoid changing all your code.
Use the correct coding header for the encoding of your file. I suspect your header should read utf-8. E.g.
#-*- coding: utf-8 -*-
Convert the result of raw_input to Unicode. This must be done with the detected encoding of the standard input:
import sys
line = raw_input().decode(sys.stdin.encoding)
By the way, the better way to model list of words to replace it to use a dict. The keys are the original word, the value is the replacement. E.g.
words = { u"telemóvel": u"celula"}
I don't see that problem over here.
Based on your use of raw_input, it seems like you're using Python 2.x
This may be because I'm copypasting off of stack overflow, and have a different dev environment to you.
Try running your script under the latest Python 3 interpreter, as well as removing the "#-*- coding:" line.
This should either hit UnicodeDecodeError issues a lot sooner in your code, or work fine.
The problem you have here is Python 2.x getting confused at some point while trying to translate between byte sequences (what Python 2.x strings contain, eg binary file contents), and human-meaningful text (unicode, eg for things like user informational display of chinese characters), because it makes incorrect assumptions about how human-readable text was encoded into the byte sequence seen in the Python strings.
It's a detail that Python 3 attempts to address a lot better/less ambiguously.
First try executing the code below, it should resolve the issue:
# -*- coding: latin-1 -*-
listapt=[u"gestão",u"utilizador",u"telemóvel"]
listabr=[u"gerenciamento",u"usuário", u"celular"]
lines=[]
line = raw_input()
line = line.decode('latin-1')
if line != "FIM":
lines.append(line)
text = u'\n'.join(lines)
for word in listapt:
if word in text:
print("Hello")
num = listapt.index(word)
print(num)
wordbr = listabr[num]
print(wordbr)

Python Notepad++ encoding error

I am using Python for the first time and am running into an encoding error that I can't seem to get around. Here is the code:
#!/usr/bin/python
#-*- coding: utf -*-
import pandas as pd
a = "C:\Users"
print(a)
When I do this, I get:
File "C:\Users\Public\Documents\Python Scripts\ImportExcel.py", line
5
a = "C:\Users"
^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in positio n 2-3: truncated \UXXXXXXXX escape
In Notepad++ I have tried all of the encoding options. Nothing seems to work.
Any suggestions?
Specifically, the problem is that the '\' is an escape character.
If you want to print the string
"C:\Users"
then you have to do it thus:
a = "C:\\Users"
Hope this helps.
The error message suggests you're on a Windows machine, but you're using *nix notation for #!/usr/bin/python. That line should look something like #!C:\Python33\python.exe on a Windows machine, depending on where you've installed Python.
Use this: # -*- coding: utf-8 -*- instead of #-- coding: utf --
You can set the encoding in Notepad++, but you also need to tell Python about it.
In legacy Python (2.7), source code is ASCII unless specified otherwise. In Python 3, source code is UTF-8 unless otherwise specified.
You should use the following as the first or second line of the file to specify the encoding of the source code. The documentation gives:
# -*- coding: <encoding> -*-
This is the format originally from the Emacs editor, but according to PEP263 you can also use:
# vim: set fileencoding=<encoding>:
of even:
# coding=<encoding>
Where <encoding> can be any encoding that Python supports, but utf-8 is generally a good choice for portable code.

Unicode with cp1251 and utf-8 on windows

I am playing around with unicode in python.
So there is a simple script:
# -*- coding: cp1251 -*-
print 'юникод'.decode('cp1251')
print unicode('юникод', 'cp1251')
print unicode('юникод', 'utf-8')
In cmd I've switched encoding to Active code page: 1251.
And there is the output:
СЋРЅРёРєРѕРґ
СЋРЅРёРєРѕРґ
юникод
I am a little bit confused.
Since I've specified encoding to cp1251 I expect that it would be decoded correctly.
But as result there is some trash code points were interpreted.
I am understand that 'юникод' is just a bytes like:
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'.
But there is a way to get correct output in terminal with cp1251?
Should I build byte string manually?
Seems like I misunderstood something.
I think I can understand what happened to you. The last line gave me the hint, that your trash codepoints confirmed. You try to display cp1251 characters but your editor is configured to use utf8.
The # -*- coding: cp1251 -*- is only used by the Python interpretor to convert characters from source python files that are outside of the ASCII range. And anyway it it is only used for unicode litterals because bytes from original source give er... exactly same bytes in byte strings. Some text editors are kind enough to automagically use this line (IDLE editor is), but I'm little confident in that and allways switch manually to the proper encoding when I use gvim for example. Short story: # -*- coding: cp1251 -*- in unused in your code and can only mislead a reader since it it not the actual encoding.
If you want to be sure of what lies in your source, you'd better use explicit escapes. In code page 1251, this word юникод is composed by those characters: '\xfe\xed\xe8\xea\xee\xe4'
If you write this source:
txt = '\xfe\xed\xe8\xea\xee\xe4'
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
and execute it in a console configured to use CP1251 charset, the first three lines will output юникод, and the last one will throw a UnicodeDecodeError exception because the input is no longer valid 'utf8'.
Alternatively, if you find comfortable with you current editor, you could write:
# -*- coding: utf8 -*-
txt = 'юникод'.decode('utf8').encode('cp1251') # or simply txt = u'юникод'.encode('cp1251')
print txt
print txt.decode('cp1251')
print unicode(txt, 'cp1251')
print unicode(txt, 'utf-8')
which should give same results - but now the declared source encoding should be the actual encoding of your python source.
BTW, a Python 3.5 IDLE that natively uses unicode confirmed that:
>>> 'СЋРЅРёРєРѕРґ'.encode('cp1251').decode('utf8')
'юникод'
Your issue is that the encoding declaration is wrong: your editor uses utf-8 character encoding to save the source code. Use # -*- coding: utf-8 -*- to fix it.
>>> u'юникод'
u'\u044e\u043d\u0438\u043a\u043e\u0434'
>>> u'юникод'.encode('utf-8')
'\xd1\x8e\xd0\xbd\xd0\xb8\xd0\xba\xd0\xbe\xd0\xb4'
>>> print _.decode('cp1251') # mojibake due to the wrong encoding
СЋРЅРёРєРѕРґ
>>> print u'юникод'
юникод
Do not use bytestrings ('' literals create bytes object on Python 2) to represent text; use Unicode strings (u'' literals -- unicode type) instead.
If your code uses Unicode strings then a code page that your Windows console uses doesn't matter as long as the chosen font can display the corresponding (non-BMP) characters. See Python, Unicode, and the Windows console
Here's complete code, for reference:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
print(u'юникод')
Note: no .decode(), unicode(). If you are using a literal to create a string; you should use Unicode literals if the string contains text. It is the only option on Python 3 where you can't put non-ascii characters inside a bytes literal and it is a good practice (to use Unicode for text instead of bytestrings) on Python 2 too.
If you are given a bytestring as an input (not literal) by some API then its encoding has nothing to do with the encoding declaration. What specific encoding to use depends on the source of the data.
Just use the following, but ensure you save the source code in the declared encoding. It can be any encoding that supports the characters you want to print. The terminal can be in a different encoding, as long as it also supports the characters you want to print:
#coding:utf8
print u'юникод'
The advantage is that you don't need to know the terminal's encoding. Python will normally1 detect the terminal encoding and encode the print output correctly.
1Unless your terminal is misconfigured.

Why I cannot save file with Chinese characters when using Python 2.7.11 IDLE?

I just downloaded the latest Python 2.7.11 64bit from its official website and installed it to my Windows 10. And I found that if the new IDLE file contains Chinese character, like 你好, then I cannot save the file. If I tried to save it for several times, then the new file crashed and disappeared.
I also installed the latest python-3.5.1-amd64.exe, and it does not have this issue.
How to solve it?
More:
A example code from wiki page, https://zh.wikipedia.org/wiki/%E9%B8%AD%E5%AD%90%E7%B1%BB%E5%9E%8B
If I past the code here, StackOverflow alays warn me: Body cannot contain "I just dow". Why?
Thanks!
More:
I find this config option, but it does not help at all.
IDLE -> Options -> Configure IDLE -> General -> Default Source Encoding: UTF-8
More:
By adding u before the Chinese code, everything will be right, it is great way. Like below:
Without u there, sometimes it will go with corrupted code. Like below:
Python 2.x uses ASCII as default encoding, while Python 3.x uses UTF-8. Just use:
my_string.encode("utf-8")
to convert ascii to utf-8 (or change it to any other encoding you need)
You can also try to put this line on the first line of your code:
# -*- coding: utf-8 -*-
Python 2 uses ASCII as its default encoding for its strings which cannot store Chinese characters. On the other hand, Python 3 uses Unicode encoding for its strings by default which can store Chinese characters.
But that doesn't mean Python 2 cannot use Unicode strings. You just have to encode your strings into Unicode. Here's an example of converting your strings to Unicode strings.
>>> plain_text = "Plain text"
>>> plain_text
'Plain text'
>>> utf8_text = unicode(plain_text, "utf-8")
>>> utf8_txt
u'Plain_text'
The prefix u in the string, utf8_txt, says that it is a Unicode string.
You could also do this.
>>> print u"你好"
>>> 你好
You just have to prepend your string with u to signify that it is a Unicode string.
When using Python 2 on Windows:
For file with Unicode characters to be saved in IDLE, a line
# -*- coding: utf-8 -*-
has to be added in its beginning.
And for Unicode characters to show correctly in console output in Windows, if running a script, saved in a file, in IDLE console or in Windows shell, strings have to be prepended with u:
print u"你好"
print u"Привет"
But in interactive mode, I discovered no need for this with cyrillic.
even in python 3.7 I still experience the same issue, UTF-8 still does the trick

python utf-8 japanese

I have some Japanese words I wish to convert to utf-8, as shown below:
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1.encode('utf-8') # Doesn't work
print jap_word2.encode('utf-8') # Prints properly
Why is it that one word can be converted properly to utf-8 and printed to show the same characters but not the other?
(I am using python 2.6 on Windows 7 Ultimate)
Lots of things must align to print characters properly:
What encoding is the script saved in?
Do you have a # coding: xxxx statement in your script, where xxxx matches the encoding the file is saved in?
Does your output terminal support your encoding? import sys; print sys.stdout.encoding
a. If not, can you change the console encoding? (chcp command on Windows)
Does the font you are using support the characters?
Saving the script in UTF-8, this works in both PythonWin and IDLE.
# coding: utf-8
jap_word1 = u'中山'
jap_word2 = u'小倉'
print jap_word1
print jap_word2
Interestingly, I got your results with the .encode('utf-8') added to both prints in IDLE, but it worked correctly in Pythonwin, whose default output window supports UTF-8.
Idle is a strange beast. sys.stdout.encoding on my system produces 'cp1252', which doesn't support Asian characters, but it prints the first word wrong and the second one right when printing in UTF-8.
Because your console is not in UTF-8. Run chcp 65001 before running.

Categories