Python codecs module - python

I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export
14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.

The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode together to form a unicode.
The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.

Related

Special characters like ç and ã aren't decoded when the text is obtained from a file

I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)
This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"

convert unicode data into malayalam using python (\u0d35 format)

I have been doing topic modeling for malayalam news article. The topics are generated in unicode format. The output is as follows:
u'0.021*"\u0d2a\u0d3f" + 0.021*"\u0d35\u0d3f\u0d36\u0d4d\u0d35\u0d02\u0d2d\u0d30\u0d28\u0d4d\u0d31\u0d46" + 0.021*"\u0d05\u0d26\u0d4d\u0d26\u0d47\u0d39\u0d02"'
I want to convert this into readable string. whenever it involves file operations it just show same string in the output file. But i want the result like:
0.021*"പി" + 0.021*"വിശ്വംഭരന്റെ" + 0.021*"അദ്ദേഹം"
into a file
seems to work fine for me ... make sure the terminal you are printing to supports it (well dang that screenshot isnt as readable as id hoped... oh well if you click it its fine)
if you want to write it to a file you probably need to encode it to utf8
with open("some_file","wb") as f:
f.write(u'0.021*"\u0d2a\u0d3f" + 0.021*"\u0d35\u0d3f\u0d36\u0d4d\u0d35\u0d02\u0d2d\u0d30\u0d28\u0d4d\u0d31\u0d46" + 0.021*"\u0d05\u0d26\u0d4d\u0d26\u0d47\u0d39\u0d02"'.encode("utf-8"))

Trouble with utf-8 encoding/decoding

I am reading a .csv which is UTF-8 encoded.
I want to create an index and rewrite the csv.
The index is created as an ongoing number and the first letter of a word.
Python 2.7.10, Ubuntu Server
#!/usr/bin/env python
# -*- coding: utf-8 -*-
counter = 0
tempDict = {}
with open(modifiedFile, "wb") as newFile:
with open(originalFile, "r") as file:
for row in file:
myList = row.split(",")
toId = str(myList[0])
if toId not in tempDict:
tempDict[toId] = counter
myId = str(toId[0]) + str(counter)
myList.append(myId)
counter += 1
else:
myId = str(toId[0]) + str(tempDict[toId])
myList.append(myId)
# and then I write everything into the csv
for i, j in enumerate(myList):
if i < 6:
newFile.write(str(j).strip())
newFile.write(",")
else:
newFile.write(str(j).strip())
newFile.write("\n")
The problem is the following.
When a word starts with a fancy letter, such as
Č
É
Ā
...
The id I create starts with a ?, but not with the letter of the word.
The strange part is, that withing the csv I create, the words with the fancy letters are written correct. There are no ? or other symbols which indicate a wrong encoding.
Why is that?
By all means, you should not be learning Python 2 unless there is a specific legacy C extension that you need.
Python 3 makes major changes to the unicode/bytes handling that removes (most) implicit behavior and makes errors visible. It's still good practice to use open('filename', encoding='utf-8') since the default encoding is environment- and platform-dependent.
Indeed, running your program in Python 3 should fix it without any changes. But here's where your bug lies:
toId = str(myList[0])
This is a no-op, since myList[0] is already a str.
myId = str(toId[0]) + str(counter)
This is a bug: toId is a str (byte string) containing UTF-8 data. You never, ever want to do anything with UTF-8 data except process it one character at a time.
with open(originalFile, "r") as file:
This is a style error, since it masks the built-in function file.
There are two changes to make this run under Python 2.
Change open(filename, mode) to io.open(filename, mode, encoding='utf-8').
Stop calling str() on strings, since that actually attempts to encode them (in ASCII!).
But you really should switch to Python 3.
There are a few pieces new to 2.6 and 2.7 that are intended to bridge the gap to 3, and one of them is the io module, which behaves in all the nice new ways: Unicode files and universal newlines.
~$ python2.7 -c 'import io,sys;print(list(io.open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
[u'\xc4\n', u'\xf9\n']
~$ python3 -c 'import sys;print(list(open(sys.argv[1],encoding="u8")))' <(printf $'\xc3\x84\r\n\xc3\xb9\r\n')
['Ä\n', 'ù\n']
This can be useful to write software for both 2 and 3. Again, the encoding argument is optional but on all platforms the default encoding is environment-dependent, so it's good to be specific.
In python 2.x strings are by default non-unicode - str() returns a non-unicode string. Use unicode() instead.
Besides, you must open the file using utf-8 encoding through codecs.open() rather than the built-in open().

Why is python converting Kurdich characters into UTF-8 literals?

I was trying to take the content of a text file and map it into a json file, but I noticed that python automatically turned the kurdish(sorani) text into UTF-8 literals. Can someone explain why python does this and how can I prevent the conversion?
You can test it with the code below:
def readText():
# test.txt contains kurdish sorani characters (an article)
# Sorani example: ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە.
with open('test.txt', 'r') as context:
data = context.readlines()
return data
print(readText())
I'm running python 2.x on Ubuntu 14.x. Python2.x does this! Python 3.x does not convert it and works just fine.
You are seeing the repr output as you call readlines which returns a list and lists show the repr representation of your data, once you actually print the strings themselves you will see the actual str output, you are also using python2:
In [11]: out = readText()
In [12]: print out
['\xda\x95\xdb\x86\xda\x98\xd8\xaa\xd8\xa7\xd9\x86 \xd8\xa8\xd8\xa7\xd8\xb4 \xd8\xa8\xdb\x95\xda\x95\xdb\x8e\xd8\xb2\xd8\xa7\xd9\x86. \xd9\x85\xd9\x86 \xd9\x86\xd8\xa7\xd9\x88\xd9\x85 \xda\x95\xdb\x95\xd9\x86\xd8\xac\xdb\x95. ']
In [13]: print out[0]
ڕۆژتان باش بەڕێزان. من ناوم ڕەنجە.
I'm going to take a stab here and guess that you are reading the output in a terminal of some sort, and when Python writes to the terminal it's trying to display in ASCII.
If you set your PYTHONIOENCODING environment variable to UTF-8 this can sometimes solve the issue - it depends on other variables as well.
So, if you're on a UNIX-like system, try this in your terminal: export PYTHONIOENCODING=UTF-8
Or, for Windows, set PYTHONIOENCODING=UTF-8.
Then, try running your script again and see if you get the correct characters printed.
More information can be found here: How to print UTF-8 Encoded Text to the console in Python3

Is there any function like iconv in Python?

I have some CSV files need to convert from shift-jis to utf-8.
Here is my code in PHP, which is successful transcode to readable text.
$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;
My problem is how to do same thing in Python.
I don't know PHP, but does this work :
mystring.decode('shift-jis').encode('utf-8') ?
Also I assume the CSV content is from a file. There are a few options for opening a file in python.
with open(myfile, 'rb') as fin
would be the first and you would get data as it is
with open(myfile, 'r') as fin
would be the default file opening
Also I tried on my computed with a shift-js text and the following code worked :
with open("shift.txt" , "rb") as fin :
text = fin.read()
text.decode('shift-jis').encode('utf-8')
result was the following in UTF-8 (without any errors)
' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'
Ok I validate my solution :)
The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6"
It gives the correct result.
You can try yourself at this URL
for when pythons built-in encodings are insufficient there's an iconv at PyPi.
pip install iconv
unfortunately the documentation is nonexistant.
There's also iconv_codecs
pip install iconv_codecs
eg:
>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')
It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.
According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.
If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .

Categories