I have some CSV files need to convert from shift-jis to utf-8.
Here is my code in PHP, which is successful transcode to readable text.
$str = utf8_decode($str);
$str = iconv('shift-jis', 'utf-8'. '//TRANSLIT', $str);
echo $str;
My problem is how to do same thing in Python.
I don't know PHP, but does this work :
mystring.decode('shift-jis').encode('utf-8') ?
Also I assume the CSV content is from a file. There are a few options for opening a file in python.
with open(myfile, 'rb') as fin
would be the first and you would get data as it is
with open(myfile, 'r') as fin
would be the default file opening
Also I tried on my computed with a shift-js text and the following code worked :
with open("shift.txt" , "rb") as fin :
text = fin.read()
text.decode('shift-jis').encode('utf-8')
result was the following in UTF-8 (without any errors)
' \xe3\x81\xa6 \xe3\x81\xa7 \xe3\x81\xa8'
Ok I validate my solution :)
The first char is indeed the good character: "\xe3\x81\xa6" means "E3 81 A6"
It gives the correct result.
You can try yourself at this URL
for when pythons built-in encodings are insufficient there's an iconv at PyPi.
pip install iconv
unfortunately the documentation is nonexistant.
There's also iconv_codecs
pip install iconv_codecs
eg:
>>> import iconv_codecs
>>> iconv_codecs.register('ansi_x3.110-1983')
>>> "foo".encode('ansi_x3.110-1983')
It would be helpful if you could post the string that you are trying to convert since this error suggest some problem with the in-data, older versions of PHP failed silently on broken input strings which makes this hard to diagnose.
According to the documentation this might also be due to differences in shift-jis dialects, try using 'shift_jisx0213' or 'shift_jis_2004' instead.
If using another dialect does not work you might get away with asking python to fail silently by using .decode('shift-jis','ignore') or .decode('shift-jis','replace') .
Related
I'm learning Python and tried to make a hanging game (literal translation - don't know the real name of the game in English. Sorry.). For those who aren't familiar with this game, the player must discover a secret word by guessing one letter at a time.
In my code, I get a collection of secret words which is imported from a txt file using the following code:
words_bank = open('palavras.txt', 'r')
words = []
for line in words_bank:
words.append(line.strip().lower())
words_bank.close()
print(words)
The output of print(words) is ['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3'] but if I try print('maçã, açaí, tucumã') in order to check the special characters, everything is printed correctly. Looks like the issue is in the encoding (or decoding... I'm still reading lots of articles about it to really understand) special characters from files.
The content of line 1 of my code is # coding: utf-8 because after some research I found out that I have to specify the Unicode format that is required for the text to be encoded/decoded. Before adding it, I was receiving the following message when running the code:
File "path/forca.py", line 12
SyntaxError: Non-ASCII character '\xc3' in file path/forca.py on line 12, but no encoding declared
Line 12 content: print('maçã, açaí, tucumã')
Things that I've already tried:
add encode='utf-8' as parameter in open('palavras.txt', 'r')
add decode='utf-8' as parameter in open('palavras.txt', 'r')
same as above but with latin1
substitute line 1 content for #coding: latin1
My OS is Ubuntu 20.04 LTS, my IDE is VS Code.
Nothing works!
I don't know what search and what to do anymore.
SOLUTION HERE
Thanks to the help given by the friends above, I was able to find out that the real problem was in the combo VS Code extension (Code Runner) + python alias version from Ubuntu 20.04 LTS.
Code Runner is set to run codes in Terminal in my situation, so apparently, when it calls for python the alias version was python 2.7.x. To overcome this situation I've used this thread to set python 3 as default.
It's done! Whenever python is called, both in terminal and VS Code with Code Runner, all special characters works just fine.
Thank's everybody for your time and your help =)
This only happens when using Python 2.x.
The error is probably because you're printing a list not printing items in the list.
When calling print(words) (words is a list), Python invokes a special function called repr on the list object. The list then creates a summary representation of the list by calling repr in each child in the list, then creates a neat string visualisation.
repr(string) actually returns an ASCII representation (with escapes) rather than a suitable version for your terminal.
Instead, try:
for x in words:
print(x)
Note. The option for open is encoding. E.g
open('myfile.txt', encoding='utf-8')
You should always, always pass the encoding option to open - Python <=3.8 on Linux and Mac will assume UTF-8 (for most people). Python <=3.8 on Windows will use an 8-bit code page.
Python 3.9 will always use UTF-8
See Python 2.x vs 3.x behaviour:
Py2
>>> print ['maçã', 'açaí', 'tucumã']
['ma\xc3\xa7\xc3\xa3', 'a\xc3\xa7a\xc3\xad', 'tucum\xc3\xa3']
>>> repr('maçã')
"'ma\\xc3\\xa7\\xc3\\xa3'"
>>> print 'maçã'
maçã
Py3
>>> print(['maçã', 'açaí', 'tucumã'])
['maçã', 'açaí', 'tucumã']
>>> repr('maçã')
"'maçã'"
Note: The possible duplicate concerns an older version of Python and this question has already generated unique answers.
I have been working on a script to process Project Gutenberg Texts texts into an internal file format for an application I am developing. In the script I process chapter headings with the re module. This works very well except in one case: the first line. My regex will always fail on the first Chapter marker at the first line if it includes the ^ caret to require the regex match to be at the beginning of the line because the BOM is consumed as the first character. (Example regex: ^Chapter).
What I've discovered is that if I do not include the caret, it won't fail on the first line, and then <feff> is included in the heading after I've processed it. An example:
<h1><feff>Chapter I</h1>
The advice according to this SO question (from which I learned of the BOM) is to fix your script to not consume/corrupt the BOM. Other SO questions talk about decoding the file with a codec but discuss errors I never encounter and do not discuss the syntax for opening a file with the template decoder.
To be clear:
I generally use pipelines of the following format:
cat -s <filename> | <other scripts> | python <scriptname> [options] > <outfile>
And I am opening the file with the following syntax:
import sys
fin = sys.stdin
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt')
for line in fin:
...Processing here...
My question is what is the proper way to handle this? Do I remove the BOM before processing the text? If so, how? Or do I use a decoder on the file before processing it (I am reading from stdin, so how would I accomplish this?)
The files are stored in UTF-8 encoding with DOS endings (\r\n). I convert them in vim to UNIX file format before processing using set ff=unix (I have to do several manual pre-processing tasks before running the script).
As a complement to the existing answer, it is possible to filter the UTF8 BOM from stdin with the codecs module. Simply you must use sys.stdin.buffer to access the underlying byte stream and decode it with a StreamReader
import sys
import codecs
# trick to process sys.stdin with a custom encoding
fin = codecs.getreader('utf_8_sig')(sys.stdin.buffer, errors='replace')
if '-i' in sys.argv: # For command line option "-i <infile>"
fin = open(sys.argv[sys.argv.index('-i') + 1], 'rt',
encoding='utf_8_sig', errors='replace')
for line in fin:
...Processing here...
In Python 3, stdin should be auto-decoded properly, but if it's not working for you (and for Python 2) you need to specify PythonIOEncoding before invoking your script like
PYTHONIOENCODING="UTF-8-SIG" python <scriptname> [options] > <outfile>
Notice that this setting also makes stdout working with UTF-8-SIG, so your <outfile> will maintain the original encoding.
For your -i parameter, just do open(path, 'rt', encoding="UTF-8-SIG")
You really don't need to import codecs or anything to deal with this. As lenz suggested in comments just check for the BOM and throw it out.
for line in input:
if line[0] == "\ufeff":
line = line[1:] # trim the BOM away
# the rest of your code goes here as usual
In Python 3.9 default encoding for standard input seems to be utf-8, at least on Linux:
In [2]: import sys
In [3]: sys.stdin
Out[3]: <_io.TextIOWrapper name='<stdin>' mode='r' encoding='utf-8'>
sys.stdin has the method reconfigure():
sys.stdin.reconfigure("utf-8-sig")
which should be called before any attempt of reading the standard input. This will decode the BOM, which will no longer appear when reading sys.stdin.
Based on a few examples (here, here) I can use psycopg2 to read and hopefully run a SQL file from python (the file is 200 lines, and though runs quickly, isn't something I want to maintain in a python file).
Here is the script to read and run the file:
sql = open("sql/sm_bounds_current.sql", "r").read()
curDest.execute(sql)
However, when I run the script, the following error is thrown:
Error: syntax error at or near "drop"
LINE 1: drop table dpsdata.sm_boundaries_current_dev;
As you can see, the first line in the script is to drop a table, but I'm not sure why the extra characters are being read, and can't seem to find a solution that might set the encoding of the file when reading it.
Found this post dealing with encoding and byte order marks, which is where my problem was.
The solution was to import OPEN from CODECS, which allows for the ENCODING option on OPEN:
import codecs
from codecs import open
sqlfile = "sql/sm_bounds_current.sql"
sql = open(sqlfile, mode='r', encoding='utf-8-sig').read()
curDest.execute(sql)
Seems to work great!
This question is related to this one.
Please read the problem that Chris describes there. I'll narrow it down: there's a CURL error 26 if a filename is utf-8-encoded and contains characters that are not in the range of those supported by non-unicode programs.
Let me explain myself:
local_filename = filename.encode("utf-8")
self.curl.setopt(self.curl.HTTPPOST, [(field, (self.curl.FORM_FILE, local_filename, self.curl.FORM_FILENAME, local_filename))])
I have windows 7 with Russian set as the language for non-unicode programs. If I don't encode filename to utf-8 (and pass filename, not local_filename to pycurl(, everything goes flawlessly if the filename contains either English or Russian chars. But if there is, say, an à, — it throws an error 26. If I pass local_filename (so encoded to UTF-8), even Russian chars aren't allowed.
Could you help, please? Thanks!
This is easy to answer, harder to fix:
pycurl uses libcurl for formposting. libcurl uses plain fopen() to open files for posting. Therefore you need to tell libcurl the exact file name that it should open and read from your local file system.
Decompose this problem into 2 components:
tell pycurl which file to open to read file data
send filename in correct encoding to the server
These may or may not be same encodings.
For 1, use sys.getfilesystemencoding() to convert unicode filename (which you use throughout python code correctly) to a string that pycurl/libcurl can open correctly with fopen(). Use strace (linux) or equivalent windows osx to verify correct file path is being opened by pycurl.
If that totally fails you can always feed file data stream from Python via pycurl.READFUNCTION.
For 2, learn how filename is transmitted during file upload, example. I don't have a good link, all I know it's not trivial, e.g. when it comes to very long file names.
I hacked up your code snippet, I have this, it works against nc -vl 5050 at least.
#!/usr/bin/python
import pycurl
c = pycurl.Curl()
filename = u"example-\N{EURO SIGN}.mp3"
with open(filename, "wb") as f:
f.write("\0\xfffoobar\x07\xff" * 9)
local_filename = filename.encode("utf-8")
c.setopt(pycurl.HTTPPOST, [("xxx", (pycurl.FORM_FILE, local_filename, pycurl.FORM_FILENAME, local_filename))])
c.setopt(pycurl.URL, "http://localhost:5050/")
c.setopt(pycurl.HTTPHEADER, ["Expect:"])
c.perform()
My test doesn't cover the case where encoding is different between OS and HTTP.
Should be enough to get your started though, shouldn't it?
I am trying to load a file saved as UTF-8 into python (ver2.6.6) which contains 14 different languages. I am using the python codecs module to decode the txt file.
import codecs
f = open('C:/temp/list_test.txt', 'r')
for lines in f:
line=filter_str(lines.decode("utf-8")
This all works well. I parse the entire file and then want to export
14 different language files. The problem that I can't understand is the following
I use the following code for output:
malangout = codecs.open("C:/temp/'polish.txt",'w','utf-8','surrogateescape')
for item in lang_dic['English']:
temp = lang_dic[lang1][item]
malangout.write(temp + '\n')
malangout.close()
Example:
Language: Polish
Expected output: Dziennik zakłóceń
Actual output: Dziennik zak‚óceƒ
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
I have tried many encoding from the python docs (7.8 codecs). Any infomation would help at this point.
The string is stored as is:
u'Dziennik zak\u201a\xf3ce\u0192'
Well, that's a problem since
In [25]: print(u'Dziennik zak\u201a\xf3ce\u0192')
Dziennik zak‚óceƒ
in contrast to
In [26]: print(u'Dziennik zak\u0142\xf3ce\u0144')
Dziennik zakłóceń
So it looks like the unicode you are storing is incorrect. Are you sure it is correct in C:/temp/list_test.txt? That is, does list_test.txt contain
In [28]: u'Dziennik zak\u201a\xf3ce\u0192'.encode('utf-8')
Out[28]: 'Dziennik zak\xe2\x80\x9a\xc3\xb3ce\xc6\x92'
or
In [27]: u'Dziennik zak\u0142\xf3ce\u0144'.encode('utf-8')
Out[27]: 'Dziennik zak\xc5\x82\xc3\xb3ce\xc5\x84'
?
PS. You may want to change
temp + '\n'
to
temp + u'\n'
to make it clear you are adding two unicode together to form a unicode.
The two lines above have the same result in Python2, but in Python3 adding a unicode and str together would raise a TypeError. Even though in Python3, '\n' is unicode, I think the challenge in transitioning to Python3 will be in changing one's mental attitude toward mixing unicode and str. In Python2 it is silently attempted for you, in Python3 it is disallowed.