Python not opening Japanese filenames - python

I've been working on a python script to open up a file with a unicode name (Japanese mostly) and save to a randomly generated (Non-unicode) filename in Windows Vista 64-bit, and I'm having issues... It just doesn't work, it works fine with non-unicode filenames (Even if it has unicode content), but the second you try to pass a unicode filename in - it doesn't work.
Here's the code:
try:
import sys, os
inpath = sys.argv[1]
outpath = sys.argv[2]
filein = open(inpath, "rb")
contents = filein.read()
fileSave = open(outpath, "wb")
fileSave.write(contents)
fileSave.close()
testfile = open(outpath + '.test', 'wb')
testfile.write(inpath)
testfile.close()
except:
errlog = open('G:\\log.txt', 'w')
errlog.write(str(sys.exc_info()))
errlog.close()
And the error:
(<type 'exceptions.IOError'>, IOError(2, 'No such file or directory'), <traceback object at 0x01092A30>)

You have to convert your inpath to unicode, like this:
inpath = sys.argv[1]
inpath = inpath.decode("UTF-8")
filein = open(inpath, "rb")
I'm guessing you are using Python 2.6, because in Python 3, all strings are unicode by default, so this problem wouldn't happen.

My guess is that sys.argv1 and sys.argv[2] are just byte arrays and don't support natively Unicode. You could confirm this by printing them and seeing if they are the character you expect. You should also print type(sys.argv1) to make sure they are of the correct type.
Where do the command-line parameters come from? Do they come from another program or are you typing them on the command-line? If they come from another program, you could have the other program encode them to UTF-8 and then have your Python program decode them from UTF-8.
Which version of Python are you using?
Edit: here's a robust solution: http://code.activestate.com/recipes/572200/

Related

broken CJK data when reading ISO-8859-1 file in python

I'm parsing some file that is ISO-8859-1 and has Chinese, Japanese, Korean characters in it.
import os
from os import listdir
cnt = 0
base_path = 'data/'
cwd = os.path.abspath(os.getcwd())
for f in os.listdir(base_path):
path = cwd + '/' + base_path + f
cnt = 0
with open(path, 'r', encoding='ISO-8859-1') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
The code runs but it prints broken characters. Other stackoverflow questions suggest I use encode and decode. For example, for Korean texts, I tried
file.read().encode('latin1').decode('euc-kr'), but that didn't do anything. I also tried to convert the files into utf-8 using iconv but the characters are still broken in the converted text file.
Any suggestions would be much appreciated.
Sorry, no. ISO-8859-1 cannot have any Chinese, Japanese, nor Korean characters in it. The code page doesn't support them at the first place.
What you did in the code is to ask Python to assume the file is in ISO-8859-1 encoding and return characters in Unicode (which is how strings are built). If you do not specify the encoding parameter in open(), the default would be assuming UTF-8 encoding use in the file and still return in Unicode, i.e. logical characters without any encoding specified.
Now the question is how are those CJK characters encoded in the file. If you know the answer, you can just put the right encoding parameter in open() and it works right away. Let's say it is EUC-KR as you mentioned, the code should be:
with open(path, 'r', encoding='euc-kr') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
If you feel frustrated, please take a look at chardet. It should help you detect the encoding from text. Example:
import chardet
with open(path, 'rb') as file:
rawdata = file.read()
guess = chardet.detect(rawdata) # e.g. {'encoding': 'EUC-KR', 'confidence': 0.99}
text = guess.decode(guess['encoding'])
cnt = 0
for line in text.splitlines():
print('line {}: {}'.format(cnt, line))
cnt +=1

Change file encoding scheme in Python

I'm trying to open a file using latin-1 encoding in order to produce a file with a different encoding. I get a NameError stating that unicode is not defined. Here the piece of code I use to this:
sourceEncoding = "latin-1"
targetEncoding = "utf-8"
source = open(r'C:\Users\chsafouane\Desktop\saf.txt')
target = open(r'C:\Users\chsafouane\Desktop\saf2.txt', "w")
target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
I'm not used at all to handling files so I don't know if there is a module I should import to use "unicode"
The fact that you see unicode not defined suggests that you're in Python3. Here's a code snippet that'll generate a latin1-encoded file, then does what you want to do, slurp the latin1-encoded file and spit out a UTF8-encoded file:
# Generate a latin1-encoded file
txt = u'U+00AxNBSP¡¢£¤¥¦§¨©ª«¬SHY­®¯U+00Bx°±²³´µ¶·¸¹º»¼½¾¿U+00CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏU+00DxÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßU+00ExàáâãäåæçèéêëìíîïU+00Fxðñòóôõö÷øùúûüýþÿ'
latin1 = txt.encode('latin1')
with open('example-latin1.txt', 'wb') as fid:
fid.write(latin1)
# Read in the latin1 file
with open('example-latin1.txt', 'r', encoding='latin1') as fid:
contents = fid.read()
assert contents == latin1.decode('latin1') # sanity check
# Spit out a UTF8-encoded file
with open('converted-utf8.txt', 'w') as fid:
fid.write(contents)
If you want the output to be something other than UTF8, add an encoding argument to open, e.g.,
with open('converted-utf_32.txt', 'w', encoding='utf_32') as fid:
fid.write(contents)
The docs have a list of all supported codecs.

How to make this python(2.7) scripts works with unicode filename?

I have the following script to process filenames with non-latin characters:
import os
filelst = []
allfile = os.listdir(os.getcwd())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = open(os.getcwd()+'\\_filelist.txt','w+')
for file in allfile:
w.write(file)
w.write("\n")
w.close()
filelist in my folder:
new 1.py
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.avi
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.srt
output in _filelist.txt:
new 1.py
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.avi
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.srt
You should get the list of files as Unicode strings instead by passing a Unicode file path to listdir. As you're using getcwd, use: os.getcwdu()
Then open your output file with a text encoding wrapper. io module is the new way to do this (io handles Universal newlines correctly).
Putting it all together:
import os
import io
filelst = []
allfile = os.listdir(os.getcwdu())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = io.open(os.getcwd()+'\\_filelist.txt','w+', encoding="utf-8")
for file in allfile:
w.write(file)
w.write("\n")
w.close()
In Windows and OS X, this will just work as filename translation is enforced. In Linux, a filename can be any encoding (or non at all!). Therefore, ensure that whatever is creating your files (avi + srt), is using UTF-8, your terminal is set to UTF-8 and your locale is UTF-8.
You need to open your file with a proper encoding to write unicode in it.You can use codecs module for opening the file:
import codecs
with codecs.open(os.getcwd()+'\\_filelist.txt','w+',encoding='your-encoding') as w:
for file in allfile:
w.write(file + '\n')
You can use UTF-8 as your encoding which is a universal encoding or another proper encoding based on your unicode type.Also note that instead of opening the file and closing it manually you can use with statement to open the file which will close the file automatically at the end of the block.

Which encoding to use while reading Excel using xlrd

I am trying to read an Excel file using xlrd to write into txt files. Everything is being written fine except for some rows which has some spanish characters like 'Téd'. I can encode those using latin-1 encoding. However the code then fails for other rows which have a 'â' with unicode u'\u2013'. u'\2013' can't be encoded using latin-1. When using UTF-8 'â' are written out fine but 'Téd' is written as 'Téd' which is not acceptable. How do I correct this.
Code below :
#!/usr/bin/python
import xlrd
import csv
import sys
filePath = sys.argv[1]
with xlrd.open_workbook(filePath) as wb:
shNames = wb.sheet_names()
for shName in shNames:
sh = wb.sheet_by_name(shName)
csvFile = shName + ".csv"
with open(csvFile, 'wb') as f:
c = csv.writer(f)
for row in range(sh.nrows):
sh_row = []
cell = ''
for item in sh.row_values(row):
if isinstance(item, float):
cell=item
else:
cell=item.encode('utf-8')
sh_row.append(cell)
cell=''
c.writerow(sh_row)
print shName + ".csv File Created"
Python's csv module
doesn’t support Unicode input.
You are correctly encoding your input before writing it -- so you don't need codecs. Just open(csvFile, "wb") (the b is important) and pass that object to the writer:
with open(csvFile, "wb") as f:
writer = csv.writer(f)
writer.writerow([entry.encode("utf-8") for entry in row])
Alternatively, unicodecsv is a drop-in replacement for csv that handles encoding.
You are getting é instead of é because you are mistaking UTF-8 encoded text for latin-1. This is probably because you're encoding twice, once as .encode("utf-8") and once as codecs.open.
By the way, the right way to check the type of an xlrd cell is to do cell.ctype == xlrd.ONE_OF_THE_TYPES.

Python: how to convert from Windows 1251 to Unicode?

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?
This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).
If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

Categories