I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you
import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?
This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).
If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".
Related
I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?
First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')
I'm trying to open a file using latin-1 encoding in order to produce a file with a different encoding. I get a NameError stating that unicode is not defined. Here the piece of code I use to this:
sourceEncoding = "latin-1"
targetEncoding = "utf-8"
source = open(r'C:\Users\chsafouane\Desktop\saf.txt')
target = open(r'C:\Users\chsafouane\Desktop\saf2.txt', "w")
target.write(unicode(source.read(), sourceEncoding).encode(targetEncoding))
I'm not used at all to handling files so I don't know if there is a module I should import to use "unicode"
The fact that you see unicode not defined suggests that you're in Python3. Here's a code snippet that'll generate a latin1-encoded file, then does what you want to do, slurp the latin1-encoded file and spit out a UTF8-encoded file:
# Generate a latin1-encoded file
txt = u'U+00AxNBSP¡¢£¤¥¦§¨©ª«¬SHY®¯U+00Bx°±²³´µ¶·¸¹º»¼½¾¿U+00CxÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏU+00DxÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßU+00ExàáâãäåæçèéêëìíîïU+00Fxðñòóôõö÷øùúûüýþÿ'
latin1 = txt.encode('latin1')
with open('example-latin1.txt', 'wb') as fid:
fid.write(latin1)
# Read in the latin1 file
with open('example-latin1.txt', 'r', encoding='latin1') as fid:
contents = fid.read()
assert contents == latin1.decode('latin1') # sanity check
# Spit out a UTF8-encoded file
with open('converted-utf8.txt', 'w') as fid:
fid.write(contents)
If you want the output to be something other than UTF8, add an encoding argument to open, e.g.,
with open('converted-utf_32.txt', 'w', encoding='utf_32') as fid:
fid.write(contents)
The docs have a list of all supported codecs.
I am working with a twitter streaming package for python. I am currently using a keyword that is written in unicode to search for tweets containing that word. I am then using python to create a database csv file of the tweets. However, I want to convert the tweets back to Arabic symbols when I save them in the csv.
The errors I am receiving are all similar to "error ondata the ASCII caracters in position ___ are not within the range of 128."
Here is my code:
class listener(StreamListener):
def on_data(self, data):
try:
#print data
tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
tweetsymbols = tweet.encode('utf-8')
print tweetsymbols
saveThis = str(now) + ':::' + tweetsymbols.decode('utf-8')
saveFile = open('rawtwitterdata.csv','a')
saveFile.write(saveThis)
saveFile.write('\n')
saveFile.close()
return True
Excel requires a Unicode BOM character written to the beginning of a UTF-8 file to view it properly. Without it, Excel assumes "ANSI" encoding, which is OS locale-dependent.
This writes a 3-row, 3-column CSV file with Arabic:
#!python2
#coding:utf8
import io
with io.open('arabic.csv','w',encoding='utf-8-sig') as f:
s = u'إعلان يونيو وبالرغم تم. المتحدة'
s = u','.join([s,s,s]) + u'\n'
f.write(s)
f.write(s)
f.write(s)
Output:
For your specific example, just make sure to write a BOM character u'\xfeff' as the first characters of your file, encoded in UTF-8. In the example above, the 'utf-8-sig' codec ensures a BOM is written.
Also consult this answer, which shows how to wrap the csv module to support Unicode, or get the third party unicodecsv module.
Here a snippet to write arabic in text
# coding=utf-8
import codecs
from datetime import datetime
class listener(object):
def on_data(self, tweetsymbols):
# python2
# tweetsymbols is str
# tweet = (str((data.split(',"text":"')[1].split('","source')[0]))).encode('utf-8')
now = datetime.now()
# work with unicode
saveThis = unicode(now) + ':::' + tweetsymbols.decode('utf-8')
try:
saveFile = codecs.open('rawtwitterdata.csv', 'a', encoding="utf8")
saveFile.write(saveThis)
saveFile.write('\n')
finally:
saveFile.close()
return self
listener().on_data("إعلان يونيو وبالرغم تم. المتحدة")
All you must know about encoding https://pythonhosted.org/kitchen/unicode-frustrations.html
I have the following script to process filenames with non-latin characters:
import os
filelst = []
allfile = os.listdir(os.getcwd())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = open(os.getcwd()+'\\_filelist.txt','w+')
for file in allfile:
w.write(file)
w.write("\n")
w.close()
filelist in my folder:
new 1.py
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.avi
ああっ女神さまっ 小っちゃいって事は便利だねっ.1998.Ep0108.x264.AC3CalChi.srt
output in _filelist.txt:
new 1.py
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.avi
???????? ??????????????.1998.Ep01-08.x264.AC3-CalChi.srt
You should get the list of files as Unicode strings instead by passing a Unicode file path to listdir. As you're using getcwd, use: os.getcwdu()
Then open your output file with a text encoding wrapper. io module is the new way to do this (io handles Universal newlines correctly).
Putting it all together:
import os
import io
filelst = []
allfile = os.listdir(os.getcwdu())
for file in allfile:
if os.path.isfile(file):
filelst.append(file)
w = io.open(os.getcwd()+'\\_filelist.txt','w+', encoding="utf-8")
for file in allfile:
w.write(file)
w.write("\n")
w.close()
In Windows and OS X, this will just work as filename translation is enforced. In Linux, a filename can be any encoding (or non at all!). Therefore, ensure that whatever is creating your files (avi + srt), is using UTF-8, your terminal is set to UTF-8 and your locale is UTF-8.
You need to open your file with a proper encoding to write unicode in it.You can use codecs module for opening the file:
import codecs
with codecs.open(os.getcwd()+'\\_filelist.txt','w+',encoding='your-encoding') as w:
for file in allfile:
w.write(file + '\n')
You can use UTF-8 as your encoding which is a universal encoding or another proper encoding based on your unicode type.Also note that instead of opening the file and closing it manually you can use with statement to open the file which will close the file automatically at the end of the block.
I've been working on a python script to open up a file with a unicode name (Japanese mostly) and save to a randomly generated (Non-unicode) filename in Windows Vista 64-bit, and I'm having issues... It just doesn't work, it works fine with non-unicode filenames (Even if it has unicode content), but the second you try to pass a unicode filename in - it doesn't work.
Here's the code:
try:
import sys, os
inpath = sys.argv[1]
outpath = sys.argv[2]
filein = open(inpath, "rb")
contents = filein.read()
fileSave = open(outpath, "wb")
fileSave.write(contents)
fileSave.close()
testfile = open(outpath + '.test', 'wb')
testfile.write(inpath)
testfile.close()
except:
errlog = open('G:\\log.txt', 'w')
errlog.write(str(sys.exc_info()))
errlog.close()
And the error:
(<type 'exceptions.IOError'>, IOError(2, 'No such file or directory'), <traceback object at 0x01092A30>)
You have to convert your inpath to unicode, like this:
inpath = sys.argv[1]
inpath = inpath.decode("UTF-8")
filein = open(inpath, "rb")
I'm guessing you are using Python 2.6, because in Python 3, all strings are unicode by default, so this problem wouldn't happen.
My guess is that sys.argv1 and sys.argv[2] are just byte arrays and don't support natively Unicode. You could confirm this by printing them and seeing if they are the character you expect. You should also print type(sys.argv1) to make sure they are of the correct type.
Where do the command-line parameters come from? Do they come from another program or are you typing them on the command-line? If they come from another program, you could have the other program encode them to UTF-8 and then have your Python program decode them from UTF-8.
Which version of Python are you using?
Edit: here's a robust solution: http://code.activestate.com/recipes/572200/