broken CJK data when reading ISO-8859-1 file in python

broken CJK data when reading ISO-8859-1 file in python - python

I'm parsing some file that is ISO-8859-1 and has Chinese, Japanese, Korean characters in it.
import os
from os import listdir
cnt = 0
base_path = 'data/'
cwd = os.path.abspath(os.getcwd())
for f in os.listdir(base_path):
path = cwd + '/' + base_path + f
cnt = 0
with open(path, 'r', encoding='ISO-8859-1') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
The code runs but it prints broken characters. Other stackoverflow questions suggest I use encode and decode. For example, for Korean texts, I tried
file.read().encode('latin1').decode('euc-kr'), but that didn't do anything. I also tried to convert the files into utf-8 using iconv but the characters are still broken in the converted text file.
Any suggestions would be much appreciated.

Sorry, no. ISO-8859-1 cannot have any Chinese, Japanese, nor Korean characters in it. The code page doesn't support them at the first place.
What you did in the code is to ask Python to assume the file is in ISO-8859-1 encoding and return characters in Unicode (which is how strings are built). If you do not specify the encoding parameter in open(), the default would be assuming UTF-8 encoding use in the file and still return in Unicode, i.e. logical characters without any encoding specified.
Now the question is how are those CJK characters encoded in the file. If you know the answer, you can just put the right encoding parameter in open() and it works right away. Let's say it is EUC-KR as you mentioned, the code should be:
with open(path, 'r', encoding='euc-kr') as file:
for line in file:
print('line {}: {}'.format(cnt, line))
cnt +=1
If you feel frustrated, please take a look at chardet. It should help you detect the encoding from text. Example:
import chardet
with open(path, 'rb') as file:
rawdata = file.read()
guess = chardet.detect(rawdata) # e.g. {'encoding': 'EUC-KR', 'confidence': 0.99}
text = guess.decode(guess['encoding'])
cnt = 0
for line in text.splitlines():
print('line {}: {}'.format(cnt, line))
cnt +=1

Related

How to read binary file to text in python 3.9

I have a .sql files that I want to read into my python session (python 3.9). I'm opening using the file context manager.
with open('file.sql', 'r') as f:
text = f.read()
When I print the text, I still get the binary characters, i.e., \xff\xfe\r\x00\n\x00-\x00-..., etc.
I've tried all the arguments such as 'rb', encoding='utf-8, etc., but the results are still binary text. It should be noted that I've used this very same procedure many times over in my code before and this has not been a problem.
Did something change in python 3.9?

First two bytes \xff\xfe looks like BOM (Byte Order Mark)
and table at Wikipedia page BOM shows that \xff\xfe can means encoding UTF-16-LE
So you could try
with open('file.sql', 'r', encoding='utf-16-le') as f:
EDIT:
There is module chardet which you may also try to use to detect encoding.
import chardet
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
info = chardet.detect(data)
print(info['encoding'])
text = data.decode(info['encoding'])
Usually files don't have BOM but if they have then you may try to detect it using example from unicodebook.readthedocs.io/guess_encoding/check-for-bom-markers
from codecs import BOM_UTF8, BOM_UTF16_BE, BOM_UTF16_LE, BOM_UTF32_BE, BOM_UTF32_LE
BOMS = (
(BOM_UTF8, "UTF-8"),
(BOM_UTF32_BE, "UTF-32-BE"),
(BOM_UTF32_LE, "UTF-32-LE"),
(BOM_UTF16_BE, "UTF-16-BE"),
(BOM_UTF16_LE, "UTF-16-LE"),
)
def check_bom(data):
return [encoding for bom, encoding in BOMS if data.startswith(bom)]
# ---------
with open('file.sql', 'rb') as f: # read bytes
data = f.read()
encoding = check_bom(data)
print(encoding)
if encoding:
text = data.decode(encoding[0])
else:
print('unknown encoding')

Python - remove spaces on the right side

I've text files in folder, but files have data as below:
I don't know how I can remove spaces from right side between CRLF. These are spaces:
33/22-BBB<there is a space to remove>CRLF
import os
root_path = "C:/Users/adm/Desktop/test"
if os.path.exists(root_path):
files = []
for name in os.listdir(root_path):
if os.path.isfile(os.path.join(root_path, name)):
files.append(os.path.join(root_path, name))
for ii in files:
with open(ii) as file:
for line in file:
line = line.rstrip()
if line:
print(line)
file.close()
Does anyone have any idea how to get rid of this?

Those are control characters, change your open() command to:
with open(ii, "r", errors = "ignore") as file:
or
# Bytes mode
with open(ii, "rb") as file:
or
# '\r\n' is CR LF. See link at bottom
with open(ii, "r", newline='\r\n') as file:
Control characters in ASCII

If it is the CRLF characters you would like to remove from each string, you could use line.replace('CRLF', '').

Convert a bunch of files from guessed encoding to UTF-8

I have this Python script that attempts to detect the character encoding of a text file (in this case, C# .cs source files, but they could be any text file) and then convert them from that character encoding and into UTF-8 (without BOM).
While chardet detects the encoding well enough and the script runs without errors, characters like © are encoded into $. So I assume there's something wrong with the script and my understanding of encoding in Python 2. Since converting files from UTF-8-SIG to UTF-8 works, I have a feeling that the problem is the decoding (reading) part and not the encoding (writing) part.
Can anyone tell me what I'm doing wrong? If switching to Python 3 is a solution, I'm all for it, I then just need help figuring out how to convert the script from running on version 2.7 to 3.4. Here's the script:
import os
import glob
import fnmatch
import codecs
from chardet.universaldetector import UniversalDetector
# from http://farmdev.com/talks/unicode/
def to_unicode_or_bust(obj, encoding='utf-8'):
if isinstance(obj, basestring):
if not isinstance(obj, unicode):
obj = unicode(obj, encoding)
return obj
def enforce_unicode():
detector = UniversalDetector()
for root, dirnames, filenames in os.walk('.'):
for filename in fnmatch.filter(filenames, '*.cs'):
detector.reset()
filepath = os.path.join(root, filename)
with open(filepath, 'r') as f:
for line in f:
detector.feed(line)
if detector.done: break
detector.close()
encoding = detector.result['encoding']
if encoding and not encoding == 'UTF-8':
print '%s -> UTF-8 %s' % (encoding.ljust(12), filepath)
with codecs.open(filepath, 'r', encoding=encoding) as f:
content = ''.join(f.readlines())
content = to_unicode_or_bust(content)
with codecs.open(filepath, 'w', encoding='utf-8') as f:
f.write(content)
enforce_unicode()
I have tried to do content = content.decode(encoding).encode('utf-8') before writing the file, but that fails with the following error:
/usr/local/.../lib/python2.7/encodings/utf_8_sig.py:19: UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
if input[:3] == codecs.BOM_UTF8:
Traceback (most recent call last):
File "./enforce-unicode.py", line 48, in <module>
enforce_unicode()
File "./enforce-unicode.py", line 43, in enforce_unicode
content = content.decode(encoding).encode('utf-8')
File "/usr/local/.../lib/python2.7/encodings/utf_8_sig.py", line 22, in decode
(output, consumed) = codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 87: ordinal not in range(128)
Ideas?

chardet simply got the detected codec it wrong, your code is otherwise correct. Character detection is based on statistics, heuristics and plain guesses, it is not a foolproof method.
For example, the Windows 1252 codepage is very close to the Latin-1 codec; files encoded with the one encoding can be decoded without error in the other encoding. Detecting the difference between a control code in the one or a Euro symbol in the other usually takes a human being looking at the result.
I'd record the chardet guesses for each file, if the file turns out to be wrongly re-coded, you need to look at what other codecs could be close. All of the 1250-series codepages look a lot alike.

Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked.
this is related to a problem I had the other week: python check if utf-8 string is uppercase
basically, the following will not work:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following:
Traceback (most recent call last):
File "./temp.py", line 25, in
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
File "/usr/lib/python2.7/codecs.py",
line 691, in write
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",
line 351, in write
data, consumed = self.encode(object, self.errors)
UnicodeDecodeError: 'ascii' codec
can't decode byte 0xd0 in position 66:
ordinal not in range(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use
outFile = open('test.xml', 'w') it works perfectly.
So whats happening??
since encoding='utf-8' is specified in etree.tostring() is it encoding the file again?
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. Why? becuase etree.tostring() has a default encoding of ascii I persume?
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file??
is print>> not workings as I expect? outFile.write(etree.tostring()) behaves the same way.
Basically, why wouldn't this work? what is going on here. It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings.
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file.
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8.
Unfortunately, the bytestrings aren't ASCII.
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output.

Using print>>outFile is a little strange. I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print). Wrap the root Element in an ElementTree and use the write method.
Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

In addition to MRABs answer some lines of code:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))

Python: how to convert from Windows 1251 to Unicode?

I'm trying to convert file content from Windows-1251 (Cyrillic) to Unicode with Python. I found this function, but it doesn't work.
#!/usr/bin/env python
import os
import sys
import shutil
def convert_to_utf8(filename):
# gather the encodings you think that the file may be
# encoded inside a tuple
encodings = ('windows-1253', 'iso-8859-7', 'macgreek')
# try to open the file and exit if some IOError occurs
try:
f = open(filename, 'r').read()
except Exception:
sys.exit(1)
# now start iterating in our encodings tuple and try to
# decode the file
for enc in encodings:
try:
# try to decode the file with the first encoding
# from the tuple.
# if it succeeds then it will reach break, so we
# will be out of the loop (something we want on
# success).
# the data variable will hold our decoded text
data = f.decode(enc)
break
except Exception:
# if the first encoding fail, then with the continue
# keyword will start again with the second encoding
# from the tuple an so on.... until it succeeds.
# if for some reason it reaches the last encoding of
# our tuple without success, then exit the program.
if enc == encodings[-1]:
sys.exit(1)
continue
# now get the absolute path of our filename and append .bak
# to the end of it (for our backup file)
fpath = os.path.abspath(filename)
newfilename = fpath + '.bak'
# and make our backup file with shutil
shutil.copy(filename, newfilename)
# and at last convert it to utf-8
f = open(filename, 'w')
try:
f.write(data.encode('utf-8'))
except Exception, e:
print e
finally:
f.close()
How can I do that?
Thank you

import codecs
f = codecs.open(filename, 'r', 'cp1251')
u = f.read() # now the contents have been transformed to a Unicode string
out = codecs.open(output, 'w', 'utf-8')
out.write(u) # and now the contents have been output as UTF-8
Is this what you intend to do?

This is just a guess, since you didn't specify what you mean by "doesn't work".
If the file is being generated properly but appears to contain garbage characters, likely the application you're viewing it with does not recognize that it contains UTF-8. You need to add a BOM to the beginning of the file - the 3 bytes 0xEF,0xBB,0xBF (unencoded).

If you use the codecs module to open the file, it will do the conversion to Unicode for you when you read from the file. E.g.:
import codecs
f = codecs.open('input.txt', encoding='cp1251')
assert isinstance(f.read(), unicode)
This only makes sense if you're working with the file's data in Python. If you're trying to convert a file from one encoding to another on the filesystem (which is what the script you posted tries to do), you'll have to specify an actual encoding, since you can't write a file in "Unicode".

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

broken CJK data when reading ISO-8859-1 file in python - python

Related

How to read binary file to text in python 3.9

Python - remove spaces on the right side

Convert a bunch of files from guessed encoding to UTF-8

Why does printing to a utf-8 file fail?

Python: how to convert from Windows 1251 to Unicode?

Categories

Resources