Python unicode normalization issue

Python unicode normalization issue - python

I am trying to convert a file that contains some unicode characters in it and replace it with normal characters. I am facing some problem with that and get the following error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data
My file looks like below:
ecteDV
ecteBl
agnéto
the code to replace accents is shown below:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re, sys, unicodedata, codecs
f = codecs.open(sys.argv[1], 'r', 'utf-8')
for line in f:
name = line.lower().strip()
normal = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore')
print normal
f.close()
Is there a way I can replace all the accents and normalize the contents of the file?

Consider that your file is perhaps not using UTF-8 as the encoding.
You are reading the file with the UTF-8 codec but decoding fails. Check that your file encoding is really UTF-8.
Note that UTF-8 is an encoding out of many, it doesn't mean 'decode magically to Unicode'.
If you don't yet understand what encodings are (as opposed to what Unicode is, a related but separate concept), you need to do some reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder

Try opening the file with the following code and replace sys.argv[1] with the filename.
import re, sys, unicodedata, codecs
with codecs.open("filename", 'r', 'utf-8') as f:
for line in f:
do something
f.close()

Related

How to remove all conflicting characters between latin1 and utf-8 using python?

I call open(file, "r") and read some lines in Python. This gives me:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)
If I add 'utf-8', I get:
'utf8' codec can't decode bytes in position 28-29: invalid continuation byte
If I add 'ISO-8859-1', I get no errors but a line is read like this:
2890 ready to try Arghï¿½ Fantasy Surfer Carnageï¿½ Dane, Marlon & Nat C all out! #fantasysurfer
As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..
What is the best approach to clean these lines up?
I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...
Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Arghï¿½ Carnageï¿½ elements.
In general, these are causing issues with the encoding.

first, ensure that you especified the rigth codification at the first line in the python file.
# -*- coding: utf-8 -*-
Second, you can use the library codecs specifying the desired codification:
import codecs
fich_in = codecs.open(filename,'r', encoding='utf-8')
Third, you can to ignore all the wrong characters using:
TEXT.encode('utf-8', 'ignore').decode('utf-8')

Try first use decode and then encode:
u"text".decode('latin-1').encode('utf-8')
Or try open file with codecs:
import codecs
with codecs.open('file', encoding="your coding")
Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.
Also if you get text in ASCII use it:
'abc'.decode('ascii')
or
unicode('abc', 'ascii')

Encoding Decoding Python

I have text "confrères" in a text file with encoded format "ISO-8859-2". I want to encode this value in "UTF-8" in python.
I used following code in python(2.7) to convert it but the converted value ["confrčres"] is different from original value ["confrères"].
# -*- coding: utf-8 -*-
import chardet
import codecs
a1=codecs.open('.../test.txt', 'r')
a=a1.read()
b = a.decode(chardet.detect(a)['encoding']).encode('utf8')
a1=codecs.open('.../test_out.txt', 'w').write(b)
Any idea how to get actual value but in UTF8 encoded format in the output file.
Thanks

If you know the codec used, don't use chardet. Character detection is never foolproof, the library guessed wrong for your file.
Note that ISO-8859-2 is the wrong codec, as that codec cannot even encode the letter è. You have ISO-8859-1 (Latin-1) or Windows codepage 1252 data instead; è in 8859-1 and cp1252 is encoded to 0xE8, and 0xE8 in 8859-2 is č:
>>> print u'confrčres'.encode('iso-8859-2').decode('iso-8859-1')
confrères
Was 8859-2 perhaps the guess chardet made?
You can use the io library to handle decoding and encoding on the fly; it is the same codebase that handles all I/O in Python 3 and has fewer issues than codecs:
from shutil import copyfileobj
with open('test.txt', 'r', encoding='iso-8859-1') as inf:
with open('test_out.txt', 'w', encoding='utf8') as outf:
copyfileobj(inf, outf)
I used shutil.copyfileobj() to handle the copying across of data.

Write bytes literal with undefined character to CSV file (Python 3)

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.
However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.
Here's a minimal example:
import urllib.request
import urllib.parse
import csv
if __name__ == "__main__":
with open("bytewrite.csv", "w", newline="") as csvfile:
a = b'\x8b' # byte literal by urllib.request
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
And this is the output:
Traceback (most recent call last):
File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
w.writerow([b])
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>
Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?
It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Quoted from csv docs:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.
Corrected minimal example:
import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
a = b'\x8b'
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])

Your interpretation of the lookup tables in encodings is not correct. The code you've listed:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Tells you two things:
How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹
This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.
The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?
http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
No, it does not. So you get an error when trying to encode it in CP1252.
Unfortunately there's no good solution for this. Some ideas:
Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.
These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?

Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.

This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?

C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.