Encoding Decoding Python - python

I have text "confrères" in a text file with encoded format "ISO-8859-2". I want to encode this value in "UTF-8" in python.
I used following code in python(2.7) to convert it but the converted value ["confrčres"] is different from original value ["confrères"].
# -*- coding: utf-8 -*-
import chardet
import codecs
a1=codecs.open('.../test.txt', 'r')
a=a1.read()
b = a.decode(chardet.detect(a)['encoding']).encode('utf8')
a1=codecs.open('.../test_out.txt', 'w').write(b)
Any idea how to get actual value but in UTF8 encoded format in the output file.
Thanks

If you know the codec used, don't use chardet. Character detection is never foolproof, the library guessed wrong for your file.
Note that ISO-8859-2 is the wrong codec, as that codec cannot even encode the letter è. You have ISO-8859-1 (Latin-1) or Windows codepage 1252 data instead; è in 8859-1 and cp1252 is encoded to 0xE8, and 0xE8 in 8859-2 is č:
>>> print u'confrčres'.encode('iso-8859-2').decode('iso-8859-1')
confrères
Was 8859-2 perhaps the guess chardet made?
You can use the io library to handle decoding and encoding on the fly; it is the same codebase that handles all I/O in Python 3 and has fewer issues than codecs:
from shutil import copyfileobj
with open('test.txt', 'r', encoding='iso-8859-1') as inf:
with open('test_out.txt', 'w', encoding='utf8') as outf:
copyfileobj(inf, outf)
I used shutil.copyfileobj() to handle the copying across of data.

Related

From ansi encoding to utf8 (and hex bytes)

I have some texts encoded in ansi windows codepage. It is known which codepage it is.
The data is stored in text files.
I would like to do the following:
convert the to utf-8
print the resulting utf-8 as bytes
Did read python encoding guide, but I could not get the answer.
So, take the minimum example here:
import codecs
chinaAnsi = '\xCE\xD2' # 我 in chinese GBK CJK Unified Ideograph-6211
# 0xE6 0x88 0x91 in UTF8
print(chinaAnsi.encode('utf-8').decode('utf-8'))
# results in b'\xc3\x8e\xc3\x92' or ÎÒ
# which is meaningless.
# --> utf-8 representation of \xCE\xD2 in LATIN-1 (windows cp1252)
As can be seen from above, the cross coding works, my machine is Windows CP1252. Except my input is in codepage 936.
So how do I deal with ansi input that is not from my own codepage ?
My final desired output from the minimal example would be the string in utf-8 followed by the utf-8 bytes.
我;e68891
The conversion of the string would mimic iconv -f cp936 -t utf-8 theInput > theOutput
I have some texts encoded in ansi windows codepage. It is known which codepage it is.
The data is stored in text files.
Assuming the known codepage is cp936, read the files using that encoding and write them back out as UTF-8. Example:
with open('input.txt', encoding='cp936') as fin:
data = fin.read()
with open('output.txt', 'w', encoding='utf8') as fout:
fout.write(data)
Starting with your example bytes:
>>> x = b'\xce\xd2'.decode('cp936') # Gives '我'
>>> print(f"{x};{x.encode('utf8').hex()}")
我;e68891

'invalid continuation byte'- csv with multiple encodings?

I'm trying to download and parse a U.S. Census csv file in python. I'm getting a recurring error that suggests that there are multiple encodings in the file.
I got the file encoding using
import urllib.request
import io
url = 'https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv'
urllib.request.urlretrieve(url, 'data/source_files/census/city/2010.csv')
This gives me the file encoding
io.open('data/source_files/census/city/2010.csv')
<_io.TextIOWrapper name='data/source_files/census/city/2010.csv' mode='r' encoding='UTF-8'>
But the encoding doesn't seem to be correct? I tried using chardet.
with open('data/source_files/census/city/2010.csv', encoding = 'UTF-8') as f:
print(chardet.detect(f.read()))
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 11902: invalid continuation byte
I get a smilar error no matter what I try
df = pd.read_csv('data/source_files/census/city/' + '2010.csv')
import csv
with open("data/source_files/census/city/2010.csv","r") as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
print(row['CBSA'])
All these approaches are giving me this error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf1 in position 11902: invalid continuation byte
Any advice on how to get around this?
Latin-1 is a single byte encoding family so everything in it should be defined in UTF-8. But sometime Latin-1 wins.
Use this, If it shows the error of UTF-8.
import pandas as pd
url = "https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv"
data = pd.read_csv(url , encoding='latin-1')
data.head()
It should show you below data.
The first code doesn't get the encoding, it just downloads the file.
The second code opens the file with an OS-specific default encoding, specifically the value of locale.getpreferredencoding(False). UTF-8 was the default for the OS used and it wasn't correct for the file.
The third code opens the file as UTF-8 again, and that is the cause of failure, not chardet.
Use the requests library instead:
>>> import requests
>>> r=requests.get('https://www2.census.gov/programs-surveys/popest/datasets/2010-2019/metro/totals/csa-est2019-alldata.csv')
>>> r
<Response [200]>
>>> r.encoding
'ISO-8859-1'
The correct encoding is ISO-8859-1 also known as latin1. r.text will be the correctly decoded text.
It looks like that CSV file is not UTF-8 Encoded, so you have to use the correct option to help the File wrapper decode it correctly.
In this case, the CSV is encoded in ANSI instead of UTF-8:
open("...", encoding="ANSI")

Encoding error even using the right codec

I want to open files depending on the encoding format, therefore I do the following:
import magic
import csv
i_file = open(filename).read()
mag = magic.Magic(mime_encoding=True)
encoding = mag.from_buffer(i_file)
print "The encoding is ",encoding
Once I know the encoding format, I try to open the file using the right one:
with codecs.open(filename, "rb", encoding) as f_obj:
reader = csv.reader(f_obj)
for row in reader:
csvlist.append(row)
However, I get the next error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 0: ordinal not in range(128)
trying to open a csv file which encoding is:
The encoding is utf-16le
The funny part comes here. If utf-16le is replaced by utf-16, the CSV utf-16le file is properly read. However, it is not well read when used in ascii csv files.
What am I doing wrong?
Python 2's csv module doesn't support Unicode. Is switching to Python 3 an option? If not, can you convert the input file to UTF-8 first?
From the docs linked above:
The csv module doesn’t directly support reading and writing Unicode,
but it is 8-bit-clean save (sic!) for some problems with ASCII NUL
characters. So you can write functions or classes that handle the
encoding and decoding for you as long as you avoid encodings like
UTF-16 that use NULs. UTF-8 is recommended.
Quick and dirty example:
with codecs.open(filename, "rb", encoding) as f_obj:
with codecs.open(filename+"u8", "wb", "utf-8") as utf8:
utf8.write(f_obj.read())
with codecs.open(filename+"u8", "rb", "utf-8") as f_obj:
reader = csv.reader(f_obj)
# etc.
This may be a bit useful to you.
Checkout python 2 documentation
https://docs.python.org/2/library/csv.html
Especially this section:
For all other encodings the following UnicodeReader and UnicodeWriter
classes can be used. They take an additional encoding parameter in
their constructor and make sure that the data passes the real reader
or writer encoded as UTF-8:
Look at the bottom of the page!!!!

Python unicode normalization issue

I am trying to convert a file that contains some unicode characters in it and replace it with normal characters. I am facing some problem with that and get the following error.
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 1-3: invalid data
My file looks like below:
ecteDV
ecteBl
agnéto
the code to replace accents is shown below:
#!/usr/bin/python
# -*- coding: utf-8 -*-
import re, sys, unicodedata, codecs
f = codecs.open(sys.argv[1], 'r', 'utf-8')
for line in f:
name = line.lower().strip()
normal = unicodedata.normalize('NFKD', name).encode('ASCII', 'ignore')
print normal
f.close()
Is there a way I can replace all the accents and normalize the contents of the file?
Consider that your file is perhaps not using UTF-8 as the encoding.
You are reading the file with the UTF-8 codec but decoding fails. Check that your file encoding is really UTF-8.
Note that UTF-8 is an encoding out of many, it doesn't mean 'decode magically to Unicode'.
If you don't yet understand what encodings are (as opposed to what Unicode is, a related but separate concept), you need to do some reading:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO
Pragmatic Unicode by Ned Batchelder
Try opening the file with the following code and replace sys.argv[1] with the filename.
import re, sys, unicodedata, codecs
with codecs.open("filename", 'r', 'utf-8') as f:
for line in f:
do something
f.close()

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

Categories