Python: File encoding errors - python

From a few days I'm struggling this annoying problem with file encoding in my little program in Python.
I work a lot with MediaWiki - recently I do documents conversion from .doc to Wikisource.
Document in Microsoft Word format is opened in Libre Office and then exported to .txt file with Wikisource format. My program is searching for a [[Image:]] tag and replace it with a name of image taken from a list - and that mechanism works really fine (Big Thanks for help brjaga!).
When I did some test on .txt files created by me everything worked just fine but when I put a .txt file with Wikisource whole thing is not so funny anymore :D
I got this message prom Python:
Traceback (most recent call last):
File "C:\Python33\final.py", line 15, in <module>
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
File "C:\Python33\lib\encodings\cp1250.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 7389: character maps to <undefined>
And this is my Python code:
li = [
"[[Image:124_BPP_PL_PL_Page_03_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0002.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0003.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0004.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0005.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0006.jpg]]",
"[[Image:124_BPP_PL_PL_Page_03_Image_0007.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0001.jpg]]",
"[[Image:124_BPP_PL_PL_Page_05_Image_0002.jpg]]"
]
with open ("C:\\124_BPP_PL_PL.txt") as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w')
for item in li:
s = s.replace("[[Image:]]", item, 1)
dest.write(s)
dest.close()
OK, so I did some research and found that this is a problem with encoding. So I installed a program Notepad++ and changed the encoding of my .txt file with Wikisource to: UTF-8 and saved it. Then I did some change in my code:
with open ("C:\\124_BPP_PL_PL.txt", encoding="utf8') as myfile:
s = ' '.join([line.replace('\n', '') for line in myfile.readlines()])
But I got this new error message:
Traceback (most recent call last):
File "C:\Python33\final.py", line 22, in <module>
dest.write(s)
File "C:\Python33\lib\encodings\cp1250.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
And I'm really stuck on this one. I thought, when I change the encoding manually in Notepad++ and then I will tell the encoding which I set - everything will be good.
Please help, Thank You in advance.

When Python 3 opens a text file, it uses the default encoding for your system when trying to decode the file in order to give you full Unicode text (the str type is fully Unicode aware). It does the same when writing out such Unicode text values.
You already solved the input side; you specified an encoding when reading. Do the same when writing: specify a codec to use to write out the file that can handle Unicode, including the non-breaking whitespace character at codepoint U+FEFF. UTF-8 is usually a good default choice:
dest = open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8')
You can use the with statement when writing too and save yourself the .close() call:
for item in li:
s = s.replace("[[Image:]]", item, 1)
with open('C:\\124_BPP_PL_PL_processed.txt', 'w', encoding='utf8') as dest:
dest.write(s)

Related

Finding the encoding opening csv file in python

I have problems understanding how to detect the proper encoding of a csv file.
I created a small csv file as a sample for testing, cutting and pasting some rows from one of the original files I want to process, and saved that information in my local excel, as CSV.
My program can handle this or similar files without problem, but when I try to open a file sent to me from another computer, the program exits with an error.
The section of the code that opens the file:
with open(file_path,'r') as f:
dialect = csv.Sniffer().sniff(f.read(1024))
f.seek(0)
reader = csv.DictReader(f, fieldnames=['RUT', 'Nombre', 'Telefono'], dialect=dialect)
for row in reader:
numeros.append(row['Telefono'])
The error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 22, in <module>
for row in reader:
File "C:\Program Files\Python35\lib\csv.py", line 110, in __next__
row = next(self.reader)
File "C:\Program Files\Python35\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6392: character maps to <undefined>
Process finished with exit code 1
My locale.getpreferredencoding() is 'cp1252'
I did a couple of attempts to guess the encoding:
with open(file_path,'r', encoding='cp1252') as f:
It works with my local generated csv, but not with the ones I'm sent.
with open(file_path,'r', encoding='utf-8') as f:
Doesn't work with any file, but it generates a different error:
Traceback (most recent call last):
File "C:/Users/.PyCharmEdu3.5/config/scratches/scratch.py", line 19, in <module>
dialect = csv.Sniffer().sniff(f.read(1024))
File "C:\Program Files\Python35\lib\codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1670: invalid continuation byte
Process finished with exit code 1
I tried too adding newline='' to the open() but it doesn't make a difference.
Following an answer from stackoverflow, I opened the file with notepad, and checked encoding in 'Save as', both my local files and the ones I receive from emails show 'ANSI' as the encoding.
Do I need to figure out the encoding by myself, or python can do that for me? Is there something wrong in my code?
I'm using Python 3.5, and the files are most likley created in computers with Spanish OS.
Update: I been doing some more testing. Almost all csv files open without problems, and the program runs correctly, but there are 2 files that cause an error when I try to open them. If I use excel, or notepad this files look normal. I suspect that the files were created or saved on a computer with an uncommon OS or language.

How to avoid automatic ASCII encoding on Python 3?

I'm working on an encryption program in Python 3 right now, but I am having some problems with ASCII encoding. For example, if I want to write a text file from python that rights Ϩ (which is chr(1000)) into a text file, and I do:
a_file = open('chr_ord.txt', 'w')
a_file.write(chr(1000))
a_file.close()
I get:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
File "C:/Comp_Sci/Coding/printRAW.py", line 3, in <module>
a_file.write(chr(1000))
File "C:\WinPython-64bit-3.4.3.4\python-3.4.3.amd64\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u03e8' in position 0: character maps to <undefined>
And if I try:
a_file = open('chr_ord.txt', 'w')
a_file.write(ascii(chr(1000)))
a_file.close()
Python doesn't crash, but the text file contains '\u03e8' instead of the desired Ϩ
Is there any way I can go around this?
The Python 3 way is to use the encoding parameter when opening the file. Eg. encode the file as UTF-8
a_file = open('chr_ord.txt', 'w', encoding='utf-8')
The default is your system ANSI code page, which doesn't contain the Ϩ character.

Python function to turn internationalized domain name from U-Label to A-Label? [duplicate]

I have a long list of domain names which I need to generate some reports on. The list contains some IDN domains, and although I know how to convert them in python on the command line:
>>> domain = u"pfarmerü.com"
>>> domain
u'pfarmer\xfc.com'
>>> domain.encode("idna")
'xn--pfarmer-t2a.com'
>>>
I'm struggling to get it to work with a small script reading data from the text file.
#!/usr/bin/python
import sys
infile = open(sys.argv[1])
for line in infile:
print line,
domain = unicode(line.strip())
print type(domain)
print "IDN:", domain.encode("idna")
print
I get the following output:
$ ./idn.py ./test
pfarmer.com
<type 'unicode'>
IDN: pfarmer.com
pfarmerü.com
Traceback (most recent call last):
File "./idn.py", line 9, in <module>
domain = unicode(line.strip())
UnicodeDecodeError: 'ascii' codec can't decode byte 0xfc in position 7: ordinal not in range(128)
I have also tried:
#!/usr/bin/python
import sys
import codecs
infile = codecs.open(sys.argv[1], "r", "utf8")
for line in infile:
print line,
domain = line.strip()
print type(domain)
print "IDN:", domain.encode("idna")
print
Which gave me:
$ ./idn.py ./test
Traceback (most recent call last):
File "./idn.py", line 8, in <module>
for line in infile:
File "/usr/lib/python2.6/codecs.py", line 679, in next
return self.reader.next()
File "/usr/lib/python2.6/codecs.py", line 610, in next
line = self.readline()
File "/usr/lib/python2.6/codecs.py", line 525, in readline
data = self.read(readsize, firstline=True)
File "/usr/lib/python2.6/codecs.py", line 472, in read
newchars, decodedbytes = self.decode(data, self.errors)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-5: unsupported Unicode code range
Here is my test data file:
pfarmer.com
pfarmerü.com
I'm very aware of my need to understand unicode now.
Thanks,
Peter
you need to know in which encoding you file was saved. This would be something like 'utf-8' (which is NOT Unicode) or 'iso-8859-1' or 'cp1252' or alike.
Then you can do (assuming 'utf-8'):
infile = open(sys.argv[1])
for line in infile:
print line,
domain = line.strip().decode('utf-8')
print type(domain)
print "IDN:", domain.encode("idna")
print
Convert encoded strings to unicode with decode. Convert unicode to string with encode. If you try to encode something which is already encoded, python tries to decode first, with the default codec 'ascii' which fails for non-ASCII-values.
Your first example is fine, except that:
domain = unicode(line.strip())
you have to specify a particular encoding here: unicode(line.strip(), 'utf-8'). Otherwise you get the default encoding which for safety is 7-bit ASCII, hence the error. Alternatively you can spell it line.strip().decode('utf-8') as in knitti's example; there is no difference in behaviour between the two syntaxes.
However judging by the error “can't decode byte 0xfc”, I think you haven't actually saved your test file as UTF-8. Presumably this is why the second example, that also looks OK in principle, fails.
Instead it's ISO-8859-1 or the very similar Windows code page 1252. If it's come from a text editor on a Western Windows box it will certainly be the latter; Linux machines use UTF-8 by default instead nowadays. Either make sure to save your file as UTF-8, or read the file using the encoding 'cp1252' instead.

Another python unicode error

I'm getting errors such as
UnicodeEncodeError('ascii', u'\x01\xff \xfeJ a z z', 1, 2, 'ordinal not in range(128)'
I'm also getting sequences such as
u'\x17\x01\xff \xfeA r t B l a k e y'
I recognize \x01\xff\xfe as a BOM, but how do I transform these into the obvious output (Jazz and Art Blakey)?
These are coming from a program that reads music file tags.
I've tried various encodings, such a s.encode('utf8'), and various decodes followed by encodes, without success.
As requested:
from hsaudiotag import auto
inf = 'test.mp3'
song = auto.File(inf)
print song.album, song.artist, song.title, song.genre
> Traceback (most recent call last): File "audio2.py", line 4, in
> <module>
> print song.album, song.artist, song.title, song.genre File "C:\program files\python27\lib\encodings\cp437.py", line 12, in encode
> return codecs.charmap_encode(input,errors,encoding_map) UnicodeEncodeError: 'charmap' codec can't encode character u'\xfe' in
> position 4 : character maps to <undefined>
If I change the print statement to
with open('x', 'wb') as f:
f.write(song.genre)
I get
Traceback (most recent call last):
File "audio2.py", line 6, in <module>
f.write(song.genre)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xff' in position 1:
ordinal not in range(128)
For your actual question, you need to write bytes, not characters, to files. Call:
f.write(song.genre.encode('utf-8'))
and you won't get the error. You can use io.open to get a character stream that you can write to with the encoding done automatically, ie:
with io.open('x', 'wb', encoding='utf-8') as f:
f.write(song.genre)
Getting Unicode to the Console can be a matter of some difficulty (under Windows in particular)—see PrintFails.
However, as discussed in the comments, what you've got doesn't look like a working tag value... it looks more like an mangled ID3v2 frame data block, which it might not be possible to recover. I don't know if this is a bug in your tag reading library or you just have a file with rubbish tags.

UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)

I want to parse my XML document. So I have stored my XML document as below
class XMLdocs(db.Expando):
id = db.IntegerProperty()
name=db.StringProperty()
content=db.BlobProperty()
Now my below is my code
parser = make_parser()
curHandler = BasketBallHandler()
parser.setContentHandler(curHandler)
for q in XMLdocs.all():
parser.parse(StringIO.StringIO(q.content))
I am getting below error
'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
Traceback (most recent call last):
File "/base/python_runtime/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 517, in __call__
handler.post(*groups)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/base_handler.py", line 59, in post
self.handle()
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 168, in handle
scan_aborted = not self.process_entity(entity, ctx)
File "/base/data/home/apps/parsepython/1.348669006354245654/mapreduce/handlers.py", line 233, in process_entity
handler(entity)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 71, in process
parser.parse(StringIO.StringIO(q.content))
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 107, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/base/python_runtime/python_dist/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
self._parser.Parse(data, isFinal)
File "/base/data/home/apps/parsepython/1.348669006354245654/parseXML.py", line 136, in characters
print ch
UnicodeEncodeError: 'ascii' codec can't encode character u'\xef' in position 0: ordinal not in range(128)
The actual best answer for this problem depends on your environment, specifically what encoding your terminal expects.
The quickest one-line solution is to encode everything you print to ASCII, which your terminal is almost certain to accept, while discarding characters that you cannot print:
print ch #fails
print ch.encode('ascii', 'ignore')
The better solution is to change your terminal's encoding to utf-8, and encode everything as utf-8 before printing. You should get in the habit of thinking about your unicode encoding EVERY time you print or read a string.
Just putting .encode('utf-8') at the end of object will do the job in recent versions of Python.
It seems you are hitting a UTF-8 byte order mark (BOM). Try using this unicode string with BOM extracted out:
import codecs
content = unicode(q.content.strip(codecs.BOM_UTF8), 'utf-8')
parser.parse(StringIO.StringIO(content))
I used strip instead of lstrip because in your case you had multiple occurences of BOM, possibly due to concatenated file contents.
This worked for me:
from django.utils.encoding import smart_str
content = smart_str(content)
The problem according to your traceback is the print statement on line 136 of parseXML.py. Unfortunately you didn't see fit to post that part of your code, but I'm going to guess it is just there for debugging. If you change it to:
print repr(ch)
then you should at least see what you are trying to print.
The problem is that you're trying to print an unicode character to a possibly non-unicode terminal. You need to encode it with the 'replace option before printing it, e.g. print ch.encode(sys.stdout.encoding, 'replace').
An easy solution to overcome this problem is to set your default encoding to utf8. Follow is an example
import sys
reload(sys)
sys.setdefaultencoding('utf8')

Categories