How to open a ASCII text graceful - python

It is confusing when I open a file with Python. By the way I'm using python3.4.
First it's a log file (a huge file that is appended to any time), so iconv is not possible.
Info1 file is ASCII text.
demo git:master ❯ file 1.log
1.log: ASCII text, with very long lines
Info2 ipython opens it with default encoding of 'UTF-8':
In [1]: f = open('1.log')
In [2]: f.encoding
Out[2]: 'UTF-8'
THEN
First when I open('1.log', encoding='utf-8', mode='r')
ERROR: 'utf-8' codec can't decode byte 0xb1 in position 6435: invalid start byte
Second when I open('1.log', encoding='ascii', mode='r')
ERROR: 'ascii' codec can't decode byte 0xe9 in position 6633: ordinal
not in range(128)
How can I gracefully handle this file with every line read?
This is my demo on github demo

I tried a few different combinations of encodings and I was able to get all the way through the log file by simply changing the encoding in your script to latin1, so the line open('1.log', encoding='utf-8', mode='r') becomes open('1.log', encoding='latin1', mode='r').

It's probably Windows CP 1252 or Latin 1. Try opening it with:
open('1.log', encoding='latin-1', 'rU')

Looks like its not an ascii file. The encoding test is usually inaccurate. try chardet which will detect the encoding for you.
Then
import chardet
filepointer = open(self.filename)
charset_detected = chardet.detect(filepointer.read())
Keep in mind that this can take a very very long time. Before you try that I recommend you manually cycle through the obvious encodings first.
Try UTF16 and UTF32. Then try the Windows encodings. Here is a list of several encodings.

Related

UnicodeDecodeError while processing Accented words

I have a python script which reads a YAML file (runs on an embedded system). Without accents, the script runs normally on my development machine and in the embedded system. But with accented words make it crash with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 6: ordinal not in range(128)
only in the embedded environment.
The YAML sample:
data: ã
The snippet which reads the YAML:
with open(YAML_FILE, 'r') as stream:
try:
data = yaml.load(stream)
Tried a bunch of solutions without success.
Versions: Python 3.6, PyYAML 3.12
The codec that is reading your bytes has been set to ASCII. This restricts you to byte values between 0 and 127.
The representation of accented characters in Unicode, comes outside this range, so you're getting a decoding error.
A UTF-8 codec decodes ASCII as well as UTF-8, because ASCII is a (very small) subset of UTF-8, by design.
If you can change your codec to be a UTF-8 decode, it should work.
In general, you should always specify how you will decode a byte stream to text, otherwise, your stream could be ambiguous.
You can specify the codec that should be used when dumping data using PyYAML, but there is no way you specify your coded in PyYAML when you load. However PyYAML will handle unicode as input and you can explicitly specify which codec to use when opening the file for reading, that codec is then used to return the text (you open the file as text file with 'r', which is the default for open()).
import yaml
YAML_FILE = 'input.yaml'
with open(YAML_FILE, encoding='utf-8') as stream:
data = yaml.safe_load(stream)
Please note that you should almost never have to use yaml.load(), which is documented to be unsafe, use yaml.safe_load() instead.
To dump data in the same format you loaded it use:
import sys
yaml.safe_dump(data, sys.stdout, allow_unicode=True, encoding='utf-8',
default_flow_style=False)
The default_flow_style is needed in order not to get the flow-style curly braces, and the allow_unicode is necessary or else you get data: "\xE3" (i.e. escape sequences for unicode characters)

Ignore UnicodeEncodeError when saving utf8 file

I have a large string of a novel that I downloaded from Project Gutenberg. I am trying to save it to my computer, but I'm getting a UnicodeEncodeError and I don't know how to fix or ignore it.
from urllib import request
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf8
raw = response.read().decode('utf8')
# Save the file
file = open('corpora/canon_texts/' + 'test', 'w')
file.write(raw)
file.close()
This gives me the following error:
UnicodeEncodeError: 'charmap' codec can't encode character '\ufeff' in position 0: character maps to <undefined>
First, I tried to remove the BOM at the beginning of the file:
# We have to get rid of the pesky Byte Order Mark before we save it
raw = raw.replace(u'\ufeff', '')
but I get the same error, just with a different position number:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
If I look in that area I can't find the offending characters, so I don't know what to remove:
raw[7850:7900]
just prints out:
' BALLENA, Spanish.\r\n PEKEE-NUEE-'
which doesn't look like it would be a problem.
So then I tried to skip the bad lines with a try statement:
file = open('corpora/canon_texts/' + 'test', 'w')
try:
file.write(raw)
except UnicodeEncodeError:
pass
file.close()
but this skips the entire text, giving me a file of 0 size.
How can I fix this?
EDIT:
A couple people have noted that '\ufeff' is utf16. I tried switching to utf16:
# Get the text
response = request.urlopen('http://www.gutenberg.org/files/2701/2701-0.txt')
# Decode it using utf16
raw = response.read().decode('utf-16')
But I can't even download the data before I get this error:
UnicodeDecodeError: 'utf-16-le' codec can't decode byte 0x0a in position 1276798: truncated data
SECOND EDIT:
I also tried decoding with utf-8-sig as suggested in u'\ufeff' in Python string because that includes BOM, but then I'm back to this error when I try to save it:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 7863-7864: character maps to <undefined>
Decoding and re-encoding a file just to save it to disk is pointless. Just write out the bytes you have downloaded, and you will have the file on disk:
raw = response.read()
with open('corpora/canon_texts/' + 'test', 'wb') as outfile:
outfile.write(raw)
This is the only reliable way to write to disk exactly what you downloaded.
Sooner or later you'll want to read in the file and work with it, so let's consider your error. You didn't provide a full stack trace (always a bad idea), but your error is during encoding, not decoding. The decoding step succeeded. The error must be arising on the line file.write(raw), which is where the text gets encoded for saving. But to what encoding is it being converted? Nobody knows, because you opened file without specifying an encoding! The encoding you're getting depends on your location, OS, and probably the tides and weather forecast. In short: Specify the encoding.
text = response.read().decode('utf8')
with open('corpora/canon_texts/' + 'test', 'w', encoding="utf-8") as outfile:
outfile.write(text)
U + feff is for UTF-16. Try that instead.
.decode(encoding="utf-8", errors="strict") offers error handling as a built-in feature:
The default for errors is 'strict', meaning that encoding errors raise a UnicodeError. Other possible values are 'ignore', 'replace' and any other name registered via codecs.register_error(), see section Error Handlers.
Probably the safest option is
decode("utf8", errors='backslashreplace')
which will escape encoding errors with a backslash, so you have a record of what failed to decode.
Conveniently, your Moby Dick text contains no backslashes, so it will be quite easy to check what characters are failing to decode.
What is strange about this text is the website says it is in utf-8, but \efeff is the BOM for utf-16. Decoding in utf-16, it looks like your just having trouble with the very last character 0x0a (which is a utf-8 line ending), which can probably safely be dropped with
decode("utf-16", errors='ignore')

How to remove all conflicting characters between latin1 and utf-8 using python?

I call open(file, "r") and read some lines in Python. This gives me:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 4: ordinal not in range(128)
If I add 'utf-8', I get:
'utf8' codec can't decode bytes in position 28-29: invalid continuation byte
If I add 'ISO-8859-1', I get no errors but a line is read like this:
2890 ready to try Argh� Fantasy Surfer Carnage� Dane, Marlon & Nat C all out! #fantasysurfer
As you can see there are some extra characters, which probably come from emojis or something... (These are tweets)..
What is the best approach to clean these lines up?
I would like to remove all the extraneous elements... I would like the strings to have only numbers, letters, and common symbols ?!>.;, etc...
Note: I don't care about the html entities, since I replace those in another function. I am talking about the weird Argh� Carnage� elements.
In general, these are causing issues with the encoding.
first, ensure that you especified the rigth codification at the first line in the python file.
# -*- coding: utf-8 -*-
Second, you can use the library codecs specifying the desired codification:
import codecs
fich_in = codecs.open(filename,'r', encoding='utf-8')
Third, you can to ignore all the wrong characters using:
TEXT.encode('utf-8', 'ignore').decode('utf-8')
Try first use decode and then encode:
u"text".decode('latin-1').encode('utf-8')
Or try open file with codecs:
import codecs
with codecs.open('file', encoding="your coding")
Your problem is either opening the file in wrong encoding, or you incorrectly identify the character encoding.
Also if you get text in ASCII use it:
'abc'.decode('ascii')
or
unicode('abc', 'ascii')

UnicodeEncodeError: 'ascii' codec can't encode characters ordinal not in range(128)

I can't read the word Curaçao from a text file. What am I doing wrong?
I have written a text file that contains the word "Curaçao". The encoding on the editor (vim) is latin1.
This python program reads the file:
import sys
with open ('foo.txt', 'r', encoding='latin1') as f:
print('f:', f.encoding)
print('stdout:', sys.stdout.encoding)
for i in f:
print(i)
And when I run it I get this...
sundev19:/home/jgalloway12/code/wdPhone $ python3 CountryFix.py
f: latin1
stdout: 646
Traceback (most recent call last):
File "CountryFix.py", line 11, in <module>
print(i)
UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 4: ordinal not in range(128)
Here is the file's contents in binary.
0000000: 4375 7261 e761 6f0a Cura.ao.
EDIT: The "real" problem I am trying to solve here is reading an Excel 2010 exported CSV which contains country names.
Fixed the file to be encoded in Latin1. Program now prints locale.
The problem here isn't the file, but the output stream.
For whatever reason, python has detected your stdout encoding as US-ASCII when you really want something more (utf-8, latin1, etc.).
Your options are:
Trick it into believing a different encoding (on linux you can do this with LANG=en_US.UTF-8, however I assume you're on windows and I don't recall how to trick python on windows in this way :)).
Write your response to a file:
with open('output.txt', 'w', encoding='latin1') as f:
...
Or write to the stdout bytestream:
import sys
sys.stdout.buffer.write(i.encode('latin1'))
Since you are printing the lines and python print function doesn't use of the encoding of open() function it tries to encode your string with it's default encoding which is ASCII. So you need to define a costume encoding for your unicode when you want to print it.
You can use str.encode() method with a proper encocding for print.

Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom
What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

Categories