I'm attempting to do something very simple, which is read a file in ascii or utf-8-sig and save it as utf-8. However, when I run the function below, and then do file filename.json in linux, it always shows the file as being ASCII. I have tried using codecs, and no luck either. The only way I can get it to work, is if I replace utf-8 with utf-8-sig, BUT that gives me the issue that the file has BOM endings. I've searched around for solutions, and I found some removing the beginning characters, however, after this is performed, the file becomes ascii again. I have tried everything her: Convert UTF-8 with BOM to UTF-8 with no BOM in Python
def file_converter(file_path):
s = open(file_path, mode='r', encoding='ascii').read()
open(file_path, mode='w', encoding='utf-8').write(s)
Files that only contain characters below U+0080 encode to exactly the same bytes as either ASCII or UTF-8 (this was one of the compatibility goals of UTF-8). file detects the file as ASCII, and it is, but it's also UTF-8, and will decode correctly as UTF-8 (just like any ASCII file will). So nothing at all is wrong.
Related
I have to process an input text file, which can be in ANSI and convert it to UTF8, whilst doing doing some processing of the lines read. In python, that'll amount to
with open(input_file_location, 'r', newline='\r\n', encoding='cp1252') as old, open(output_file_location, 'w', encoding='utf_8') as new:
for line in old:
modified = ... do processing here ....
new.write(modified)
However, this will work as expected only if the input file is ANSI (windows). If however, the input file was UTF8 originally, the above code works silently, reading it assuming ANSI and thus things in output are not as expected.
So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
Thanks
So - question is - how to handle the scenario if existing file was already UTF8, so either read it as UTF8, or better, avoid the whole of above processing.
UTF8 is more constraining than CP1252, and both are ascii compatible. So you can start by reading it as UTF8, if that works you're fine (it's either plain ASCII or valid UTF-8), if that does not fall back to CP1252.
Alternatively you could try running chardet on it, but that's not necessarily more reliable: every byte is "valid" in ISO-8859 encodings (of which CP1252 is a derivative), so every file "decodes properly", they just return garbage.
There isn't a guaranteed way to determine the encoding a file if it isn't known in advance. However if you are sure that the possibilities are restricted to UTF-8 and cp1252, then the following approach may work:
Open the file in binary mode and read the first three bytes. If these bytes are b'\xef\xbb\xbf' then the encoding is extremely likely to be 'utf-8-sig', a Microsoft variant of UTF-8 (unless you have cp1252 files that legitimately begin with "''"). See the final paragraph of this section of the codecs docs.
Assume UTF-8. Both UTF-8 and cp1252 will decode bytes in the ASCII range (0-127) identically. Single bytes with the high bit set are not valid UTF-8, so if the file is encoded as cp1252 and contains such bytes a UnicodeDecodeError will be raised.
Catch the above UnicodeDecodeError and try again with cp1252.
I have a Python 2.7 script which imports data from CSV files exported from various others sources.
As part of the import process I have a small function that establishes the correct character encoding for the file. I then open the file and loop the lines using:
with io.open(filename, "r", encoding=file_encoding) as input_file:
for raw_line in input_file:
cleaned_line = raw_line.replace('\x00', '').replace(u"\ufeff", "").encode('utf-8')
# do stuff
The files from this source usually come as UTF-8 (with BOM) and I detect the encoding 'utf-8-sig' and use that to open the file.
The problem I am having is that one of my data sources returns a file that seems to have an encoding error. The rest of the file (about 27k lines of CSV data) are all correct, as usual, but one line fails.
The line in question fails with this error (at the for raw_line in input_file line):
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa0 in position 1709: invalid start byte
The line has several non-breaking spaces characters that are encoded as with a single byte with value 'A0' rather than 2 bytes with 'C2 A0'.
I am already doing some light cleaning on a line by line basis for other problems as you can see on my "cleaned_line" line at the top of the loop (I dislike doing this per line but with the files I get I haven't found a way to do it better). However, the code fails before I ever even get there.
Is there a correct/nice way to handle this particular issue? I thought I'd nailed the whole encoding issue until this.
You can tell Python to ignore decoding errors, or to replace the faulty bytes with a placeholder character.
Set errors to 'ignore' to ignore the A0 bytes:
with io.open(filename, "r", encoding=file_encoding, errors='ignore') as input_file:
or to 'replace' to replace them with the U+FFFD REPLACEMENT CHARACTER:
with io.open(filename, "r", encoding=file_encoding, errors='replace') as input_file:
UTF-8 is a self-correcting encoding; bytes are either always part of a multi-byte code point, or can be decoded as ASCII directly, so ignoring un-decodable bytes is relatively safe.
You can do encoding 'translit/long' to normalize utf-8 to table of string charts, need to import translitcodec first.
I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom
What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.
Attempting to write a line to a text file in Python 2.7, and have the following code:
# -*- coding: utf-8 -*-
...
f = open(os.path.join(os.path.dirname(__file__), 'output.txt'), 'w')
f.write('Smith’s BaseBall Cap') // Note the strangely shaped apostrophe
However, in output.txt, I get Smith’s BaseBall Cap, instead. Not sure how to correct this encoding problem? Any protips with this sort of issue?
You have declared your file to be encoded with UTF-8, so your byte-string literal is in UTF-8. The curly apostrophe is U+2019. In UTF-8, this is encoded as three bytes, \xE2\x80\x99. Those three bytes are written to your output file. Then, when you examine the output file, it is interpreted as something other than UTF-8, and you see the three incorrect characters instead.
In Mac OS Roman, those three bytes display as ’.
Your file is a correct UTF-8 file, but you are viewing it incorrectly.
There are a couple possibilities, but the first one to check is that the output file actually contains what you think it does. Are you sure you're not viewing the file with the wrong encoding? Some editors have an option to choose what encoding you're viewing the file in. The editor needs to know the file's encoding, and if it interprets the file as being in some other encoding than UTF-8, it will display the wrong thing even though the contents of the file are correct.
When I run your code (on Python 2.6) I get the correct output in the file. Another thing to try: Use the codecs module to open the file for UTF-8 writing: f = codecs.open("file.txt", "w", "utf-8"). Then declare the string as a unicode string withu"'Smith’s BaseBall Cap'"`.
I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.