Inside my application, user can upload the file (text file), and I need to read it and construct json object for another API call.
I open file with
f = open(file, encoding="utf-8")
get the first word and construct Json object,...
My problem is that some files (especially from Microsoft environment) that have BOM object at the beginning. Problem is that my Json now have this character inside
{
"word":"\\ufeffMyWord"
}
and of course, the API is not working from this point on.
I obviously miss something, because, shouldn't utf-8 remove BOM objects? (Because it is not utf-8-sig).
How to overcome this?
No, the UTF-8 standard does not define a BOM character. That's because UTF-8 has no byte order ambiguity issue like UTF-16 and UTF-32 do. The Unicode consortium doesn't recommend using U+FEFF at the start of a UTF-8 encoded file, while the IETF actively discourages it if alternatives to specify the codec exist. From the Wikipedia article on BOM usage in UTF-8:
The Unicode Standard permits the BOM in UTF-8, but does not require or recommend its use.
[...]
The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."
The Unicode standard only 'permits' the BOM because it is a regular character, just like any other; it's a zero-width non-breaking space character. As a result, the Unicode consortium recommends it is not removed when decoding, to preserve information (in case it had a different meaning or you wanted to retain compatibility with tools that have come to rely on it).
You have two options:
Strip the string first, U+FEFF is considered whitespace so removed with str.strip(). Or explicitly just strip the BOM:
text = text.lstrip('\ufeff') # remove the BOM if present
(technically that'll remove any number of zero-width non-breaking space characters, but that is probably what you'd want anyway).
Open the file with the utf-8-sig codec instead. That codec was added to handle such files, explicitly removing the UTF-8 BOM bytesequence from the start if present, before decoding. It'll work on files without those bytes.
UTF-8 doesn't removes BOM (Byte Order Mark). You have to put a check if the file contains BOM, remove it.
if text.startswith(codecs.BOM_UTF8):
headers[0] = (headers[0])[3:]
print "Removed BOM"
else:
print "No BOM char, Process your file"
Related
If I have a string that I want to use in byte form encoded as UTF-8, do I need to encode the variable as a byte variable? Or, since Python is by default encoded as UTF-8, will it just treat the string as UTF-8 byte form in certain contexts without explicit encoding?
For example, I'm working on a project where I have an array of dictionaries that map strings to strings. If I write this array to a file with json.dump and then read it with json.load, the strings are recovered just fine, and I get no error, despite never encoding. This indicates to me that if you're just using UTF-8, you don't actually need to convert to byte form. Am I wrong? If I'm right, is this bad practice nonetheless? Would my example be any different if I were just writing strings without the JSON?
Python has multiple defaults regarding encoding.
In Python 3, the situation is as follows:
The source file encoding is UTF-8 by default. You can override this with a comment in one of the first two lines of the module (# coding: latin-1) if you really have to. It only affects string literals (and variable names).
The encoding parameter of str.encode() and bytes.decode() is UTF-8 too.
But when you open a file with open(), then the default for encoding depends on the circumstances (OS, env variables, Python version, build). You can check its value with locale.getpreferredencoding(). This default is also used when you read from sys.stdin or use print().
So I'd say it's okay to rely on the defaults for the first two cases (it's officially recommended for the first one).
But the third one is tricky: The IO default is UTF-8 on many systems, so you might think that with open(path) as f: will always use UTF-8, because it did so during development, but then you port the script to a different server and suddenly it raises UnicodeErrors or produces gibberish.
It's often not necessary to deal with encoded strings (ie. bytes objects) for processing text.
Rather, you make sure to have it decoded when reading and encoded when writing/sending the text.
This is done automatically for streams created with open() (unless you specify binary mode 'rb'/'wb').
If you think input/output has to be UTF-8, then you should explicitly specify encoding='utf8' when calling open().
all - I'm trying to perform a regex on a bunch of science data, converting certain special symbols into ASCII-friendly characters. For example, I want to replace 'µ'(UTF-8 \xc2\xb5) to the string 'micro', and '±' with '+/-'. I cooked up a python script to do this, which looks like this:
import re
def stripChars(string):
outString = (re.sub(r'\xc2\xb5+','micro', string)) #Metric 'micro (10^-6)' (Greek 'mu') letter
outString = (re.sub(r'\xc2\xb1+','+/-', outString)) #Scientific 'Plus-Minus' symbol
return outString
However, for these two specific characters, I'm getting strange results. I dug into it a bit, and it looks like I'm suffering from the bug described here, in which certain characters come out wrong because they are UTF data being interpreted as Windows-1252 (or ISO 8859-1).
I grepped the relevant data, and found that it is returning the erroneous result there as well (e.g. the 'µ' appears as 'µ') However, elsewhere in the same data set there exists datum in which the same symbol is displayed correctly. This may be due to a bug in the system which collected the data in the first place. The real weirdness is that it seems my current code only catches the incorrect version, letting the correct one pass through.
In any case, I'm really stuck on how to proceed. I need to be able to come up with a series of regex substitutions which will catch both the correct and incorrect versions of these characters, but the identifier for the correct version is failing in this case.
I must admit, I'm still fairly junior to programming, and anything more than the most basic regex is still like black magic to me. This problem seems a bit more intractable than any I've had to tackle before, and that's why I bring it to here to get some more eyes on it.
Thanks!
If your input data is encoded as UTF-8, your code should work. Here’s a
complete program that works for me. It assumes the input is UTF-8 and
simply operates on the raw bytes, not converting to or from Unicode.
Note that I removed the + from the end of each input regex; that
would accept one or more of the last character, which you probably
didn’t intend.
import re
def stripChars(s):
s = (re.sub(r'\xc2\xb5', 'micro', s)) # micro
s = (re.sub(r'\xc2\xb1', '+/-', s)) # plus-or-minus
return s
f_in = open('data')
f_out = open('output', 'w')
for line in f_in:
print(type(line))
line = stripChars(line)
f_out.write(line)
If your data is encoded some other way (see for example this
question for how to tell), this version will be more useful. You can
specify any encoding for input and output. It decodes to internal
Unicode on reading, acts on that when replacing, then encodes on
writing.
import codecs
import re
encoding_in = 'iso8859-1'
encoding_out = 'ascii'
def stripChars(s):
s = (re.sub(u'\u00B5', 'micro', s)) # micro
s = (re.sub(u'\u00B1', '+/-', s)) # plus-or-minus
return s
f_in = codecs.open('data-8859', 'r', encoding_in)
f_out = codecs.open('output', 'w', encoding_out)
for uline in f_in:
uline = stripChars(uline)
f_out.write(uline)
Note that it will raise an exception if it tries to write non-ASCII data
with an ASCII encoding. The easy way to avoid this is to just write
UTF-8, but then you may not notice uncaught characters. You can catch
the exception and do something graceful. Or you can let the program
crash and update it for the character(s) you’re missing.
Ok, as you use a Python2 version, you read the file as byte strings, and your code should successfully translate all utf-8 encoded versions of µ (U+00B5) or ± (U+00B1).
This is coherent with what you later say:
my current code only catches the incorrect version, letting the correct one pass through
This is in fact perfectly correct. Let us first look at what exactly happen for µ. µ is u'\u00b5' it is encoded in utf-8 as '\xc2\xb5' and encoded in Latin1 or cp1252 as '\xb5'. As 'Â' is U+00C2, its Latin1 or cp1252 code is 0xc2. That means that a µ character correctly encoded in utf-8 will read as µ in a Windows 1252 system. And when it looks correct, it is because it is not utf-8 encoded but Latin1 encoded.
It looks that you are trying to process a file where parts are utf-8 encoded while others are Latin1 (or cp1252) encoded. You really should try to fix that in the system that is collecting data because it can cause hard to recover trouble.
The good news is that it can be fixed here because you only want to process 2 non ASCII characters: you just have to try to decode the utf-8 version as you do, and then try in a second pass to decode the Latin1 version. Code could be (ne need for regexes here):
def stripChars(string):
outString = string.replace('\xc2\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in utf-8
outString = outString.replace('\xb5','micro') #Metric 'micro (10^-6)' (Greek 'mu') letter in Latin1
outString = outString.replace('\xc2\xb1','+/-') #Scientific 'Plus-Minus' symbol in utf-8
outString = outString.replace('\xb1','+/-') #Scientific 'Plus-Minus' symbol in Latin1
return outString
For references Latin1 AKA ISO-8859-1 encoding has the exact unicode values for all unicode character below 256. Window code page 1252 (cp1252 in Python) is a Windows variation of the Latin1 encoding where some characters normally unused in Latin1 are used for higher code characters. For example € (U+20AC) is encoded as '\80' in cp1252 while it does not exist at all in Latin1.
I have a text file who is filled with unicode characters as "\ud83d\udca5" but python don't seem to like them.
But if I replace it by u'\U0001f4a5' which seems to be his python escape style (Charbase), it works.
Is there a solution to convert them all into the u"\Uxxxxxxxx" escape format than python can understand ?
Thanks.
You're mixing up Unicode and encoded strings. u'\U0001f4a5' is a Unicode object, Python's internal datatype for handling strings. (In Python 3, the u is optional since now all strings are Unicode objects).
Files, on the other hand, use encodings. UTF-8 is the most common one, but it's just one means of storing a Unicode object in a byte-oriented file or stream. When opening such a file, you need to specify the encoding so Python can translate the bytes into meaningful Unicode objects.
In your case, it seems you need to open file using the UTF-16 codec instead of UTF-8.
with open("myfile.txt", encoding="utf-16") as f:
s = f.read()
will give you the proper contents if the codec is in fact UTF-16. If it doesn't look right, try "utf-16-le" or "utf-16-be".
I'm on a OSX machine and running Python 2.7. I'm trying to do a os.walk on a smb share.
for root, dirnames, filenames in os.walk("./test"):
for filename in filenames:
print filename
matchObj = re.match( r".*ö.*",filename,re.UNICODE)
if i use the above code it works as long as the filename do not contain umlauts.
In my shell the umlauts are printed fine but when I copy them back to a utf8 formated Textdeditor (in my case Sublime), I get:
screenshot
Expected:
filename.jpeg
filename_ö.jpg
Of course the regex fails with that.
if i hardcode the filename like:
re.match( r".*ö.*",'filename_ö',re.UNICODE)
it works fine.
I tried:
os.walk(u"./test")
filename.decode('utf8')
but gives me:
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0308' in position 10: ordinal not in range(128)
u'\u0308' are the dots above the umlauts.
I'm overlooking something stupid i guess?
Unicode characters can be represented in various forms; there's "ö", but then there's also the possibility to represent that same character using an "o" and separate combining diacritics. OS X generally prefers the separated variant, and your editor doesn't seem to handle that very gracefully, nor do these two separate characters match your regex.
You need to normalize your Unicode data if you require one way or the other in particular. See unicodedata.normalize. You want the NFC normalized form.
There are several issues:
The screenshot as #deceze explained is due to Unicode normalization. Note: it is not necessary for the codepoints to look different e.g., ö (U+00f6) and ö (U+006f U+0308) look the same in my browser
r".*ö.*" is a bytestring in Python 2 and the value depends on the encoding declaration at the top of your Python source file (something like: # -*- coding: utf-8 -*-) e.g., if the declared encoding is utf-8 then 'ö' bytestring is a sequence of two bytes: '\xc3\xb6'.
There is no way for the regex engine to know the actual encoding that should be used to interpret input bytestrings.
You should not use bytestrings, to represent text; use Unicode instead (either use u'' literals or add from __future__ import unicode_literals at the top)
filename.decode('utf8') raises UnicodeEncodeError if you use os.walk(u"./test") because filename is Unicode already. Python 2 tries to encode filename implicitly using the default encoding that is 'ascii'. Do not decode Unicode: drop .decode('utf-8')
btw, the last two issues are impossible in Python 3: r".*ö.*" is a Unicode literal, and you can't create a bytestring with literal non-ascii characters there, and there is no .decode() method (you would get AttributeError if you try to decode Unicode). You could run your script on Python 3, to detect Unicode-related bugs.
I have a column a spreadsheet whose header contains non-ASCII characters thus:
'Campaign'
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
The string is one the keys in the rows of a csv.DictReader()
When I try to populate a new dict with with the value of this key:
spends['Campaign'] = 2
I get:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?
Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.
Second, you're decoding with the wrong codec. "" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.
In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.
Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.