Python Unicode Handling Errors - How To Simply Remove Unicode

Python Unicode Handling Errors - How To Simply Remove Unicode - python

There are literally dozens, maybe even hundreds of questions on this site about unicode handling errors with python. Here is an example of what I am talking about:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 2310: ordinal not in range(128)
A great many of the questions indicate that the OP just wants the offending content to GO AWAY. The responses they receive are uniformly full of mumbo jumbo about codecs, character sets, and all sorts of things that do not address this one basic question:
"I am trying to process a text file with some unicode in it, I could not possibly care any less what this stuff is, it is just noise in the context of the problem I am trying to solve."
So, I have a file with a zillion JSON encoded tweets, I have no interest at all in these special characters, I want them REMOVED from the line.
fh = open('file-full-of-unicode.txt')
for line in fh:
print zap_unicode(line)
Given a variable called 'line', how do I simply print it minus any unicode it might contain?
There, I have repeated the question in several different fashions so it can not be misconstrued - unicode is junk in the context of what I am trying to do, I want to convert it to something innocuous or simply remove it entirely. How is this most easily accomplished?

You can do line.decode('ascii', 'ignore'). This will decode as ASCII everything that it can, ignoring any errors.
However, if you do this, prepare for pain. Unicode exists for a reason. Throwing away parts of your data without even knowing what you're throwing away will almost always cause problems down the road. It's like saying "I don't care if some of the gizmos inside my car engine don't work, I just want to drive it. Just take out anything that doesn't work!" That's all well and good until it explodes.

The problem can't "go away", the code that can't handle unicodes properly must be fixed. There is no trivial way to do this, but the simplest way is to decode the bytes on input to make sure that the application is using text everywhere.

as #BrenBarn suggested the result will be spurious english words...
my suggestion is why not consider removing entire word which contains a unicode character
i.e a word is space+characters+space if characters has a unicode character then remove all character after previous space and before next space.. the result will be much better than removing just unicode characters..
if you just want to remove names (#username) and tags (#tags) you can use a filter ([#,#,..]) instead of brute forcibly searching for all unicode characters..
I have low reputation so can't comment, so made it as an answer.. :-(

Related

Python process a file that contains strange characters

I have a strange text file that I am required to replace any social security number with XXX-XX-XXXX. Great! Simply suck the file in, regex that junk out, and write the file out. Loving life, this will be easy as pie. My acceptance criteria is that I can only change the SSNs the rest of the file must stay exactly the same since it has fixed width columns and even strange characters must be kept for debugging other processes. OK, cool, I got this.
I read the file in:
filehandle = open("text.txt", "r", encoding="UTF-8")
And it gives me some encoding errors like this:
'utf-8' codec can't decode byte 0xd1 in position 6919: invalid continuation byte
I can't figure out the encoding. I've tried chardet and it thinks it's ASCII but I just get a different encoding error. I just need a way to suck this file in, do a simple regex and put it back out. I can put in:
errors="ignore"
And it won't crash but ends up stripping out some of the strange characters which then throws the spacing of the columns off. Here is an example of one of the characters I'm talking about with it's hex (need to use images since I can't copy/paste it here):
The 4E is the 'N' in CHILDREN
The EF BF BD make up the .. stuff
The 53 is the S in CHILDREN
I'm sure this is part of the problem. So, what should I do to simply:
Take the file in, use a regex to simply change \d{3}-\d{2}-\d{4} to XXX-XX-XXXX where the file has some weird characters in it without changing anything else in the file? Thank you all!

You should open your file in binary mode and avoid processing Unicode decoding of UTF-8.
Then use a bytes regular expression to find the social security numbers and replace the found places with relevant bytes.

Converting a weird data type to Str

I apologize in advance as I am not sure how to ask this! Okay so I am attempting to use a twitter API within Python. Here is the snippet of code giving me issues:
trends = twitter.Api.GetTrendsCurrent(api)
print str(trends)
This returns:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-5: ordinal not in range(128)
When I attempt to .encode, the interpreter tells me I cannot encode a Trend object. How do I get around this?

Simple answer:
Use repr, not str. It should always, always work (unless the API itself is broken and that is where the error is being thrown from).
Long answer:
By default, when you cast a Unicode string to a byte str (and vice versa) in Python 2, it will use the ascii encoding by default for the conversion process. This works most of the time, but not always. Thus, nasty edge cases like this are a pain. One of the big reasons for the break in backwards compatibility in Python 3 was to change this behavior.
Use latin1 for testing. It may not be the correct encoding, but it will always (always, always, always) work and give you a jumping off point for debugging this properly so you at least can print something.
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('latin1')
Or, better yet, when encoding force it to ignore or replace errors:
trends = twitter.Api.GetTrendsCurrent(api)
print type(trends)
print unicode(trends)
print unicode(trends).encode('utf8', 'xmlcharrefreplace')
Chances are, since you are dealing with a web based API, you are dealing with UTF-8 data anyway; it is pretty much the default encoding across the board on the web.

Python decoding of back quotations

I am receiving this issue
" UnicodeEncodeError: 'latin-1' codec can't encode character u'\u201d' "
I'm quite new to working with databases as a whole. Previously, I had been using SQLite3; however, now transitioning/migrating to MySQL, I noticed u'\u201d' and u'\u201c' characters were within some of my text data.
I'm currently making a python script to tackle the migration; however, I'm getting stuck with this codec issue that I previously didn't for see.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?

You don't have a problem decoding these characters; wherever they're coming from, if they're showing up as \u201d (”) and \u201c (“), they're already being properly decoded.
The problem is encoding these characters. If you want to store your strings in Latin-1 columns, they can only contain the 256 characters that exist in Latin-1, and these two are not among them.
So my question is, how do I replace/decode these values so that I can actually store them in MySQL DB?
The obvious solution is to use UTF-8 columns instead of Latin-1 in MySQL. Then this problem wouldn't even exist; any Unicode string can be encoded as UTF-8.
But assuming you can't do that for some reason…
Python comes with built-in support for different error handlers that can help you do something with these characters while encoding them. You just have to decide what "something" that is.
Let's say your string looks like hey “hey” hey. Here's what each error handler would do with it:
s.encode('latin-1', 'ignore'): hey hey hey
s.encode('latin-1', 'replace'): hey ?hey? hey
s.encode('latin-1', 'xmlcharrefreplace'):hey “hey” hey`
s.encode('latin-1', 'backslashreplace'):hey \u201chey\u201d hey`
The first two have the advantage of being somewhat readable, but the disadvantage that you can never recover the original string. If you want that, but want something even more readable, you may want to consider a third-party library like unidecode:
unidecode('hey “hey” hey').encode('latin-1'):hey "hey" hey`
The last two are lossless, but kind of ugly. Although in some contexts they'll look pretty nice—e.g., if you're building an XML document, xmlcharrefreplace (maybe even with 'ascii' instead of 'latin-1') will give you exactly what you want in an XML viewer. There are special-purpose translators for various other use cases (like HTML references, or XML named entities instead of numbered, etc.) if you know what you want.
But in general, you have to make the choice between throwing away information, or "hiding" it in some ugly but recoverable form.

django + unicode constant errors

I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.

I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.

Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.

Encoding error in Python with Chinese characters

I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.
I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is:
line = line.decode("GB18030")
The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
In this file, there are about 5 such errors in about a million lines.
I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.
Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence

""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?
Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.
How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?
How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?
Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.
In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.
You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.
If the file is not confidential, you may like to make it available for download.
Was the file created on Windows? How are you reading it in Python? (show code)
Update after OP comments:
Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.
Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.
If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.
It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.
You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...
Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.
If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')
Update after further information
All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())
The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?
Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.
Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.
Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.
The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""
It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:
C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character
Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'
And the offer to send you some debugging code (see example output below) is still open.
>>> import decode_debug as de
>>> def logger(s):
... sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>
Eureka: -- Probable cause of sometimes losing the quote character --
It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:
>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>

You might try chardet.

Try this:
codecs.open(file, encoding='gb18030', errors='replace')
Don't forget the parameter errors, you can also set it to 'ignore'.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.