Decoding a String that Contains Encoded Characters

Decoding a String that Contains Encoded Characters - python

I have some strings that I am pasting in to my script as test data. The strings come from emails that contain encoded characters and it's throwing a SyntaxError. So far, I have not been able to find a solution to this issue. When I print repr(string), I get these strings:
'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
'Total Value for 2nd Load \xe2\x80\x93 approx. $74,300\n'
And this error pops up when I run my script:
SyntaxError: Non-ASCII character '\xe2' in file <filename> on line <line number>, but no
encoding declared; see http://www.python.org/peps/pep-2063.html
When I just print the lines containing the encoded characters I get this:
'Total Value for 2nd Load â€“ approx. $74,300'
The data looks like this when I copy it from the email:
'Total Value for 1st Load – approx. $75,200'
'Total Value for 2nd Load – approx. $74,300'
From doing my searches, I believe it's encoded with utf-8, but I have no idea how to work with this data based on the fact that some characters are encoded, but most of them are not(maybe?). I have tried varying "solutions" I have found so far. Including adding # -*- coding: utf-8 -*- to the top of my script and the script just hangs... It doesn't do anything :(
If someone could provide some information on how to deal with this scenario, that would be amazing.
I have tried decoding and encoding using string.encode() and string.decode(), using different encoding based on what I could find on Google, but that hasn't solved the problem.
I would really prefer a python solution because the project I'm working on requires people to paste data into a textfield in a GUI, and then that data will be processed. Other solutions suggested pasting the data into something like word, or notepad, saving it as plain text, then doing another copy/paste back from that file. This is a bit much. Does anybody know of a pythonic way of dealing with this issue?

>>> msg = 'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
>>> print msg.decode("utf-8")
Total Value for 1st Load – approx. $75,200
make sure you use something like idle that can support these characters (IE dos terminal probably will not!)

Related

str.replace not working on my series - but works well on example

In my dataframe, I have a variable that should be a number, but is currently recognized as a string, with space after the thousand's value, for example : "5 948.5"
Before I can convert it to a float, I need to remove that space
d = {'col1': [1,2], 'numbers': [' 4 856.4','5 000.5']}
data = pd.DataFrame(data=d)
data['numbers']=data['numbers'].str.replace(" ", "")
This works perfectly.
But when I do the exact same thing to my series, nothing happens (no error message, but the spaces remain). Other manipulations to that series work normally.
Any idea of what I can try to understand and fix the problem on my series?
Thanks!
Edit:
I ve loaded the data with a
pd.read_csv(file.csv, encoding = "ISO-8859-1")
could that be responsible for the unmovable spaces? If I did not do that, I'd have an error message when loading "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 2: invalid continuation byte"
I have tried to call read_csv with encoding='latin1', and encoding='cp1252' - the problem remains.
Edit 1.b. it seems to be an issue with the encoding of the space (Thanks #Marat). I downloaded an excel version of the data, and tried to replace all spaces of that column by nothing. It did not work. Removing a few spaces manually did work (but the file is too large to do it this way)
Edit 2: sample data. It really looks like the example I gave above, that works..but it really doesnt. I know nobody can reproduce this on their computer, I am not asking for the solution, but rather for ideas of what could be wrong...
As requested: here is a copytoclipboard of my data:
GroupeRA,SecteurMA,StatutMA,TypeMA,fields_ancienMatricule,historizedFields_denomination,Annee,region,fields_codeComiteSubregional,arrondissement,fields_ins,adresse_commune,perequation
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2017,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,x,Le xyzI,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 302,26 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"687,56 "
Crèches,MASS,Collectif,CREC,632100101,xyz,2019,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 372,91 "
Edit 3: the data is in csv (though as mentioned in edit 1.b I also got the data in xls and have the same issue (even when opening in xls directly, cannot "find&replace all" to find the spaces, as if excel did not read them as such)
I used DB vizualizer to extract the data from our database.

Thanks all for your help. It was indeed an issue with the "space" character - which was not a 'space' like the one produced by my keyboard. Got solved with the following sql command when extracting the data
[perequation]= CONVERT(MONEY,REPLACE(REPLACE(ds.computedFields_perequation, CHAR(160),''), ',', '.')),

Best way to remove '\xad' in Python?

I'm trying to build a corpus from the .txt file found at this link.
I believe the instances of \xad are supposedly 'soft-hyphens', but do not appear to be read correctly under UTF-8 encoding. I've tried encoding the .txt file as iso8859-15, using the code:
with open('Harry Potter 3 - The Prisoner Of Azkaban.txt', 'r',
encoding='iso8859-15') as myfile:
data=myfile.read().replace('\n', '')
data2 = data.split(' ')
This returns an array of 'words', but '\xad' remains attached to many entries in data2. I've tried
data_clean = data.replace('\\xad', '')
and
data_clean = data.replace('\\xad|\\xad\\xad','')
but this doesn't seem to remove the instances of '\xad'. Has anyone ran into a similar problem before? Ideally I'd like to encode this data as UTF-8 to avail of the nltk library, but it won't read the file with UTF-8 encoding as I get the following error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xad in position 471: invalid start byte
Any help would be greatly appreciated!
Additional context: This is a recreational project with the aim of being able to generate stories based on the txt file. Everything I've generated thus far has been permeated with '\xad', which ruins the fun!

Your file almost certainly has actual U+00AD soft-hyphen characters in it.
These are characters that mark places where a word could be split when fitting lines to a page. The idea is that the soft hyphen is invisible if the word doesn't need to be split, but printed the same as a U+2010 normal hyphen if it does.
Since you don't care about rendering this text in a book with nicely flowing text, you're never going to hyphenate anything, so you just want to remove these characters.
The way to do this is not to fiddle with the encoding. Just remove them from the Unicode text, using whichever of these you find most readable:
data = data.replace('\xad', '')
data = data.replace('\u00ad', '')
data = data.replace('\N{SOFT HYPHEN}', '')
Notice the single backslash. We're not replacing a literal backslash, x, a, d, we're replacing a literal soft-hyphen character, that is, the character whose code point is hex 0xad.
You can either do this to the whole file before splitting into words, or do it once per word after splitting.
Meanwhile, you seem to be confused about what encodings are and what to do with them:
I've tried encoding the .txt file as iso8859-15
No, you've tried decoding the file as ISO-8859-15. It's not clear why you tried ISO-8859-15 in the first place. But, since the ISO-8859-15 encoding for the character '\xad' is the byte b'\xad', maybe that's correct.
Ideally I'd like to encode this data as UTF-8 to avail of the nltk library
But NLTK doesn't want UTF-8 bytes, it wants Unicode strings. You don't need to encode it for that.
Plus, you're not trying to encode your Unicode text to UTF-8, you're trying to decode your bytes from UTF-8. If that's not what those bytes are… if you're lucky, you'll get an error like this one; if not, you'll get mojibake that you don't notice until you've screwed up a 500GB corpus and thrown away the original data.1
1. UTF-8 is specifically designed so you'll get early errors whenever possible. In this case, reading ISO-8859-15 text with soft hyphens as if it were UTF-8 raises exactly the error you're seeing, but reading UTF-8 text with soft hyphens as if it were ISO-8859-15 will silently succeed, but with an extra 'Â' character before each soft hyphen. The error is usually more helpful.

Unicode Emoji's in python from csv files

I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒۝ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.

In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.

First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.

Encoding error in Python with Chinese characters

I'm a beginner having trouble decoding several dozen CSV file with numbers + (Simplified) Chinese characters to UTF-8 in Python 2.7.
I do not know the encoding of the input files so I have tried all the possible encodings I am aware of -- GB18030, UTF-7, UTF-8, UTF-16 & UTF-32 (LE & BE). Also, for good measure, GBK and GB3212, though these should be a subset of GB18030. The UTF ones all stop when they get to the first Chinese characters. The other encodings stop somewhere in the first line except GB18030. I thought this would be the solution because it read through the first few files and decoded them fine. Part of my code, reading line by line, is:
line = line.decode("GB18030")
The first 2 files I tried to decode worked fine. Midway through the third file, Python spits out
UnicodeDecodeError: 'gb18030' codec can't decode bytes in position 168-169: illegal multibyte sequence
In this file, there are about 5 such errors in about a million lines.
I opened the input file in a text editor and checked which characters were giving the decoding errors, and the first few all had Euro signs in a particular column of the CSV files. I am fairly confident these are typos, so I would just like to delete the Euro characters. I would like to examine types of encoding errors one by one; I would like to get rid of all the Euro errors but do not want to just ignore others until I look at them first.
Edit: I used chardet which gave GB2312 as the encoding with .99 confidence for all files. I tried using GB2312 to decode which gave:
UnicodeDecodeError: 'gb2312' codec can't decode bytes in position 108-109: illegal multibyte sequence

""" ... GB18030. I thought this would be the solution because it read through the first few files and decoded them fine.""" -- please explain what you mean. To me, there are TWO criteria for a successful decoding: firstly that raw_bytes.decode('some_encoding') didn't fail, secondly that the resultant unicode when displayed makes sense in a particular language. Every file in the universe will pass the first test when decoded with latin1 aka iso_8859_1. Many files in East Asian languages pass the first test with gb18030, because mostly the frequently used characters in Chinese, Japanese, and Korean are encoded using the same blocks of two-byte sequences. How much of the second test have you done?
Don't muck about looking at the data in an IDE or text editor. Look at it in a web browser; they usually make a better job of detecting encodings.
How do you know that it's a Euro character? By looking at the screen of a text editor that's decoding the raw bytes using what encoding? cp1252?
How do you know it contains Chinese characters? Are you sure it's not Japanese? Korean? Where did you get it from?
Chinese files created in Hong Kong, Taiwan, maybe Macao, and other places off the mainland use big5 or big5_hkscs encoding -- try that.
In any case, take Mark's advice and point chardet at it; chardet usually makes a reasonably good job of detecting the encoding used if the file is large enough and correctly encoded Chinese/Japanese/Korean -- however if someone has been hand editing the file in a text editor using a single-byte charset, a few illegal characters may cause the encoding used for the other 99.9% of the characters not to be detected.
You may like to do print repr(line) on say 5 lines from the file and edit the output into your question.
If the file is not confidential, you may like to make it available for download.
Was the file created on Windows? How are you reading it in Python? (show code)
Update after OP comments:
Notepad etc don't attempt to guess the encoding; "ANSI" is the default. You have to tell it what to do. What you are calling the Euro character is the raw byte "\x80" decoded by your editor using the default encoding for your environment -- the usual suspect being "cp1252". Don't use such an editor to edit your file.
Earlier you were talking about the "first few errors". Now you say you have 5 errors total. Please explain.
If the file is indeed almost correct gb18030, you should be able to decode the file line by line, and when you get such an error, trap it, print the error message, extract the byte offsets from the message, print repr(two_bad_bytes), and keep going. I'm very interested in which of the two bytes the \x80 appears. If it doesn't appear at all, the "Euro character" is not part of your problem. Note that \x80 can appear validly in a gb18030 file, but only as the 2nd byte of a 2-byte sequence starting with \x81 to \xfe.
It's a good idea to know what your problem is before you try to fix it. Trying to fix it by bashing it about with Notepad etc in "ANSI" mode is not a good idea.
You have been very coy about how you decided that the results of gb18030 decoding made sense. In particular I would be closely scrutinising the lines where gbk fails but gb18030 "works" -- there must be some extremely rare Chinese characters in there, or maybe some non-Chinese non-ASCII characters ...
Here's a suggestion for a better way to inspect the damage: decode each file with raw_bytes.decode(encoding, 'replace') and write the result (encoded in utf8) to another file. Count the errors by result.count(u'\ufffd'). View the output file with whatever you used to decide that the gb18030 decoding made sense. The U+FFFD character should show up as a white question mark inside a black diamond.
If you decide that the undecodable pieces can be discarded, the easiest way is raw_bytes.decode(encoding, 'ignore')
Update after further information
All those \\ are confusing. It appears that "getting the bytes" involves repr(repr(bytes)) instead of just repr(bytes) ... at the interactive prompt, do either bytes (you'll get an implict repr()), or print repr(bytes) (which won't get the implicit repr())
The blank space: I presume that you mean that '\xf8\xf8'.decode('gb18030') is what you interpret as some kind of full-width space, and that the interpretation is done by visual inspection using some unnameable viewer software. Is that correct?
Actually, '\xf8\xf8'.decode('gb18030') -> u'\e28b'. U+E28B is in the Unicode PUA (Private Use Area). The "blank space" presumably means that the viewer software unsuprisingly doesn't have a glyph for U+E28B in the font it is using.
Perhaps the source of the files is deliberately using the PUA for characters that are not in standard gb18030, or for annotation, or for transmitting pseudosecret info. If so, you will need to resort to the decoding tambourine, an offshoot of recent Russian research reported here.
Alternative: the cp939-HKSCS theory. According to the HK government, HKSCS big5 code FE57 was once mapped to U+E28B but is now mapped to U+28804.
The "euro": You said """Due to the data I can't share the whole line, but what I was calling the euro char is in: \xcb\xbe\x80\x80" [I'm assuming a \ was omitted from the start of that, and the " is literal]. The "euro character", when it appears, is always in the same column that I don't need, so I was hoping to just use "ignore". Unfortunately, since the "euro char" is right next to quotes in the file, sometimes "ignore" gets rid of both the euro character as well [as] quotes, which poses a problem for the csv module to determine columns"""
It would help enormously if you could show the patterns of where these \x80 bytes appear in relation to the quotes and the Chinese characters -- keep it readable by just showing the hex, and hide your confidential data e.g. by using C1 C2 to represent "two bytes which I am sure represent a Chinese character". For example:
C1 C2 C1 C2 cb be 80 80 22 # `\x22` is the quote character
Please supply examples of (1) where the " is not lost by 'replace' or 'ignore' (2) where the quote is lost. In your sole example to date, the " is not lost:
>>> '\xcb\xbe\x80\x80\x22'.decode('gb18030', 'ignore')
u'\u53f8"'
And the offer to send you some debugging code (see example output below) is still open.
>>> import decode_debug as de
>>> def logger(s):
... sys.stderr.write('*** ' + s + '\n')
...
>>> import sys
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'replace', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8\ufffd\ufffd"'
>>> de.decode_debug('\xcb\xbe\x80\x80\x22', 'gb18030', 'ignore', logger)
*** input[2:5] ('\x80\x80"') doesn't start with a plausible code sequence
*** input[3:5] ('\x80"') doesn't start with a plausible code sequence
u'\u53f8"'
>>>
Eureka: -- Probable cause of sometimes losing the quote character --
It appears there is a bug in the gb18030 decoder replace/ignore mechanism: \x80 is not a valid gb18030 lead byte; when it is detected the decoder should attempt to resync with the NEXT byte. However it seems to be ignoring both the \x80 AND the following byte:
>>> '\x80abcd'.decode('gb18030', 'replace')
u'\ufffdbcd' # the 'a' is lost
>>> de.decode_debug('\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffdabcd'
>>> '\x80\x80abcd'.decode('gb18030', 'replace')
u'\ufffdabcd' # the second '\x80' is lost
>>> de.decode_debug('\x80\x80abcd', 'gb18030', 'replace', logger)
*** input[0:4] ('\x80\x80ab') doesn't start with a plausible code sequence
*** input[1:5] ('\x80abc') doesn't start with a plausible code sequence
u'\ufffd\ufffdabcd'
>>>

You might try chardet.

Try this:
codecs.open(file, encoding='gb18030', errors='replace')
Don't forget the parameter errors, you can also set it to 'ignore'.

Why does it print funny characters? unicode problem?

The user entered the word
éclair
into the search box.
Showing results 1 - 10 of about 140 for �air.
Why does it show the weird question mark?
I'm using Django to display it:
Showing results 1 - 10 of about 140 for {{query|safe}}

It's an encoding problem. Most likely your form or the output page is not UTF-8 encoded.
This article is very good reading on the issue: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
You need to check the encoding of
the HTML page where the user input the word
the HTML page you are using to output the word
the multi-byte ability of the functions you use to work with the string (though that probably isn't a problem in Python)
If the search is going to apply to a data base, you will need to check the encoding of the database connection, as well as the encoding of your tables and columns.

This is the result when you interpret data that is not encoded in UTF-8 as UTF-8 encoded.
The interpreter expects from the code point of your first character of the word éclair a multibyte encoded character with a length of three characters, consumes the next two characters but can’t decode it (probably invalid byte sequence). For this case the REPLACEMENT CHARACTER � (U+FFFD) is shown.
So in your case you just need to really encode your data with UTF-8.

You are serving the page with the wrong character encoding (charset). Check that you are using the same encoding throughout all your application (for example UTF-8). This includes:
HTTP headers from web server (Content-Type: text/html;charset=utf-8)
Communication with database (i.e SET NAMES 'utf-8')

It would also be good to check your browser encoding setting.

I second the responses above. Some other things from the top of my head:
If you're using e.g. MySQL database, then it could be good to create your database using:
CREATE DATABASE x CHARACTER SET UTF8
You can also check this: http://docs.djangoproject.com/en/dev/ref/settings/#default-charset

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.