Decoding a byte with latin-1 characters to string with decimal representation

Decoding a byte with latin-1 characters to string with decimal representation - python

I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)

The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.

Related

Python 2.7 convert special characters into utf-8 byes

I have strings that I need to replace into an URL for accessing different JSON files. My problem is that some strings have special characters and I need only these as UTF-8 bytes, so I can properly find the JSON tables.
An example:
# I have this string
a = 'code - Brasilândia'
#in the JSON url it appears as
'code%20-%20Brasil%C3%A2ndia'
I managed to get the spaces converted right using urllib.quote(), but it does not convert the special characters as I need them.
print(urllib.quote('code - Brasilândia))
'code%20-%20Brasil%83ndia'
When I substitute this in the URL, I cannot reach the JSON table.
I managed to make this work using u before the string, u'code - Brasilândia', but this did not solve my issue, because the string will ultimately be a user input, and will need to be constantly changed.
I have tried several methods, but I could not get the result I need.
I'm specifically using python 2.7 for this project, and I cannot change it.
Any ideas?

You could try decoding the string as UTF-8, and if it fails, assume that it's Latin-1, or whichever 8-bit encoding you expect.
try:
yourstring.decode('utf-8')
except UnicodeDecodeError:
yourstring = yourstring.decode('latin-1').encode('utf-8')
print(urllib.quote(yourstring))
... provided you can establish the correct encoding; 0x83 seems to correspond to â only in some fairly obscure legacy encodings like code pages 437 and 850 (and those are the least obscure). See also https://tripleee.github.io/8bit/#83
(disclosure: the linked site is mine).
Demo: https://ideone.com/fjX15c

Converting Python 3 bytes object to string when bytes object apparently only contains characters

I'm new to Python 3 and it seems that I can't quite completely grasp unicode and character encoding.
I'm working with the output of another tool that returns the content of an html page as a bytes object. Other tools we use need this output to be in bytes type, but, I'd like to convert the bytes output to a string for some parsing and comparison to other strings. For cases that I'm interested in, printing the output bytes object shows only characters and no \x or \u binary. I'm a little confused on how best to do this and why the methods that create the desired output, actually do work.
I've read elsewhere that .decode() should be used in this context and this does work, but I don't understand why I am decoding an object that is already characters. From what I understand, decoding is intended for binary numbers, for example:
>>> b'\x41'.decode('utf-8')
'A'
In my understanding, all I really want to do is tell Python that an object that's been labeled as a bytes type object is actually a str object. Simply using the str() function on the bytes object also accomplishes this goal, but adds the "b" prefix and adds quotations around the string.
Here are the two solutions I'm working with:
>>> str(b'htmltext')
"b'htmltext'"
>>> b'htmltext'.decode('utf-8')
'htmltext'
Essentially, either of these solutions appears to achieve what I'm looking for, but the decode() obviously seems cleaner and, from what I've read, the recommended method. I'm wondering why decode() works, given that, apparently, I'm not converting binary numbers to characters. Furthermore, is there any reason other than the unappealing "b" and quotation marks in the output that str() would not be a valid solution here?

Don't confuse the developer-friendly representation of the bytes object with the data that is contained in it. You have binary data either way.
The developer representation makes it easy for you to see what is contained by showing anything that just happens to be a valid ASCII codepoint as that ASCII character, rather than the \xhh escape code. It's just easier to read text encoded as ASCII that way, and a lot of the world's text happens to be ASCII encoded.
You'll have a harder time when the data is not within the ASCII range however:
>>> 'Åæøéï'.encode('utf8')
b'\xc3\x85\xc3\xa6\xc3\xb8\xc3\xa9\xc3\xaf'
That's a UTF-8 byte sequence encoding text with accents. The above may be a little bit contrived, but most non-English text will include some non-ASCII text. Even English text can contain em-dashes or fancy quotes, and the b'...' bytes version of that is not nearly as readable as the properly decoded text version:
>>> '“Kragerø” is a town in Norway – in the province of Vestfold'.encode('utf8')
b'\xe2\x80\x9cKrager\xc3\xb8\xe2\x80\x9d is a town in Norway \xe2\x80\x93 in the province of Vestfold'
Note that the b'....' output is the result of using the repr() function on a bytes object; that calls the object.__repr__() method, which has the explicit function of producing a developer-friendly string for you. There is no dedicated object.__str__() method on a bytes object, the __repr__ method is called instead, even when you use the str() function. The proper way to convert a bytes value to a string is to decode (using the correct codec for the data).
Of course, when you have binary data that represents something else, like, say, image data, then keep it as bytes. There is no text to decode there.

Reading unicode characters from file/sqlite database and using it in Python

I have a list of variables with unicode characters, some of them for chemicals like Ozone gas: like 'O\u2083'. All of them are stored in a sqlite database which is read in a Python code to produce O3. However, when I read I get 'O\\u2083'. The sqlite database is created using an csv file that contains the string 'O\u2083' among others. I understand that \u2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be \,u,2,0,8,3). Is there any way to recognize unicode characters in this context? Now my first option to solve it is to create a function to recognize set of characters and replace for unicode characters. Is there anything like this already implemented?

SQLite allows you to read/write Unicode text directly. u'O\u2083' is two characters u'O' and u'\u2083' (your question has a typo: 'u\2083' != '\u2083').
I understand that u\2083 is not being stored in sqlite database as unicode character but as 6 unicode characters (which would be u,\,2,0,8,3)
Don't confuse u'u\2083' and u'\u2083': the latter is a single character while the former is 4-character sequence: u'u', u'\x10' ('\20' is interpreted as octal in Python), u'8', u'3'.
If you save a single Unicode character u'\u2083' into a SQLite database; it is stored as a single Unicode character (the internal representation of Unicode inside the database is irrelevant as long as the abstraction holds).
On Python 2, if there is no from __future__ import unicode_literals at the top of the module then 'abc' string literal creates a bytestring instead of a Unicode string -- in that case both 'u\2083' and '\u2083' are sequences of bytes, not text characters (\uxxxx is not recognized as a unicode escape sequence inside bytestrings).

If you have a byte string (length 7), decode the Unicode escape.
>>> s = 'O\u2083'
>>> len(s)
7
>>> s
'O\\u2083'
>>> print(s)
O\u2083
>>> u = s.decode('unicode-escape')
>>> len(u)
2
>>> u
u'O\u2083'
>>> print(u)
O₃
Caveat: Your console/IDE used to print the character needs to use an encoding that supports the character or you'll get a UnicodeEncodeError when printing. The font must support the symbol as well.

It's important to remember everything is bytes. To pull bytes into something useful to you, you kind of have to know what encoding is used when you pull in data. There are too many ambiguous cases to determine encoding by analyzing the data. When you send data out of your program, it's all back out to bytes again. Depending on whether you're using Python 2.x or 3.x you'll have a very different experience with Unicode and Python.
You can, however attempt encoding and simply do a "replace" on errors. For example the_string.encode("utf-8","replace") will try to encode as utf-8 and will replace problems with a ? -- You could also anticipate problem characters and replace them beforehand, but that gets unmanageable quickly. Take a look at codecs classes for more replacement options.

Replace numeric character references in XML document using Python

I am struggling with the following issue: I have an XML string that contains the following tag and I want to convert this, using cElementTree, to a valid XML document:
<tag>#55296;#57136;#55296;#57149;#55296;#57139;#55296;#57136;#55296;#57151;#55296;
#57154;#55296;#57136;</tag>
but each # sign is preceded by a & sign and hence the output looks like: 𐌰𐌽𐌳𐌰𐌿𐍂𐌰
This is a unicode string and the encoding is UTF-8. I want to discard these numeric character references because they are not legal XML in a valid XML document (see Parser error using Perl XML::DOM module, "reference to invalid character number")
I have tried different regular expression to match these numeric character references. For example, I have tried the following (Python) regex:
RE_NUMERIC_CHARACTER = re.compile('&#[\d{1,5}]+;')
This does work in regular python session but as soon as I use the same regex in my code then it doesn't work, presumably because those numeric characters have been interpreted (and are shown as boxes or question marks).
I have also tried the unescape function from http://effbot.org/zone/re-sub.htm but that does not work either.
Thus: how can I match, using a regular expression in Python, these numeric character references and create a valid XML document?

Eurgh. You've got surrogates (UTF-16 code units in the range D800-DFFF), which some fool has incorrectly encoded individually instead of using a pair of code units for a single character. It would be ideal to replace this mess with what it should look like:
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Or, just as valid, in literal characters (if you've got a font that can display the Gothic alphabet):
<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>
Usually, it would be best to do replacement operations like this on parsed text nodes, to avoid messing up non-character-reference sequences in other places like comments or PIs. However of course that's not possible in this case since this isn't really XML at all. You could try to fix it up with a crude regex, though it would be better to find out where the invalid input is coming from and kick the person responsible until they fix it.
>>> def lenient_deccharref(m):
... return unichr(int(m.group(1)))
...
>>> tag= '<tag>𐌰𐌽𐌳𐌰𐌿𐍂𐌰</tag>'
>>> re.sub('&#(\d+);', lenient_deccharref, tag).encode('utf-8')
'<tag>\xf0\x90\x8c\xb0\xf0\x90\x8c\xbd\xf0\x90\x8c\xb3\xf0\x90\x8c\xb0\xf0\x90\x8c\xbf\xf0\x90\x8d\x82\xf0\x90\x8c\xb0</tag>'
This is the correct UTF-8 encoding of 𐌰𐌽𐌳𐌰𐌿𐍂𐌰. The utf-8 codec allows you to encode a sequence of surrogates to correct UTF-8 even on a wide-Unicode platform where the surrogates should not have appeared in the string in the first place.
>>> _.decode('utf-8')
u'<tag>\U00010330\U0001033d\U00010333\U00010330\U0001033f\U00010342\U00010330</tag>'

Sending a List through an URL

I have a list that I need to send through a URL to a third party vendor. I don't know what language they are using.
The list prints out like this:
[u'1', u'6', u'5']
I know that the u encodes the string in utf-8 right? So a couple of questions.
Can I send a list through a URL?
Will the u's show up on the other end when going through the URL?
If so, how do I remove them?
I am not sure what keywords to search to help me out, so any resources would be helpful too.

Can I send a list through a URL?
No. A URL is just text. If you want a way to package structured information in it, you'll have to agree that with the provider you're talking to.
One standard encoding for structure in URLs, that might or might not be what you need, is the use of multiple parameters with the same name in a query string. This format comes from HTML form submissions:
http://www.example.com/script?par=1&par=6&par=5
might be considered to represent a parameter par with a three-item list as its value. Or maybe not, it's up to the receiver to decide. For example in a PHP application you would have had to name the parameter par[] to get it to accept the array value.
I know that the u encodes the string in utf-8 right?
No. a u'...' string is a native Unicode string, where each index represents a whole character and not a byte in any particular encoding. If you want UTF-8 bytes, write u'...'.encode('utf-8') before URL-encoding. UTF-8 is a good default choice, but again: what encoding the receiving side wants is up to that application.
Will the u's show up on the other end when going through the URL?
u is part of the literal representation of the string, just the same as the ' quotes themselves. They are not part of the string value and would not be echoed by print or when joined into other strings, unless you deliberately asked for the literal representation by calling repr.

u'' is not utf-8, its python unicode strings for python 2.x
To send it through url, you need to encode them with utf8 like .encode('utf-8'), and also need to urlencode, and list cannot send it through URL, you need to make it as string.
Basically, you need to do it in following steps
python list -> unicode string -> utf8 string -> url encode -> send it through proper urllib api

Incorrect. unicode literals use Python's internal encoding, decided when it was compiled.
You can't send anything "through" URLs. Pick a protocol instead. And encode before sending, probably to UTF-8.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.