Converting Unicode codepoints into Unicode character using Python 3.3.1

Converting Unicode codepoints into Unicode character using Python 3.3.1 - python

I've this string :
sig=45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C\u0026type=video%2F3gpp%3B+codecs%3D%22mp4v.20.3%2C+mp4a.40.2%22\u0026quality=small\u
0026itag=17\u0026url=http%3A%2F%2Fr6---sn-cx5h-itql.c.youtube.com%2Fvideoplayback%3Fsource%3Dyoutube%26mt%3D1367776467%26expire%3D1367797699%26itag%3D17%26factor%3D1.25%2
6upn%3DpkX9erXUHx4%26cp%3DU0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW%26key%3Dyt1%26id%3Dab9b0e2f311eaf00%26mv%3Dm%26newshard%3Dyes%26ms%3Dau%26ip%3D49.205.30.138%26sparams%
3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cupn%252Cexpire%26burst%3D40%26algorithm%3Dthrottle-factor%26ipbits%3D8%26fexp%3D9
17000%252C919366%252C916626%252C902533%252C932000%252C932004%252C906383%252C904479%252C901208%252C925714%252C929119%252C931202%252C900821%252C900823%252C912518%252C911416
%252C930807%252C919373%252C906836%252C926403%252C900824%252C912711%252C929606%252C910075%26sver%3D3\u0026fallback_host=tc.v19.cache2.c.youtube.com
As you can see it contains the both forms:
%xx. For example, %3, %2F etc.
\uxxxx. For example, \u0026
I need to convert them to their unicode character representation. I'm using Python 3.3.1, and urllib.parse.unquote(s) converts only %xx to their unicode character representation. It doesn't, however, convert \uxxxx to their unicode character representation. For example, \u0026 should convert into &.
How can I convert both of them?

Two options:
Choose to interpret it as JSON; that format uses the same escape codes. The input does need to have quotes around it to be seen as a string.
Encode to latin 1 (to preserve bytes), then decode with the unicode_escape codec:
>>> urllib.parse.unquote(sig).encode('latin1').decode('unicode_escape')
'45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C&type=video/3gpp;+codecs="mp4v.20.3,+mp4a.40.2"&quality=small&itag=17&url=http://r6---sn-cx5h-itql.c.youtube.com/videoplayback?source=youtube&mt=1367776467&expire=1367797699&itag=17&factor=1.25&upn=pkX9erXUHx4&cp=U0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW&key=yt1&id=ab9b0e2f311eaf00&mv=m&newshard=yes&ms=au&ip=49.205.30.138&sparams=algorithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&burst=40&algorithm=throttle-factor&ipbits=8&fexp=917000%2C919366%2C916626%2C902533%2C932000%2C932004%2C906383%2C904479%2C901208%2C925714%2C929119%2C931202%2C900821%2C900823%2C912518%2C911416%2C930807%2C919373%2C906836%2C926403%2C900824%2C912711%2C929606%2C910075&sver=3&fallback_host=tc.v19.cache2.c.youtube.com'
This interprets \u escape codes just like it Python would do when reading string literals in Python source code.

If I'm guessing right, this is more or less a URL. The '%xx' encodes a single byte outside the allowed character set. The '\uxxxx' encodes a Unicode codepoint. I believe that it is normal for URLs to encode Unicode characters as UTF-8 and then to encode the bytes outside the allowed charset as '%xx' (which affects all multibyte UTF-8 sequences). This makes it surprising that there are '%xx'-encoded bytes already, because translating the Unicode codepoints will make the conversions irreversible.
Make sure you have tests and that you can verify the actual results, because this seems like it was unsafe. At least I don't fully understand the requirements here.

Related

Python string encode and decode

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.

You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

encoding issue. Replace special character

I have a dictionary that looks like this:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Now I'm trying to replace the \xf6 with ö ,
but trying .replace('\xf6', 'ö') returns an error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position
0: ordinal not in range(128)
How can I fix this?

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.
From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.
With that said, what you see in your example data:
{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}
Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.
But when you try to replace it using:
x.replace('\xf6', 'ö')
You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.
Hence why you get the "'ascii' codec can't decode byte...' message.
You can do unicode replacements like this:
a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')
This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.
What you want to do, is encode your string into something you want to use, for instance - UTF-8:
a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'
And define UTF-8 as your primary encoding by placing this at the top of your code:
#!/usr/bin/python
# -*- coding: UTF-8
This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.
But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".
In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.
There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:
locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')
I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):
Python - Encoding string - Swedish Letters
tl;dr:
Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).
If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.
If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)
In that case just use an explicit encoding to convert your unicode strings to byte strings:
print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')
should give what you expect.

Concatenation of unicode and byte strings

From what I understand, when concatenating a string and Unicode string, Python will automatically decode the string based on the default encoding and convert to Unicode before concatenating.
I'm assuming something like this if default is 'ascii' (please correct if mistaken):
string -> ASCII hexadecimal bytes -> Unicode hexadecimal bytes -> Unicode string
Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating? Why does the string need to be decoded first? Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?

Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating?
It could probably do that with literals, but not string characters at runtime. Imagine a string that contains a 'Ӹ' character. How do you think it can be converted to u'Ӹ' in Unicode? IT HAS TO BE DECODED!
Ӹ is Unicode codepoint U+04F8 CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS. 'Ӹ' and u'Ӹ' are not encoded the same way (in fact, I can't even find an 8bit encoding that supports U+04F8), so you can't simply change one to the other directly. A string has to be decoded from its source encoding (ASCII, ISO-8859-1, etc) to an intermediary (ISO 10646, Unicode) that can be represented in the target encoding (UTF-8, UTF-16, UTF-32, etc).
Why does the string need to be decoded first?
Because the two values being concatenated need to be in the same encoding before they can be concatented.
Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?
Because non-ASCII characters are represented differently in different encodings. Unicode is universal, but other encodings are not. And Python supports hundreds of encodings.
Take the Euro sign (€, Unicode codepoint U+20AC), for example. It does not exist in ASCII and most ISO-8859-X encodings, but it is encoded as byte 0xA4 in ISO-8859-7, -15, and -16, but as byte 0x88 in Windows-1251. But 0xA4 represents different Unicode codepoints in other encodings. It is ¤ (U+00A4 CURRENCY SIGN) in ISO-8859-1, but is Ł (U+0141 CAPITAL LETTER L WITH STROKE) in ISO-8859-2, etc.
So how do you expect Python to convert 0xA4 to Unicode? Should it convert to U+00A4, U+0141, or U+20AC?
So, string encoding matters!
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Why don't python interpreter use the file coding format for decoding?

The code bellow will cause an UnicodeDecodeError:
#-*- coding:utf-8 -*-
s="中文"
u=u"123"
u=s+u
I know it's because python interpreter is using ascii to decode s.
Why don't python interpreter use the file format(utf-8) for decoding?

Implicit decoding cannot know what source encoding was used. That information is not stored with strings.
All that Python has after importing is a byte string with characters representing bytes in the range 0-255. You could have imported that string from another module, or read it from a file object, etc. The fact that the parser knew what encoding was used for those bytes doesn't even matter for plain byte strings.
As such, it is always better to decode bytes explicitly, rather than rely on the implicit decoding. Either make use a Unicode literal for s as well, or explicitly decode using str.decode()
u = s.decode('utf8') + u

The types of the 2 strings are different - the first is a normal string, second is a unicode string, hence the error.
So, instead of doing s="中文", do as following to get unicode strings for both:
s=u"中文"
u=u"123"
u=s+u

The code works perfectly fine on Python 3.
However, in Python 2, if you do not add a u before a string literal, you are constructing a string of bytes. When one wants to combine a string of bytes and a string of characters, one either has to decode the string of bytes, or encode the string of characters. Python 2.x opted for the former. In order to prevent accidents (for example, someone appending binary data to a user input and thus generating garbage), the Python developers chose ascii as the encoding for that conversion.
You can add a line
from __future__ import unicode_literals
after the #coding declaration so that literals without u or b prefixes are always character and not byte literals.

I don't understand encode and decode in Python (2.7.3)

I tried to understand by myself encode and decode in Python but nothing is really clear for me.
str.encode([encoding,[errors]])
str.decode([encoding,[errors]])
First, I don't understand the need of the "encoding" parameter in these two functions.
What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".
I have an important question, is there some way to pass from one encoding to another?
I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".
Thanks for you help.

It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.
The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.

My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.
Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".

Python 2.x has two types of strings:
str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.
This model was changed for Python 3.x:
2.x unicode became 3.x str (and the u prefix was dropped from the literals).
A bytes type was introduced for representing binary data.
A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:
>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'
To convert the other way, use the decode method:
>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'

Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.
The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.