Concatenation of unicode and byte strings

Concatenation of unicode and byte strings - python

From what I understand, when concatenating a string and Unicode string, Python will automatically decode the string based on the default encoding and convert to Unicode before concatenating.
I'm assuming something like this if default is 'ascii' (please correct if mistaken):
string -> ASCII hexadecimal bytes -> Unicode hexadecimal bytes -> Unicode string
Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating? Why does the string need to be decoded first? Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?

Wouldn't it be easier and raise less UnicodeDetectionError if, for example, u'a' + 'Ӹ' is converted to u'a' + u'Ӹ' directly before concatenating?
It could probably do that with literals, but not string characters at runtime. Imagine a string that contains a 'Ӹ' character. How do you think it can be converted to u'Ӹ' in Unicode? IT HAS TO BE DECODED!
Ӹ is Unicode codepoint U+04F8 CYRILLIC CAPITAL LETTER YERU WITH DIAERESIS. 'Ӹ' and u'Ӹ' are not encoded the same way (in fact, I can't even find an 8bit encoding that supports U+04F8), so you can't simply change one to the other directly. A string has to be decoded from its source encoding (ASCII, ISO-8859-1, etc) to an intermediary (ISO 10646, Unicode) that can be represented in the target encoding (UTF-8, UTF-16, UTF-32, etc).
Why does the string need to be decoded first?
Because the two values being concatenated need to be in the same encoding before they can be concatented.
Why does it matter if the string contains non-ASCII characters if it will be converted to Unicode anyway?
Because non-ASCII characters are represented differently in different encodings. Unicode is universal, but other encodings are not. And Python supports hundreds of encodings.
Take the Euro sign (€, Unicode codepoint U+20AC), for example. It does not exist in ASCII and most ISO-8859-X encodings, but it is encoded as byte 0xA4 in ISO-8859-7, -15, and -16, but as byte 0x88 in Windows-1251. But 0xA4 represents different Unicode codepoints in other encodings. It is ¤ (U+00A4 CURRENCY SIGN) in ISO-8859-1, but is Ł (U+0141 CAPITAL LETTER L WITH STROKE) in ISO-8859-2, etc.
So how do you expect Python to convert 0xA4 to Unicode? Should it convert to U+00A4, U+0141, or U+20AC?
So, string encoding matters!
See The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)

Related

Is the Unicode code point value equal to the UTF-16BE representation for every character?

I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?

No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.

Convert from string containing hexadecimal characters to bytes in python 3

I have a string that contains printable and unprintable characters, for instance:
'\xe8\x00\x00\x00\x00\x60\xfc\xe8\x89\x00\x00\x00\x60\x89'
What's the most "pythonesque" way to convert this to a bytes object in Python 3, i.e.:
b'\xe8\x00\x00\x00\x00`\xfc\xe8\x89\x00\x00\x00`\x89'

If all your codepoints are within the range U+0000 to U+00FF, you can encode to Latin-1:
inputstring.encode('latin1')
as the first 255 codepoints of Unicode map one-to-one to bytes in the Latin-1 standard.
This is by far and away the fastest method, but won't work for any characters in the input string outside that range.
Basically, if you got Unicode that contains 'bytes' that should not have been decoded, encode to Latin-1 to get the original bytes again.
Demo:
>>> '\xe8\x00\x00\x00\x00\x60\xfc\xe8\x89\x00\x00\x00\x60\x89'.encode('latin1')
b'\xe8\x00\x00\x00\x00`\xfc\xe8\x89\x00\x00\x00`\x89'

Converting Unicode codepoints into Unicode character using Python 3.3.1

I've this string :
sig=45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C\u0026type=video%2F3gpp%3B+codecs%3D%22mp4v.20.3%2C+mp4a.40.2%22\u0026quality=small\u
0026itag=17\u0026url=http%3A%2F%2Fr6---sn-cx5h-itql.c.youtube.com%2Fvideoplayback%3Fsource%3Dyoutube%26mt%3D1367776467%26expire%3D1367797699%26itag%3D17%26factor%3D1.25%2
6upn%3DpkX9erXUHx4%26cp%3DU0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW%26key%3Dyt1%26id%3Dab9b0e2f311eaf00%26mv%3Dm%26newshard%3Dyes%26ms%3Dau%26ip%3D49.205.30.138%26sparams%
3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cupn%252Cexpire%26burst%3D40%26algorithm%3Dthrottle-factor%26ipbits%3D8%26fexp%3D9
17000%252C919366%252C916626%252C902533%252C932000%252C932004%252C906383%252C904479%252C901208%252C925714%252C929119%252C931202%252C900821%252C900823%252C912518%252C911416
%252C930807%252C919373%252C906836%252C926403%252C900824%252C912711%252C929606%252C910075%26sver%3D3\u0026fallback_host=tc.v19.cache2.c.youtube.com
As you can see it contains the both forms:
%xx. For example, %3, %2F etc.
\uxxxx. For example, \u0026
I need to convert them to their unicode character representation. I'm using Python 3.3.1, and urllib.parse.unquote(s) converts only %xx to their unicode character representation. It doesn't, however, convert \uxxxx to their unicode character representation. For example, \u0026 should convert into &.
How can I convert both of them?

Two options:
Choose to interpret it as JSON; that format uses the same escape codes. The input does need to have quotes around it to be seen as a string.
Encode to latin 1 (to preserve bytes), then decode with the unicode_escape codec:
>>> urllib.parse.unquote(sig).encode('latin1').decode('unicode_escape')
'45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C&type=video/3gpp;+codecs="mp4v.20.3,+mp4a.40.2"&quality=small&itag=17&url=http://r6---sn-cx5h-itql.c.youtube.com/videoplayback?source=youtube&mt=1367776467&expire=1367797699&itag=17&factor=1.25&upn=pkX9erXUHx4&cp=U0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW&key=yt1&id=ab9b0e2f311eaf00&mv=m&newshard=yes&ms=au&ip=49.205.30.138&sparams=algorithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&burst=40&algorithm=throttle-factor&ipbits=8&fexp=917000%2C919366%2C916626%2C902533%2C932000%2C932004%2C906383%2C904479%2C901208%2C925714%2C929119%2C931202%2C900821%2C900823%2C912518%2C911416%2C930807%2C919373%2C906836%2C926403%2C900824%2C912711%2C929606%2C910075&sver=3&fallback_host=tc.v19.cache2.c.youtube.com'
This interprets \u escape codes just like it Python would do when reading string literals in Python source code.

If I'm guessing right, this is more or less a URL. The '%xx' encodes a single byte outside the allowed character set. The '\uxxxx' encodes a Unicode codepoint. I believe that it is normal for URLs to encode Unicode characters as UTF-8 and then to encode the bytes outside the allowed charset as '%xx' (which affects all multibyte UTF-8 sequences). This makes it surprising that there are '%xx'-encoded bytes already, because translating the Unicode codepoints will make the conversions irreversible.
Make sure you have tests and that you can verify the actual results, because this seems like it was unsafe. At least I don't fully understand the requirements here.

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?

You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.

Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

Test a string if it's Unicode, which UTF standard is and get its length in bytes?

I need to test if a string is Unicode, and then if it whether it's UTF-8. After that, get the string's length in bytes including the BOM, if it ever uses that. How can this be done in Python?
Also for didactic purposes, what does a byte list representation of a UTF-8 string look like? I am curious how a UTF-8 string is represented in Python.
Latter edit: pprint does that pretty well.

try:
string.decode('utf-8')
print "string is UTF-8, length %d bytes" % len(string)
except UnicodeError:
print "string is not UTF-8"
In Python 2, str is a sequence of bytes and unicode is a sequence of characters. You use str.decode to decode a byte sequence to unicode, and unicode.encode to encode a sequence of characters to str. So for example, u"é" is the unicode string containing the single character U+00E9 and can also be written u"\xe9"; encoding into UTF-8 gives the byte sequence "\xc3\xa9".
In Python 3, this is changed; bytes is a sequence of bytes and str is a sequence of characters.

To Check if Unicode
>>>a = u'F'
>>>isinstance(a, unicode)
True
To Check if it is UTF-8 or ASCII
>>>import chardet
>>>encoding = chardet.detect('AA')
>>>encoding['encoding']
'ascii'

I would definitely recommend Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know about Unicode and Character Sets (No Excuses!), if you haven't already read it.
For Python's Unicode and encoding/decoding machinery, start here. To get the byte-length of a Unicode string encoded in utf-8, you could do:
print len(my_unicode_string.encode('utf-8'))
Your question is tagged python-2.5, but be aware that this changes somewhat in Python 3+.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.