How to acces string with mixed 1 byte and 2 byte symbols? - python

I read questions and answers for a quiz from a file in UTF-8 encoding but the answer can consists 1 byte symbols (English) and 2 byte symbols (Russian) in the same text:
"best car тайота"`
I need to write answer replaced with "*" so it looks like "**** *** ******" to help guess what answer is. For determining length I use
len(answer.decode('utf-8'))
But in the next hint when I want to show some symbols like "b*s* ca* *а*от*", I can access the 1 byte symbols via answer[index] but I can't read 2 byte symbols this way, and that's why I get "b*s* ca*" without 2 byte symbols.
Is there solution for this?

Decode the string to a Unicode value once, and do your replacements in that.
A unicode string object supports the same operations as byte strings; just be careful when mixing byte strings and Unicode strings as that could trigger an automatic encode or decode (leading to UnicodeEncode or UnicodeDecode errors). Printing the string should automatically encode the value to match your terminal codec.
You may want to read up on Python and Unicode:
Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Related

Python string encode and decode

Encoding in JS means converting a string with special characters to escaped usable string. like : encodeURIComponent would convert spaces to %20 etc to be usable in URIs.
So encoding here means converting to a particular format.
In Python 2.7, I have a string : 奥多比. To convert it into UTF-8 format, however, I need to use decode() function.
Like: "奥多比".decode("utf-8") == u'\u5965\u591a\u6bd4'
I want to understand how the meaning of encode and decode is changing with language. To me essentially I should be doing "奥多比".encode("utf-8")
What am I missing here.
You appear to be confusing Unicode text (represented in Python 2 as the unicode type, indicated by the u prefix on the literal syntax), with one of the standard Unicode encodings, UTF-8.
You are not creating UTF-8, you created a Unicode text object, by decoding from a UTF-8 byte stream.
The byte string literal `"奥多比"' is a sequence of binary data, bytes. You either entered these in a text editor and saved the file as UTF-8 (and told Python to treat your source code as UTF-8 by starting the file with a PEP 263 codec header), or you typed it into the Python interactive prompt in a terminal that was configured to send UTF-8 data.
I strongly urge you to read more about the difference between bytes, codecs and Unicode text. The following links are highly recommended:
Ned Batchelder's Pragmatic Unicode
The Python Unicode HOWTO
Joel Spolsky's The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
In Python v2, it's type str, i.e. sequence of bytes. To convert it to a Unicode string, you need to decode this sequence of bytes using a codec. Simply said, it specifies how should bytes be converted to a sequence of Unicode code points. Look into Unicode HOWTO for more in-depth article on this.

Is the Unicode code point value equal to the UTF-16BE representation for every character?

I saved some strings in Microsoft Agenda in Unicode big endian format (UTF-16BE). When I open it with the shell command xxd to see the binary value, write it down, and get the value of the Unicode code point by ord() to get the ordinal value character by character (this is a python built-in function which takes a one-character Unicode string and returns the code point value), and compare them, I find they are equal.
But I think that the Unicode code point value is different to UTF-16BE — one is a code point; the other is an encoding format. Some of them are equal, but maybe they are different for some characters.
Is the Unicode code point value equal to the UTF-16BE encoding representation for every character?
No, codepoints outside of the Basic Multilingual Plane use two UTF-16 words (so 4 bytes).
For codepoints in the U+0000 to U+D7FF and U+E000 to U+FFFF ranges, the codepoint and UTF-16 encoding map one-to-one.
For codepoints in the range U+10000 to U+10FFFF, two words in the range U+D800 to U+DFFF are used; a lead surrogate from 0xD800 to 0xDBFF and a trail surrogate from 0xDC00 to 0xDFFF.
See the UTF-16 Wikipedia article on the nitty gritty details.
So, most UTF-16 big-endian bytes, when printed, can be mapped directly to Unicode codepoints. For UTF-16 little-endian you just swap the bytes around. For UTF-16 words in starting with a 0xD8 through to 0xDF byte, you'll have to map surrogates to the actual codepoint.

Converting Unicode codepoints into Unicode character using Python 3.3.1

I've this string :
sig=45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C\u0026type=video%2F3gpp%3B+codecs%3D%22mp4v.20.3%2C+mp4a.40.2%22\u0026quality=small\u
0026itag=17\u0026url=http%3A%2F%2Fr6---sn-cx5h-itql.c.youtube.com%2Fvideoplayback%3Fsource%3Dyoutube%26mt%3D1367776467%26expire%3D1367797699%26itag%3D17%26factor%3D1.25%2
6upn%3DpkX9erXUHx4%26cp%3DU0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW%26key%3Dyt1%26id%3Dab9b0e2f311eaf00%26mv%3Dm%26newshard%3Dyes%26ms%3Dau%26ip%3D49.205.30.138%26sparams%
3Dalgorithm%252Cburst%252Ccp%252Cfactor%252Cid%252Cip%252Cipbits%252Citag%252Csource%252Cupn%252Cexpire%26burst%3D40%26algorithm%3Dthrottle-factor%26ipbits%3D8%26fexp%3D9
17000%252C919366%252C916626%252C902533%252C932000%252C932004%252C906383%252C904479%252C901208%252C925714%252C929119%252C931202%252C900821%252C900823%252C912518%252C911416
%252C930807%252C919373%252C906836%252C926403%252C900824%252C912711%252C929606%252C910075%26sver%3D3\u0026fallback_host=tc.v19.cache2.c.youtube.com
As you can see it contains the both forms:
%xx. For example, %3, %2F etc.
\uxxxx. For example, \u0026
I need to convert them to their unicode character representation. I'm using Python 3.3.1, and urllib.parse.unquote(s) converts only %xx to their unicode character representation. It doesn't, however, convert \uxxxx to their unicode character representation. For example, \u0026 should convert into &.
How can I convert both of them?
Two options:
Choose to interpret it as JSON; that format uses the same escape codes. The input does need to have quotes around it to be seen as a string.
Encode to latin 1 (to preserve bytes), then decode with the unicode_escape codec:
>>> urllib.parse.unquote(sig).encode('latin1').decode('unicode_escape')
'45C482D2486105B02211ED4A0E3163A9F7095E81.4DDB3B3A13C77FE508DCFB7C6CC68957096A406C&type=video/3gpp;+codecs="mp4v.20.3,+mp4a.40.2"&quality=small&itag=17&url=http://r6---sn-cx5h-itql.c.youtube.com/videoplayback?source=youtube&mt=1367776467&expire=1367797699&itag=17&factor=1.25&upn=pkX9erXUHx4&cp=U0hVTFdUVV9OU0NONV9PTllHOnhGdTVLUThqUWJW&key=yt1&id=ab9b0e2f311eaf00&mv=m&newshard=yes&ms=au&ip=49.205.30.138&sparams=algorithm%2Cburst%2Ccp%2Cfactor%2Cid%2Cip%2Cipbits%2Citag%2Csource%2Cupn%2Cexpire&burst=40&algorithm=throttle-factor&ipbits=8&fexp=917000%2C919366%2C916626%2C902533%2C932000%2C932004%2C906383%2C904479%2C901208%2C925714%2C929119%2C931202%2C900821%2C900823%2C912518%2C911416%2C930807%2C919373%2C906836%2C926403%2C900824%2C912711%2C929606%2C910075&sver=3&fallback_host=tc.v19.cache2.c.youtube.com'
This interprets \u escape codes just like it Python would do when reading string literals in Python source code.
If I'm guessing right, this is more or less a URL. The '%xx' encodes a single byte outside the allowed character set. The '\uxxxx' encodes a Unicode codepoint. I believe that it is normal for URLs to encode Unicode characters as UTF-8 and then to encode the bytes outside the allowed charset as '%xx' (which affects all multibyte UTF-8 sequences). This makes it surprising that there are '%xx'-encoded bytes already, because translating the Unicode codepoints will make the conversions irreversible.
Make sure you have tests and that you can verify the actual results, because this seems like it was unsafe. At least I don't fully understand the requirements here.

Large strings and len()

This may be a newbie question, but here it goes. I have a large string (167572 bytes) with both ASCII and non ASCII characters. When I use len() on the string I get the wrong length. It seems that len() doesn't count 0x0A characters. The only way I can get the actual length of the string is with this code:
for x in test:
totalLen += 1
for x in test:
if x == '\x0a':
totalLen += 1
print totalLen
What is wrong with len()? Or am I using it wrong?
You are confusing encoded byte strings with unicode text. In UTF-8, for example, up to 3 bytes are used to encode any given character, in UTF-16 each character is encoded using at least 2 bytes each.
A python string is a series of bytes, to get unicode you'd have to decode the string with an appropriate codec. If your text is encoded using UTF-8, for example, you can decode it with:
test = test.decode('utf8')
On the other hand, data written to a file is always encoded, so a unicode string of length 10 could take up 20 bytes in a file, if written using the UTF-16 codec.
Most likely you are getting confused with such 'wider' characters, not with wether or not your \n (ASCII 10) characters are counted correctly.
Please do yourself a favour and read up on Unicode and encodings:
Joel on Software: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
The Python Unicode HOWTO.
Could it be that you're expecting it to contain \r\n, i.e. ASCII 13 (carriage return) followed by ASCII 10 (line feed), or that you look at the string once it's been written out to a text file, which adds these?
It's hard to be specific since you don't give a lot of detail, i.e. where the string's data comes from.

I don't understand encode and decode in Python (2.7.3)

I tried to understand by myself encode and decode in Python but nothing is really clear for me.
str.encode([encoding,[errors]])
str.decode([encoding,[errors]])
First, I don't understand the need of the "encoding" parameter in these two functions.
What is the output of each function, its encoding? What is the use of the "encoding" parameter in each function? I don't really understand the definition of "bytes string".
I have an important question, is there some way to pass from one encoding to another?
I have read some text on ASN.1 about "octet string", so I wondered whether it was the same as "bytes string".
Thanks for you help.
It's a little more complex in Python 2 (compared to Python 3), since it conflates the concepts of 'string' and 'bytestring' quite a bit, but see The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets. Essentially, what you need to understand is that 'string' and 'character' are abstract concepts that can't be directly represented by a computer. A bytestring is a raw stream of bytes straight from disk (or that can be written straight from disk). encode goes from abstract to concrete (you give it preferably a unicode string, and it gives you back a byte string); decode goes the opposite way.
The encoding is the rule that says 'a' should be represented by the byte 0x61 and 'α' by the two-byte sequence 0xc0\xb1.
My presentation from PyCon, Pragmatic Unicode, or, How Do I Stop The Pain covers all of these details.
Briefly, Unicode strings are sequences of integers called code points, and bytestrings are sequences of bytes. An encoding is a way to represent Unicode code points as a series of bytes. So unicode_string.encode(enc) will return the byte string of the Unicode string encoded with "enc", and byte_string.decode(enc) will return the Unicode string created by decoding the byte string with "enc".
Python 2.x has two types of strings:
str = "byte strings" = a sequence of octets. These are used for both "legacy" character encodings (such as windows-1252 or IBM437) and for raw binary data (such as struct.pack output).
unicode = "Unicode strings" = a sequence of UTF-16 or UTF-32 depending on how Python is built.
This model was changed for Python 3.x:
2.x unicode became 3.x str (and the u prefix was dropped from the literals).
A bytes type was introduced for representing binary data.
A character encoding is a mapping between Unicode strings and byte strings. To convert a Unicode string, to a byte string, use the encode method:
>>> u'\u20AC'.encode('UTF-8')
'\xe2\x82\xac'
To convert the other way, use the decode method:
>>> '\xE2\x82\xAC'.decode('UTF-8')
u'\u20ac'
Yes, a byte string is an octet string. Encoding and decoding happens when inputting / outputting text (from/to the console, files, the network, ...). Your console may use UTF-8 internally, your web server serves latin-1, and certain file formats need strange encodings like Bibtex's accents: fran\c{c}aise. You need to convert from/to them on input/output.
The {en|de}code methods do this. They are often called behind the scenes (for example, print "hello world" encodes the string to whatever your terminal uses).

Categories