python different length for same unicode - python

I found something really weird about unicode, in my understanding, if I u"" + "string", the type will be unicode, but why are their length different?
print len(u''+'New York\u200b')
14
print type(u''+'New York\u200b')
<type 'unicode'>
print len(u'New York\u200b')
9
print type(u'New York\u200b')
<type 'unicode'>
I also tried to get rid of \u200b, which I think it is unicode
text = u'New York\u200b'
print text.encode('ascii', errors='ignore')
New York
text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b
Also got different result, I am really confused!
btw, I am using python 2.7, is it the time to change to 3.3?? Thanks in advance!!

>>> (u''+'New York\u200b').encode('utf-8')
'New York\\u200b'
As you can see, since 'New York\u200b' is not a unicode string, the \u escape doesn't have any special meaning and it is interpreted literally, i.e. as the sequence of ASCII characters \ u 2 0 0 b, hence the string has length 14. The u'' only converts the string to unicode, but it does not cause a re-interpretation of the contents. Putting the u before the literal makes python interpret it as an escape, hence as a single character, hence the string is length 9.
In your second example:
text = u''+'New York\u200b'
print text.encode('ascii', errors='ignore')
New York\u200b
Here the .encode does not modify the characters in the string, it only converts from unicode to str.
It's probably clearer if you print the contents of the two strings
>>> print(u'New York\u200b') # note: \u200b interpreted as unicode character
New York
>>> print(b'New York\u200b'.decode('ascii'))
New York\u200b
Or if you prefer to see an actual unicode representation try with code point 9731:
>>> print(u'New York\u2603')
New York☃
>>> print(b'New York\u2603'.decode('ascii'))
New York\u2603

'New York\u200b' is a non-unicode string of length 14.
(You append it to u'' string, but it itself is not unicode yet.)
u'New York\u200b' is a unicode string of length 9.

Related

python write file dealing with encode

I'm confused. I need HELP!!!
I'm dealing with a file contains Chinese characters,for instance, let's call it a.TEST, and here is what's inside.
你好 中国 Hello China 1 2 3
You don't need to understand what the chinese means.(Actually it's 'hello China')
>>> f=open('wr.TRAIN')
>>> print f.read()
你好 中国 Hello China 1 2 3
>>> f.seek(0)
>>> content = f.readline()
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>> print content
你好 中国 Hello China 1 2 3
>>> type(content)
<type 'str'>
>>> isinstance(content,unicode)
False
Here comes the first Question: Why python shell give me the utf-8of content when i just type content,meanwhile print content cmd can output the form that I want to see?
The Second Question: what's the difference between unicode and str?
Someone told me that encode is convert unicode to str, but what i learned from Unicode HowTo tells me encode is convert unicode to utf-8
Not over yet! :)
here is test.py
#!/usr/bin/python
#-*- coding: utf-8 -*-
fr = open('a.TEST')
fw = open('out.TEST','w')
content = fr.readline()
content_list = content.split()
print content
fw.write('{0}'.format(content_list))
fr.close()
fw.close()
Third Question:Why the chinese character turn into utf-8 code when I do .split()?
and I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work, but it doesn't.
I don't want what's written into out.TEST is character encoding form, I want it to be exactly the character that look like originally(你好). How to do it?
What is Encoding
A file consists of bytes. You can represent each byte with a number between 0 and 255 (or 0x00 and 0xFF in hexadecimal).
Text is also written as bytes. There is an agreement on the way text is written. That is an encoding. The most basic encoding is ASCII and other encodings are usually based on it. For example, ASCII defines that number 65 (0x41) represents 'A', 66 (0x42) represents 'B' etc.
How are Strings Represented
In python, you can define a string using numeric values:
>>> '\x41\x42\x43'
'ABC'
'\x41\x42\x43' is exactly the same thing as 'ABC'. Python will always represent the string using the more readable textual representation ('ABC').
However, some numeric values are not printable characters, so they will be represented in numeric form:
>>> '\x00\x01\x02\x03\x04'
'\x00\x01\x02\x03\x04'
Others characters have aliases to make your job easier:
>>> '\x0a\x0d\x09'
'\n\r\t'
Different Encodings
ASCII table defines meaning of numbers 0-127 and includes only the english alphabet. Numbers 128-255 are undefined. So, other encodings define a meaning for 128-255. Yet others change the meaning of the whole range 0-255.
There are many encodings and they define 128-255 differently.
For example, character 185 (0xB9) is ą in windows-1250 encoding, but it is š in iso-8859-2 encoding.
So, what happens if you print \xb9? It depends on the encoding used in the console. In my case (my console uses cp852 encoding) it is:
>>> print '\xb9'
╣
Because of that ambiguity, string '\xb9' will never be represented as '╣' (nor 'ą'...). That would hide the true value. It will be represented as the numeric value:
>>> '\xb9'
'\xb9'
Also:
>>> '╣'
'\xb9'
See also the string from the question in my console:
>>> content = '\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> content
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
>>>
>>> print content
ńŻáňąŻ ńŞşňŤŻ Hello China 1 2 3
But what happens if variable is just entered in the console?
When a variable is enteren in cosole without print, its representation is printed. It is the same as the following:
>>> print repr(content)
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd Hello China 1 2 3\n'
What is Unicode?
Unicode table aims to define a numeric representation of all characters in the world and more. It can actually do that, because it is not limited to 256 values (or to any other limit actually). This is not an encoding, but a universal mapping of numbers to characters.
For example, unicode defines that number 353 (0x0161) is character š. That is allways true regardless of your locale and encodings you use. That character can be stored in files (or memory) in any encoding which supports š.
What is UTF-8?
When encoding a unicode character, one can use any encoding, but not all of them will support all characters.
For example, š (unicode 0x0161) can be encoded in iso-8869-2 as 0xB9, but it cannot be encoded in iso-8869-1 at all.
So, to be able to encode anything, you need an encoding which supports every unicode character. UTF-8 is one of those encodings, but there are others:
>>> u'\u0161'.encode('utf-7')
'+AWE-'
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> u'\u0161'.encode('utf-16le')
'a\x01'
>>> u'\u0161'.encode('utf-16be')
'\x01a'
>>> u'\u0161'.encode('utf-32le')
'a\x01\x00\x00'
>>> u'\u0161'.encode('utf-32be')
'\x00\x00\x01a'
The good thing about utf-8 is that the whole ASCII range is unchanged and as long as only ASCII is used, only one byte is used per character:
>>> u'abcdefg'.encode('utf-8')
'abcdefg'
Unicode in Python 2
Important: This is really specific to Python 2. Python 3 is different.
Unlike str objects, which are strings of bytes, unicode objects are strings of unicode characters.
They can be encoded into a str in chosen encoding, or decoded from str in chosen encoding.
A unicode string is specified using u before the opening quote. The characters inside are interpreted using current encoding, or they can be specified in numeric format \uHEX:
>>> u'ABCD'
u'ABCD'
>>>
>>> u'\u0041\u0042\u0043'
u'ABC'
>>> u'šâů'
u'\u0161\xe2\u016f'
And Now the Answers
First Question
contents prints repr(contents)
print contents prints contents
Second Question
UTF-8 strings are byte strings (str). You get them by encoding the unicode:
>>> u'\u0161'.encode('utf-8')
'\xc5\xa1'
>>> '\xc5\xa1'.decode('utf-8')
u'\u0161'
So yes, encode converts unicode to str. The str can be utf-8, but it does not have to be.
Third Question
A) "Why the chinese character turn into utf-8 code when I do .split()?"
They were utf-8 all the time.
B) "I thought fw.write('{0}'.format(content_list).decode('utf-8')) will work"
content_list is not a string. It is a list. When a list is converted to a string, it is done using its repr, which also does repr of all of the contents.
For example:
>>> 'a \n a \n a'
'a \n a \n a'
>>> print 'a \n a \n a'
a
a
a
>>> print ['a \n a \n a']
['a \n a \n a']
The last print printed repr(list) which contains repr(str).
In the beginning, there was just english characters, and people was not satisfied.
Then they want to display every character in the world.But there is problem. One byte can only represent 255 characters. There just simply not enough place to hold them.
Then people decide to use two byte to represent one character.And call it 'utf8'.
No matter what characters you write in, it's all store in byte form.
In Python, there is no such datatype called 'unicode', just 'str'. And 'unicode' is an encoding system of 'str'.
'\xe4\xbd\xa0\xe5\xa5\xbd \xe4\xb8\xad\xe5\x9b\xbd' is byte form of "你好 中国".
It can not display without an encoding system specified.
I suppose you could blame linux/unix. Python has no problem to display 'utf-8' characters, while 'cat' cannot.

encoding unicode using UTF-8

In Python, if I type
euro = u'\u20AC'
euroUTF8 = euro.encode('utf-8')
print(euroUTF8, type(euroUTF8), len(euroUTF8))
the output is
('\xe2\x82\xac', <type 'str'>, 3)
I have two questions:
1. it looks like euroUTF8 is encoded over 3 bytes, but how do I get its binary representation to see how many bits it contain?
2. what does 'x' in '\xe2\x82\xac' mean? I don't think 'x' is a hex number. And why there are three '\'?
In Python 2, print is a statement, not a function. You are printing a tuple here. Print the individual elements by removing the (..):
>>> euro = u'\u20AC'
>>> euroUTF8 = euro.encode('utf-8')
>>> print euroUTF8, type(euroUTF8), len(euroUTF8)
€ <type 'str'> 3
Now you get the 3 individual objects written as strings to stdout; my terminal just happens to be configured to interpret anything written to it as UTF-8, so the bytes correctly result in the € Euro symbol being displayed.
The \x<hh> sequences are Python string literal escape sequences (see the reference documentation); they are the default output for the repr() applied to a string with non-ASCII, non-printable bytes in them. You'll see the same thing when echoing the value in an interactive interpreter:
>>> euroUTF8
'\xe2\x82\xac'
>>> euroUTF8[0]
'\xe2'
>>> euroUTF8[1]
'\x82'
>>> euroUTF8[2]
'\xac'
They provide you with ASCII-safe debugging output. The contents of all Python standard library containers use this format; including lists, tuples and dictionaries.
If you want to format to see the bits that make up these values, convert each byte to an integer by using the ord() function, then format the integer as binary:
>>> ' '.join([format(ord(b), '08b') for b in euroUTF8])
'11100010 10000010 10101100'
Each letter in each encoding are represented using different number of bits. UTF-8 is a 8 bit encoding, so there is no need to get a binary representation to know each bit count of each character. (If you still want to present bits, refer to Martijn's answer.)
\x means that the following value is a byte. So x is not something like a hex number that you should convert or read. It identifies the following value, which is you are interested in. \'s are used to escape that x's because they are not a part of the value.

How can I slice a substring from a unicode string with Python?

I have a unicode string as a result : u'splunk>\xae\uf001'
How can I get the substring 'uf001'
as a simple string in python?
The characters uf001 are not actually present in the string, so you can't just slice them off. You can do
repr(s)[-6:-1]
or
'u' + hex(ord(s[-1]))[2:]
Since you want the actual string (as seen from comments) , just get the last character [-1] index , Example -
>>> a = u'splunk>\xae\uf001'
>>> print(a)
splunk>®ï€
>>> a[-1]
'\uf001'
>>> print(a[-1])
ï€
If you want the unicode representation (\uf001) , then take repr(a[-1]) , Example -
>>> repr(a[-1])
"'\\uf001'"
\uf001 is a single unicode character (not multiple strings) , so you can directly get that character as above.
You see \uf001 because you are checking the results of repr() on the string, if you print it, or use it somewhere else (like for files, etc) it will be the correct \uf001 character.
u'' it is how a Unicode string is represented in Python source code. REPL uses this representation by default to display unicode objects:
>>> u'splunk>\xae\uf001'
u'splunk>\xae\uf001'
>>> print(u'splunk>\xae\uf001')
splunk>®
>>> print(u'splunk>\xae\uf001'[-1])

If your terminal is not configured to display Unicode or if you are on a narrow build (e.g., it is likely for Python 2 on Windows) then the result may be different.
Unicode string is an immutable sequence of Unicode codepoints in Python. len(u'\uf001') == 1: it does not contain uf001 (5 characters) in it. You could write it as u'' (it is necessary to declare the character encoding of your source file on Python 2 if you use non-ascii characters):
>>> u'\uf001' == u''
True
It is just a different way to represent exactly the same Unicode character (a single codepoint in this case).
Note: some user-perceived characters may span several Unicode codepoints e.g.:
>>> import unicodedata
>>> unicodedata.normalize('NFKD', u'ё')
u'\u0435\u0308'
>>> print(unicodedata.normalize('NFKD', u'ё'))
ё

Python issue with incorrectly formated strings that contains \x

At some point our python script receives string like that:
In [1]: ab = 'asd\xeffe\ctive'
In [2]: print ab
asd�fe\ctve \ \\ \\\k\\\
Data is damaged we need escape \x to be properly interpreted as \x but \c has not special meaning in string thus must be intact.
So far the closest solution I found is do something like:
In [1]: ab = 'asd\xeffe\ctve \\ \\\\ \\\\\\k\\\\\\'
In [2]: print ab.encode('string-escape').replace('\\\\', '\\').replace("\\'", "'")
asd\xeffe\ctve \ \\ \\\k\\\
Output taken from IPython, I assumed that ab is a string not unicode string (in the later case we would have to do something like that:
def escape_string(s):
if isinstance(s, str):
s = s.encode('string-escape').replace('\\\\', '\\').replace("\\'", "'")
elif isinstance(s, unicode):
s = s.encode('unicode-escape').replace('\\\\', '\\').replace("\\'", "'")
return s
\xhh is an escape character and \x is seen as the start of this escape.
'\\' is the same as '\x5c'. It is just two different ways to write the backslash character as a Python string literal.
These literal strings: r'\c', '\\c', '\x5cc', '\x5c\x63' are identical str objects in memory.
'\xef' is a single byte (239 as an integer), but r'\xef' (same as '\\xef') is a 4-byte string: '\x5c\x78\x65\x66'.
If s[0] returns '\xef' then it is what s object actually contains. If it is wrong then fix the source of the data.
Note: string-escape also escapes \n and the like:
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('unicode-escape')
\xef\\c\\\u2603"'\u2603\u2603"'\n\xa0
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('string-escape')
\xef\\c\\\\N{SNOWMAN}"\'\xe2\x98\x83\\u2603"\'\n\xa0
backslashreplace is used only on characters that cause UnicodeEncodeError:
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''
ï\c\☃"'☃☃"'
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''
�\c\\N{SNOWMAN}"'☃\u2603"'
�
>>> print u'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.encode('ascii', 'backslashreplace')
\xef\c\\u2603"'\u2603\u2603"'
\xa0
>>> print b'''\xef\c\\\N{SNOWMAN}"'\
... ☃\u2603\"\'\n\xa0'''.decode('latin1').encode('ascii', 'backslashreplace')
\xef\c\\N{SNOWMAN}"'\xe2\x98\x83\u2603"'
\xa0
Backslashes introduce "escape sequences". \x specifically allows you to specify a byte, which is given as two hexadecimal digits after the x. ef are two hexadecimal digits, hence you get no error. Double the backslash to escape it, or use a raw string r"\xeffective".
Edit: While the Python console may show you '\\', this is precisely what you expect. You just say you expect something else because you confuse the string and its representation. It's a string containing a single backslash. If you were to output it with print, you'd see a single backslash.
But the string literal '\' is ill-formed (not closed because \' is an apostrophe, not a backslash and end-of-string-literal), so repr, which formats the results at the interactive shell, does not produce it. Instead it produces a string literal which you could paste into Python source code and get the same string object. For example, len('\\') == 1.
The \x escape sequence signifies a Unicode character in the string, and ef is being interpreted as the hex code. You can sanitize the string by adding an additional \, or else make it a raw string (r'\xeffective').
>>> r'\xeffective'[0]
'\\'
EDIT: You could convert an existing string using the following hack:
>>> a = '\xeffective'
>>> b = repr(a).strip("'")
>>> b
'\\xeffective'

Why is it when I print something, there is always a unicode next to it? (Python)

[u'Iphones', u'dont', u'receieve', u'messages']
Is there a way to print it without the "u" in front of it?
What you are seeing is the __repr__() representation of the unicode string which includes the u to make it clear. If you don't want the u you could print the object (using __str__) - this works for me:
print [str(x) for x in l]
Probably better is to read up on python unicode and encode using the particular unicode codec you want:
print [x.encode() for x in l]
[edit]: to clarify repr and why the u is there - the goal of repr is to provide a convenient string representation, "to return a string that would yield an object with the same value when passed to eval()". Ie you can copy and paste the printed output and get the same object (list of unicode strings).
Python contains string classes for both unicode strings and regular strings. The u before a string indicates that it is a unicode string.
>>> mystrings = [u'Iphones', u'dont', u'receieve', u'messages']
>>> [str(s) for s in mystrings]
['Iphones', 'dont', 'receieve', 'messages']
>>> type(u'Iphones')
<type 'unicode'>
>>> type('Iphones')
<type 'str'>
See http://docs.python.org/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange for more information about the string types available in Python.

Categories