I am trying to read how strings work in Python and am having a tough time deciphering various functionalities. Here's what I understand. Hoping to get corrections and new perspectives as to how to remember these nuances.
Firstly, I know that Unicode evolved to accommodate multiple languages and accents across the world. But how does python store strings?
If I define s = 'hello' what is the encoding in which the string s is stored? Is it Unicode? Or does it store in plain bytes? On doing type(s) I got the answer as <type 'str'>. However, when I did us = unicode(s), us was of the type <type 'unicode'>. Is us a str type or is there actually a unicode type in python?
Also, I know that to store space, I know that we encode strings as bytes using encode() function. So suppose bs = s.encode('utf-8', errors='ignore') will return a bytes object. So, now when I am writing bs to a file, should I open the file in wb mode? I have seen that if opened in w mode, it stores the string in the file as b"<content in s>".
What does decode() function do?(I know, the question is too open-ended.) Is it like, we apply this on a bytes object and this transforms the string into our chosen encoding? Or does it always convert it back to an Unicode sequence? Can any other insights be drawn from the following lines?
>>> s = 'hello'
>>> bobj = bytes(s, 'utf-8')
>>> bobj
'hello'
>>> type(bobj)
<type 'str'>
>>> bobj.decode('ascii')
u'hello'
>>> us = bobj.decode('ascii')
>>> type(us)
<type 'str'>
How does str(object) work? I read that it will try to execute the str() function in the object description. But how differently does this function act on say Unicode strings and regular byte-coded strings?
Thanks in advance.
Important: below python3 behavior is described. While python2 has some conceptual similarities, the exposed behavior would be different.
In a nutshell: due to the unicode support string object in python3 is a higher level abstraction. It's up to the interpreter how to represent it in memory. So, when it comes to serialization (eg. writing string's textual representation to a file), one needs to explicitly encode it to a bytes sequence first, using a specified encoding (eg. UTF-8). The same is true for the bytes to string conversion, i.e. decoding. In python2 same behavior can be achieved using unicode class, while str is rather a synonym to bytes.
While it's not a direct answer to your question, have a look at these examples:
import sys
e = ''
print(len(e)) # 0
print(sys.getsizeof(e)) # 49
a = 'hello'
print(len(a)) # 5
print(sys.getsizeof(a)) # 54
u = 'hello平仮名'
print(len(u)) # 8
print(sys.getsizeof(u)) # 90
print(len(u[1:])) # 7
print(sys.getsizeof(u[1:])) # 88
print(len(u[:-1])) # 7
print(sys.getsizeof(u[:-1])) # 88
print(len(u[:-2])) # 6
print(sys.getsizeof(u[:-2])) # 86
print(len(u[:-3])) # 5
print(sys.getsizeof(u[:-3])) # 54
print(len(u[:-4])) # 4
print(sys.getsizeof(u[:-4])) # 53
j = 'hello😋😋😋'
print(len(j)) # 8
print(sys.getsizeof(j)) # 108
print(len(j[:-1])) # 7
print(sys.getsizeof(j[:-1])) # 104
print(len(j[:-2])) # 6
print(sys.getsizeof(j[:-2])) # 100
Strings are immutable in Python and this gives the interpreter an advantage to decide on a way the string will be encoded during the creation stage. Let's review the numbers from above:
Empty string object has an overhead of 49 bytes.
String with ASCII symbols of length 5 has size 49 + 5. I.e. the encoding uses 1 byte per symbol.
String with mixed (ASCII + non-ASCII) symbols has a higher memory footprint even
though the length is still 8.
The difference of u and u[1:] and at the same time the difference of u and u[:-1] is 90 - 88 = 2 bytes. I.e. the encoding uses 2 bytes per symbol. Even though the prefix of the string can be encoded with 1 byte per symbol. This gives us a huge advantage of having constant time indexing operation on strings, but we pay with an extra memory overhead.
Memory footprint of string j is even higher. It's just because we can't encode all the symbols in it using 2 bytes per symbol, so the interpreter uses 4 bytes per each symbol now.
Ok, keep checking the behavior. We already know, that the interpreter stores strings in even number of bytes per symbol way to give us O(1) access by index. However, we also know that UTF-8 uses variadic length representation of symbols. Let's prove it:
j = 'hello😋😋😋'
b = j.encode('utf8') # b'hello\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b\xf0\x9f\x98\x8b'
print(len(b)) # 17
So, we can see, that the first 5 characters are encoded using 1 byte per symbol while the remaining 3 symbols are encoded using (17 - 5)/3 = 4 bytes per symbol. This also explains why python uses 4 bytes per symbol representation under the hood.
And another way around, when we have a sequence of bytes and decode it to a string, the interpreter will decide on the internal string representation (1, 2, or 4 bytes per symbol) and it's completely opaque to the programmer. The only thing which must be transparent is the encoding of the sequence of bytes. We must tell the interpreter how to treat the bytes. While we should let him decide on the internal representation of string object.
Related
I'm using the following code to pack an integer into an unsigned short as follows,
raw_data = 40
# Pack into little endian
data_packed = struct.pack('<H', raw_data)
Now I'm trying to unpack the result as follows. I use utf-16-le since the data is encoded as little-endian.
def get_bin_str(data):
bin_asc = binascii.hexlify(data)
result = bin(int(bin_asc.decode("utf-16-le"), 16))
trimmed_res = result[2:]
return trimmed_res
print(get_bin_str(data_packed))
Unfortunately, it throws the following error,
result = bin(int(bin_asc.decode("utf-16-le"), 16)) ValueError: invalid
literal for int() with base 16: '㠲〰'
How do I properly decode the bytes in little-endian to binary data properly?
Use unpack to reverse what you packed. The data isn't UTF-encoded so there is no reason to use UTF encodings.
>>> import struct
>>> data_packed = struct.pack('<H', 40)
>>> data_packed.hex() # the two little-endian bytes are 0x28 (40) and 0x00 (0)
2800
>>> data = struct.unpack('<H',data_packed)
>>> data
(40,)
unpack returns a tuple, so index it to get the single value
>>> data = struct.unpack('<H',data_packed)[0]
>>> data
40
To print in binary use string formatting. Either of these work work best. bin() doesn't let you specify the number of binary digits to display and the 0b needs to be removed if not desired.
>>> format(data,'016b')
'0000000000101000'
>>> f'{data:016b}'
'0000000000101000'
You have not said what you are trying to do, so let's assume your goal is to educate yourself. (If you are trying to pack data that will be passed to another program, the only reliable test is to check if the program reads your output correctly.)
Python does not have an "unsigned short" type, so the output of struct.pack() is a byte array. To see what's in it, just print it:
>>> data_packed = struct.pack('<H', 40)
>>> print(data_packed)
b'(\x00'
What's that? Well, the character (, which is decimal 40 in the ascii table, followed by a null byte. If you had used a number that does not map to a printable ascii character, you'd see something less surprising:
>>> struct.pack("<H", 11)
b'\x0b\x00'
Where 0b is 11 in hex, of course. Wait, I specified "little-endian", so why is my number on the left? The answer is, it's not. Python prints the byte string left to right because that's how English is written, but that's irrelevant. If it helps, think of strings as growing upwards: From low memory locations to high memory. The least significant byte comes first, which makes this little-endian.
Anyway, you can also look at the bytes directly:
>>> print(data_packed[0])
40
Yup, it's still there. But what about the bits, you say? For this, use bin() on each of the bytes separately:
>>> bin(data_packed[0])
'0b101000'
>>> bin(data_packed[1])
'0b0'
The two high bits you see are worth 32 and 8. Your number was less than 256, so it fits entirely in the low byte of the short you constructed.
What's wrong with your unpacking code?
Just for fun let's see what your sequence of transformations in get_bin_str was doing.
>>> binascii.hexlify(data_packed)
b'2800'
Um, all right. Not sure why you converted to hex digits, but now you have 4 bytes, not two. (28 is the number 40 written in hex, the 00 is for the null byte.) In the next step, you call decode and tell it that these 4 bytes are actually UTF-16; there's just enough for two unicode characters, let's take a look:
>>> b'2800'.decode("utf-16-le")
'㠲〰'
In the next step Python finally notices that something is wrong, but by then it does not make much difference because you are pretty far away from the number 40 you started with.
To correctly read your data as a UTF-16 character, call decode directly on the byte string you packed.
>>> data_packed.decode("utf-16-le")
'('
>>> ord('(')
40
I use python 2.7 and windows.
I want to convert string'A123456' to bytes: b'\x0A\x12\x34\x56' and then concatenate it with other bytes (b'\xBB') to b'\xbb\x0A\x12\x34\x56'.
That is, I want to obtain b'\xbb\x0A\x12\x34\x56' from string'A123456' and b'\xBB'
This isn't too hard to do with binascii.unhexlify, the only problem you've got is that you want to zero pad your string when it's not an even number of nibbles (unhexlify won't accept a length 7 string).
So first off, it's probably best to make a quick utility function that does that, because doing it efficiently isn't super obvious, and you want a self-documenting name:
def zeropad_even(s):
# Adding one, then stripping low bit leaves even values unchanged, and rounds
# up odd values to next even value
return s.zfill(len(s) + 1 & ~1)
Now all you have to do is use that to fix up your string before unhexlifying it:
>>> from binascii import unhexlify
>>> unhexlify(zeropad_even('A123456'))
'\n\x124V'
>>> _ == b'\x0A\x12\x34\x56'
True
I included that last test just to show that you got the expected result; the repr of str tries to use printable ASCII or short escapes where available, so only the \x12 actually ends up in the repr; \x0A' becomes \n, \x34 is 4 and \x56 is V, but those are all equivalent ways to spell the same bytes.
Short story:
Is Python 3 unicode string lookup O(1) or O(n)?
Long story:
Index lookup of a character in a C char array is constant time O(1) because we can with certainty jump to a contiguous memory location:
const char* mystring = "abcdef";
char its_d = mystring[3];
Its the same as saying:
char its_d = *(mystring + 3);
Because we know that sizeof(char) is 1 as C99, and because of ASCII one character fits in one byte.
Now, in Python 3, now that string literals are unicode strings, we have the following:
>>> mystring = 'ab€cd'
>>> len(mystring)
5
>>> mybytes = mystring.encode('utf-8')
>>> len(mybytes)
7
>>> mybytes
b'ab\xe2\x82\xaccd'
>>> mystring[2]
'€'
>>> mybytes[2]
226
>> ord(mystring[2])
8364
Being UTF-8 encoded, byte 2 is > 127 and thus uses a multibyte representation for the character 3.
I cannot other than conclude that a index lookup in a Python string CANNOT be O(1), because of the multibyte representation of characters? That means that mystring[2] is O(n), and that somehow a on-the-fly interpretation of the memory array is being performed ir order to find the character at index? If that's the case, did I missed some relevant documentation stating this?
I made some very basic benchmark but I cannot infer an O(n) behaviour: https://gist.github.com/carlos-jenkins/e3084a07402ccc25dfd0038c9fe284b5
$ python3 lookups.py
Allocating memory...
Go!
String lookup: 0.513942 ms
Bytes lookup : 0.486462 ms
EDIT: Updated with better example.
UTF-8 is the default source encoding for Python. The internal representation uses fixed-size per-character elements in both Python 2 and Python 3. One of the results is that accessing characters in Python (Unicode) string objects by index has O(1) cost.
The code and results you presented do not demonstrate otherwise. You convert a string to a UTF-8-encoded byte sequence, and we all know that UTF-8 uses variable-length code sequences, but none of that says anything about the internal representation of the original string.
Let's say, I have a string (Unicode if it matters) variable which is less than 100 bytes. I want to create another variable with exactly 100 byte in size which includes this string and is padded with zero or whatever. How would I do it in Python 3?
For assembling packets to go over the network, or for assembling byte-perfect binary files, I suggest using the struct module.
struct — Interpret bytes as packed binary data
Just for the string, you might not need struct, but as soon as you start also packing binary values, struct will make your life much easier.
Depending on your needs, you might be better off with an off-the-shelf network serialization library, such as Protocol Buffers; or you might even just use JSON for the wire format.
Protocol Buffer Basics: Python
PyMOTW - JavaScript Object Notation Serializer
Something like this should work:
st = "具有"
by = bytes(st, "utf-8")
by += b"0" * (100 - len(by))
print(by)
# b'\xe5\x85\xb7\xe6\x9c\x890000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000'
Obligatory addendum since your original post seems to conflate strings with the length of their encoded byte representation: Python unicode explanation
To pad with null bytes you can do it the way they do it in the stdlib base64 module.
some_data = b'foosdsfkl\x05'
null_padded = some_data + bytes(100 - len(some_data))
Here's a roundabout way of doing it:
>>> import sys
>>> a = "a"
>>> sys.getsizeof(a)
22
>>> a = "aa"
>>> sys.getsizeof(a)
23
>>> a = "aaa"
>>> sys.getsizeof(a)
24
So following this, an ASCII string of 100 bytes will need to be 79 characters long
>>> a = "".join(["a" for i in range(79)])
>>> len(a)
79
>>> sys.getsizeof(a)
100
This approach above is a fairly simple way of "calibrating" strings to figure out their lengths. You could automate a script to pad a string out to the appropriate memory size to account for other encodings.
def padder(strng):
TARGETSIZE = 100
padChar = "0"
curSize = sys.getsizeof(strng)
if curSize <= TARGETSIZE:
for i in range(TARGETSIZE - curSize):
strng = padChar + strng
return strng
else:
return strng # Not sure if you need to handle strings that start longer than your target, but you can do that here
In python 3.2, i can change the type of an object easily. For example :
x=0
print(type (x))
x=bytes(0)
print(type (x))
it will give me this :
<class 'int'>
<class 'bytes'>
But, in python 2.7, it seems that i can't use the same way to do it. If i do the same code, it give me this :
<type 'int'>
<type 'str'>
What can i do to change the type into a bytes type?
You are not changing types, you are assigning a different value to a variable.
You are also hitting on one of the fundamental differences between python 2.x and 3.x; grossly simplified the 2.x type unicode has replaced the str type, which itself has been renamed to bytes. It happens to work in your code as more recent versions of Python 2 have added bytes as an alias for str to ease writing code that works under both versions.
In other words, your code is working as expected.
What can i do to change the type into a bytes type?
You can't, there is no such type as 'bytes' in Python 2.7.
From the Python 2.7 documentation (5.6 Sequence Types):
"There are seven sequence types: strings, Unicode strings, lists, tuples, bytearrays, buffers, and xrange objects."
From the Python 3.2 documentation (5.6 Sequence Types):
"There are six sequence types: strings, byte sequences (bytes objects), byte arrays (bytearray objects), lists, tuples, and range objects."
In Python 2.x, bytes is just an alias for str, so everything works as expected. Moreover, you are not changing the type of any objects here – you are merely rebinding the name x to a different object.
May be not exactly what you need, but when I needed to get the decimal value of the byte d8 (it was a byte giving an offset in a file) i did:
a = (data[-1:]) # the variable 'data' holds 60 bytes from a PE file, I needed the last byte
#so now a == '\xd8' , a string
b = str(a.encode('hex')) # which makes b == 'd8' , again a string
c = '0x' + b # c == '0xd8' , again a string
int_value = int(c,16) # giving me my desired offset in decimal: 216
#I hope this can help someone stuck in my situation
Just example to emphasize a procedure of turning regular string into binary string and back:
sb = "a0" # just string with 2 characters representing a byte
ib = int(sb, 16) # integer value (160 decimal)
xsb = chr(ib) # a binary string (equals '\xa0')
Now backwards
back_sb = xsb.encode('hex')
back_sb == sb # returns True