I'm working with an array of bytes extracted from an UDP packet in Python.
The data it's represented like this:
data = [0x00,0x01,0x23,0x84,0xa6]
And when I use bytearray(data) and prints its content the prompt shows me a not hexadecimal digit like x01# or with other data contents the # digit its replace by a \n digit. I don't really know why this happens.
The complete code example
data = [0x00,0x01,0x23,0x84,0xa6]
data1 = bytearray(data)
print(data)
print(data1)
And the print shows
[0, 1, 35, 132, 166]
bytearray(b'\x00\x01#\x84\xa6')
Using bytes(data) the problem is the same.
Your bytearray is represented as a string. When a string is represented for human eyes the characters are displayed according to the current encoding (ASCII, utf-8, etc.). In your current encoding, the character with the value 0x23 is a hash-symbol (#). Only for the bytes which do not have a character representation (0x00, etc.) the hex representation is displayed (e.g. \x00).
So what you see is absolutely correct because you asked (maybe without knowing) for a string representation of your byte array.
If you want to see a hex value for each byte, use data1.hex(). This will create a hex representation for each byte and concatenate all of these. The result will be a string containing only hex digits (0-9 and a-f). This is only useful for printing, in most cases it is not useful for further processing.
In Python3, consider using bytes([0x00, 0x01, ...]) instead. That will produce a bytes object which is more native to the language (e.g. many functions like write(), send(), etc. will accept it as input). It also has a hex() method as described above.
Related
I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8?
I need to do this because the LSP protocol requires the offsets to be in UTF-16, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("๐") is 1, just as with any other single character. That's independent of the fact that "๐" requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "๐" in UTF-16LE, you would invoke "๐".encode('utf-16-le').
Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16'), you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (216):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
def craft_integration(xintegration_time):
integration_time = xintegration_time
integration_time_str = str(integration_time)
integration_time_str = integration_time_str.encode('utf-8')
integration_time_hex = integration_time_str.hex()
return integration_time_hex
def send_set_integration(xtime):
int_time_hex = decoder_crafter.craft_integration(xtime)
set_hex = "c1c000000000000010001100000000000000000000000004"+int_time_hex+"1400000000000000000000000000000000000000c5c4c3c2"
set_hex = str(set_hex)
print(set_hex)
set_hex = unhexlify(set_hex)
For example, input is '1000'.
That becomes 31303030 with craft_integration().
It is then inserted into the default hex string.
Output is:
c1c000000000000010001100000000000000000000000004313030301400000000000000000000000000000000000000c5c4c3c2
When unhexlify() is used, output is:
b'\xc1\xc0\x00\x00\x00\x00\x00\x00\x10\x00\x11\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x041000\x14\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xc5\xc4\xc3\xc2'
\x041000 is an conjunction of \x04 and 1000 which was the original input value, not the converted value.
Why would this happen?
What you have in fact is simply your desired value being rendered into a form by the default implementation of bytes.__repr__ that you were not expecting to the point that it was unhelpful to what you want.
To start from a more basic level: in Python, any element (well, any "byte", i.e. a group of 8 bits) inside a bytes type are typically being stored as raw digital representation somewhere in a machine as binary. In order to "print" them out onto a console for human consumption it must be turned into a form that may be interpreted by the console such that the correct glyph may be used to represent the underlying value. For many values, such as 0 (or 00000000 in binary), Python would use \x00 to represent that. The \ is the escape character to start an escape sequence, the x that follows signifies that the escape sequence is to be followed by 2 hexadecimal characters, and combining those two characters with the whole sequence would form the representation of that single byte using four characters. Likewise for 255, in binary that would be 11111111, and this same value as part of a bytes type will be encoded as \xff.
Now there are exceptions - if a given value falls inside the ASCII range, and that it in the range of printable characters, the representation will instead be the corresponding ASCII character. So in the case of the hexadecimal 30 (decimal 48), rendering of that as part of a bytes type will show 0 instead of \x30, as 0 is the corresponding printable character.
So for your case, a bytes representation that was printed out in the console in the form of b'\x041000', is not in fact a big \x value, as the \x escape sequence is only applied to exactly two subsequent characters - all following characters (i.e. 1000) are in fact being represented using the printable characters that would otherwise be represented as \x31\x30\x30\x30.
There is another method available to those who don't mind working with the decimal representation of bytes - simply cast the bytes into a bytearray then into a list. We will take two nul bytes (b'\x00\x00') as an example:
>>> list(bytearray(b'\x00\x00'))
[0, 0]
Clearly those two nul bytes will correspond to two zero values. Now try using the confusing b'\x04\x31\x30\x30\x30' which got rendered into b'\x041000':
>>> list(bytearray(b'\x041000'))
[4, 49, 48, 48, 48]
We can note that it was in fact 5 bytes rendered with the corresponding decimal numbers in a list of 5 elements.
It is often easy to get confused with what the actual value is, vs. what is being shown and visualized on the computer console. Unfortunately the tools we use sometimes amplify that confusion, but as programmers we should understand this and seek ways to minimize this for users of our work, as this example shows that not everyone may have the intuition that certain representations of bytes may instead be represented as printable ASCII.
The following represents a binary image extracted from a file (spaces inserted between bytes to make reading easier). File is opened with 'rb' mode.
01 77 33 9F 41 42 43 44 00 11 11 11
In Python 2.7, I read it as a character string and I use ord() to extract the binary values and then I can extract or even search the string for a specific text value (such as the "ABCD" in characters 4-7). The binary bytes can be anything from 0-FF. I've been putting off conversion to python 3 partly because of this.
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
I don't write the file, I just use it, so changing it is not an option.
I've seen lots of examples of using b' and other things to convert fixed strings but I need a way to intermix these values, extracting bytes, 2-byte to 8-byte values as 16-bit to 64-bit words, and extracting/searching for ASCII strings within the larger string.
The byte/character separation in Python 3 seems somewhat inflexible for what I need. I'm sure there's a way to do this I just haven't found an example or an answered question that seems to cover this case.
This is a simplified example, I can't provide real data (it's proprietary) but this illustrates the problem. The real files may be short (<1K) or huge (>100K), containing multiple records of different sizes.
Is there an easy, straightforward way to essentially replicate the functionality I have in Python 2.7?
This is on Windows.
Thanks
I need to be able, in Python 3, to treat a string of bytes as a mixture of binary and ascii (not unicode) values. The format is not fixed, it consists of data structures. For example, the 33 in byte 2 might be a record length that tells me where the start of the next record is. In other words, I can't just say that I know the text string is always in location 4.
Read the file in binary mode, as you are doing. This produces a bytes object, which in 3.x is not the same as a str (as it would be in 2.x).
Interpret the bytes as bytes, as needed, to figure out the general structure of the data. Slicing the bytes produces another bytes as before; indexing produces an int with the numeric value of that single byte (not as before) - no ord required.
When you have determined a subset of the bytes that represent a string (let's say for convenience that you have sliced it out), convert to string using the appropriate encoding: e.g. str(my_bytes, 'ascii'). Note that ASCII will not handle byte values 0x80 through 0xFF; especially with binary-ish legacy file formats, there's a good chance your data is actually something like Latin-1: str(my_bytes, 'iso-8859-1').
search the string for a specific text value
You can search at either the text or the byte level - bytes objects support the in operator, searching for either a subsequence of bytes or a single integer value. Whether it makes more sense to search before or after string conversion will depend on what you are doing.
using b' and other things to convert fixed strings
b'' is just the syntax for a literal bytes object. It's what you'll see if you ask for the repr of what you read from the file. Prefixing a b onto an existing string literal in your code isn't really "converting" anything, but replacing it with the value you should have had in the first place.
2-byte to 8-byte values as 16-bit to 64-bit words
The documentation says it at least as well as I could:
>>> help(int.from_bytes)
Help on built-in function from_bytes:
from_bytes(...) method of builtins.type instance
int.from_bytes(bytes, byteorder, *, signed=False) -> int
Return the integer represented by the given array of bytes.
The bytes argument must be a bytes-like object (e.g. bytes or bytearray).
The byteorder argument determines the byte order used to represent the
integer. If byteorder is 'big', the most significant byte is at the
beginning of the byte array. If byteorder is 'little', the most
significant byte is at the end of the byte array. To request the native
byte order of the host system, use `sys.byteorder' as the byte order value.
The signed keyword-only argument indicates whether two's complement is
used to represent the integer.
When I print a program such as this in Python:
x = b'francis'
The output is b'francis'. If bytes is in 0's and 1's why is it not printing it out?
You seem to be fundamentally confused, in a very common way. The data itself is a distinct concept from its representation, i.e. what you see when you attempt to print it out or otherwise display it. There may be multiple ways to represent the same data. This is just like how if I write 23 (in decimal) or 0x17 (hexadecimal) or 0o27 (octal) or 0b10111 (binary) or twenty-three (English), I am talking about the same number.
At some lower level below Python, everything is bytes, and each byte consists of bits; but it is not correct to say that the bytes "are in" 0s and 1s - just like how it is not correct to say that the number twenty-three "is in" decimal digits (or hexadecimal, octal or binary ones, or in English text characters).
The symbols 0 and 1 are just pictures that we draw on a screen to represent the state of those bits - if we choose to represent them individually. Sometimes, we choose larger groupings, and assign different symbols to various combinations of states. For example, we may interpret multiple bits as a single integer value in binary; or (using Unicode) we might further interpret that number as a "code point" (most of these are text characters; some are control characters, or portions of text characters).
A Python bytes object is a wrapper for a "raw" sequence of bytes. When you display it, Python uses a representation where each byte (grouping of 8 bits) corresponds to one or more symbols: bytes whose corresponding integer value is between thirty-two and one hundred twenty-six (inclusive) are (for historical reasons) represented using individual text characters (following the so-called ASCII encoding), while others are represented with a four-character "escape sequence" beginning with \x and followed by the hexadecimal representation of the number.
From python docs:
bytes and bytearray objects are sequences of integers (between 0 and
255), representing the ASCII value of single bytes.
So they are sequence of integers which represents ASCII values.
For conversion you can use:
import sys
int.from_bytes(b'\x11', byteorder=sys.byteorder) # => 17
bin(int.from_bytes(b'\x11', byteorder=sys.byteorder)) # => '0b10001'
The bytes object was intentionally designed to work like this: the repr uses the corresponding ASCII characters for bytes in the printable ASCII range, well-known backslash escapes for a few special ASCII control characters, and hex backslash escapes for everything else (and the str just is the repr).
The basic idea is that bytes can be used as an immutable array of integers from 0-255, but more often it's used as an immutable array of characters encoded in some ASCII-compatible charset.
In particular, one of the most common uses of bytes is for things like the headers in HTTP, SMTP, and other network protocols. These headers are generally entirely in pure ASCII, or at least pure ASCII keys with some values in pure ASCII and others in an ASCII-compatible charsetโand you generally have to parse the ASCII headers to figure out what charset to use to decode the body. Being able to see those headers are ASCII characters is a lot more useful than just seeing them as a sequence of numbers.
Basically, everything on your computer is eventually represented by 0's and 1's.
The purpose of b-notation isn't as you expected it to be.
I would like to refer you to a great answer that might help you understand what the b-notation is for and how to use it properly:
What does the 'b' character do in front of a string literal?
Good luck.
What I am really doing is creating a BMP file from JPEG using python and it's got some header data which contains info like size, height or width of the image, so basically I want to read a JPEG file, gets it width and height, calculate the new size of a BMP file and store it in the header.
Let's say the new size of the BMP file is 40000 bytes whose hex value is 0x9c40, now as there is 4 byte space to save this in the header, we can write it as 0x00009c40. In BMP header data, LSB is written first and then MSB so I have to write, 0x409c0000 in the file.
My Problems:-
I was able to do this in C but I am totally lost how to do so in Python.
For example, if I have i=40000, and by using str=hex(i)[2:] I got the hex value, now by some coding I was able to add the extra zeros and then reverse the code. Now how to write this '409c0000' data in the file as hex?
The header size is 54 bytes for BMP file, so is there is another way to just store the data in a string like str='00ffcf4f...'(upto 54 bytes) and just convert the whole str at once as hex and write it to file?
My friend told me to use unhexlify from binascii,
by doing unhexlify('fffcff') I get '\xff\xfc\xff' which is what I want but when I try unhexlify('3000') I get '0\x00'` which is not what I want. It is same for any value containing 3, 4, 5, 6 or 7. Is it the right way to do this?
You are not writing hex, you are writing binary data. Hexadecimal is a helpful notation when dealing with binary data, but don't confuse the notation with the value.
Use the struct module to pack integer data into binary structures, the same way C would.
binascii.unhexlify also is a good choice, provided you already have the data in a string using hex notation. The output is correct, but the binary representation only uses hex escapes for bytes outside the printable ASCII range.
Thus fffcff does correctly becomes \xff\xfc\xff, representing 3 bytes in hex escape notation, and 3000 is \x30\x00, but \x30 is the '0' character in ASCII, so the Python representation for that byte simply uses that ASCII character, as that is the most common way to interpret bytes.
Packing the integer value 40000 using struct.pack() as an unsigned integer (little endian) then becomes:
>>> import struct
>>> struct.pack('<I', 40000)
'#\x9c\x00\x00'
where the 40 byte is represented by the ASCII character for that byte, the # glyph.
If this is confusing, you can always create a new hex representation by going the other way and use 0binascii.hexlify() function](https://docs.python.org/2/library/binascii.html#binascii.hexlify) to create a hexadecimal representation for yourself, just to debug the output:
>>> import binascii
>>> binascii.hexlify(struct.pack('<I', 40000))
'409c0000'
and you'll see that the # byte is still the right hex value.