I need to port code from perl that packs byte string. In perl it looks like the following:
pack 'B*', '0100001000111110010100101101000010010001'
I don't see B* format analog in python struct module. Perhaps there are ready solutions not to invent a bicycle?
Honestly, description is not clear for me, so i even can't imagine how it works to implement it by myself:
Likewise, the b and B formats pack a string that's that many bits
long. Each such format generates 1 bit of the result. These are
typically followed by a repeat count like B8 or B64 .
Each result bit
is based on the least-significant bit of the corresponding input
character, i.e., on ord($char)%2. In particular, characters "0" and
"1" generate bits 0 and 1, as do characters "\000" and "\001" .
Starting from the beginning of the input string, each 8-tuple of
characters is converted to 1 character of output.
With format b , the
first character of the 8-tuple determines the least-significant bit of
a character; with format B , it determines the most-significant bit of
a character.
If the length of the input string is not evenly divisible
by 8, the remainder is packed as if the input string were padded by
null characters at the end. Similarly during unpacking, "extra" bits
are ignored.
If the input string is longer than needed, remaining
characters are ignored.
A * for the repeat count uses all characters
of the input field. On unpacking, bits are converted to a string of 0
s and 1 s.
So, string is divided in chunks for 8 symbols. If last chunk is less 8 symbols, it is padded with null characters in the end to be 8 symbols. Then, each chunk becomes a byte.
But i can't understand, what are resulting bits? What is meant under B8 and B64 here?
The int-object has a to_bytes-method:
binary = '0100001000111110010100101101000010010001'
number = int(binary, 2)
print(number.to_bytes((number.bit_length()+7)//8, 'big'))
# b'B>R\xd0\x91'
I'm not sure of the exact perl semantics, but here's my guess at them:
def pack_bit_string(bs):
ret = b''
while bs:
chunk, bs = bs[:8], bs[8:]
# convert to an integer so we can pack it
i = int(chunk, 2)
# Handle trailing chunks that are not 8 bits
# Note this as an augmented assignment, perhaps also read as
# i = i << (8 - len(chunk))
i <<= 8 - len(chunk)
ret += struct.pack('B', i)
return ret
Comments are inline. If you know things like "the input is less than 64 bits" you can avoid the loop and use Q for struct.pack
Related
I have long string, which can consist of few sub-strings (not always, sometimes it's one string, sometimes there are 4 sub-strings sticked together). Each one starts with byte length, for example 4D or 4E. Below is example big-string which consists of 4 sub-strings:
4D44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E4E44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB54E44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B9694E44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
After splitting by pattern, the output SHOULD BE:
4D44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E
4E44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB5
4E44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B969
4E44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
Each long string has ID - in this case it's 44B909, each line has this ID after bytes. My original code took first 6 letters (4D44B9) and splitted string by this. It's working in 95% cases - where EACH line has same length, for example 4D. The problem is that not always each line has same length - as in string above. Look at my code below:
def repeat():
string = input('Please paste string below:'+'\n')
code = string[:6]
print('\n')
print('SPLITTED:')
string = string.replace(code, '\n'+'\n'+code)
print(string)
while True:
repeat()
When you try to paste this one long string, it won't split it, because first line has 4D, and rest has 4E. I'd like it to "ignore" (for a moment) first 2 letters (4E) and take six next letters, as "split-pattern"? The output should be as these 4 lines above! I was changing code a bit, but I was getting some strange results, like below:
44B9096268182113077A95C84005D55FCD9D79476DDA4346C7EF1F4F07D4B46693F51812C8B74E
44B9097368182113077A340040058D55E7E8D3924C57182F6E07A4D3617E100D1652169668636CB54E
44B9096868182113077A37004005705FE9461E85F69A4C8E1B00CE03E6337B8F3D853A51C447B9694E
44B9096668182113077AA400400555C9FAADA21F1EC93DBD5B579E4E07DDAF75A45D095E72010DBB
How can I make it work??
If the first two characters encode the string's length in hex, why do you not use that to decide how much of the string to consume?
However, the offsets in your example seem wrong; 4D is correct (decimal 78) but 4E should apparently be 51 (the string is four characters longer).
For the question about how to split on a slightly variable pattern, a regular expression seems like a good solution.
import re
splitted = re.split(r'4[DE](?=44B909)', string)
In so many words, this says "use 4D or 4E as the delimiter to split on, but only if it's immediately followed by 44B909".
(There will be an empty group before the first value but that's easy to shift off; or change the regex to r'(?<!^)4[DE](?=44B909O)'.)
If you don't want to discard anything, include everything in the lookahead:
splitted = re.split(r'(?<!^)(?=4[DE]44B909)', string)
I'm trying to convert binary to decimal to ASCII. Using this code, I'm able to take a binary input and split it into chunks of 7 bits.
def binary_to_ascii7bits(bstring):
n = 7
byte = [bstring[i:i+n] for i in range(0, len(bstring), n)]
print(byte)
I need to be able to turn each 7-bit substring into a decimal number in order to use the chr function. If I try to turn this list into a string, it prints for example, "['1111000']", but I cannot have the brackets and apostrophes in the string. What can I do to fix this?
First of all, for the chr function it should be an integer, not a decimal.
Add this list comprehension before the print function -
byte = [chr(64 + int(i)) for i in byte]
This will give the string for the bytes. I think this is what you want.
You can add just one more line (as below) to achieve what you described.
You have int(..., 2) to convert the string representation of a binary number into an integer. Then apply chr to get a character. This procedure is done using (list) comprehension, so that the result is a list of characters. Then use join to make a single string.
text = '1111000' * 10
def binary_to_ascii7bits(bstring):
n = 7
byte = [bstring[i:i+n] for i in range(0, len(bstring), n)]
return ''.join(chr(int(x, 2)) for x in byte)
print(binary_to_ascii7bits(text)) # xxxxxxxxxx
This question already has answers here:
Truncating string to byte length in Python
(4 answers)
Closed 3 years ago.
For storage in a given Oracle table (whose field lengths are defined in bytes) I need to cut strings beforehand in Python 3 to a maximal length in Bytes, although the strings can contain UTF-8 characters.
My solution is to concatenate the result string character by character from the original string and check when the result string exceeds the length limit:
def cut_str_to_bytes(s, max_bytes):
"""
Ensure that a string has not more than max_bytes bytes
:param s: The string (utf-8 encoded)
:param max_bytes: Maximal number of bytes
:return: The cut string
"""
def len_as_bytes(s):
return len(s.encode(errors='replace'))
if len_as_bytes(s) <= max_bytes:
return s
res = ""
for c in s:
old = res
res += c
if len_as_bytes(res) > max_bytes:
res = old
break
return res
This is obviously rather slow. What is an efficient way to do this?
ps: I saw Truncate a string to a specific number of bytes in Python, but their solution to use sys.getsizeof() does not give the number of bytes of the string characters, but rather the size of the whole string object (Python need some bytes to manage the string object), so that does not really help.
It is valid to cut a UTF-8 string anywhere except in the middle of a multibyte character. So, if you want the longest UTF-8 string within a maximum byte length, what you need is to first take the max bytes and then reduce it as long as it has an unfinished character at the end.
Compared to your solution, which has O(n) complexity, because it goes character-by-character, this one just removes up to 3 bytes from the end (because a UTF-8 character is never longer than 4 bytes).
RFC 3629 specifies these as valid UTF-8 byte sequences:
Char. number range | UTF-8 octet sequence
(hexadecimal) | (binary)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
So, the simplest way to go with a valid UTF-8 stream:
if the last character is 0xxxxxxx, all is fine
otherwise, find the location of a 11xxxxxx within the last 4 bytes to see whether you have a complete character, based on the table above
Therefore, this should work:
def cut_str_to_bytes(s, max_bytes):
# cut it twice to avoid encoding potentially GBs of `s` just to get e.g. 10 bytes?
b = s[:max_bytes].encode('utf-8')[:max_bytes]
if b[-1] & 0b10000000:
last_11xxxxxx_index = [i for i in range(-1, -5, -1)
if b[i] & 0b11000000 == 0b11000000][0]
# note that last_11xxxxxx_index is negative
last_11xxxxxx = b[last_11xxxxxx_index]
if not last_11xxxxxx & 0b00100000:
last_char_length = 2
elif not last_11xxxxxx & 0b0010000:
last_char_length = 3
elif not last_11xxxxxx & 0b0001000:
last_char_length = 4
if last_char_length > -last_11xxxxxx_index:
# remove the incomplete character
b = b[:last_11xxxxxx_index]
return b.decode('utf-8')
Alternatively, you may try decoding the last bytes, rather than doing the low-level stuff, but I'm not sure the code would be simpler that way...
Note: The function shown here works for strings which are longer than two characters. A version which also covers the edge cases of shorter strings can be found on GitHub.
Is there a way to add comments into a multiline string, or is it not possible? I'm trying to write data into a csv file from a triple-quote string. I'm adding comments in the string to explain the data. I tried doing this, but Python just assumed that the comment was part of the string.
"""
1,1,2,3,5,8,13 # numbers to the Fibonnaci sequence
1,4,9,16,25,36,49 # numbers of the square number sequence
1,1,2,5,14,42,132,429 # numbers in the Catalan number sequence
"""
No, it's not possible to have comments in a string. How would python know that the hash sign # in your string is supposed to be a comment, and not just a hash sign? It makes a lot more sense to interpret the # character as part of the string than as a comment.
As a workaround, you can make use of automatic string literal concatenation:
(
"1,1,2,3,5,8,13\n" # numbers to the Fibonnaci sequence
"1,4,9,16,25,36,49\n" # numbers of the square number sequence
"1,1,2,5,14,42,132,429" # numbers in the Catalan number sequence
)
If you add comments into the string, they become part of the string. If that weren't true, you'd never be able to use a # character in a string, which would be a pretty serious problem.
However, you can post-process the string to remove comments, as long as you know this particular string isn't going to have any other # characters.
For example:
s = """
1,1,2,3,5,8,13 # numbers to the Fibonnaci sequence
1,4,9,16,25,36,49 # numbers of the square number sequence
1,1,2,5,14,42,132,429 # numbers in the Catalan number sequence
"""
s = re.sub(r'#.*', '', s)
If you also want to remove trailing whitespace before the #, change the regex to r'\s*#.*'.
If you don't understand what these regexes are matching and how, see regex101 for a nice visualization.
If you plan to do this many times in the same program, you can even use a trick similar to the popular D = textwrap.dedent idiom:
C = functools.partial(re.sub, r'#.*', '')
And now:
s = C("""
1,1,2,3,5,8,13 # numbers to the Fibonnaci sequence
1,4,9,16,25,36,49 # numbers of the square number sequence
1,1,2,5,14,42,132,429 # numbers in the Catalan number sequence
""")
I want to create 5 hex bytes length string that is gonna be send through a socket. I want that send 255 packets changing the third byte incremntally. How can I do that?
Something like this code:
i=0
while True:
a="\x3f\x4f"+hex(i)+"\x0D\x0A"
socket.send(a)
i=i+1
The problem is that this code is introducing 0x0 (30 78 30) instead of 00 in the first loop for example.
Thank you
I think you're a bit confused here.
\x3f is a single character (the same character as ?).
If i is, say, 63 (hex 3F), you don't want to add the separate characters \\, x, 3, and f to the string, you want to add the single character \x3f. Likewise, if it's 0 (hex 00), you don't want to add the separate characters \\, x, 0 to the string, you want to add the single character \x0.
That's exactly what the chr function is for:
Return a string of one character whose ASCII code is the integer i. For example, chr(97) returns the string 'a'…
By contrast, the function hex will:
[c]onvert an integer number (of any size) to a lowercase hexadecimal string prefixed with “0x”…
So, hex(97) returns the four character string '0x61'.