Split array byte string in Python - python

I'm trying to split a string of bytes like this:
'\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
into something like this:
'\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', etc.
However, the split() method returns me something like this:
'xf0', 'x9f' 'x98' etc.
I tried split(" "), but it does not seem to work. How do I achieve the above mentioned?

str.split(' ') or even just str.split() (split on arbitrary-width whitespace) works just fine on your input:
sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
parts = sample.split()
Demo:
>>> sample = '\xf0\x9f\x98\x84 \xf0\x9f\x98\x83 \xf0\x9f\x98\x80 \xf0\x9f\x98\x8a \xe2\x98\xba \xf0\x9f\x98\x89 \xf0\x9f\x98\x8d \xf0\x9f\x98\x98 \xf0\x9f\x98\x9a \xf0\x9f\x98\x97 \xf0\x9f\x98\x99 \xf0\x9f\x98\x9c \xf0\x9f\x98\x9d \xf0\x9f\x98\x9b \xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x85 \xf0\x9f\x98\x86 \xf0\x9f\x98\x8b \xf0\x9f\x98\x8e \xf0\x9f\x98\xac \xf0\x9f\x98\x87'
>>> sample.split()
['\xf0\x9f\x98\x84', '\xf0\x9f\x98\x83', '\xf0\x9f\x98\x80', '\xf0\x9f\x98\x8a', '\xe2\x98\xba', '\xf0\x9f\x98\x89', '\xf0\x9f\x98\x8d', '\xf0\x9f\x98\x98', '\xf0\x9f\x98\x9a', '\xf0\x9f\x98\x97', '\xf0\x9f\x98\x99', '\xf0\x9f\x98\x9c', '\xf0\x9f\x98\x9d', '\xf0\x9f\x98\x9b', '\xf0\x9f\x98\x81', '\xf0\x9f\x98\x82', '\xf0\x9f\x98\x85', '\xf0\x9f\x98\x86', '\xf0\x9f\x98\x8b', '\xf0\x9f\x98\x8e', '\xf0\x9f\x98\xac', '\xf0\x9f\x98\x87']
However, if this is binary data, you need to be careful that there are no \x20 bytes in those 4-byte values. It might be better to just produce chunks of 5 bytes from this, then remove the last byte:
for i in range(0, len(sample), 5):
chunk = sample[i:i + 4] # ignore the 5th byte, a space
Demo:
>>> for i in range(0, len(sample), 5):
... chunk = sample[i:i + 4] # ignore the 5th byte, a space
... print chunk.decode('utf8')
... if i == 20: break
...
😄
😃
😀
😊
# On browsers that support it, those are various smiling emoji

Related

How to add incoming characters from UART to a string?

How do I add up incoming individual characters over UART to form a string?
For exmaple:
The characters from UART are printed in following format:
\x02
1
2
3
4
5
6
7
\X03
\x02
a
b
c
d
e
f
g
\x03
and I would like output to be something like:
1234567
abcdefg
I have tried this so far:
#!/usr/bin/env python
import time
import serial
ser = serial.Serial('/dev/ttyUSB0',38400)
txt = ""
ser.flushInput()
ser.flushOutput()
while 1:
bytesToRead = ser.inWaiting()
data_raw = ser.read(1)
while 1:
if data_raw !='\x02' or data_raw !='\x03':
txt += data_raw
elif data_raw == '\x03':
break
print txt
Any ideas on how to do that? I am getting no output using this.
First of all, you don't need to call inWaiting: read will block until data is available unless you explicitly set a read timeout. Secondly, if you do insist on using it, keep in mind that the function inWaiting has been replaced by the property in_waiting.
The string /x03 is a 4-character string containing all printable characters. The string \x03, on the other hand, contains only a single non printable character with ASCII code 3. Backslash is the escape character in Python strings. \x followed by two digits is an ASCII character code. Please use backslashes where they belong. This is the immediate reason you are not seeing output: the four character string can never appear in a one-character read.
That being out of the way, the most important thing to remember is to clear your buffer when you encounter a terminator character. Let's say you want to use the inefficient method of adding to your string in place. When you reach \x03, you should print txt and just reset it back to '' instead of breaking out of the loop. A better way might be to use bytearray, which is a mutable sequence. Keep in mind also that read returns bytes, not strings in Python 3.x. This means that you should be decoding the result if you want text: txt = txt.decode('ascii').
I would suggest a further improvement, and create an infinite generator function to split your steam into strings. You could use that generator to print the strings or do anything else you wanted with them:
def getstrings(port):
buf = bytearray()
while True:
b = port.read(1)
if b == b'\x02':
del buf[:]
elif b == b'\x03':
yield buf.decode('ascii')
else:
buf.append(b)
for item in getstring(Serial(...)):
print(item)
Here's one way that I think would help you a bit
l = []
while 1:
data_raw = ser.read(1)
if data_raw !='/x02' or data_raw !='/x03':
l.append(data_raw)
elif data_raw == '/x03':
txt = "-".join(l)
print txt
I begin by creating an empty list and each time you receive a new raw_data. you append it to the list. Once you reach your ending character, you create your string and print it.
I took the liberty of removing one of the loop here to give you a simpler approach. The code will not break by itself at the end (just add a break after the print if you want to). It will print out the result whenever you reach the ending character and then wait for a next data stream to start.
If you want to see an intermediate result, you can print out each data_raw and could add right after a print of the currently joined list.
Be sure to set a timeout value to None when you open your port. That way, you will simply wait to receive one bit, process it and then return to read one bit at a time.
ser = serial.Serial(port='/dev/ttyUSB0', baudrate=38400, timeout=None)
http://pyserial.readthedocs.io/en/latest/pyserial_api.html for mor information on that.

Read Null terminated string in python

I'm trying to read a null terminated string but i'm having issues when unpacking a char and putting it together with a string.
This is the code:
def readString(f):
str = ''
while True:
char = readChar(f)
str = str.join(char)
if (hex(ord(char))) == '0x0':
break
return str
def readChar(f):
char = unpack('c',f.read(1))[0]
return char
Now this is giving me this error:
TypeError: sequence item 0: expected str instance, int found
I'm also trying the following:
char = unpack('c',f.read(1)).decode("ascii")
But it throws me:
AttributeError: 'tuple' object has no attribute 'decode'
I don't even know how to read the chars and add it to the string, Is there any proper way to do this?
Here's a version that (ab)uses __iter__'s lesser-known "sentinel" argument:
with open('file.txt', 'rb') as f:
val = ''.join(iter(lambda: f.read(1).decode('ascii'), '\x00'))
How about:
myString = myNullTerminatedString.split("\x00")[0]
For example:
myNullTerminatedString = "hello world\x00\x00\x00\x00\x00\x00"
myString = myNullTerminatedString.split("\x00")[0]
print(myString) # "hello world"
This works by splitting the string on the null character. Since the string should terminate at the first null character, we simply grab the first item in the list after splitting. split will return a list of one item if the delimiter doesn't exist, so it still works even if there's no null terminator at all.
It also will work with byte strings:
myByteString = b'hello world\x00'
myStr = myByteString.split(b'\x00')[0].decode('ascii') # "hello world" as normal string
If you're reading from a file, you can do a relatively larger read - estimate how much you'll need to read to find your null string. This is a lot faster than reading byte-by-byte. For example:
resultingStr = ''
while True:
buf = f.read(512)
resultingStr += buf
if len(buf)==0: break
if (b"\x00" in resultingStr):
extraBytes = resultingStr.index(b"\x00")
resultingStr = resultingStr.split(b"\x00")[0]
break
# now "resultingStr" contains the string
f.seek(0 - extraBytes,1) # seek backwards by the number of bytes, now the pointer will be on the null byte in the file
# or f.seek(1 - extraBytes,1) to skip the null byte in the file
(edit version 2, added extra way at the end)
Maybe there are some libraries out there that can help you with this, but as I don't know about them lets attack the problem at hand with what we know.
In python 2 bytes and string are basically the same thing, that change in python 3 where string is what in py2 is unicode and bytes is its own separate type, which mean that you don't need to define a read char if you are in py2 as no extra work is required, so I don't think you need that unpack function for this particular case, with that in mind lets define the new readString
def readString(myfile):
chars = []
while True:
c = myfile.read(1)
if c == chr(0):
return "".join(chars)
chars.append(c)
just like with your code I read a character one at the time but I instead save them in a list, the reason is that string are immutable so doing str+=char result in unnecessary copies; and when I find the null character return the join string. And chr is the inverse of ord, it will give you the character given its ascii value. This will exclude the null character, if its needed just move the appending...
Now lets test it with your sample file
for instance lets try to read "Sword_Wea_Dummy" from it
with open("sword.blendscn","rb") as archi:
#lets simulate that some prior processing was made by
#moving the pointer of the file
archi.seek(6)
string=readString(archi)
print "string repr:", repr(string)
print "string:", string
print ""
#and the rest of the file is there waiting to be processed
print "rest of the file: ", repr(archi.read())
and this is the output
string repr: 'Sword_Wea_Dummy'
string: Sword_Wea_Dummy
rest of the file: '\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
other tests
>>> with open("sword.blendscn","rb") as archi:
print readString(archi)
print readString(archi)
print readString(archi)
sword
Sword_Wea_Dummy
ÍÌÌ=p=Š4:¦6¿JÆ=
>>> with open("sword.blendscn","rb") as archi:
print repr(readString(archi))
print repr(readString(archi))
print repr(readString(archi))
'sword'
'Sword_Wea_Dummy'
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6='
>>>
Now that I think about it, you mention that the data portion is of fixed size, if that is true for all files and the structure on all of them is as follow
[unknow size data][know size data]
then that is a pattern we can exploit, we only need to know the size of the file and we can get both part smoothly as follow
import os
def getDataPair(filename,knowSize):
size = os.path.getsize(filename)
with open(filename, "rb") as archi:
unknown = archi.read(size-knowSize)
know = archi.read()
return unknown, know
and by knowing the size of the data portion, its use is simple (which I get by playing with the prior example)
>>> strins_data, data = getDataPair("sword.blendscn", 80)
>>> string_data, data = getDataPair("sword.blendscn", 80)
>>> string_data
'sword\x00Sword_Wea_Dummy\x00'
>>> data
'\xcd\xcc\xcc=p=\x8a4:\xa66\xbfJ\x15\xc6=\x00\x00\x00\x00\xeaQ8?\x9e\x8d\x874$-i\xb3\x00\x00\x00\x00\x9b\xc6\xaa2K\x15\xc6=;\xa66?\x00\x00\x00\x00\xb8\x88\xbf#\x0e\xf3\xb1#ITuB\x00\x00\x80?\xcd\xcc\xcc=\x00\x00\x00\x00\xcd\xccL>'
>>> string_data.split(chr(0))
['sword', 'Sword_Wea_Dummy', '']
>>>
Now to get each string a simple split will suffice and you can pass the rest of the file contained in data to the appropriated function to be processed
Doing file I/O one character at a time is horribly slow.
Instead use readline0, now on pypi: https://pypi.org/project/readline0/ . Or something like it.
In 3.x, there's a "newline" argument to open, but it doesn't appear to be as flexible as readline0.
Here is my implementation:
import struct
def read_null_str(f):
r_str = ""
while 1:
back_offset = f.tell()
try:
r_char = struct.unpack("c", f.read(1))[0].decode("utf8")
except:
f.seek(back_offset)
temp_char = struct.unpack("<H", f.read(2))[0]
r_char = chr(temp_char)
if ord(r_char) == 0:
return r_str
else:
r_str += r_char

Python Read Certain Number of Bytes After Character

I'm dealing with a character separated hex file, where each field has a particular start code. I've opened the file as 'rb', but I was wondering, after I get the index of the startcode using .find, how do I read a certain number of bytes from this position?
This is how I am loading the file and what I am attempting to do
with open(someFile, 'rb') as fileData:
startIndex = fileData.find('(G')
data = fileData[startIndex:7]
where 7 is the number of bytes I want to read from the index returned by the find function. I am using python 2.7.3
You can get the position of a substring in a bytestring under python2.7 like this:
>>> with open('student.txt', 'rb') as f:
... data = f.read()
...
>>> data # holds the French word for student: élève
'\xc3\xa9l\xc3\xa8ve\n'
>>> len(data) # this shows we are dealing with bytes here, because "élève\n" would be 6 characters long, had it been properly decoded!
8
>>> len(data.decode('utf-8'))
6
>>> data.find('\xa8') # continue with the bytestring...
4
>>> bytes_to_read = 3
>>> data[4:4+bytes_to_read]
'\xa8ve'
You can look for the special characters, and for compatibility with Python3k, it's better if you prepend the character with a b, indicating these are bytes (in Python2.x, it will work without though):
>>> data.find(b'è') # in python2.x this works too (unfortunately, because it has lead to a lot of confusion): data.find('è')
3
>>> bytes_to_read = 3
>>> pos = data.find(b'è')
>>> data[pos:pos+bytes_to_read] # when you use the syntax 'n:m', it will read bytes in a bytestring
'\xc3\xa8v'
>>>

Python: converting hex values, stored as string, to hex data

(Answer found. Close the topic)
I'm trying to convert hex values, stored as string, in to hex data.
I have:
data_input = 'AB688FB2509AA9D85C239B5DE16DD557D6477DEC23AF86F2AABD6D3B3E278FF9'
I need:
data_output = '\xAB\x68\x8F\xB2\x50\x9A\xA9\xD8\x5C\x23\x9B\x5D\xE1\x6D\xD5\x57\xD6\x47\x7D\xEC\x23\xAF\x86\xF2\xAA\xBD\x6D\x3B\x3E\x27\x8F\xF9'
I was trying data_input.decode('hex'), binascii.unhexlify(data_input) but all they return:
"\xabh\x8f\xb2P\x9a\xa9\xd8\\#\x9b]\xe1m\xd5W\xd6G}\xec#\xaf\x86\xf2\xaa\xbdm;>'\x8f\xf9"
What should I write to receive all bytes in '\xFF' view?
updating:
I need representation in '\xFF' view to write this data to a file (I'm opening file with 'wb') as:
«hЏІPљ©Ш\#›]бmХWЦG}м#Ї†тЄЅm;>'Џщ
update2
Sorry for bothering. An answer lies under my nose all the time:
data_output = data_input.decode('hex')
write_file(filename, data_output) #just opens a file 'wb', ant write a data in it
gives the same result as I need
I like chopping strings into fixed-width chunks using re.findall
print '\\x' + '\\x'.join(re.findall('.{2}', data_input))
If you want to actually convert the string into a list of ints, you can do that like this:
data = [int(x, 16) for x in re.findall('.{2}', data_input)]
It's an inefficient solution, but there's always:
flag = True
data_output = ''
for char in data_input:
if flag:
buffer = char
flag = False
else:
data_output = data_output + '\\x' + buffer + char
flag = True
EDIT HOPEFULLY THE LAST: Who knew I could mess up in so many different ways on that simple a loop? Should actually run now...
>>> int('0x10AFCC', 16)
1093580
>>> hex(1093580)
'0x10afcc'
So prepend your string with '0x' then do the above

How to find null byte in a string in Python?

I'm having an issue parsing data after reading a file. What I'm doing is reading a binary file in and need to create a list of attributes from the read file all of the data in the file is terminated with a null byte. What I'm trying to do is find every instance of a null byte terminated attribute.
Essentially taking a string like
Health\x00experience\x00charactername\x00
and storing it in a list.
The real issue is I need to keep the null bytes in tact, I just need to be able to find each instance of a null byte and store the data that precedes it.
Python doesn't treat NUL bytes as anything special; they're no different from spaces or commas. So, split() works fine:
>>> my_string = "Health\x00experience\x00charactername\x00"
>>> my_string.split('\x00')
['Health', 'experience', 'charactername', '']
Note that split is treating \x00 as a separator, not a terminator, so we get an extra empty string at the end. If that's a problem, you can just slice it off:
>>> my_string.split('\x00')[:-1]
['Health', 'experience', 'charactername']
While it boils down to using split('\x00') a convenience wrapper might be nice.
def readlines(f, bufsize):
buf = ""
data = True
while data:
data = f.read(bufsize)
buf += data
lines = buf.split('\x00')
buf = lines.pop()
for line in lines:
yield line + '\x00'
yield buf + '\x00'
then you can do something like
with open('myfile', 'rb') as f:
mylist = [item for item in readlines(f, 524288)]
This has the added benefit of not needing to load the entire contents into memory before splitting the text.
To check if string has NULL byte, simply use in operator, for example:
if b'\x00' in data:
To find the position of it, use find() which would return the lowest index in the string where substring sub is found. Then use optional arguments start and end for slice notation.
Split on null bytes; .split() returns a list:
>> print("Health\x00experience\x00charactername\x00".split("\x00"))
['Health', 'experience', 'charactername', '']
If you know the data always ends with a null byte, you can slice the list to chop off the last empty string (like result_list[:-1]).

Categories