How to capture all characters in binary string without python interpreting it - python

Here is how I reproduce the problem:
Create a log file called 'temp.log' and paste this line into it
DEBUG: packetReceived '\x61\x62\x63'
I want to have a script which will read the line from the log file and decode the binary string part ('\x61\x62\x63'). For the decoding, I am using struct, so:
struct.unpack('BBB', '\x61\x62\x63')
Should give me
(97, 98, 99)
Here is the script which I am using
import re
import struct
import sys
f = open(sys.argv[1], 'r')
for line in f:
print line
packet = re.compile(r"packetReceived \'(.*)\'").search(line).group(1)
# packet is the string r'\x61\x62\x63'
assert(len(packet), 12)
# this works ok (returns (97, 98, 99))
struct.unpack('BBB', '\x61\x62\x63')
# this fails because packet is interpreted as r'\\x61\\x62\x63'
struct.unpack('BBB', packet)
I run the script using temp.log as the argument to the script.
Hopefully the comments highlight my problem. How can I get the variable packet to be interpreted as '\x61\x62\x63' ??
ASIDE: On the first edit of this question, I assumed that reading the line from the file was the same as this:
line = "DEBUG: packetReceived '\x61\x62\x63'"
which made packet == 'abc'
however it is actually the same as this (using rawstring)
line = r"DEBUG: packetReceived '\x61\x62\x63'"

Python doesn't interpret strings that you pass to regular expressions. The escape sequences were most likely interpreted earlier, when you defined variable line. This works correctly for example:
line = r"DEBUG: packetReceived '\x61\x62\x63'"
print re.compile(r"packetReceived '(.*)'").search(line).group(1)
It prints \x61\x62\x63.

>>> re.compile(r"packetReceived '(.*)'").search(r"DEBUG: packetReceived '\x61\x62\x63'").group(1)
'\\x61\\x62\\x63'
Nope, that line is not where your problem lies.

As described in your question, packet is equal to '\x61\x62\x63'. Its len is 12 bytes, neither 15 nor 3 bytes.
What confuses you, is that ipython (which I understand you are using) and the python interpreter display values using the repr() call, which tries to format values as they would be in your code. Since backslashes are special in Python string constants, repr() displays them duplicated, as they would be in Python code.
This might be of help:
for char in packet:
print("%5d %2s %2r" % (ord(char), char, char))
Count your characters and see how they are printed. First column displays the ordinal value of the character, second column has the character itself, third column has the repr of the character.
EDIT
Change the last line:
struct.unpack('BBB', packet)
to:
struct.unpack('BBB', packet.decode('string_escape'))

If you're sure you are receiving twelve characters and not just three represented as twelve, it may be just the printing of the string that is causing you grief.
Compare:
>> print '\x61\x62\x63'
abc
>>> print r'\x61\x62\x63'
\x61\x62\x63
My 50c is on you actually receiving three characters and them being printed like this:
>>> print ''.join('\\x%02x' % ord(c) for c in 'abc')
\x61\x62\x63

Related

Unicode to ASCII string in Python 2

Trying to make a wall display of current MET data for a selected airport.
This is my first use of a Raspberry Pi 3 and of Python.
The plan is to read from a net data provider and show selected data on a LCD display.
The LCD library seems to work only in Python2. Json data seems to be easier to handle in Python3.
This question python json loads unicode probably adresses my problem, but I do not anderstand what to do.
So, what shall I do to my code?
Minimal example demonstrating my problem:
#!/usr/bin/python
import I2C_LCD_driver
import urllib2
import urllib
import json
mylcd = I2C_LCD_driver.lcd()
mylcd.lcd_clear()
url = 'https://avwx.rest/api/metar/ESSB'
request = urllib2.Request(url)
response = urllib2.urlopen(request).read()
data = json.loads(response)
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string(str1, 1, 0)
The error is as follows:
$python Minimal.py
Traceback (most recent call last):
File "Minimal.py", line 18, in <module>
mylcd.lcd_display_string(str1, 1, 0)
File "/home/pi/I2C_LCD_driver.py", line 161, in lcd_display_string
self.lcd_write(ord(char), Rs)
TypeError: ord() expected a character, but string of length 4 found
It's a little bit hard to tell without seeing the inside of mylcd.lcd_display_string(), but I think the problem is here:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
I suspect that you want str1 to contain something of type string, with a value like "132 metres". Try adding a print statement just after, so that you can see what str1 contains.
str1 = data["Altimeter"], data["Units"]["Altimeter"]
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
I think you will see a result like:
str1 is: ('foo', 'bar'), of type <type 'tuple'>.
The mention of "type tuple", the parentheses, and the comma indicate that str1 is not a string.
The problem is that the comma in the print statement does not do concatenation, which is perhaps what you are expecting. It joins the two values into a tuple. To concatenate, a flexible and sufficient method is to use the str.format() method:
str1 = "{0} {1}".format(data["Altimeter"], data["Units"]["Altimeter"])
print( "str1 is: {0}, of type {1}.".format(str1, type(str1)) )
Then I expect you will see a result like:
str1 is: 132 metres, of type <type 'str'>.
That value of type "str" should work fine with mylcd.lcd_display_string().
You are passing in a tuple, not a single string:
str1 = data["Altimeter"], data["Units"]["Altimeter"]
mylcd.lcd_display_string() expects a single string instead. Perhaps you meant to concatenate the two strings:
str1 = data["Altimeter"] + data["Units"]["Altimeter"]
Python 2 will implicitly encode Unicode strings to ASCII byte strings where a byte string is expected; if your data is not ASCII encodable you'd have to encode your data yourself, with a suitable codec, one that maps your Unicode codepoints to the right bytes for your LCD display (I believe you can have different character sets stored in ROM). This likely would involve a manual translation table as non-ASCII LCD ROM character sets seldom correspond with standard codecs.
If you just want to force the string to contain ASCII characters only, encode with the error handler set to 'ignore':
str1 = (data["Altimeter"] + data["Units"]["Altimeter"]).encode('ASCII', 'ignore')

Attempting to debug hex instruction, but python clears my console?

I'm writing a driver and am concatenating some hex instructions based on a few conditionals. Up until this point, all instructions have worked as intended.
A new instruction I was working on isn't working as intended, so I attempted to print out the instruction after concatenation and before execution to see what was wrong.
msg = '\xc2%s%s' % ('\x1b\x63', '07')
assert self.dev.ctrl_transfer(0x21, 9, 0x0300, 0, msg) == len(msg)
print(msg)
When I print it after concatenation it clears the console and prints '07' and then continues the rest of the driver execution. I'm able to print and execute every other instruction I've concatenated, such as the following, without issue.
msg = '\xc2%s%s' % ('\x1b\x72, '07')
Does anyone have an idea why this is happening? Does the '\x63' byte tell python to do something I'm unaware of? It should just be concatenated to the rest of the instruction, followed by the '\x07' byte. Note, that if I include the '\x' before the '07' (unlike my code above) it still does the same thing, it just doesn't print '07', it leaves a blank line.
Thanks!
The character '\x63' is the same character as 'c' (and a half-dozen other ways to spell it). The letter c doesn't mean anything special to Python.
The character '\x1b' right before the c is Escape. That doesn't mean anything special to Python either—but it probably does to your terminal. Most terminals use "escape sequences" that start with Escape and end with a letter to do things like scroll up, changing the main text color, or clear the screen.
If this is getting in the way of an interactive debugging session, you may want to consider printing the repr of the string rather than the string itself. The easiest way to do that is to not even use print:
>>> msg = b'\x1b\x63'
>>> msg
b'\x1bc'
>>> print(repr(msg))
b'\x1bc'
Notice that either way, it includes the b and the quotes—and that it hex-escapes all non-printable bytes. And it works basically the same with Unicode strings instead of byte string:
>>> msg = '\x1b\x63'
>>> msg
'\x1bc'
>>> print(repr(msg))
'\x1bc'
If you're using Python 2.x, you'll have u prefixes instead of none on the Unicode ones, and no prefixes instead of b on the bytes, but basically the same.

python 3 - reading a file within zipped archive places 'b' character at start of each line

In the code below, I always get a strange output that places b before every line. Just the letter b.
E.g. a sample output is like this:
[b'2017-06-01,15:19:57,']
The script itself is this:
from zipfile import ZipFile
with ZipFile('myarchive.zip','r') as myzip:
with myzip.open('logs/logfile1.txt') as myfile:
next(myfile)
print(myfile.readlines())
The archive has a single folder in it called "logs" and inside logs there are several text files, each with lines below an empty first line (hence the next(myfile)
It places the b before the data, no matter which file I try to read. If there are multiple lines in a file it outputs something like this:
[b'2017-06-01,15:06:28,start session: \n', b'2017-06-01,15:06:36,stop session']
Why is it placing the pesky b there?
In Python 3.x there is a distinction between strings and bytes data. When representing bytes as strings Python adds b prefix to denote that. If you want to treat your bytes as strings, you first need to decode them into a string:
your_string = your_bytes.decode("utf-8")
Of course, the codec you'll use depends on how your string was encoded into bytes in the first place.
Because zip is binary format and while reading from it it gives bytes instead of str.
you can convert using str.decode()
for example
>>>byte_string = b'2017-06-01,15:06:28,start session: \n'
>>>byte_string.decode()
2017-06-01,15:06:28,start session: \n
will give you the desired str.
In Python 3, (from the documentation) bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.
This is just clarifying formatting in the print output. If you want to output strings without this formatting, you can use a format string like this:
print("%s" % myfile.readlines())

Python: How to process pasted text from the clipboard?

I'm processing a string like this:
scrpt = "\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n"
infile = StringIO(scrpt)
#pretend infile was just a regular file...
r = csv.DictReader(infile, dialect=csv.Sniffer().sniff(infile.read(1000)))
infile.seek(0)
Frame, Xco, Yco = [],[],[]
for row in r:
Frame.append(row['Frame'])
Xco.append(row['X pixels'])
Yco.append(row['Y pixels'])
This works fine. I get the string variable 'scrpt' sorted nicely into the the variables 'Frame', 'Xco', and 'Yco'
Now if I do this:
print(scrpt)
I see things neatly laid out in tabbed columns like this:
Frame X pixels Y pixels
2 615.5 334.5
3 615.885 334.136
4 615.937 334.087
5 615.917 334.106
6 615.892 334.129
7 615.905 334.117
8 615.767 334.246
9 615.546 334.456
10 615.352 334.643
But if I have the same string pasted from the clipboard and try to process it it doesn't work.
In this case, if I print it like this:
print(scrpt)
I see:
\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n
Then when I go to process it the csv module won't sort it out.
What am I doing wrong?
It looks like I'm using the same data in both cases but something is different.
My guess is that your clipboard has literal backslash and t characters, not tab characters. For example, if you just copy from the first line of your source, that's exactly what you'll get.
In other words, it's as if you did this:
scrpt = r"\tFrame\tX pixels\tY pixels\r\n\t2\t615.5\t334.5\r\n\t3\t615.885\t334.136\r\n\t4\t615.937\t334.087\r\n\t5\t615.917\t334.106\r\n\t6\t615.892\t334.129\r\n\t7\t615.905\t334.117\r\n\t8\t615.767\t334.246\r\n\t9\t615.546\t334.456\r\n\t10\t615.352\t334.643\r\n\r\n"
… or, equivalently:
scrpt = "\\tFrame\\tX pixels\\tY pixels\\r\\n\\t2\\t615.5\\t334.5\\r\\n\\t3\\t615.885\\t334.136\\r\\n\\t4\\t615.937\\t334.087\\r\\n\\t5\\t615.917\\t334.106\\r\\n\\t6\\t615.892\\t334.129\\r\\n\\t7\\t615.905\\t334.117\\r\\n\\t8\\t615.767\\t334.246\\r\\n\\t9\\t615.546\\t334.456\\r\\n\\t10\\t615.352\\t334.643\\r\\n\\r\\n"
If that's the problem, the fix is pretty easy:
scrpt = scrpt.decode('string_escape')
Or, in 3.x (where you can't call decode on a str):
script = codecs.decode(script, 'unicode_escape')
The unicode_escape codec is described in the list of Standard Encodings in the codecs module. It's defined as:
Produce a string that is suitable as Unicode literal in Python source code
In other words, if you encode with this codec, it will replace each non-printing Unicode character with an escape sequence that you can type into your source code. If you've got a tab character, it'll replace that with a backslash character and a t.
You want to do the exact reverse of that: you've got a string you copied out of source code, with source-code-style escape sequences, and you want to interpret it the same way the Python interpreter does. So, you just decode with the same codec. If you've got a backslash followed by a t, it'll replace them with a tab character.
It's worth playing with this in the interactive interpreter (remember to keep the repr and str representations straight while doing so!) until you get it.

How do I split a multi-languages line in Python and get the Unicode hex value?

I try to split this kind of lines in Python:
aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"
This line contains Hebrew, simplified Chinese and English.
If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).
The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:
print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))
And I get this error:
SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
In Python 2, you need to open the file specifying an encoding like this:
import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")
In Python 3, you can just add the encoding option to any open() calls.
This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).
To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():
>>> ord(u"£")
163
if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.
Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:
>>> unicodedata.bidirectional(u"£")
ET # 'E'uropean 'T'erminator
In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:
print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))
In Python 3, string constants are Unicode by default.

Categories