Python: hexadecimal regular expression question - python

I want to parse the output of a serial monitoring program called Docklight (I highly recommend it)
It outputs 'hexadecimal' strings: or a sequence of (two capital hex digits followed by a space). the corresponding regular expression is: ([0-9A-F]{2} )+ for example: '05 03 DA 4B 3F '
When program detects particular sequences of characters it places comments in the 'hexadecimal ' string. for example:
'05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
comments are strings of the following format ' .+ ' (a sequence of characters preceded by a space and followed by a space)
I want to get rid of the comments. for example, the 'hexadecimal' string above filtered would be:
'05 03 04 01 0A 03 08 0B BD AF 0D 0A '
how do i go about doing this with A regular expression?

You could try re.findall():
>>> a='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.findall(r"\b[0-9A-F]{2}\b", a)
['05', '03', '04', '01', '0A', '03', '08', '0B', 'BD', 'AF', '0D', '0A']
The \b in the regular expression matches a "word boundary".
Of course, your input is ambiguous if the serial monitor inserts something like THIS BE THE HEADER.

Using your regex
hexa = '([0-9A-F]{2} )+'
" ".join(re.findall(hexa, line))

It might be easier to find all the hexadecimal numbers, assuming the inserted strings won't contain a match:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> import re
>>> pattern = re.compile("[0-9A-F]{2} ")
>>> "".join(pattern.findall(data))
'05 03 04 01 0A 03 08 0B BD AF AD 0D 0A '
Otherwise you could use the fact that the inserted strings are preceed by two spaces:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.sub("( .*?)(?=( [0-9A-F]{2} |$))","",data)
'05 03 04 01 0A 03 08 0B BD AF 0D 0A'
This uses a look ahead to work out when the inserted string ends. It looks for either a hexadecimal string surround by spaces or the end of the source string.

How about a solution that actually uses regex negation? ;)
result = re.sub(r"[ ]+(?:(?!\b[0-9A-F]{2}\b).)+", "", subject)

While you already received two answers that find you all hexadecimal numbers, here's the same with a direct regular expression that finds you all text that does not look like a hexadecimal number (assuming that's two letter/digits in uppercase / lowercase 0-9 and A-F range, followed by a space).
Something like this (sorry, I'm not a pythoneer, but you get the idea):
newstring = re.sub(r"[^ ]+(?<![0-9A-Fa-f ]{2}|^.)", "", yourstring)
It works by "looking back". It finds every consecutive non-space substring, then negatively looks back with (?<!....). It says: "if the previous two characters were not a hex number, then succeed". The little ^. at the end prevents to incorrectly match the first character of the string.
Edit
As suggested by Alan Moore, here's the same idea with a positive lookahead expression:
newstring = re.sub(r"(?>\b[0-9A-Fa-f ]{2}\b)", "", yourstring)

Why regexp? More pythonic for me is (fixed for hexdigit not regular digit):
command='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
print ' '.join(com for com in command.split()
if len(com)==2 and all(c.upper() in '0123456789ABCDEF' for c in com))

Related

Check for blanks at specified positions in a string

I have the following problem, which I have been able to solve in a very long way and I would like to know if there is any other way to solve it. I have the following string structure:
text = 01 ARA 22 - 02 GAG 23
But due to processing sometimes the spaces are not added properly and it may look like this:
text = 04 GOR23- 02 OER 23
text = 04 ORO 21-02 RRO 24
text = 04 DRE25- 12 RIS21
When they should look as follows:
text = 04 GOR 23 - 02 OER 23
text = 04 ORO 21 - 02 RRO 24
text = 04 DRE 25 - 12 RIS 21
To add the space in those specific positions, basically I check if in that position of the string the space exists, if it does not exist I add it.
Is there another way in python to do it more efficiently?
I appreciate any advice.
You can use a regex to capture each of the components in the text, and then replace any missing spaces with a space:
import re
regex = re.compile(r'(\d{2})\s*([A-Z]{3})\s*(\d{2})\s*-\s*(\d{2})\s*([A-Z]{3})\s*(\d{2})')
text = ['04 GOR23- 02 OER 23',
'04 ORO 21-02 RRO 24',
'04 DRE25- 12 RIS21']
[regex.sub(r'\1 \2 \3 - \4 \5 \6', t) for t in text]
Output:
['04 GOR 23 - 02 OER 23',
'04 ORO 21 - 02 RRO 24',
'04 DRE 25 - 12 RIS 21']
Here is another way to do so:
data = '04 GOR23- 02 OER 23'
new_data = "".join(char if not i in [2, 5, 7, 8, 10, 13] else f" {char}" for i, char in enumerate(data.replace(" ", "")))

How to parse a string which may or maynot contain newline in python?

I have a text dataset like this separated by comma-
TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR
sample output will be like this:
sample output
I have to parse the data to form a pandas dataframe containing columns like is_alphanumeric, sorted/unsorted and values.
I have separated the file based on comma delimiter and running a for loop through each item.
value_name=[]
vname = re.search('VALUES ARE \( (.*) \)', line)
value_name.append(vname.group(1).replace("' '", "''"))
but this regex only fetches values which are in a single line. I am unable to fetch those values which are spread in multiline. One data item here shows 2 lines for values, there can be 3 as well. How do I fetch in that case. How to remove the newline character and remove the extra spaces in that case?
You can supply flags to the regex methods that influence how matching is done - I supply 're.DOTALL' so '.' may also match newlines and adjust the pattern for your values a bit:
split at ','
extract into dictionary via regex
split + strip + join every dataline if '\n' in to remove newline and leading spaces
set to DataFrame
text = """TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR"""
Code:
import re
import pandas as pd
# map a regexpattern to the columnname in the df to store its info
patterns = {"(IS ALPHANUMERIC)": "alphanum",
"(IS SORTED)": "sorting",
"(IS UNSORTED)": "sorting",
"(IS NONLINEAR)": "search",
"(IS LINEAR)": "search",
"VALUES ARE \((.+?)\)": "values"}
data = []
df = pd.DataFrame(columns=["alphanum", "sorting", "search", "values"])
for i,parts in enumerate(text.split(",")):
print()
found = {"alphanum":None, "sorting":None, "search":None, "values":None}
for pattern, heading in patterns.items():
match = re.search(pattern, parts, flags=re.DOTALL)
if match:
found[heading] = ' '.join(map(str.strip, match[1].split("\n")))
df.loc[i] = found
print(df)
Output:
alphanum sorting search values
0 IS ALPHANUMERIC IS UNSORTED IS LINEAR A S Q M N C H
1 IS ALPHANUMERIC IS SORTED IS NONLINEAR 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E
2 IS ALPHANUMERIC IS SORTED IS LINEAR 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P 17 T...
#PatrickArtner provided the regex solution, just want to add solution using string methods
dataset = """TABLE DATA IS ALPHANUMERIC
TABLE IS UNSORTED
VALUES ARE ( A S Q M N C H )
SEARCH IS LINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E )
SEARCH IS NONLINEAR
,
TABLE DATA IS ALPHANUMERIC
TABLE IS SORTED
VALUES ARE ( 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 P
17 T 08 K 09 E )
SEARCH IS LINEAR"""
def parse_chunk(chunk):
data = {}
headers = {'TABLE DATA IS ': 'DATA TYPE', 'TABLE IS ': 'IS_SORTED',
'SEARCH IS ': 'SEARCH'}
for line in chunk:
if line.startswith('VALUES ARE ( '):
data['VALUES'] = [line[13:-2]]
elif line.startswith(' '):
data['VALUES'].append(line.strip(' )'))
else:
for prefix, key in headers.items():
if line.startswith(prefix):
data[key] = line[len(prefix):]
break
data['VALUES'] = ' '.join(data['VALUES']) # combine all values chunks
return data
import pandas as pd
data = [parse_chunk(chunk.splitlines()) for chunk in dataset.split('\n,\n')]
print(data)
df = pd.DataFrame(data)
print(df)
output
[{'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'UNSORTED', 'VALUES': 'A S Q M N C H', 'SEARCH': 'LINEAR'}, {'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'SORTED', 'VALUES': "0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E", 'SEARCH': 'NONLINEAR'}, {'DATA TYPE': 'ALPHANUMERIC', 'IS_SORTED': 'SORTED', 'VALUES': "02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 17 T 08 K 09 E", 'SEARCH': 'LINEAR'}]
DATA TYPE IS_SORTED VALUES SEARCH
0 ALPHANUMERIC UNSORTED A S Q M N C H LINEAR
1 ALPHANUMERIC SORTED 0A M 0B S 0A D 01 ' ' 04 N 05 P 07 T 08 K 09 E NONLINEAR
2 ALPHANUMERIC SORTED 02 M 0f S 0A M 0B S 0A D 01 ' ' 0D N 05 17 T ... LINEAR

Dynamic Two Dimensional Array Creation Python

I have a long string of repeating two hexadecimal characters separated by a space read in from a file that I would like to store into a two dimensional (array) list for processing later. The string is in the form:
file_content = "00 18 00 19 F0 0F 1A 80 FF C7 E8 11 7F 52 7D 00 F0 0D F0 0C 0B FF"
Each sub string that needs indexed begins with "00" and ends with "FF". There are no instances of "FF" mid string but there are instances of "00" possible which makes this tricky. I would like to store each one of these events to its own index in the list. For example:
event_list = [[00 18 00 19 F0 0F 1A 80 FF], [C7 E8 11 7F 52 7D 00 F0 0D F0 0C 0B FF], .....}
If I understand this correctly, you're splitting it up based on the 'FF's present in the string, so you could probably get away with something like:
event_list = [('%sFF' % x).strip().split(' ') for x in file_content.split('FF')[:-1]]
This will split your original string by the 'FF's present, then loop through the split parts, append an ' FF' to the end of them. It then splits the new string by the space character, generating a new list and appends it to the outer list, creating the 2D array you require in 1 line :)

Converting an array of integers to unicode

I have an array arr=[07,10,11,05].
I actually would want these values to be converted to hex and send it to a function. I know I can convert the values individually to hex like hex(arr[0]), hex(arr[1]) etc. What I have been doing now is, converting the array to
unicode as arr =[u'07 0a 0b 05'] manually and then sending to the function which serves the purpose for me. But it is too impractical when I have to send hundreds to unicode strings in arr as in
arr = [u'a3 06 06 01 00 01 01 00',
u'a3 06 06 02 00 01 02 00',
u'a3 06 06 03 00 01 03 00',
u'a3 06 06 04 00 01 04 00',
u'a3 06 06 05 00 01 05 00',
u'a3 06 06 06 00 01 06 00',
u'a3 06 06 07 00 01 07 00']
I am sending this one by one (arr[i]) using a loop. Is there a way I can dynamically convert an array of integers to unicode and send it to an external function which readily accepts such unicode strings?
Is ' '.join(map(hex,arr)) what you are looking for?
More specifically:
import numpy as np
a = np.array([[1,2,3],[6,11,23]])
def strip_hexstring(s):
return s.strip('0x').strip('L')
def format_line(line):
return u'{}'.format(' '.join(map(strip_hexstring,map(hex,line))))
arr = [format_line(a[i,:]) for i in range(a.shape[0])]
You can pad with zeros like in your example, if necessary.

Chrome corrupting binary file download by converting to UTF-8

I've currently been assigned to maintain an application written with Flask. Currently I'm trying to add a feature that allows users to download a pre-generated excel file, however, whenever I try to send it, my browser appears to re-encode the file in UTF-8 which causes multibyte characters to be added, which corrupts the file.
File downloaded with wget:
(venv) luke#ubuntu:~$ hexdump -C wget.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 06 06 fb 4a 1f 23 |PK...........J.#|
00000010 cf 03 c0 00 00 00 13 02 00 00 0b 00 00 00 5f 72 |.............._r|
The file downloaded with Chrome (notice the EF BF BD sequences?)
(venv) luke#ubuntu:~$ hexdump -C chrome.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 ef bf bd 03 ef bf |PK..............|
00000010 bd 4a 1f 23 ef bf bd 03 ef bf bd 00 00 00 13 02 |.J.#............|
Does anyone know how I could fix this? This is the code I'm using:
data = b'PK\x03\x04\x14\x00\x00\x00\x08\x00}\x0c\xfbJ\x1f#\xcf\x03\xc0\x00\x00\x00\x13\x02\x00\x00\x0b\x00\x00\x00'
send_file(BytesIO(data), attachment_filename="x.xlsx", as_attachment=True)
Related issue: Encoding problems trying to send a binary file using flask_restful
Chrome was expecting to receive utf-8 encoded text, and found some bytes that couldn't be interpreted as valid utf-8 encoding of a char - which is normal, because your file is binary. So it replaced these invalid bytes with EF BF BD, the utf-8 encoding of the Unicode replacement character. The content-type header you send is probably text/..... Maybe try something like Content-Type:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Categories