Python Regular Expression Extract Chunk of Data From Binary File - python

I've a binary file. From that file I need to extract few chunk of data using python regular expression.
I need to extract non null characters-set present in-between null characters sets.
For example this is the main character set:
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56
The regex should extract below character sets from above master set:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32 and
\x56\x65\x00\x35\x56
One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.
If its not making any sense please let me know I will try to explain it in a better manner.
Thanks in Advance,

You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.
In Perl, its something like this
Perl test case
$strLangs = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";
# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;
# Split on 5 or more 0's
#Alllangs = split /\x00{5,}/, $strLangs;
# Print each language characters
foreach $lang (#Alllangs)
{
print "<";
for ( split //, $lang ) {
printf( "%x,", ord($_));
}
print ">\n";
}
Output >>
<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>

You can use split and lstrip with list comprehension as:
s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\') for i in sp if i != ""]
Output:
['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
split entire data based on 5 nul values.
in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).

Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split().
import re
data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
'\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
'\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
chunks = re.split(r'\000{6,}', data.strip('\x00'))
# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk)
for chunk in chunks),
Output:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56

Related

ANTLR4 Literal for HEX string containing new line

I was trying to parse a SVF file for JTAG and I found this problem:
I have to parse an hexadecimal value that can contain spaces and new lines, but I need to have also numbers (without spaces).
I have line comments and white spaces are irrelevant so I used these lexical rules:
COMMENT : ('!' | '//') .*? '\n' -> skip ;
WS : [ \t\r\n]+ -> skip ;
The numbers and hex definitions are:
hexLiteral : HEX | NUM ;
NUM : [0-9]+ ;
HEX : [0-9a-f]+ ;
This works if the input has no new lines or spaces in hex strings, eg:
hexBlock returns [val: str] : '(' hexLiteral ')' {print($hexLiteral.text)}
Run over (0af3) does the job.
But I need to match and extract also strings like (0a3f 10 e2) returning 0a3f10e2.
My first idea was to use:
hexLiteral : (HEX | NUM) hexLiteral? ;
But the parsing for the block results in mismatched input '10' expecting ')'.
You are trying to make 2 opposite things to work together:
You want to ignore whitespaces and I guess you use them to separate tokens in your language.
You want whitespaces in some of your tokens too.
Instead of trying to make your grammar accepting all kind of ws/digit combinations I recommend to collect the individual parts as normal numbers and then in the semantic phase after the parse run you can examine your parse tree and put all tokens together that are supposed to build a single unit.

How to convert a multiline string into a list of lines?

In sikuli I've get a multiline string from clipboard like this...
Names = App.getClipboard();
So Name =
#corazona
#Pebleo00
#cofriasd
«paflio
and I have use this regex to delete the first character if it is not in x00-x7f hex range or is not a word, or is a digit
import re
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", Names)
So now Names =
corazona
Pebleo00
cofriasd
paflio
But, I am having trouble with the second regex that converts "Names" into the items of a sequence. I would like to convert "Names" into...
'corazona', 'Pebleo00', 'cofriasd', 'paflio'
or
'corazona', 'Pebleo00', 'cofriasd', 'paflio',
So sikuli can then recognize it as a List (I've found that Sikuli is able to recognize it even with those last "comma" and "space" in the end) by using...
NamesAsList = eval(Names)
How could I do this in python? is it necessary to use regex, or there is other way to do this in python?
I have already done this but using .Net regex, I just don't know how to do it in python, I have googled it with no result.
This is how I did it using .Net regex
Text to find:
(.*[^$])(\r\n|\z)
Replace with:
'$1',%" "%
Thanks Advanced.
A couple of one liners. Your question isn't completely clear - but I am assuming - you want to split a given string delimited by 'newline' and then generate a list of strings by removing the first character if it's not alpha numeric. Here's how I'd go about it
import re
r = re.compile(r'^[a-zA-Z0-9]') # match # beginning anything that's not alpha numeric
s = '#abc\ndef\nghi'
l = [r.sub('', x) for x in s.split()]
# join this list with comma (if that's required else you got the list already)
','.join(l)
Hope that's what you want.
If Names is a string before you "convert" it, in which each name is separated by a new line ('\n'), then this will work:
NamesAsList = '\n'.split(Names)
See this question for other options.
You could use splitlines()
import re
clipBoard = App.getClipboard();
Names = re.sub(r"(?m)^([^\x00-\x7F]+|\W|\d)", "", clipBoard)
# Replace the end of a line with a comma.
singleNames = ', '.join(Names.splitlines())
print(singleNames)

Suggestion for python regex and selecting columns [duplicate]

This question already has answers here:
Split string on whitespace in Python [duplicate]
(4 answers)
Closed 8 years ago.
How can I select, in a file with 3, 4 or X columns separated by space (not constant space, but multiple spaces on each line) select the first 2 columns of each row with a regex?
My files consist of : IP [SPACES] Subnet_Mask [SPACES] NEXT_HOP_IP [NEW LINE]
All rows use that format. How can I extract only the first 2 columns? (IP & Subnet mask)
Here is an example on which to try your regex:
10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
Don't look to the specific IPs. I know the second column is not formed of valid address masks. It's just an example.
I already tried:
(?P<IP_ADD>\s*[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3})(?P<space>\s*)(?P<MASK>[1-9][0-9]{1,2}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}(\s+|\D*))
But it doesn't quite work...
With a regular expression:
If you want to get the 2 first columns, whatever they contain, and whatever amount of space separates them, you can use \S (matches anything but whitespaces) and \s (matches whitespaces only) to achieve that:
import re
lines = """
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224
"""
regex = re.compile(r'(\S+)\s+(\S+)')
regex.findall(lines)
Result:
[('10.97.96.0', '10.97.97.128'),
('47.73.1.0', '47.73.4.128'),
('47.73.7.6', '47.73.8.0'),
('47.73.15.0', '47.73.40.0'),
('47.73.41.0', '85.205.9.164'),
('85.205.14.44', '172.17.103.0'),
('172.17.103.8', '172.17.103.48'),
('172.17.103.56', '172.17.103.96'),
('172.17.103.100', '172.17.103.136'),
('172.17.103.140', '172.17.104.44'),
('172.17.105.28', '172.17.105.32'),
('172.17.105.220', '172.17.105.224')]
Without a regular expression
If you didn't want to use a regex, and still be able to handle multiple spaces, you could also do:
while ' ' in lines: # notice the two-spaces-string
lines = lines.replace(' ', ' ')
columns = [line.split(' ')[:2] for line in lines.split('\n') if line]
Pros and cons:
The advantage of using a regex is that it would also parse the data properly if separators include tabulations, which wouldn't be the case with the 2nd solution.
On the other hand, regular expressions require more computing than a simple string splitting, which could make a difference on very large data sets.
One liner it is:
[s.split()[:2] for s in string.split('\n')]
Example
string = """10.97.96.0 10.97.97.128 47.73.1.0
47.73.4.128 47.73.7.6 47.73.8.0
47.73.15.0 47.73.40.0 47.73.41.0
85.205.9.164 85.205.14.44 172.17.103.0
172.17.103.8 172.17.103.48 172.17.103.56
172.17.103.96 172.17.103.100 172.17.103.136
172.17.103.140 172.17.104.44 172.17.105.28
172.17.105.32 172.17.105.220 172.17.105.224"""
print [s.split()[:2] for s in string.split('\n')]
Outputs
[['10.97.96.0', '10.97.97.128']
['47.73.4.128', '47.73.7.6']
['47.73.15.0', '47.73.40.0']
['85.205.9.164', '85.205.14.44']
['172.17.103.8', '172.17.103.48']
['172.17.103.96', '172.17.103.100']
['172.17.103.140', '172.17.104.44']
['172.17.105.32', '172.17.105.220']]
Since you need "some sort of one-liner", there are many ways that does not involve python.
Maybe:
| awk '{print $1,$2}'
with anything that produces your input on stdout.
Edited to perform space match with any number of spaces.
You can accomplish this with python regular expressions like this as an option if you know it's going to be the first 2 space separated values.
A nice regex cheat sheet will also help you find out some shortcuts. Specific tokens classes like words, spaces, and numbers have these little shortcuts.
import re
line = "10.97.96.0 10.97.97.128 47.73.1.0"
result = re.split("\s+", line)[0:2]
result
['10.97.96.0', '10.97.97.128']

Jython output formats, adding symbol at N:th character

I have a problem that probably is very easy to solve. I have a script that takes numbers from various places does math with them and then prints the results as strings.
This is a sample
type("c", KEY_CTRL)
LeInput = Env.getClipboard().strip() #Takes stuff from clipboard
LeInput = LeInput.replace("-","") #Quick replace
Variable = int(LeInput) + 5 #Simple math operation
StringOut = str(Variable) #Converts it to string
popup(StringOut) #shows result for the amazed user
But what I want to do is to add the "-" signs again as per XXXX-XX-XX but I have no idea on how to do this with Regex etc. The only solution I have is dividing it by 10^N to split it into smaller and smaller integers. As an example:
int 543442/100 = 5434 giving the first string the number 5434, and then repeat process until i have split it enough times to get my 5434-42 or whatever.
So how do I insert any symbol at the N:th character?
OK, so here is the Jython solution based on the answer from Tenub
import re
strOut = re.sub(r'^(\d{4})(.{2})(.{2})', r'\1-\2-\3', strIn)
This can be worth noting when doing Regex with Jython:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two- character
*string containing '\' and 'n', while "\n" is a one-character string*
containing a newline. Usually patterns will be expressed in Python
*code using this raw string notation.*
Here is a working example
http://regex101.com/r/oN2wF1
In that case you could do a replace with the following:
(\d{4})(\d{2})(\d+)
to
$1-$2-$3

Python: Regex question / CSV parsing / Psycopg nested arrays

I'm having trouble parsing nested array's returned by Psycopg2. The DB I'm working on returns records that can have nested array's as value. Psycopg only parses the outer array of such values.
My first approach was splitting the string on comma's, but then I ran into the problem that sometimes a string within the result also contains comma's, which renders the entire approach unusable.
My next attempt was using regex to find the "components" within the string, but then I noticed I wasn't able to detect numbers (since numbers can also occur within strings).
Currently, this is my code:
import re
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
r = re.compile('\".*?\"|[\w]{8}-[\w]{4}-[\w]{4}-[\w]{4}-[\w]{12}|^\d*[0-9](|.\d*[0-9]|,\d*[0-9])?$')
result = r.search(text)
if result:
result = result.groups()
The result of this should be:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e', 'Marc, Dirk en Koen', 398547, 85.5, -9.2, '62fe6393-00f7-418d-b0b3-7116f6d5cf10']
Since I would like to have this functionality generic, I cannot be certain of the order of arguments. I only know that the types that are supported are strings, uuid's, (signed) integers and (signed) decimals.
Am I using a wrong approach? Or can anyone point me in the right direction?
Thanks in advance!
Python's native lib should do a good work. Have you tried it already?
http://docs.python.org/library/csv.html
From your sample, it looks something like ^{(?:(?:([^},"']+|"[^"]+"|'[^']+')(?:,|}))+(?<=})|})$ to me. That's not perfect since it would allow "{foo,bar}baz}", but it could be fixed if that matters to you.
If you can do ASSERTIONS, this will get you on the right track.
This problem is too extensive to be done in a single regex. You are trying to validate and parse at the same time in a global match. But your intented result requires sub-processing after the match. For that reason, its better to write a simpler global parser, then itterate over the results for validation and fixup (yes, you have fixup stipulated in your example).
The two main parsing regex's are these:
strips delimeter quote too and only $2 contains data, use in a while loop, global context
/(?!}$)(?:^{?|,)\s*("|)(.*?)\1\s*(?=,|}$)/
my preferred one, does not strip quotes, only captures $1, can use to capture in an array or in a while loop, global context
/(?!}$)(?:^{?|,)\s*(".*?"|.*?)\s*(?=,|}$)/
This is an example of post processing (in Perl) with a documented regex: (edit: fix append trailing ,)
use strict; use warnings;
my $str = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}';
my $rx = qr/ (?!}$) (?:^{?|,) \s* ( ".*?" | .*?) \s* (?=,|}$) /x;
my $rxExpanded = qr/
(?!}$) # ASSERT ahead: NOT a } plus end
(?:^{?|,) # Boundry: Start of string plus { OR comma
\s* # 0 or more whitespace
( ".*?" | .*?) # Capture "Quoted" or non quoted data
\s* # 0 or more whitespace
(?=,|}$) # Boundry ASSERT ahead: Comma OR } plus end
/x;
my ($newstring, $sucess) = ('[', 0);
for my $field ($str =~ /$rx/g)
{
my $tmp = $field;
$sucess = 1;
if ( $tmp =~ s/^"|"$//g || $tmp =~ /(?:[a-f0-9]+-){3,}/ ) {
$tmp = "'$tmp'";
}
$newstring .= "$tmp,";
}
if ( $sucess ) {
$newstring =~ s/,$//;
$newstring .= ']';
print $newstring,"\n";
}
else {
print "Invalid string!\n";
}
Output:
['2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e','Marc, Dirk en Koen',398547,85.5,-9.2,'6
2fe6393-00f7-418d-b0b3-7116f6d5cf10']
It seemed that the CSV approach was the easiest to implement:
def parsePsycopgSQLArray(input):
import csv
import cStringIO
input = input.strip("{")
input = input.strip("}")
buffer = cStringIO.StringIO(input)
reader = csv.reader(buffer, delimiter=',', quotechar='"')
return reader.next() #There can only be one row
if __name__ == "__main__":
text = '{2f5e5fef-1e8c-43a2-9a11-3a39b2cbb45e,"Marc, Dirk en Koen",398547,85.5,-9.2, 62fe6393-00f7-418d-b0b3-7116f6d5cf10}'
result = parsePsycopgSQLArray(text)
print result
Thanks for the responses, they were most helpfull!
Improved upon Dirk's answer. This handles escape characters better as well as the empty array case. One less strip call as well:
def restore_str_array(val):
"""
Converts a postgres formatted string array (as a string) to python
:param val: postgres string array
:return: python array with values as strings
"""
val = val.strip("{}")
if not val:
return []
reader = csv.reader(StringIO(val), delimiter=',', quotechar='"', escapechar='\\')
return reader.next()

Categories