I was trying to parse a SVF file for JTAG and I found this problem:
I have to parse an hexadecimal value that can contain spaces and new lines, but I need to have also numbers (without spaces).
I have line comments and white spaces are irrelevant so I used these lexical rules:
COMMENT : ('!' | '//') .*? '\n' -> skip ;
WS : [ \t\r\n]+ -> skip ;
The numbers and hex definitions are:
hexLiteral : HEX | NUM ;
NUM : [0-9]+ ;
HEX : [0-9a-f]+ ;
This works if the input has no new lines or spaces in hex strings, eg:
hexBlock returns [val: str] : '(' hexLiteral ')' {print($hexLiteral.text)}
Run over (0af3) does the job.
But I need to match and extract also strings like (0a3f 10 e2) returning 0a3f10e2.
My first idea was to use:
hexLiteral : (HEX | NUM) hexLiteral? ;
But the parsing for the block results in mismatched input '10' expecting ')'.
You are trying to make 2 opposite things to work together:
You want to ignore whitespaces and I guess you use them to separate tokens in your language.
You want whitespaces in some of your tokens too.
Instead of trying to make your grammar accepting all kind of ws/digit combinations I recommend to collect the individual parts as normal numbers and then in the semantic phase after the parse run you can examine your parse tree and put all tokens together that are supposed to build a single unit.
Related
Using regex in python 3.6.3 I am trying to extract scientific notation numbers associated with a specific start text and end text. From the following sample data:
Not_the_data : REAL[10] (randomtext := doesntapply) := [1.00000000e+000,-2.00000000e000,3.00000000e+000,4.00000000e+000,5.00000000e+000,6.00000000e+000
,7.00000000e+000,8.00000000e-000,9.00000000e+000,1.00000000e+001,1.10000000e+001];
This_data : REAL[2,27] (RADIX := Float) := [3.45982254e-001,9.80374157e-001,8.29904616e-001,1.57800000e+002,4.48320538e-001,6.20533180e+001
,1.80081348e+003,-8.93283653e+000,5.25826037e-001,2.16974407e-001,1.17304848e+002,6.82604387e-002
,3.76116596e-002,6.82604387e-002,3.76116596e-002];
Not_it_either : REAL[72] (randomtext := doesntapply) := [0.00000000e+000,-0.00000000e000,0.00000000e+000,0.00000000e+000,0.00000000e+000,0.00000000e+000];
I would want only the data in the "This_data" set:
['3.45982254e-001','9.80374157e-001','8.29904616e001','1.57800000e+002','4.48320538e-001','6.20533180e+001','1.80081348e+003','-8.93283653e+000','5.25826037e-001','2.16974407e-001','1.17304848e+002','6.82604387e-002','3.76116596e-002','6.82604387e-002','3.76116596e-002']
If I don't use the lookaround functions I can get all the numbers that match the scientific notation easily like this:
values = re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)',_DATA_,re.DOTALL|re.MULTILINE)
But as soon as I add a lookahead function:
values = re.findall('(?<=This_data).*?(-?[0-9]+.[0-9]+e[+-][0-9]+)+',_DATA_,re.DOTALL|re.MULTILINE)
all but the first number in the desired set drop off. I have attempted multiple iterations of this using positive and negative lookahead and lookbehind on debugex to no avail.
My source file is 50k+ lines and the data set desired is 10-11k lines. Ideally I would like to capture my data set in one read through of my file.
How can I correctly use a lookahead or lookbehind function to limit my data capture to numbers that meet the format but only from the desired "This_Data" set?
Any help is appreciated!
You might have an easier time parsing the file one line at a time, skipping lines that don't meet the criteria. It looks like each line ends with a semicolon, so you can use that as a way to break the parsing.
import re
PARSING = False
out = []
with open('path/to/file.data') as fp:
for line in fp:
if line.startswith('This_data'):
PARSING = True
if PARSING:
out.append(re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)', line)
# check if the line contains a semicolon to stop parsing
if ';' in line:
PARSING = False
# return the results:
out
I have a fasta file with a header than includes the sequence name and length
>1 9081 bp
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
I need to remove everything after the name "1" and tried doing that in python by:
newfile.write(oldfile.replace("bp",""))
This removes "bp" but I still have the numbers now.
>1 9081
gcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga
How do I designate the term: any character followed by bp to be replaced with nothing. I tried ***bp or ---bp or ...bp but those don't work.
Thanks!
Radwa
You should use a regular expression for this purpose.
Try this (assuming your file name may contain more than 1 characters and may contain both digits and letters):
import re
regex = re.compile(r'(^\w+)\s.*', re.DOTALL)
print(regex.sub(r'\1', '1 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
print(regex.sub(r'\1', 's12d 9081 bp\ngcgcccgaacagggacttgaaagcgaaagagaaaccagagaagctctctcgacgcagga' ))
Output:
1
s12d
I have a py3 string that includes escaped utf-8 sequencies, such as "Company\\ffffffc2\\ffffffae", which I would like to convert to the correct utf 8 string (which would in the example be "Company®", since the escaped sequence is c2 ae). I've tried
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace(
"\\\\ffffff", "\\x"), "ascii").decode("utf-8"))
result: Company\xc2\xae
print (bytes("Company\\\\ffffffc2\\\\ffffffae".replace (
"\\\\ffffff", "\\x"), "ascii").decode("unicode_escape"))
result: Company®
(wrong, since chracters are treated separately, but they should be treated together.
If I do
print (b"Company\xc2\xae".decode("utf-8"))
It gives the correct result.
Company®
How can i achieve that programmatically (i.e. starting from a py3 str)
A simple solution is:
import ast
test_in = "Company\\\\ffffffc2\\\\ffffffae"
test_out = ast.literal_eval("b'''" + test_in.replace('\\\\ffffff','\\x') + "'''").decode('utf-8')
print(test_out)
However it will fail if there is a triple quote ''' in the input string itself.
Following code does not have this problem, but it is not as simple as the first one.
In the first step the string is split on a regular expression. The odd items are ascii parts, e.g. "Company"; each even item corresponds to one escaped utf8 code, e.g. "\\\\ffffffc2". Each substring is converted to bytes according to its meaning in the input string. Finally all parts are joined together and decoded from bytes to a string.
import re
REGEXP = re.compile(r'(\\\\ffffff[0-9a-f]{2})', flags=re.I)
def convert(estr):
def split(estr):
for i, substr in enumerate(REGEXP.split(estr)):
if i % 2:
yield bytes.fromhex(substr[-2:])
elif substr:
yield bytes(substr, 'ascii')
return b''.join(split(estr)).decode('utf-8')
test_in = "Company\\\\ffffffc2\\\\ffffffae"
print(convert(test_in))
The code could be optimized. Ascii parts do not need encode/decode and consecutive hex codes should be concatenated.
I have a problem that probably is very easy to solve. I have a script that takes numbers from various places does math with them and then prints the results as strings.
This is a sample
type("c", KEY_CTRL)
LeInput = Env.getClipboard().strip() #Takes stuff from clipboard
LeInput = LeInput.replace("-","") #Quick replace
Variable = int(LeInput) + 5 #Simple math operation
StringOut = str(Variable) #Converts it to string
popup(StringOut) #shows result for the amazed user
But what I want to do is to add the "-" signs again as per XXXX-XX-XX but I have no idea on how to do this with Regex etc. The only solution I have is dividing it by 10^N to split it into smaller and smaller integers. As an example:
int 543442/100 = 5434 giving the first string the number 5434, and then repeat process until i have split it enough times to get my 5434-42 or whatever.
So how do I insert any symbol at the N:th character?
OK, so here is the Jython solution based on the answer from Tenub
import re
strOut = re.sub(r'^(\d{4})(.{2})(.{2})', r'\1-\2-\3', strIn)
This can be worth noting when doing Regex with Jython:
The solution is to use Python’s raw string notation for regular
expression patterns; backslashes are not handled in any special way in
a string literal prefixed with 'r'. So r"\n" is a two- character
*string containing '\' and 'n', while "\n" is a one-character string*
containing a newline. Usually patterns will be expressed in Python
*code using this raw string notation.*
Here is a working example
http://regex101.com/r/oN2wF1
In that case you could do a replace with the following:
(\d{4})(\d{2})(\d+)
to
$1-$2-$3
I've a binary file. From that file I need to extract few chunk of data using python regular expression.
I need to extract non null characters-set present in-between null characters sets.
For example this is the main character set:
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56
The regex should extract below character sets from above master set:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32 and
\x56\x65\x00\x35\x56
One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.
If its not making any sense please let me know I will try to explain it in a better manner.
Thanks in Advance,
You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.
In Perl, its something like this
Perl test case
$strLangs = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";
# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;
# Split on 5 or more 0's
#Alllangs = split /\x00{5,}/, $strLangs;
# Print each language characters
foreach $lang (#Alllangs)
{
print "<";
for ( split //, $lang ) {
printf( "%x,", ord($_));
}
print ">\n";
}
Output >>
<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>
You can use split and lstrip with list comprehension as:
s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\') for i in sp if i != ""]
Output:
['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
split entire data based on 5 nul values.
in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).
Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split().
import re
data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
'\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
'\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
chunks = re.split(r'\000{6,}', data.strip('\x00'))
# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk)
for chunk in chunks),
Output:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56