Regex Extract specific data between specific strings in python

Regex Extract specific data between specific strings in python - python

Using regex in python 3.6.3 I am trying to extract scientific notation numbers associated with a specific start text and end text. From the following sample data:
Not_the_data : REAL[10] (randomtext := doesntapply) := [1.00000000e+000,-2.00000000e000,3.00000000e+000,4.00000000e+000,5.00000000e+000,6.00000000e+000
,7.00000000e+000,8.00000000e-000,9.00000000e+000,1.00000000e+001,1.10000000e+001];
This_data : REAL[2,27] (RADIX := Float) := [3.45982254e-001,9.80374157e-001,8.29904616e-001,1.57800000e+002,4.48320538e-001,6.20533180e+001
,1.80081348e+003,-8.93283653e+000,5.25826037e-001,2.16974407e-001,1.17304848e+002,6.82604387e-002
,3.76116596e-002,6.82604387e-002,3.76116596e-002];
Not_it_either : REAL[72] (randomtext := doesntapply) := [0.00000000e+000,-0.00000000e000,0.00000000e+000,0.00000000e+000,0.00000000e+000,0.00000000e+000];
I would want only the data in the "This_data" set:
['3.45982254e-001','9.80374157e-001','8.29904616e001','1.57800000e+002','4.48320538e-001','6.20533180e+001','1.80081348e+003','-8.93283653e+000','5.25826037e-001','2.16974407e-001','1.17304848e+002','6.82604387e-002','3.76116596e-002','6.82604387e-002','3.76116596e-002']
If I don't use the lookaround functions I can get all the numbers that match the scientific notation easily like this:
values = re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)',_DATA_,re.DOTALL|re.MULTILINE)
But as soon as I add a lookahead function:
values = re.findall('(?<=This_data).*?(-?[0-9]+.[0-9]+e[+-][0-9]+)+',_DATA_,re.DOTALL|re.MULTILINE)
all but the first number in the desired set drop off. I have attempted multiple iterations of this using positive and negative lookahead and lookbehind on debugex to no avail.
My source file is 50k+ lines and the data set desired is 10-11k lines. Ideally I would like to capture my data set in one read through of my file.
How can I correctly use a lookahead or lookbehind function to limit my data capture to numbers that meet the format but only from the desired "This_Data" set?
Any help is appreciated!

You might have an easier time parsing the file one line at a time, skipping lines that don't meet the criteria. It looks like each line ends with a semicolon, so you can use that as a way to break the parsing.
import re
PARSING = False
out = []
with open('path/to/file.data') as fp:
for line in fp:
if line.startswith('This_data'):
PARSING = True
if PARSING:
out.append(re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)', line)
# check if the line contains a semicolon to stop parsing
if ';' in line:
PARSING = False
# return the results:
out

Related

how to parse line start with multiple key words in python?

I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?
Use case:
So each line, I am expecting to see one of these as starting of each line: CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, so I do this by using #flakes help in previous post:
import re, os, sys
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S+) (\S+).*$")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:
ValueError: not enough values to unpack (expected 3, got 2)
I used this input file for parsing using re.
objective
Basically, I need to check each line to see whether it start with CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?

If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.
You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.
Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.
The updated pattern:
^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)
See the capture groups in the regex demo
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
Output
component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'

What's a better way to process inconsistently structured strings?

I have an output string like this:
read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec
And I want to just extract one of the numerical values for computation, say iops. I'm processing it like this:
if 'read ' in key:
my_read_iops = value.split(",")[2].split("=")[1]
result['test_details']['read'] = my_read_iops
But there are slight inconsistencies with some of the strings I'm reading in and my code is getting super complicated and verbose. So instead of manually counting the number of commas vs "=" chars, what's a better way to handle this?

You can use regular expression \s* to handle inconsistent spacing, it matches zero or more whitespaces:
import re
s = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
for m in re.finditer(r'\s*(?P<name>\w*)\s*=\s*(?P<value>[\w/]*)\s*', s):
print(m.group('name'), m.group('value'))
# io 131220KB
# bw 14016KB/s
# iops 3504
# runt 9362msec
Using group name, you can construct pattern string from a list of column names and do it like:
names = ['io', 'bw', 'iops', 'runt']
name_val_pat = r'\s*{name}\s*=\s*(?P<{group_name}>[\w/]*)\s*'
pattern = ','.join([name_val_pat.format(name=name, group_name=name) for name in names])
# '\s*io\s*=\s*(?P<io>[\w/]*)\s*,\s*bw\s*=\s*(?P<bw>[\w/]*)\s*,\s*iops\s*=\s*(?P<iops>[\w/]*)\s*,\s*runt\s*=\s*(?P<runt>[\w/]*)\s*'
match = re.search(pattern, s)
data_dict = {name: match.group(name) for name in names}
print(data_dict)
# {'io': '131220KB', 'bw': '14016KB/s', 'runt': '9362msec', 'iops': '3504'}
In this way, you only need to change names and keep the order correct.

If I were you,I'd use regex(regular expression) as first choice.
import re
s= "read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec"
re.search(r"iops=(\d+)",s).group(1)
By this python code, I find the string pattern that starts 'iops=' and continues number expression at least 1 digit.I extract the target string(3504) by using round bracket.
you can find more information about regex from
https://docs.python.org/3.6/library/re.html#module-re
regex is powerful language for complex pattern matching with simple syntax.

from re import match
string = 'read : io=131220KB, bw=14016KB/s, iops=3504, runt= 9362msec'
iops = match(r'.+(iops=)([0-9]+)', string).group(2)
iops
'3504'

Python Regular Expression Extract Chunk of Data From Binary File

I've a binary file. From that file I need to extract few chunk of data using python regular expression.
I need to extract non null characters-set present in-between null characters sets.
For example this is the main character set:
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56
The regex should extract below character sets from above master set:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32 and
\x56\x65\x00\x35\x56
One thing is important, If it gets more than 5 null bytes continuously then only it should treat these null characters set as separator..otherwise it should include this null bytes into no-null character. As you can see in given example few null characters are also present in extracted character set.
If its not making any sense please let me know I will try to explain it in a better manner.
Thanks in Advance,

You could split on \x00{5,}
This is 5 or more zero's. Its the delimeter you specified.
In Perl, its something like this
Perl test case
$strLangs = "\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56";
# Remove leading zero's (5 or more)
$strLangs =~ s/^\x00{5,}//;
# Split on 5 or more 0's
#Alllangs = split /\x00{5,}/, $strLangs;
# Print each language characters
foreach $lang (#Alllangs)
{
print "<";
for ( split //, $lang ) {
printf( "%x,", ord($_));
}
print ">\n";
}
Output >>
<ff,fe,fe,0,0,23,41,>
<41,49,57,0,0,0,0,32,41,49,57,0,0,0,0,32,>
<56,65,0,35,56,>

You can use split and lstrip with list comprehension as:
s='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
sp=s.split('\x00\x00\x00\x00\x00')
print [i.lstrip('\x00\\') for i in sp if i != ""]
Output:
['\xff\xfe\xfe\x00\x00#A', 'AIW\x00\x00\x00\x002AIW\x00\x00\x00\x002', 'Ve\x005V']
split entire data based on 5 nul values.
in the list, find if any element is starting with nul and if it's starting remove them (this works for variable number of nul replacement at start).

Here's how to do it in Python. I had to str.strip() off and leading and trailing nulls to get the regex pattern to prevent the inclusion of an extra empty string at the beginning of the list of results returned from re.split().
import re
data = ('\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xff\xfe\xfe\x00\x00\x23\x41'
'\x00\x00\x00\x00\x00\x00\x00\x00\x41\x49\x57\x00\x00\x00\x00\x32\x41'
'\x49\x57\x00\x00\x00\x00\x32\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
'\x00\x00\x00\x00\x00\x56\x65\x00\x35\x56'
'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00')
chunks = re.split(r'\000{6,}', data.strip('\x00'))
# display results
print ',\n'.join(''.join('\\x'+ch.encode('hex_codec') for ch in chunk)
for chunk in chunks),
Output:
\xff\xfe\xfe\x00\x00\x23\x41,
\x41\x49\x57\x00\x00\x00\x00\x32\x41\x49\x57\x00\x00\x00\x00\x32,
\x56\x65\x00\x35\x56

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.

Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)

If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs

You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =

No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.

I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?

how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.

either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," "）
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module，but it's not necessory

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Regex Extract specific data between specific strings in python - python

Related

how to parse line start with multiple key words in python?

What's a better way to process inconsistently structured strings?

Python Regular Expression Extract Chunk of Data From Binary File

How to substitute specific patterns in python

python substitute a substring with one character less

Categories

Resources