how to parse line start with multiple key words in python? - python

I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?
Use case:
So each line, I am expecting to see one of these as starting of each line: CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, so I do this by using #flakes help in previous post:
import re, os, sys
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S+) (\S+).*$")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:
ValueError: not enough values to unpack (expected 3, got 2)
I used this input file for parsing using re.
objective
Basically, I need to check each line to see whether it start with CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?

If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.
You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.
Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.
The updated pattern:
^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)
See the capture groups in the regex demo
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
Output
component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'

Related

how to avoid regex matching "Revert "Revert"

I have the following code which matches both line1 and line2,I ONLY want to match line2 but not line1, can anyone provide guidance on how to do this?
import re
line1 = '''Revert "Revert <change://problem/47614601> [tech feature][V3 Driver] Last data path activity timestamp update is required for feature""'''
line2 = '''Revert <change://problem/47614601> [tech feature][V3 Driver] Last data path activity timestamp update is required for feature"'''
if re.findall(".*?(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})", line2):
match = re.findall(".*?(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})", line2)
print "Revert radar match%s"%match
revert_radar = True
print revert_radar
Something like this should do what you want:
>>> regex = "(?!:(?:R|r)evert.*)(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})"
>>> re.match(regex, line1) is None
True
>>> re.match(regex, line2).groups()
('47614601',)
Negative look behind: word NOT proceeded by sameword-space-doublequote
r'''(?<![Rr]evert ")[Rr]evert\s<change:[/][/]problem[/]\d{8}.*"'''
Negative look ahead: pattern you want NOT followed by a double doublequote
r'''[Rr]evert\s<change:[/][/]problem[/]\d{8}.*?\w"(?!")'''
this doesn't work if the unwanted line immediately precedes the wanted line without an intervening newline character.
it does work if the unwanted line follows the wanted line.
If looking at individual lines, look for the pattern at the start of string - of the three this is the least work, most efficient for the regex engine
r'''^[Rr]evert\s<change:[/][/]problem[/]\d{8}.*$'''
If the line you are searching for is embedded in a long string but there is a newline character before that line you can use the previous pattern with the multiline flag.

Regex Extract specific data between specific strings in python

Using regex in python 3.6.3 I am trying to extract scientific notation numbers associated with a specific start text and end text. From the following sample data:
Not_the_data : REAL[10] (randomtext := doesntapply) := [1.00000000e+000,-2.00000000e000,3.00000000e+000,4.00000000e+000,5.00000000e+000,6.00000000e+000
,7.00000000e+000,8.00000000e-000,9.00000000e+000,1.00000000e+001,1.10000000e+001];
This_data : REAL[2,27] (RADIX := Float) := [3.45982254e-001,9.80374157e-001,8.29904616e-001,1.57800000e+002,4.48320538e-001,6.20533180e+001
,1.80081348e+003,-8.93283653e+000,5.25826037e-001,2.16974407e-001,1.17304848e+002,6.82604387e-002
,3.76116596e-002,6.82604387e-002,3.76116596e-002];
Not_it_either : REAL[72] (randomtext := doesntapply) := [0.00000000e+000,-0.00000000e000,0.00000000e+000,0.00000000e+000,0.00000000e+000,0.00000000e+000];
I would want only the data in the "This_data" set:
['3.45982254e-001','9.80374157e-001','8.29904616e001','1.57800000e+002','4.48320538e-001','6.20533180e+001','1.80081348e+003','-8.93283653e+000','5.25826037e-001','2.16974407e-001','1.17304848e+002','6.82604387e-002','3.76116596e-002','6.82604387e-002','3.76116596e-002']
If I don't use the lookaround functions I can get all the numbers that match the scientific notation easily like this:
values = re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)',_DATA_,re.DOTALL|re.MULTILINE)
But as soon as I add a lookahead function:
values = re.findall('(?<=This_data).*?(-?[0-9]+.[0-9]+e[+-][0-9]+)+',_DATA_,re.DOTALL|re.MULTILINE)
all but the first number in the desired set drop off. I have attempted multiple iterations of this using positive and negative lookahead and lookbehind on debugex to no avail.
My source file is 50k+ lines and the data set desired is 10-11k lines. Ideally I would like to capture my data set in one read through of my file.
How can I correctly use a lookahead or lookbehind function to limit my data capture to numbers that meet the format but only from the desired "This_Data" set?
Any help is appreciated!
You might have an easier time parsing the file one line at a time, skipping lines that don't meet the criteria. It looks like each line ends with a semicolon, so you can use that as a way to break the parsing.
import re
PARSING = False
out = []
with open('path/to/file.data') as fp:
for line in fp:
if line.startswith('This_data'):
PARSING = True
if PARSING:
out.append(re.findall('(-?[0-9]+.[0-9]+e[+-][0-9]+)', line)
# check if the line contains a semicolon to stop parsing
if ';' in line:
PARSING = False
# return the results:
out

extract a line that matches a string with IP address

I am working on python to find a line that matches a particular pattern of an IP address.
f = open('file.txt','r')
for line in f:
if line.find("N2:42.61.0.69")
print line
The pattern that I am trying to match is "node number":"IP address", where
"node number" is the alphabet 'N' followed by a 'number'. Such as N23, N456, N98765, etc.
I used the pattern N2:42\.61\.0\.69, but it didn't yield any result.
Most of the examples talks about regex to match a particular pattern such as for an IP address "^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$".
But here I want to match a particular string with IP address.
Thanks for the help!
The function find():
Returns the lowest index in s where the substring sub is found such that sub is wholly contained in s[start:end]. Return -1 on failure. Defaults for start and end and interpretation of negative values is the same as for slices.
So if you want to find lines containing the given string( "N2:42.61.0.69" in your case), the condition should be:
if line.find("N2:42.61.0.69") != -1:
If I understand your question, the value of the node may change, but the value of the IP address is known and fixed. Since the dot has a special meaning in regular expressions, you must place a backslash before.
import re
REGEX = re.compile( r"^N\d+:42\.61\.0\.69$" )
f = open('file.txt','r')
for line in f:
line = line.strip()
if REGEX.match(line):
print line
f.close()
Notice the use of a raw string (r" ..."), without it all the backslashes would need to appear twice (\\). Also, the ^ at the beginning and the $ at the end are used to match the whole line. If other text can appear in the lines, you can remove them and use search instead of match. The rstrip() is used to remove end-of-line characters (\n for example).
When used with the following data (contents of file.txt)
N1:42.61.0.69
N2:10.0.0.1
N123456:42.61.0.69
it prints only the first and third lines
Are you looking for this:
(?:^|\s*)N\d+:(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:$|\s+)
REGEX 101 DEMO
In python 2.7, to use the pattern in your example, you could:
import re
p = '(?:^|\s*)N\d+:(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:$|\s+)'
with open('file.txt', 'r') as f:
print [line.strip() for line in f.readlines() if re.findall(p, line) ]
If you want to process file.txt a line at a time, the following should work:
import re
with open('file.txt') as f_input:
for line in f_input:
match = re.search(r'(.*?(N\d+):42\.61\.0\.69 .*?)', line)
if match:
line, node_number = match.groups()
print "{} found in {}".format(node_number, line)
Given file.txt contains the following:
link L518523: N1:42.61.0.69 N248066
non matching line
link L518533: N2:42.61.0.69 N248066
link L518553: N3:43.61.0.69 N248066
link L518553: N4:42.61.0.69 N248066
You would get the following output:
N1 found in link L518523: N1:42.61.0.69
N2 found in link L518533: N2:42.61.0.69
N4 found in link L518553: N4:42.61.0.69

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

Python - trying to capture the middle of a line, regex or split

I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>

Categories