how to avoid regex matching "Revert "Revert" - python

I have the following code which matches both line1 and line2,I ONLY want to match line2 but not line1, can anyone provide guidance on how to do this?
import re
line1 = '''Revert "Revert <change://problem/47614601> [tech feature][V3 Driver] Last data path activity timestamp update is required for feature""'''
line2 = '''Revert <change://problem/47614601> [tech feature][V3 Driver] Last data path activity timestamp update is required for feature"'''
if re.findall(".*?(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})", line2):
match = re.findall(".*?(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})", line2)
print "Revert radar match%s"%match
revert_radar = True
print revert_radar

Something like this should do what you want:
>>> regex = "(?!:(?:R|r)evert.*)(?:Revert|revert)\s*\S*(?:change:\/\/problem\/)(\d{8})"
>>> re.match(regex, line1) is None
True
>>> re.match(regex, line2).groups()
('47614601',)

Negative look behind: word NOT proceeded by sameword-space-doublequote
r'''(?<![Rr]evert ")[Rr]evert\s<change:[/][/]problem[/]\d{8}.*"'''
Negative look ahead: pattern you want NOT followed by a double doublequote
r'''[Rr]evert\s<change:[/][/]problem[/]\d{8}.*?\w"(?!")'''
this doesn't work if the unwanted line immediately precedes the wanted line without an intervening newline character.
it does work if the unwanted line follows the wanted line.
If looking at individual lines, look for the pattern at the start of string - of the three this is the least work, most efficient for the regex engine
r'''^[Rr]evert\s<change:[/][/]problem[/]\d{8}.*$'''
If the line you are searching for is embedded in a long string but there is a newline character before that line you can use the previous pattern with the multiline flag.

Related

how to parse line start with multiple key words in python?

I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?
Use case:
So each line, I am expecting to see one of these as starting of each line: CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, so I do this by using #flakes help in previous post:
import re, os, sys
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S+) (\S+).*$")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:
ValueError: not enough values to unpack (expected 3, got 2)
I used this input file for parsing using re.
objective
Basically, I need to check each line to see whether it start with CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?
If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.
You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.
Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.
The updated pattern:
^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)
See the capture groups in the regex demo
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
Output
component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'

extract a line that matches a string with IP address

I am working on python to find a line that matches a particular pattern of an IP address.
f = open('file.txt','r')
for line in f:
if line.find("N2:42.61.0.69")
print line
The pattern that I am trying to match is "node number":"IP address", where
"node number" is the alphabet 'N' followed by a 'number'. Such as N23, N456, N98765, etc.
I used the pattern N2:42\.61\.0\.69, but it didn't yield any result.
Most of the examples talks about regex to match a particular pattern such as for an IP address "^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$".
But here I want to match a particular string with IP address.
Thanks for the help!
The function find():
Returns the lowest index in s where the substring sub is found such that sub is wholly contained in s[start:end]. Return -1 on failure. Defaults for start and end and interpretation of negative values is the same as for slices.
So if you want to find lines containing the given string( "N2:42.61.0.69" in your case), the condition should be:
if line.find("N2:42.61.0.69") != -1:
If I understand your question, the value of the node may change, but the value of the IP address is known and fixed. Since the dot has a special meaning in regular expressions, you must place a backslash before.
import re
REGEX = re.compile( r"^N\d+:42\.61\.0\.69$" )
f = open('file.txt','r')
for line in f:
line = line.strip()
if REGEX.match(line):
print line
f.close()
Notice the use of a raw string (r" ..."), without it all the backslashes would need to appear twice (\\). Also, the ^ at the beginning and the $ at the end are used to match the whole line. If other text can appear in the lines, you can remove them and use search instead of match. The rstrip() is used to remove end-of-line characters (\n for example).
When used with the following data (contents of file.txt)
N1:42.61.0.69
N2:10.0.0.1
N123456:42.61.0.69
it prints only the first and third lines
Are you looking for this:
(?:^|\s*)N\d+:(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:$|\s+)
REGEX 101 DEMO
In python 2.7, to use the pattern in your example, you could:
import re
p = '(?:^|\s*)N\d+:(?:(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9]).){3}(?:25[0-5]|2[0-4][0-9]|1[0-9][0-9]|[1-9][0-9]|[0-9])(?:$|\s+)'
with open('file.txt', 'r') as f:
print [line.strip() for line in f.readlines() if re.findall(p, line) ]
If you want to process file.txt a line at a time, the following should work:
import re
with open('file.txt') as f_input:
for line in f_input:
match = re.search(r'(.*?(N\d+):42\.61\.0\.69 .*?)', line)
if match:
line, node_number = match.groups()
print "{} found in {}".format(node_number, line)
Given file.txt contains the following:
link L518523: N1:42.61.0.69 N248066
non matching line
link L518533: N2:42.61.0.69 N248066
link L518553: N3:43.61.0.69 N248066
link L518553: N4:42.61.0.69 N248066
You would get the following output:
N1 found in link L518523: N1:42.61.0.69
N2 found in link L518533: N2:42.61.0.69
N4 found in link L518553: N4:42.61.0.69

How to substitute specific patterns in python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs
You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =
No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.
I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?
how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.
either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

python regular expression to match strings

I want to parse a string, such as:
package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'
uses-permission:'android.permission.WRITE_APN_SETTINGS'
uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'
uses-permission:'android.permission.ACCESS_NETWORK_STATE'
I want to get:
string1: jp.tjkapp.droidllwp`
string2: 1.1
Because there are multiple uses-permission, I want to get permission as a list, contains:
WRITE_APN_SETTINGS, RECEIVE_BOOT_COMPLETED and ACCESS_NETWORK_STATE.
Could you help me write the python regular expression to get the strings I want?
Thanks.
Assuming the code block you provided is one long string, here stored in a variable called input_string:
name = re.search(r"(?<=name\=\')[\w\.]+?(?=\')", input_string).group(0)
versionName = re.search(r"(?<=versionName\=\')\d+?\.\d+?(?=\')", input_string).group(0)
permissions = re.findall(r'(?<=android\.permission\.)[A-Z_]+(?=\')', input_string)
Explanation:
name
(?<=name\=\'): check ahead of the main string in order to return only strings that are preceded by name='. The \ in front of = and ' serve to escape them so that the regex knows we're talking about the = string and not a regex command. name=' is not also returned when we get the result, we just know that the results we get are all preceded by it.
[\w\.]+?: This is the main string we're searching for. \w means any alphanumeric character and underscore. \. is an escaped period, so the regex knows we mean . and not the regex command represented by an unescaped period. Putting these in [] means we're okay with anything we've stuck in brackets, so we're saying that we'll accept any alphanumeric character, _, or .. + afterwords means at least one of the previous thing, meaning at least one (but possibly more) of [\w\.]. Finally, the ? means don't be greedy--we're telling the regex to get the smallest possible group that meets these specifications, since + could go on for an unlimited number of repeats of anything matched by [\w\.].
(?=\'): check behind the main string in order to return only strings that are followed by '. The \ is also an escape, since otherwise regex or Python's string execution might misinterpret '. This final ' is not returned with our results, we just know that in the original string, it followed any result we do end up getting.
You can do this without regex by reading the file content line by line.
>>> def split_string(s):
... if s.startswith('package'):
... return [i.split('=')[1] for i in s.split() if "=" in i]
... elif s.startswith('uses-permission'):
... return s.split('.')[-1]
...
>>> split_string("package: name='jp.tjkapp.droid1lwp' versionCode='2' versionName='1.1'")
["'jp.tjkapp.droid1lwp'", "'2'", "'1.1'"]
>>> split_string("uses-permission:'android.permission.WRITE_APN_SETTINGS'")
"WRITE_APN_SETTINGS'"
>>> split_string("uses-permission:'android.permission.RECEIVE_BOOT_COMPLETED'")
"RECEIVE_BOOT_COMPLETED'"
>>> split_string("uses-permission:'android.permission.ACCESS_NETWORK_STATE'")
"ACCESS_NETWORK_STATE'"
>>>
Here is one example code
#!/usr/bin/env python
inputFile = open("test.txt", "r").readlines()
for line in inputFile:
if line.startswith("package"):
words = line.split()
string1 = words[1].split("=")[1].replace("'","")
string2 = words[3].split("=")[1].replace("'","")
test.txt file contains input data you mentioned earlier..

Categories