How to get a capture group that doesnt always exist? - python

I have a regex something like
(\d\d\d)(\d\d\d)(\.\d\d){0,1}
when it matches I can easily get first two groups, but how do I check if third occurred 0 or 1 times.
Also another minor question: in the (\.\d\d) I only care about \d\d part, any other way to tell regex that \.\d\d needs to appear 0 or 1 times, but that I want to capture only \d\d part ?
This was based on a problem of parsing a
hhmmss
string that has optional decimal part for seconds( so it becomes
hhmmss.ss
)... I put \d\d\d in the question so it is clear about what \d\d Im talking about.

import re
value = "123456.33"
regex = re.search("^(\d\d\d)(\d\d\d)(?:\.(\d\d)){0,1}$", value)
if regex:
print regex.group(1)
print regex.group(2)
if regex.group(3) is not None:
print regex.group(3)
else:
print "3rd group not found"
else:
print "value don't match regex"

Related

Replace a substring with defined region and follow up variable region in Python

I have a seemingly simple problem that for the life of me is just outside my reach of understanding. What I mean by that is that I can come up with many complex ways to attempt this, but there must be an easy way.
What I am trying to do is find and replace a substring in a string, but the catch is that it is based on a mix of a defined region and then variable regions based on length.
Here is an example:
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC' and I want to replace AATCGATCGTA with <span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>
So in this example the first part will always be constant AATCGA and will be used to locate the region to replace. This is then followed by a "spacer", in this case a single character but could be more than one and needs to be specified, and finally the last bit that will follow the "tail", in this case four characters, but could also be more or less. A set-up in this case would be:
to_find = 'AATCGA'
spacer = 'T' #Variable based on number and not on the character
tail = 'CGTA' #Variable based on number and not on the character
With this information I need to do something like:
new_seq = sequence.replace(f'{to_find}{len(spacer)}{len(tail)}', f'<span color="blue">{to_find}</span><span>{spacer}</span><span color="green">{tail}</span>')
print(new_seq)
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
But the spacer could be 3 characters from the end of to_find and it may vary, the same with the tail section. Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
Any help would be much appreciated!
I'm not quite sure I understand you fully. Nevertheless, you don't seem to be too far off. Just use regex.
import re
sequence = 'AATCGATCGTATATCTGCGTAGACTCTGTGCATGC'
expected_new_seq = '<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC'
to_find = 'AATCGA'
spacer = 'T' # Variable based on number and not on the character
tail = 'CGTA' # Variable based on number and not on the character
# In this case, the pattern is (AATCGA)(.{1})(.{4})
# It matches "AATCGA" that is followed by 1 character and then 4 characters.
# AATCGA is captured in group 1, then the next unknown character is captured
# in group 2, and the next 4 unknown characters are captured in group 3
# (the brackets create capturing groups).
pattern = f'({to_find})(.{{{len(spacer)}}})(.{{{len(tail)}}})'
# \1 refers to capture group 1 (to_find), \2 refers to capture group 2 (spacer),
# and \3 refers to capture group 3 (tail).
# This no longer needs to be a f-string. But making it a raw string means we
# don't need to escape the slashes
repl = r'<span color="blue">\1</span><span>\2</span><span color="green">\3</span>'
new_seq = re.sub(pattern, repl, sequence)
print(new_seq)
print(new_seq == expected_new_seq)
Output:
<span color="blue">AATCGA</span><span>T</span><span color="green">CGTA</span>TATCTGCGTAGACTCTGTGCATGC
True
Have a play around with it here (also includes interactive explanation): https://regex101.com/r/2mshrI/1
Also, it could be in reverse where the to_find is on the right hand side and then the tail is in the start.
How do you know when to replace it when it's in reverse instead of forward? After all, all you're doing is matching a short string followed/lead by n characters. I imagine you'd get matches in both directions, so which replacement do you carry out? Please provide more examples - longer input with expected output.

best way to find substring using regex in python 3

I was trying to find out the best way to find the specific substring in key value pair using re for the following:
some_string-variable_length/some_no_variable_digit/some_no1_variable_digit/some_string1/some_string2
eg: aba/101/11111/cde/xyz or aaa/111/1119/cde/xzx or ada/21111/5/cxe/yyz
here everything is variable and what I was looking for is something like below in key value pair:
`cde: 2` as there are two entries for cde
cxe: 1 as there is only one cxe
Note: everything is variable here except /. ie cde or cxe or some string will be there exactly after two / in each case
input:aba/101/11111/cde/xyz/blabla
output: cde:xyz/blabla
input: aaa/111/1119/cde/xzx/blabla
output: cde:xzx/blabla
input: aahjdsga/11231/1119/gfts/sjhgdshg/blabla
output: gfts:sjhgdshg/blabla
If you notice here, my key is always the first string after 3rd / and value is always the substring after key
Here are a couple of solutions based on your description that "key is always the first string after 3rd / and value is always the substring after key". The first uses str.split with a maxsplit of 4 to collect everything after the fourth / into the value. The second uses regex to extract the two parts:
inp = ['aba/101/11111/cde/xyz/blabla',
'aaa/111/1119/cde/xzx/blabla',
'aahjdsga/11231/1119/gfts/sjhgdshg/blabla'
]
for s in inp:
parts = s.split('/', 4)
key = parts[3]
value = parts[4]
print(f'{key}:{value}')
import re
for s in inp:
m = re.match(r'^(?:[^/]*/){3}([^/]*)/(.*)$', s)
if m is not None:
key = m.group(1)
value = m.group(2)
print(f'{key}:{value}')
For both pieces of code the output is
cde:xyz/blabla
cde:xzx/blabla
gfts:sjhgdshg/blabla
Others have already posted various regexes; a more broad question — is this problem best solved using a regex? Depending on how the data is formatted overall, it may be better parsed using
the .split('/') method on the string; or
csv.reader(..., delimiter='/') or csv.DictReader(..., delimiter='/') in the csv module.
Try (?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)
demo
Try new per commnt
(?<!\S)[^\s/]*(?:/[^\s/]*){2}/([^\s/]*)(?:/(\S*))?
demo2

Using a single regex to search for 2 criteria in python

I have a function
def extract_pid(log_line):
regex = PROBLEM
result = re.search(regex, log_line)
if result is None:
return None
return "{} ({})".format(result[1], result[2])
My intended outcome of this function is to be able to return the pid numbers between [ ] and the corresponding uppercase text, for example;
logline = "July 31 07:51:48 mycomputer bad_process[12345]: ERROR Performing package upgrade"
From the above string I'd expect the outcome of my function to return 12345 (ERROR)
I have 2 criteria to meet in this function \[\d+\] and [A-Z]{2,} If i test each regex individually I get the expected outcome.
My question is, how do I specify both regex in the same line and output them as displayed in the function above, I cannot find anywhere in the documentation a simple "use this for AND" I found "| for or" but I need it to process both regex criteria.
I understand this could be done in 2 functions and joined together but i've been tasked with doing this from a single function.
Use two capturing groups:
regex = r'(\[\d+\]).*?([A-Z]{2,})'
The .*? means "put any characters in between, as few as possible". If you can, replace that with : (colon space) assuming that's what's always there.

New to regex -- unexpected results in for loop

I'm not sure if this is a problem in my understanding of regex modules, or a silly mistake I'm making in my for loop.
I have a list of numbers that look like this:
4; 94
3; 92
1; 53
etc.
I made a regex pattern to match just the last two digits of the string:
'^.*\s([0-9]+)$'
This works when I take each element of the list 1 at a time.
However when I try and make a for loop
for i in xData:
if re.findall('^.*\s([0-9]+)$', i)
print i
The output is simply the entire string instead of just the last two digits.
I'm sure I'm missing something very simple here but if someone could point me in the right direction that would be great. Thanks.
You are printing the whole string, i. If you wanted to print the output of re.findall(), then store the result and print that result:
for i in xData:
results = re.findall('^.*\s([0-9]+)$', i)
if results:
print results
I don't think that re.findall() is the right method here, since your lines contain just the one set of digits. Use re.search() to get a match object, and if the match object is not None, take the first group data:
for i in xData:
match = re.search('^.*\s([0-9]+)$', i)
if match:
print match.group(1)
I might be missing something here, but if all you're looking to do is get the last 2 characters, could you use the below?
for i in xData:
print(i[-2:])

regular expressions to extract phone numbers

I am new to regular expressions and I am trying to write a pattern of phone numbers, in order to identify them and be able to extract them. My doubt can be summarized to the following simple example:
I try first to identify whether in the string is there something like (+34) which should be optional:
prefixsrch = re.compile(r'(\(?\+34\)?)?')
that I test in the following string in the following way:
line0 = "(+34)"
print prefixsrch.findall(line0)
which yields the result:
['(+34)','']
My first question is: why does it find two occurrences of the pattern? I guess that this is related to the fact that the prefix thing is optional but I do not completely understand it. Anyway, now for my big doubt
If we do a similar thing searching for a pattern of 9 digits we get the same:
numsrch = re.compile(r'\d{9}')
line1 = "971756754"
print numsrch.findall(line1)
yields something like:
['971756754']
which is fine. Now what I want to do is identify a 9 digits number, preceded or not, by (+34). So to my understanding I should do something like:
phonesrch = re.compile(r'(\(?\+34\)?)?\d{9}')
If I test it in the following strings...
line0 = "(+34)971756754"
line1 = "971756754"
print phonesrch.findall(line0)
print phonesrch.findall(line1)
this is, to my surprise, what I get:
['(+34)']
['']
What I was expecting to get is ['(+34)971756754'] and ['971756754']. Does anybody has the insight of this? thank you very much in advance.
Your capturing group is wrong. Make the country code within a non-capturing group and the entire expression in the capturing group
>>> line0 = "(+34)971756754"
>>> line1 = "971756754"
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line0)
['(+34)971756754']
>>> re.findall(r'((?:\(?\+34\)?)?\d{9})',line1)
['971756754']
My first question is: why does it find two occurrences of the pattern?
This is because, ? which means it match 0 or 1 repetitions, so an empty string is also a valid match

Categories