Given this code (Python 3.6):
>>> import re
>>> a = re.search(r'\(.+?\)$', '(canary) (wharf)')
>>> a
<_sre.SRE_Match object; span=(0, 16), match='(canary) (wharf)'>
>>>
Why doesn't re stop searching at the first parethesis closure?
The expected output is None. The search should detect that there is not an end of line after (canary), but it doesn't.
Edit:If there is only ONE word between parens, it should match, if there are more than one, it shouldn't match at all.
Any help would be hugely appreciated.
The lazy flag isn't being ignored.
You get a match on the entire string because .+? means match anything one or more times until you find a match, expanding as needed. If the regex was \([^)]+?\)$ it would have matched only the last (wharf) because we excluded the +? from matching )
Or if the regex was \(.+?\), it would have matched the (canary) and the (wharf), which shows that it's being lazy.
\(.+?\)$ matches everything because you make it match everything until the end of the line.
If you want to ensure that there is only one group in parentheses in the entire string, we can do that with our "no-parentheses-regex" from above and force the start of the string to match the start of your regex.
^\([^)]+?\)$
Try it: https://regex101.com/r/Ts9JeF/1
Explanation:
^\(: Match a literal ( at the start of the string
[^)]+?: Match anything but ), as many times as needed
\)$: Match a literal )$ at the end of the line.
Or, if you want to allow other words before and after the one in parentheses, but nothing in parentheses, do this:
^[^()]*?\([^)]+?\)[^()]*$
Try it: https://regex101.com/r/Ts9JeF/3
Explanation:
^[^()]*?: At the start of the string, match anything but parentheses zero or more times.
\([^)]+?\): Very similar to our previous regex
[^()]*$: Match zero or more non-parentheses characters until the end of the string.
the non-greedy qualifier makes it match the shortest repeat -- in this case the shortest successful repeat is the entire string. it doesn't "not match the )" because you didn't tell it to do so
you can think of the engine doing something like this (using simplified string '(a) (b)':
start at position 0
'(' matches (, proceed to position 1
'a' matches ., proceed to position 2
(non-greedy) ')' matches ), proceed to position 3
(non-greedy) end of string does not match $ => backtrack to position 2
')' matches . proceed to position 3
(non-greedy) ' ' does not match )
' ' matches . proceed to position 4
(non-greedy) '(' does not match )
'(' matches . proceed to position 5
(non-greedy) 'b' does not match )
'b' matches . proceed to position 6
(non-greedy) ')' matches )
(non-greedy) $ matches end of string => DONE!
try this regex on for size:
r'\([^)]+\)$'
here a left-paren is matched, followed by a nonzero number of non-right parens followed by a right paren and the end of the string
Related
I am trying to extract first 5 character+digit from last hyphen.
Here is the example
String -- X008-TGa19-ER751QF7
Output -- X008-TGa19-ER751
String -- X002-KF13-ER782cPU80
Output -- X002-KF13-ER782
My attempt -- I could manage to take element from the last -- (\w+)[^-.]*$
But now how to take first 5, then return my the entire value as the output as shown in the example.
You can optionally repeat a - and 1+ word chars from the start of the string. Then match the last - and match 5 word chars.
^\w+(?:-\w+)*-\w{5}
^ Start of string
\w+ Math 1+ word chars
(?:-\w+)* Optionally repeat - and 1+ word chars
-\w{5} Match - and 5 word chars
Regex demo
import re
regex = r"^\w+(?:-\w+)*-\w{5}"
s = ("X008-TGa19-ER751QF7\n"
"X002-KF13-ER782cPU80")
print(re.findall(regex, s, re.MULTILINE))
Output
['X008-TGa19-ER751', 'X002-KF13-ER782']
Note that \w can also match _.
If there can also be other character in the string, to get the first 5 digits or characters except _ after the last hyphen, you can match word characters without an underscore using a negated character class [^\W_]{5}
Repeat that 5 times while asserting no more underscore at the right.
^.*-[^\W_]{5}(?=[^-]*$)
Regex demo
(\w+-\w+-\w{5}) seems to capture what you're asking for.
Example:
https://regex101.com/r/PcPSim/1
If you are open for non-regex solution, you can use this which is based on splitting, slicing and joining the strings:
>>> my_str = "X008-TGa19-ER751QF7"
>>> '-'.join(s[:5] for s in my_str.split('-'))
'X008-TGa19-ER751'
Here I am splitting the string based on hyphen -, slicing the string to get at max five chars per sub-string, and joining it back using str.join() to get the string in your desired format.
^(.*-[^-]{5})[^-]*$
Capture group 1 is what you need
https://regex101.com/r/SYz9i5/1
Explanation
^(.*-[^-]{5})[^-]*$
^ Start of line
( Capture group 1 start
.* Any number of any character
- hyphen
[^-]{5} 5 non-hyphen character
) Capture group 1 end
[^-]* Any number of non-hyphen character
$ End of line
Another simpler one is
^(.*-.{5}).*$
This should be quite straight-forward.
This is making use of behaviour greedy match of first .*, which will try to match as much as possible, so the - will be the last one with at least 5 character following it.
https://regex101.com/r/CFqgeF/1/
I've got a string like this s = "Hello this is Helloworld #helloworld #hiworld #nihaoworld " The idea is to catch all the hashtag however the hashtag needs to have a boundary around. e.g. if something like "Hello this is helloworld#helloworld"won't be captured.
I want to generate the following result as ["#helloworld","#hiworld","nihaoworld"]
I've got the following python code
import re
print re.findall('(?:^|\s+)(#[a-z]{1,})(?:\s+|$)', s)
The result I got is ["#helloworld","#nihaoworld"] with the middle word missing
I don't think you really need a regular expression for this, you can just use:
s.strip().split()
However, if you do want to use a regex, you could just use (?:^|\s)(#\w+):
>>> import re
>>> s = " #helloworld #hiworld #nihaoworld "
>>> re.findall(r'(?:^|\s)(#\w+)', s)
['#helloworld', '#hiworld', '#nihaoworld']
Explanation
Non-capturing group (?:^|\s)
1st Alternative ^
^ asserts position at start of the string
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (#\w+)
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)
Probably a regex question (forgive my broken english).
I need to identify a sub string that starts with a certain value.
For example, take the following string:
"Select 1 from user.table1 inner join user.table2..."
I need to extract all the words that start with "user" and end with "blank space". So, after applying this "unkown" regex to the above string, it would produce the following result:
table1
table2
I tried to use the "re.findall" function, but couldn't find a way to specify the start and end patterns.
So, how can extract the substrings using a starting pattern?
Try Positive Lookbehind :
import re
pattern=r'(?<=user\.)(\w+)?\s'
string_1="Select 1 from user.table1 inner join user.table2 ..."
match=re.findall(pattern,string_1)
print(match)
output:
['table1', 'table2']
regex information:
(?<=user\.)(\w+)?\s
`Positive Lookbehind` `(?<=user\.)`
Assert that the Regex below matches
user matches the characters user literally (case sensitive)
\. matches the character . literally (case sensitive)
1st Capturing Group (\w+)?
? Quantifier — Matches between zero and one times, as many times as possible, giving back as needed (greedy)
\w+ matches any word character (equal to [a-zA-Z0-9_])
If that pattern doesn't work try this : (?<=user\.)\w+
You can try it like this:
re.findall(r'\buser\.(..*?)\b',
"Select 1 from user.table1 inner join user.table2...")
This will return:
['table1', 'table2']
I have a list of fasta sequences, each of which look like this:
>>> sequence_list[0]
'gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)
I'd like to be able to extract the gene names from each of the fasta entries in my list, but I'm having difficulty finding the right regular expression. I thought this one would work: "^/(.+/),$". Start with a parentheses, then any number of any character, then end with a parentheses followed by a comma. Unfortunately: this returns None:
test = re.search(r"^/(.+/),$", sequence_list[0])
print(test)
Can someone point out the error in this regex?
Without any capturing groups,
>>> import re
>>> str = """
... gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
... ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
... CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)"""
>>> m = re.findall(r'(?<=\().*?(?=\),)', str)
>>> m
['Ndufa10']
It matches only the words which are inside the parenthesis only when the closing bracket is followed by a comma.
DEMO
Explanation:
(?<=\() In regex (?<=pattern) is called a lookbehind. It actually looks after a string which matches the pattern inside lookbehind . In our case the pattern inside the lookbehind is \( means a literal (.
.*?(?=\),) It matches any character zero or more times. ? after the * makes the match reluctant. So it does an shortest match. And the characters in which the regex engine is going to match must be followed by ),
you need to escape parenthesis:
>>> re.findall(r'\([^)]*\),', txt)
['(Ndufa10),']
Can someone point out the error in this regex? r"^/(.+/),$"
regex escape character is \ not / (do not confuse with python escape character which is also \, but is not needed when using raw strings)
=> r"^\(.+\),$"
^ and $ match start/end of the input string, not what you want to output
=> r"\(.+\),"
you need to match "any" characters up to 1st occurence of ), not to the last one, so you need lazy operator +?
=> r"\(.+?\),"
in case gene names could not contain ) character, you can use a faster regex that avoids backtracking
=> r"\([^)]+\),"
Below is the Python regular expression. What does the ?: mean in it? What does the expression do overall? How does it match a MAC address such as "00:07:32:12:ac:de:ef"?
re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}), string)
It (?:...) means a set of non-capturing grouping parentheses.
Normally, when you write (...) in a regex, it 'captures' the matched material. When you use the non-capturing version, it doesn't capture.
You can get at the various parts matched by the regex using the methods in the re package after the regex matches against a particular string.
How does this regular expression match MAC address "00:07:32:12:ac:de:ef"?
That's a different question from what you initially asked. However, the regex part is:
([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})
The outer most pair of parentheses are capturing parentheses; what they surround will be available when you use the regex against a string successfully.
The [\dA-Fa-f]{2} part matches a digit (\d) or the hexadecimal digits A-Fa-f], in a pair {2}, followed by a non-capturing grouping where the matched material is a colon or dash (: or -), followed by another pair of hex digits, with the whole repeated exactly 5 times.
p = re.compile(([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5}))
m = p.match("00:07:32:12:ac:de:ef")
if m:
m.group(1)
The last line should print the string "00:07:32:12:ac:de" because that is the first set of 6 pairs of hex digits (out of the seven pairs in total in the string). In fact, the outer grouping parentheses are redundant and if omitted, m.group(0) would work (it works even with them). If you need to match 7 pairs, then you change the 5 into a 6. If you need to reject them, then you'd put anchors into the regex:
p = re.compile(^([\dA-Fa-f]{2}(?:[:-][\dA-Fa-f]{2}){5})$)
The caret ^ matches the start of string; the dollar $ matches the end of string. With the 5, that would not match your sample string. With 6 in place of 5, it would match your string.
Using ?: as in (?:...) makes the group non-capturing during replace. During find it does'nt make any sense.
Your RegEx means
r"""
( # Match the regular expression below and capture its match into backreference number 1
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
(?: # Match the regular expression below
[:-] # Match a single character present in the list below
# The character “:”
# The character “-”
[\dA-Fa-f] # Match a single character present in the list below
# A single digit 0..9
# A character in the range between “A” and “F”
# A character in the range between “a” and “f”
{2} # Exactly 2 times
){5} # Exactly 5 times
)
"""
Hope this helps.
It does not change the search process. But it affects the retrieval of the group after the match has been found.
For example:
Text:
text = 'John Wick'
pattern to find:
regex = re.compile(r'John(?:\sWick)') # here we are looking for 'John' and also for a group (space + Wick). the ?: makes this group unretrievable.
When we print the match - nothing changes:
<re.Match object; span=(0, 9), match='John Wick'>
But if you try to manually address the group with (?:) syntax:
res = regex.finditer(text)
for i in res:
print(i)
print(i.group(1)) # here we are trying to retrieve (?:\sWick) group
it gives us an error:
IndexError: no such group
Also, look:
Python docs:
(?:...)
A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.
the link to the re page in docs:
https://docs.python.org/3/library/re.html
(?:...) means a non cature group. The group will not be captured.