Get sentence after pattern with regex python - python

In my string (example adopted from this turorial) I want to get everything until the first following . after the generic (year). pattern:
str = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
I think I'm almost there with my code but not quite yet:
test = re.findall(r'[\(\d\d\d\d\).-]+([^.]*)', str)
... which returns: ['com, (2002)', 'blah monkey', ' (1991)', '#abc', 'com blah dishwasher']
The desired output is:
['blah monkey', '#abc']
In other words, I want to find everything that is between the year pattern and the next dot.

If you want to get every thing between (year). and the first . you can use this:
\(\d{4}\)\.([^.]*)
See Live Demo.
And explanation here:
"\(\d{4}\)\.([^.]*)"g
\( matches the character ( literally
\d{4} match a digit [0-9]
Quantifier: {4} Exactly 4 times
\) matches the character ) literally
\. matches the character . literally
1st Capturing group ([^.]*)
[^.]* match a single character not present in the list below
Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
g modifier: global. All matches (don't return on first match)

This should do the trick
print re.findall(r'\(\d{4}\)\.([^\.]+)', str)
$ ['blah monkey', '#abc']

You are using [...] in the wrong way. Try with \(\d{4}\)\.([^.]*)\.:
>>> s = 'purple alice#google.com, (2002).blah monkey. (1991).#abc.com blah dishwasher'
>>> re.findall(r'\(\d{4}\)\.([^.]*)\.', s)
['blah monkey', '#abc']
For the reference, [...] specifies a character class. By using [\(\d\d\d\d\).-] you were saying: one of 0123456789().-.

Related

Can re ignore a lazy quantifier?

Given this code (Python 3.6):
>>> import re
>>> a = re.search(r'\(.+?\)$', '(canary) (wharf)')
>>> a
<_sre.SRE_Match object; span=(0, 16), match='(canary) (wharf)'>
>>>
Why doesn't re stop searching at the first parethesis closure?
The expected output is None. The search should detect that there is not an end of line after (canary), but it doesn't.
Edit:If there is only ONE word between parens, it should match, if there are more than one, it shouldn't match at all.
Any help would be hugely appreciated.
The lazy flag isn't being ignored.
You get a match on the entire string because .+? means match anything one or more times until you find a match, expanding as needed. If the regex was \([^)]+?\)$ it would have matched only the last (wharf) because we excluded the +? from matching )
Or if the regex was \(.+?\), it would have matched the (canary) and the (wharf), which shows that it's being lazy.
\(.+?\)$ matches everything because you make it match everything until the end of the line.
If you want to ensure that there is only one group in parentheses in the entire string, we can do that with our "no-parentheses-regex" from above and force the start of the string to match the start of your regex.
^\([^)]+?\)$
Try it: https://regex101.com/r/Ts9JeF/1
Explanation:
^\(: Match a literal ( at the start of the string
[^)]+?: Match anything but ), as many times as needed
\)$: Match a literal )$ at the end of the line.
Or, if you want to allow other words before and after the one in parentheses, but nothing in parentheses, do this:
^[^()]*?\([^)]+?\)[^()]*$
Try it: https://regex101.com/r/Ts9JeF/3
Explanation:
^[^()]*?: At the start of the string, match anything but parentheses zero or more times.
\([^)]+?\): Very similar to our previous regex
[^()]*$: Match zero or more non-parentheses characters until the end of the string.
the non-greedy qualifier makes it match the shortest repeat -- in this case the shortest successful repeat is the entire string. it doesn't "not match the )" because you didn't tell it to do so
you can think of the engine doing something like this (using simplified string '(a) (b)':
start at position 0
'(' matches (, proceed to position 1
'a' matches ., proceed to position 2
(non-greedy) ')' matches ), proceed to position 3
(non-greedy) end of string does not match $ => backtrack to position 2
')' matches . proceed to position 3
(non-greedy) ' ' does not match )
' ' matches . proceed to position 4
(non-greedy) '(' does not match )
'(' matches . proceed to position 5
(non-greedy) 'b' does not match )
'b' matches . proceed to position 6
(non-greedy) ')' matches )
(non-greedy) $ matches end of string => DONE!
try this regex on for size:
r'\([^)]+\)$'
here a left-paren is matched, followed by a nonzero number of non-right parens followed by a right paren and the end of the string

How to capture the word with space around without capturing the space?

I've got a string like this s = "Hello this is Helloworld #helloworld #hiworld #nihaoworld " The idea is to catch all the hashtag however the hashtag needs to have a boundary around. e.g. if something like "Hello this is helloworld#helloworld"won't be captured.
I want to generate the following result as ["#helloworld","#hiworld","nihaoworld"]
I've got the following python code
import re
print re.findall('(?:^|\s+)(#[a-z]{1,})(?:\s+|$)', s)
The result I got is ["#helloworld","#nihaoworld"] with the middle word missing
I don't think you really need a regular expression for this, you can just use:
s.strip().split()
However, if you do want to use a regex, you could just use (?:^|\s)(#\w+):
>>> import re
>>> s = " #helloworld #hiworld #nihaoworld "
>>> re.findall(r'(?:^|\s)(#\w+)', s)
['#helloworld', '#hiworld', '#nihaoworld']
Explanation
Non-capturing group (?:^|\s)
1st Alternative ^
^ asserts position at start of the string
2nd Alternative \s
\s matches any whitespace character (equal to [\r\n\t\f\v ])
1st Capturing Group (#\w+)
# matches the character # literally (case sensitive)
\w+ matches any word character (equal to [a-zA-Z0-9_])
+ Quantifier — Matches between one and unlimited times, as many times as possible, giving back as needed (greedy)

Referencing previous group possible within the same regex?

I am trying to perform a regex in Python. I want to match on a file path that does not have a domain extension and additionally, I only want to get those file paths that have 20 characters max after the last '\' of the file path. For example, given the data:
c:\users\docs\cmd.exe
c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
c:\users\docs\files\target
I would want to match on 'target', and not the other two lines. It should be noted that in my current situation, using the re module or python operations is not an option, as this regex is fed into the program (which uses re.match() ), so I have do to this within a regex string.
I have two regexes:
^([^.]+)$ will match the the last 2 lines
([^\\]{,20}$) will match 'cmd.exe' and 'target'
How can I combine these two into one regex? I tried backreferencing (?P=, etc), but couldn't get it to work. Is this even possible?
How about \\([^\\.]{1,20})(?:$|\n)? It seems to work for me.
\\ is escaped literal backslash.
( start of capture group.
[^\\.] match anything except literal backslash or literal dot character
{1,20} match class 1-20 times, as many times as possible (greedy).
) end the capture group.
(?: starts a non-capturing group
$ match the end of the string.
| is the 'or' operator for this group
\n matches a line-feed or newline character (ASCII 10)
) end of non-capturing group
To create this, I used https://regex101.com/#python which is a very good resource in my opinion because it explains every part of the regex and neatly shows the captured groups in real time.
>>> s = r"""c:\users\docs\cmd.exe
... c:\users\docs\files\ewyrkfdisadfasdfaffsfdasfsafsdf
... c:\users\docs\files\target""".split('\n')
>>> [re.match(r'.*\\([^.]{,20})$', x) for x in s]
[None, None, <_sre.SRE_Match object at 0x7f6ad9631558>]
also
>>> [re.findall(r'.*\\([^.]{,20})$', x) for x in s]
[[], [], ['target']]
This means:
.*\\ - grab everything up to and including the last \
([^.]{,20}) - make sure there are no . in the remaining upto 20 characters
$ - end of line
The () around the middle group indicate that it should be the group returned as the match

Regex for match parentheses in Python

I have a list of fasta sequences, each of which look like this:
>>> sequence_list[0]
'gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)
I'd like to be able to extract the gene names from each of the fasta entries in my list, but I'm having difficulty finding the right regular expression. I thought this one would work: "^/(.+/),$". Start with a parentheses, then any number of any character, then end with a parentheses followed by a comma. Unfortunately: this returns None:
test = re.search(r"^/(.+/),$", sequence_list[0])
print(test)
Can someone point out the error in this regex?
Without any capturing groups,
>>> import re
>>> str = """
... gi|13195623|ref|NM_024197.1| Mus musculus NADH dehydrogenase (ubiquinone) 1 alp
... ha subcomplex 10 (Ndufa10), mRNAGCCGGCGCAGACGGCGAAGTCATGGCCTTGAGGTTGCTGAGACTCGTC
... CCGGCGTCGGCTCCCGCGCGCGGCCTCGCGGCCGGAGCCCAGCGCGTGGG (etc)"""
>>> m = re.findall(r'(?<=\().*?(?=\),)', str)
>>> m
['Ndufa10']
It matches only the words which are inside the parenthesis only when the closing bracket is followed by a comma.
DEMO
Explanation:
(?<=\() In regex (?<=pattern) is called a lookbehind. It actually looks after a string which matches the pattern inside lookbehind . In our case the pattern inside the lookbehind is \( means a literal (.
.*?(?=\),) It matches any character zero or more times. ? after the * makes the match reluctant. So it does an shortest match. And the characters in which the regex engine is going to match must be followed by ),
you need to escape parenthesis:
>>> re.findall(r'\([^)]*\),', txt)
['(Ndufa10),']
Can someone point out the error in this regex? r"^/(.+/),$"
regex escape character is \ not / (do not confuse with python escape character which is also \, but is not needed when using raw strings)
=> r"^\(.+\),$"
^ and $ match start/end of the input string, not what you want to output
=> r"\(.+\),"
you need to match "any" characters up to 1st occurence of ), not to the last one, so you need lazy operator +?
=> r"\(.+?\),"
in case gene names could not contain ) character, you can use a faster regex that avoids backtracking
=> r"\([^)]+\),"

Regex to extract top level domain from email address

From email address like
xxx#site.co.uk
xxx#site.uk
xxx#site.me.uk
I want to write a regex which should return 'uk' is all the cases.
I have tried
'+#([^.]+)\..+'
which gives only the domain name. I have tried using
'[^/.]+$'
but it is giving error.
The regex to extract what you are asking for is:
\.([^.\n\s]*)$ with /gm modifiers
explanation:
\. matches the character . literally
1st Capturing group ([^.\n\s]*)
[^.\n\s]* match a single character not present in the list below
Quantifier: Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
. the literal character .
\n matches a fine-feed (newline) character (ASCII 10)
\s match any white space character [\r\n\t\f ]
$ assert position at end of a line
m modifier: multi-line. Causes ^ and $ to match the begin/end of each line (not only begin/end of string)
g modifier: global. All matches
for your input example, it will be:
import re
m = re.compile(r'\.([^.\n\s]*)$', re.M)
f = re.findall(m, data)
print f
output:
['uk', 'uk', 'uk']
hope this helps.
As myemail#com is a valid address, you can use:
#.*([^.]+)$
You don't need regex. This would always give you 'uk' in your examples:
>>> url = 'foo#site.co.uk'
>>> url.split('.')[-1]
'uk'
Simply .*\.(\w+) won't help?
Can add more validations for "#" to the regular expression if needed.

Categories