Regex that does not contain a substring after some point - python

I want a regex that doesn't match a string if contains the word page, and match if it's not contain.
^https?.+/(event|news)/.+(?!page).+$ this is the regex I'm currently using, so I want it to not match with, e.g. https://www.foosite.com/news/foopath/page/10, but it does. Where did I made a mistake?
The double .+ expressions should imply that there should be some string around the page string, and (?!page) should imply there must not be a string like page between them. What's wrong with this expression? Thanks, and sorry for poor grammar.

Your problem is that .+(?!page).+ will match foopath/page/10 because the first .+ match can end at the 1 in 10, and the second can match from there until $. Instead, just assert there is no combination of characters plus the word page after (event|news)/:
^https?.+/(event|news)/(?!.*page)
Demo on regex101
If you want more than just a match/nomatch decision, you can capture the entire matching string with this regex:
^https?.+/(event|news)/(?!.*page).*$
Demo on regex101

You might be looking for
^https?.+/(event|news)/(?:(?!page).)+$
See a demo on regex101.com.

Matching is usually way easier in regex than excluding.
I would rather match your excluded words and invert the logic on the if-clause.
if(!re.match(...

Related

Regex negative lookahead in python [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Python: Regex to search for a "Mozilla" but ignore the match if the string also includes "iPhone" [duplicate]

I am trying to search for all occurrences of "Tom" which are not followed by "Thumb".
I have tried to look for
Tom ^((?!Thumb).)*$
but I still get the lines that match to Tom Thumb.
You don't say what flavor of regex you're using, but this should work in general:
Tom(?!\s+Thumb)
In case you are not looking for whole words, you can use the following regex:
Tom(?!.*Thumb)
If there are more words to check after a wanted match, you may use
Tom(?!.*(?:Thumb|Finger|more words here))
Tom(?!.*Thumb)(?!.*Finger)(?!.*more words here)
To make . match line breaks please refer to How do I match any character across multiple lines in a regular expression?
See this regex demo
If you are looking for whole words (i.e. a whole word Tom should only be matched if there is no whole word Thumb further to the right of it), use
\bTom\b(?!.*\bThumb\b)
See another regex demo
Note that:
\b - matches a leading/trailing word boundary
(?!.*Thumb) - is a negative lookahead that fails the match if there are any 0+ characters (depending on the engine including/excluding linebreak symbols) followed with Thumb.
Tom(?!\s+Thumb) is what you search for.

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

extract string betwen two strings in pandas

I have a text column that looks like:
http://start.blabla.com/landing/fb603?&mkw...
I want to extract "start.blabla.com"
which is always between:
http://
and:
/landing/
namely:
start.blabla.com
I do:
df.col.str.extract('http://*?\/landing')
But it doesn't work.
What am I doing wrong?
Your regex matches http:/, then 0+ / symbols as few as possible and then /landing.
You need to match and capture the characters (The extract method accepts a regular expression with at least one capture group.) after http:// other than /, 1 or more times. It can be done with
http://([^/]+)/landing
^^^^^^^
where [^/]+ is a negated character class that matches 1+ occurrences of characters other than /.
See the regex demo
Just to answer a question you didn't ask, if you wanted to extract several portions of the string into separate columns, you'd do it this way:
df.col.str.extract('http://(?P<Site>.*?)/landing/(?P<RestUrl>.*)')
You'd get something along the lines of:
Site RestUrl
0 start.blabla.com fb603?&mkw...
To understand how this regex (and any other regex for that matter) is constructed I suggest you take a look at the excellent site regex101. I constructed a snippet where you can see the above regex in action here.

Regex is not matching in the way that I want to

Hi I'm new to regexes.
I have a string that I want to match any number of A-Z a-z 0-9 - and _
I've tried the following in python however it always matches, even the empty space. Can someone tell me why that is?
re.match(r'[A-Za-z0-9_-]+', 'gfds9 41.-=,434')
Your regex matches one or more of those characters. Your text starts with one or more of those characters, hence it matches. If you want it to only match those characters then you have to match them from the beginning to the end of the text.
re.match(r'^[A-Za-z0-9_-]+$', 'gfds9 41.-=,434')
Try the alternative for it maybe it will work for you:
[\w-]+
EDIT:
Although the initial regex you provided also works for me.

Categories