Regex match only if word count between 1-50 - python

So I have this code:
(r'\[quote\](.+?)\[/quote\]')
What I want to do is to change the regex so it only matches if the text within [quote] [/quote] is between 1-50 words.
Is there any easy way to do this?
Edit: Removed confusing html code in the regex example. I am NOT trying to match HTML.

Sure there is, depending on how you define a "word."
I would do so separately from regex, but if you want to use regex, you could probably do:
r"\[quote\](.+?\s){1,49}[/quote\]"
That will match between 2 and 50 words (since it demands a trailing \s, it can't match ONE)
Crud, that also won't match the LAST word, so let's do this instead:
r"\[quote\](.+?(?:\s.+?){1,49})\[/quote\]"

This is a definite misuse of regexes for a lot of reasons, not the least of which is the problem matching [X]HTML as #Hyperboreus noted, but if you really insist you could do something along the lines of ([a-zA-Z0-9]\s){1}{49}.
For the record, I don't recommend this.

Related

Use regex to remove a substring that matches a beginning of a substring through the following comma

I haven't found any helpful Regex tools to help me figure this complicated pattern out.
I have the following string:
Myfirstname Mylastname, Department of Mydepartment, Mytitle, The University of Me; 4-1-1, Hong,Bunk, Tokyo 113-8655, Japan E-mail:my.email#example.jp, Tel:00-00-222-1171, Fax:00-00-225-3386
I am trying to learn enough Regex patterns to remove the substrings one at a time:
E-mail:my.email#example.jp
Tel:00-00-222-1171
Fax:00-00-225-3386
So I think the correct pattern would be to remove a given word (ie., "E-mail", "Tel") all the way through the following comma.
Is type of dynamic pattern possible in Regex?
I am performing the match in Python, however, I don't think that would matter too much.
Also, I know the data string looks comma separated, and it is. However there is no guarantee of preserving the order of those fields. That's why I'm trying to use a Regex match.
How about this regex:
<YOUR_WORD>.*?(?=(,|($)))
Explanation:
It looks for the word specified in <YOUR_WORD> placeholder
It looks for any kind of character afterwards
The search stops when it hits one of the two options:
It finds the character ,
It finds an end of the line
So:
E-mail.*?(?=(,|($)))
Will result in:
E-mail:my.email#example.jp
And
Fax.*?(?=(,|($)))
Will result in:
Fax:00-00-225-3386
If there are edge cases it misses - I would like to know, and whether it affects the performance/ is necessary.

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

Python regex with look behind and alternatives

I want to have a regular expression that finds the texts that are "wrapped" in between "HEAD or HEADa" and "HEAD. That is, I may have a text that starts with the first word as HEAD or HEADa and the following "heads" are of type HEAD.
HEAD\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
HEADa\n\n text...text...HEAD \n\n text....text HEAD\n\n text....text .....
I want only to capture the text that are in between the "heads" therefore I have a regex with look behind and look ahead expressions looking for my "heads". I have the following regex:
var = "HEADa", "HEAD"
my_pat = re.compile(r"(?<=^\b"+var[0]+r"|"+var[1]+r"\b) \w*\s\s(.*?)(?=\b"+var[1] +r"\b)",re.DOTALL|re.MULTILINE)
However, when I try to execute this regex, I am getting an error message saying that I cannot have variable length in the look behind expression. What is wrong with this regex?
Currently, the first part of your regex looks like this:
(?<=^\bHEADa|HEAD\b)
You have two alternatives; one matches five characters and the other matches four, and that's why you get the error. Some regex flavors will let you do that even though they say they don't allow variable-length lookbehinds, but not Python. You could break it up into two lookbehinds, like this:
(?:(?<=^HEADa\b)|(?<=\bHEAD\b))
...but you probably don't need lookbehinds for this anyway. Try this instead:
(?:^HEADa|\bHEAD)\b
Whatever gets matched by the (.*?) later on will still be available through group #1. If you really need the whole of the text between the delimiters, you can capture that in group #1, and that other group will become #2 (or you can use named groups, and not have to keep track of the numbers).
Generally speaking, lookbehind should never be your first resort. It may seem like the obvious tool for the job, but you're usually better off doing a straight match and extracting the part you want with a capturing group. And that's true of all flavors, not just Python; just because you can do more with lookbehinds in other flavors doesn't mean you should.
BTW, you may have noticed that I redistributed your word boundaries; I think this is what you really intended.

Categories