Regex in python. How to simplify my example? - python

I'm new at Python programming. Right now I'm struggling with simplifying my existing code. Here is the exercise: develop a pattern which will match the telephone number in format (xxx) xxx-xx-xx AND xxx-xxx-xx-xx. What I've come up with so far:
patt = "\(?\d{3}\)?\s?-?\d{3}-\d{2}-\d{2}"
It works perfectly. But the problem is obvious: if I have an optional pattern, say, "(specific-patter-ffddff445%$#%)--ds" before some kind of fixed pattern, I will have to put "?" symbol before EVERY symbol in the optional pattern. How can I combine all symbols and put just one "?" mark?

So what you have matches all kinds of incorrect formats. For example:
012)345-67-89
(012 345-67-89
What you want is an option, which regexes provide you: https://docs.python.org/3.4/library/re.html#regular-expression-syntax
Something like this would be preferable:
patt = '(?:\(\d{3}\) |\d{3}-)\d{3}-\d{2}-\d{2}'
This will match either "(xxx) " or "xxx-" as a prefix to "xxx-xx-xx". And will not match either of the error strings listed above.
? should only be used in the event that what it operates on is truly optional.

Related

Exact search of a string that has parenthesis using regex

I am new to regexes.
I have the following string : \n(941)\n364\nShackle\n(941)\nRivet\n105\nTop
Out of this string, I want to extract Rivet and I already have (941) as a string in a variable.
My thought process was like this:
Find all the (941)s
filter the results by checking if the string after (941) is followed by \n, followed by a word, and ending with \n
I made a regex for the 2nd part: \n[\w\s\'\d\-\/\.]+$\n.
The problem I am facing is that because of the parenthesis in (941) the regex is taking 941 as a group. In the 3rd step the regex may be wrong, which I can fix later, but 1st I needed help in finding the 2nd (941) so then I can apply the 3rd step on that.
PS.
I know I can use python string methods like find and then loop over the searches, but I wanted to see if this can be done directly using regex only.
I have tried the following regex: (?:...), (941){1} and the make regex literal character \ like this \(941\) with no useful results. Maybe I am using them wrong.
Just wanted to know if it is possible to be done using regex. Though it might be useful for others too or a good share for future viewers.
Thanks!
Assuming:
You want to avoid matching only digits;
Want to match a substring made of word-characters (thus including possible digits);
Try to escape the variable and use it in the regular expression through f-string:
import re
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
var1 = '(941)'
var2 = re.escape(var1)
m = re.findall(fr'{var2}\n(?!\d+\n)(\w+)', s)[0]
print(m)
Prints:
Rivet
If you have text in a variable that should be matched exactly, use re.escape() to escape it when substituting into the regexp.
s = '\n(941)\n364\nShackle\n(941)\nRivet\n105\nTop'
num = '(941)'
re.findall(rf'(?<=\n{re.escape(num)}\n)[\w\s\'\d\-\/\.]+(?=\n)', s)
This puts (941)\n in a lookbehind, so it's not included in the match. This avoids a problem with the \n at the end of one match overlapping with the \n at the beginning of the next.

Python regex match all sentences include either wordA or wordB [duplicate]

I'm creating a javascript regex to match queries in a search engine string. I am having a problem with alternation. I have the following regex:
.*baidu.com.*[/?].*wd{1}=
I want to be able to match strings that have the string 'word' or 'qw' in addition to 'wd', but everything I try is unsuccessful. I thought I would be able to do something like the following:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
but it does not seem to work.
replace [wd|word|qw] with (wd|word|qw) or (?:wd|word|qw).
[] denotes character sets, () denotes logical groupings.
Your expression:
.*baidu.com.*[/?].*[wd|word|qw]{1}=
does need a few changes, including [wd|word|qw] to (wd|word|qw) and getting rid of the redundant {1}, like so:
.*baidu.com.*[/?].*(wd|word|qw)=
But you also need to understand that the first part of your expression (.*baidu.com.*[/?].*) will match baidu.com hello what spelling/handle????????? or hbaidu-com/ or even something like lkas----jhdf lkja$##!3hdsfbaidugcomlaksjhdf.[($?lakshf, because the dot (.) matches any character except newlines... to match a literal dot, you have to escape it with a backslash (like \.)
There are several approaches you could take to match things in a URL, but we could help you more if you tell us what you are trying to do or accomplish - perhaps regex is not the best solution or (EDIT) only part of the best solution?

Regex - If not match then match this - Python

I apologise for the amount of text, but I cannot wrap my head around this and I would like to make my problem clear.
I am currently attempting to create a regex expression to find the end of a website/email link in order to then process the rest of the address. I have decided to look for the ending of the address (eg. '.com', '.org', '.net'); however, I am having difficulty in two areas when dealing with this. (I have chosen this method as it is the best fit for the current project)
Firstly I am trying to get around accidentally hindering users typing words with these keywords within them (eg. '"org"anisation', 'try this "or g"o to'). How I have tackled this is, as an example, the regex:
org(?!\w) - To skip the match if there are letters directly after the keyword.
The secondary problem is finding extra parts of an address (eg. 'www.website."org".uk') which would not be matched. To combat this, as an example, I have used the regex:
org((\W*|\.|dot)\w\w) - In an attempt to find the first two letters after the keyword, as most extensions are only two letters.
The Main Problem:
In order to prevent both of the above situations I have used the regex akin to:
org(.|dot)\w\w|(?!\w)
However, I am not as versed as I would like to be in Regex to find a solution and I understand that this would not create correct results. I know there is a form of 'If this then that' within Regex but I just cant seem to understand the online documentation I have found on the subject.
If possible would someone be able to explain how I may go about creating a system to say:
IF: NOT org(\w)
ELSE IF: org(.|dot)
THEN: MATCH org(.|dot)\w\w
ELSE: MATCH org
I would really appreciate any help on the matter, this has been on my mind for a while now. I would just like to see it through, but I just do not possess the required knowledge.
Edit:
Test cases the Regex would need to pass (Specifically for the 'org' regex for these examples):
(I have marked matches in square brackets '[ ]', and I have marked possible matches to be disregarded with '< >' )
"Hello, please come and check out my website: www.website.[org]"
"I have just uploaded a new game at games.[org.uk]"
"If you would like quote please email me at email#email.[org.ru]"
"I have just made a new <org>anisation website at website.[org], please get in contact at name.name#email.[org.us]"
"For more info check info.[org] <or g>o to info.[org.uk]"
I hope this allows for a better insight to what the Regex needs to do.
The following regex:
(?i)(?<=\.)org(?:\.[a-z]{2})?\b
should do the work for you.
demo:
https://regex101.com/r/8F9qbQ/2/
explanations:
(?i) to activate the case as insensitive (.ORG or .org)
(?<=.) forces that there is a . before org to avoid matches when org is actually a part of a word.
org to match ORG or org
(?:...)? non capturing group that can appear 0 to 1 time
\.[a-zA-Z]{2} dot followed by exactly 2 letters
\b word boundary constraint
There are some other simpler way to catch any website, but assuming that you exactly need the feature IF: NOT org(\w) ELSE IF: org(.|dot) THEN: MATCH org(.|dot)\w\w ELSE: MATCH org, then you can use:
org(?!\w)(\.\w\w)?
It will match:
"org.uk" of www.domain.org.uk
"org" of www.domain.org
But will not match www.domain.orgzz and orgzz
Explanation:
The org(?!\w) part will match org that is not followed by a letter character. It will match the org of org, org of org. but will not match orgzz.
Then, if we already have the org, we will try if we can match additional (\.\w\w) by adding the quantifier ? which means match if there is any, which will match the \.uk but it is not necessary.
I made a little regex that captures a website as long as it starts with 'www.' that is followed by some characters with a following '.'.
import re
matcher = re.compile('(www\.\S*\.\S*)') #matches any website with layout www.whatever
string = 'they sky is very blue www.harvard.edu.co see nothing else triggers it, www, org'
match = re.search(matcher, string).group(1)
#output
#'www.harvard.edu.co'
Now you can tighten this up as needed to avoid false positives.

How do I create a regex with regular variable and some fixed text in Python?

In code i only want to fetch variable name from a c file which is used in if condition.
Following is code snippet of regex:
fieldMatch = re.findall(itemFieldList[i]+"=", codeline, re.IGNORECASE);
here i can find variable itemFieldList[i] from file.
But when i try to add if as shown below nothing is extracted as output even though variable exist in c code in if condition .
fieldMatch = re.findall(("^(\w+)if+[(](\w+)("+itemFieldList[i]+")="), codeline, re.IGNORECASE|re.MULTILINE);
Can anyone suggest how can we create regex to fetch mentioned scenario.
Sample Input :
IF(WORK.env_flow_ind=="R")
OR
IF( WORK.qa_flow_ind=="Q" OR WORK.env_flow_ind=="R")
here itemFieldList[i] = WORK.env_flow_ind
I don't have enough reputation to make this a comment, which it should be and I can't say that I fully understand the question. But to point out a few things:
it's about adding variables to your regex then you should be using string templates to make it more understandable for us and your future self.
"^{}".format(variable)
Doing that will allow you to create a dynamic regex that searches for what you want.
Secondly, I don't think that is your problem. I think that your regex is malformed. I don't know what exactly you are trying to search for but I recommend reading the python regex documentation and testing your regex on a resource like regex101 to make sure that you're capturing what you intend to. From what I can see you are a bit confused about groups. When you put parenthesis around a pattern you are identifying it as a group. You were on the right track trying to exclude the parenthesis in your search by surrounding it with square brackets but it's simpler and cleaner to escape them.
if you are trying to capture this statement:
if(someCondition == fries)
and you want to extract the keyword fries the valid syntax for that pattern is:
(?=if\((?:[\w=\s])+(fries)\))
Since you want this to be dynamic you would replace the string fries with your string template, and you'll get code that ends up something like this:
p = re.compile("(?=if\((?:[\w=\s])+({})\))".format(search), re.IGNORECASE)
p.findall(string)
Regex101 does a better job of breaking down my regex than I ever will:
Link cuz i have no rep
You can build the regex pattern as:
pattern = r"\bif\b\s*\(.*?\b" + re.escape(variablename) + r"\b"
This will look for the word “if” in lowercase, then optionally any spaces, then an opening parenthesis, then optionally any characters, and then your search term, its beginning and its end at word boundaries.
So if variablename is "WORK.env_flow_ind", then re.findall(pattern, textfile) will match the following lines:
if(blabla & WORK.env_flow_ind == "a")
if (WORK.env_flow_id == "b")
if(WORK.env_flow_id == "b")
if( WORK.env_flow_id == "b")
and these won't match:
if (WORK.env_bla == "c")
if (WORK.env_flow_id2 == "d")

How do I extract definitions from a html file?

I'm trying to practice with regular expressions by extracting function definitions from Python's standard library built-in functions page. What I do have so far is that the definitions are generally printed between <dd><p> and </dd></dl>. When I try
import re
fname = open('functions.html').read()
deflst = re.findall(r'<dd><p>([\D3]+)</dd></dl>', fhand)
it doesn't actually stop at </dd></dl>. This is probably something very silly that I'm missing here, but I've been really having a hard time trying to figure this one out.
Regular expressions are evaluated left to right, in a sense. So in your regular expression,
r'<dd><p>([\D3]+)</dd></dl>'
the regex engine will first look for a <dd><p>, then it will look at each of the following characters in turn, checking each for whether it's a nondigit or 3, and if so, add it to the match. It turns out that all the characters in </dd></dl> are in the class "nondigit or 3", so all of them get added to the portion matched by [\D3]+, and the engine dutifully keeps going. It will only stop when it finds a character that is a digit other than 3, and then go on and "notice" the rest of the regex (the </dd></dl>).
To fix this, you can use the reluctant quantifier like so:
r'<dd><p>([\D3]+?)</dd></dl>'
(note the added ?) which means the regex engine should be conservative in how much it adds to the match. Instead of trying to "gobble" as many characters as possible, it will now try to match the [\D3]+? to just one character and then go on and see if the rest of the regex matches, and if not it will try to match [\D3]+? with just two characters, and so on.
Basically, [\D3]+ matches the longest possible string of [\D3]'s that it can while still letting the full regex match, whereas [\D3]+? matches the shortest possible string of [\D3]'s that it can while still letting the full regex match.
Of course one shouldn't really be using regular expressions to parse HTML in "the real world", but if you just want to practice regular expressions, this is probably as good a text sample as any.
By default all quantifiers are greedy which means they want to match as many characters as possible. You can use ? after quantifier to make it lazy which matches as few characters as possible. \d+? matches at least one digit, but as few as possible.
Try r'<dd><p>([\D3]+?)</dd></dl>'

Categories