regular expression - partially match - python

My aim is to find matches in a text where not always all matches are present.
I am trying to collect the phone number, the E-mail and the website of venues from a web site. Only some venues have all three information available but most of them only one or two of them. I tried to write a code. However, it works only if all 3 information are available. Could someone help me what is wrong?
grouped = re.compile('col-right[\s\S]*?' +
'Tel[\s\S]*?([0-9]{0,4}-?[0-9]{3,7}-?[0-9]{0,4}-?[0-9]{0,4})' +
'[\s\S]*?href="http://([\w\W]*?)"' +
'[\s\S]*?href="mailto:([\s\S]*?)">[\s\S]*?</div>')
for match in re.finditer(grouped, text):
print (match.group(1))
print (match.group(2))
print (match.group(3))
Also the digits in the phone numbers are divided with "-" but sometimes there is a space between the "-" and the next set of digits. How can I include that in the code that this space is only occasionally present?

Your logic is good, but it needs a little work.
First of all, you need the phone number. Write a regex for it, and add it to a group: (regex)* the group is marked with (``) and * means that it has to be present 0 or more times.
Write the next regex, add it to another group (emailRegex)* and the third group (website)*.
Instead of * you could also use the ?, once or none at all (as I can see, you used ?.
Now, putting all together, simply mix them with any character in between them
(group1)?.*(emailRegex)?.*(website)*
grup1 matches phone number, followed by any character, email, followed by any character, website. And if one of them is missing, there is no problem at all.
Email regex example: (probably not the most complete one)
([a-zA-Z_]+[a-zA-Z_.-0-9]*#[a-zA-Z0-9]\.[a-z]+])?
This works like this: the email should start with a letter or an underscore _ and it should be followed by lower/upper case, numbers, underscore or a dot ( .) followed by # and letters followed by a dot (notice that I used \. to escape the special any character notation and in the end you add a mix of at least a letter.
works for email#mail.com.
The fact that I put the entire regex in brackets means it is a group and it should appear once or none at all (hence the ?). Between groups, you add .* meaning that in between the phone number/email/address can be any characters.

Related

How to check if the whole input string (real numbers separated by a space) matches a regex in Python?

I have an input string consisting of a sequence of real numbers separated by a single space. It is also acceptable for the string to contain only one real number (no spaces). My goal is to check whether the string structure matches the following (in this order):
optional (0/1): minus (-)
1/more digits
optional (1+): a period and 1/more digits
optional (0+): a group consisting of a space and the first group (the first three bullet points)
It should describe the string completely. If not, it should print an error message and exit.
My current regular expression is ^(-?\d+(\.?\d)*)( \1)*$ which I thought would be okay, but even the first group doesn't match all the real numbers individually. And I need it to check the string from the beginning to the end, including the spaces.
My code for this function looks like this:
import re
def structure_check(string):
structure = r"^(-?\d+(\.?\d)*)( \1)*$"
if re.match(structure,string):
return("OK")
else:
print("Input error")
exit()
It should accept strings like: 15 35 -45 8 -2.3 4564.18 56 etc., but it doesn't correspond to changes in the input (doesn't match) at all. It shouldn't match if there is too many spaces, incorrectly placed . or -, or if there are other characters than digits, periods, dashes (-) and spaces.
I could also do this with just the first group while iterating over a list created by splitting the input string by space, but I would prefer to check it according to my main goal, since I wouldn't have to split the input in the validation function and also to save some more code lines by checking the input alltogether (eg. for excess spaces, or unsupported characters, which I'd have to otherwise check separately).
Sorry if I missed any answered questions, I couldn't find any appropriate for my problem in Python. If you know about any, feel free to link them, please. And thank you, I am a beginner and started learning regex for a project just about yesterday.
You can use:
^((?:[+-]?\d+(?:[.]\d+)?)(?:[ \t]|$))*$
Demo and explantation
I added + to the optional sign. If you only want to match with no sign or -, just remove that from the optional character class.
You could also use an unrolled version to prevent matching a space at the end.
^-?\d+(?:\.\d+)?(?: -?\d+(?:\.\d+)?)*$
Regex demo
The backreference \1 will match exactly what is matched in group 1 and for your pattern will match for example 123 123 123
If you want to repeat the group, you could recurse the first group using the PyPi regex module and (?1)
^(-?\d+(?:\.\d+)?)(?: (?1))*$
See a Python example
Problem is in your regexp, to be specific, in ( \1)* part.
This, described, means: space and string that was matched in group 1 zero or more times
Thus, your regexp will match for the following, for example:
15 15 15
-5.3 -5.3 -5.3 -5.3
And so on.
To fix the regexp, I would replace the group reference with the actual group, like so:
^(-?\d+(\.?\d)*)( -?\d+(\.?\d)*)*$
I would also point out that this regexp allows the numbers to have multiple decimal dots, (e.g. 1.2.3 passes) however I'm not sure if that's intended or not.
In JavaScript you can use the method .test of regex. The regex should work in python.
let ok = /^(([+\-]?\d+(\.\d+)?)( |$))+$/.test("15 35 -45 8 -2.3 4564.18 56");
console.log(ok);
Explanation: (.\d+)? You must make the whole group optional. The number can be followed by a space or the end of a string ( |$). The pattern is repeated throughout the string so I wrapped the entire expression in a group. Insert ^ at the beginning of the regex and $ at the end of the regex to force the regex to check the string completely.

How to make regex that matches a number with commas for every three digits?

I am a beginner in Python and in regular expressions and now I try to deal with one exercise, that sound like that:
How would you write a regex that matches a number with commas for
every three digits? It must match the following:
'42'
'1,234'
'6,368,745'
but not the following:
'12,34,567' (which has only two digits between the commas)
'1234' (which lacks commas)
I thought it would be easy, but I've already spent several hours and still don't have write answer. And even the answer, that was in book with this exercise, doesn't work at all (the pattern in the book is ^\d{1,3}(,\d{3})*$)
Thank you in advance!
The answer in your book seems correct for me. It works on the test cases you have given also.
(^\d{1,3}(,\d{3})*$)
The '^' symbol tells to search for integers at the start of the line. d{1,3} tells that there should be at least one integer but not more than 3 so ;
1234,123
will not work.
(,\d{3})*$
This expression tells that there should be one comma followed by three integers at the end of the line as many as there are.
Maybe the answer you are looking for is this:
(^\d+(,\d{3})*$)
Which matches a number with commas for every three digits without limiting the number being larger than 3 digits long before the comma.
You can go with this (which is a slightly improved version of what the book specifies):
^\d{1,3}(?:,\d{3})*$
Demo on Regex101
I got it to work by putting the stuff between the carrot and the dollar in parentheses like so: re.compile(r'^(\d{1,3}(,\d{3})*)$')
but I find this regex pretty useless, because you can't use it to find these numbers in a document because the string has to begin and end with the exact phrase.
#This program is to validate the regular expression for this scenerio.
#Any properly formattes number (w/Commas) will match.
#Parsing through a document for this regex is beyond my capability at this time.
print('Type a number with commas')
sentence = input()
import re
pattern = re.compile(r'\d{1,3}(,\d{3})*')
matches = pattern.match(sentence)
if matches.group(0) != sentence:
#Checks to see if the input value
#does NOT match the pattern.
print ('Does Not Match the Regular Expression!')
else:
print(matches.group(0)+ ' matches the pattern.')
#If the values match it will state verification.
The Simple answer is :
^\d{1,2}(,\d{3})*$
^\d{1,2} - should start with a number and matches 1 or 2 digits.
(,\d{3})*$ - once ',' is passed it requires 3 digits.
Works for all the scenarios in the book.
test your scenarios on https://pythex.org/
I also went down the rabbit hole trying to write a regex that is a solution to the question in the book. The question in the book does not assume that each line is such a number, that is, there might be multiple such numbers in the same line and there might some kind of quotation marks around the number (similar to the question text). On the other hand, the solution provided in the book makes those assumptions: (^\d{1,3}(,\d{3})*$)
I tried to use the question text as input and ended up with the following pattern, which is way too complicated:
r'''(
(?:(?<=\s)|(?<=[\'"])|(?<=^))
\d{1,3}
(?:,\d{3})*
(?:(?=\s)|(?=[\'"])|(?=$))
)'''
(?:(?<=\s)|(?<=[\'"])|(?<=^)) is a non-capturing group that allows
the number to start after \s characters, ', ", or the start of the text.
(?:,\d{3})* is a non-capturing group to avoid capturing, for example, 123 in 12,123.
(?:(?=\s)|(?=[\'"])|(?=$)) is a non-capturing group that allows
the number to end before \s characters, ', ", or the end of the text (no newline case).
Obviously you could extend the list of allowed characters around the number.

Regex make a match must end with a number

I'm trying to make a regex that captures a certain word can have a certain number of spaces after it and have a number right after.
Example
Created 100 images
I tried
/(created\s*\d*)/
Which matches what I want but also matches
'Created '
^No digits are after the word.
I want it to must include a digit as well
Change \d* to \d+. * means zero or more, where + means one or more.
Try it here

Python regex: Matching a URL

I have some confusion regarding the pattern matching in the following expression. I tried to look up online but couldn't find an understandable solution:
imgurUrlPattern = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
What exactly are the parentheses doing ? I understood up until the first asterisk , but I can't figure out what is happening after that.
Regular expressions can be represented as graphs to understand there operation. A parallel connection between nodes indicate that it is optional a serial connection indicates taht it is mandatory and a loop indicated repitition over the same node.
(http://i.imgur.com/(.*))(\?.*)?
Debuggex Demo
So this starts with an imgur URL http://i.imgur.com/(.*) (mandatorily) having any characters untill a '?'(optional) is encountered. Following any characters after the '?'. Notice '?' has been escaped of its regular behaviour. The pink highlights indicate the capture groups.
(http://i.imgur.com/(.*))(\?.*)?
The first capturing group (http://i.imgur.com/(.*)) means that the string should start with http://i.imgur.com/ followed by any number of characters (.*) (this is a poor regex, you shouldn't do it this way). (.*) is also the second capturing group.
The third capturing group (\?.*) means that this part of the string must start with ? and then contain any number of any characters, as above.
The last ? means that the last capturing group is optional.
EDIT:
These groups can then be used as:
p = re.compile(r'(http://i.imgur.com/(.*))(\?.*)?')
m = p.match('ab')
m.group(0);
m.group(2);
To improve the regex, you must limit the engine to what characters you need, like:
(http://i.imgur.com/([A-z0-9\-]+))(\?[[^/]+*)?
[A-z0-9\-]+ limit to alphanumeric characters
[^/] exclude /
The (.*) means any character repeated any amount of times, the (\?.*)? matches the query string of a url for example (a imgur search of "cat"):
http://imgur.com/search?q=cat
http://imgur.com/search is matched by the (http://i.imgur.com/(.*)) (the search is specifically matched by the (.*)) section of the regex. The ?q=cat is matched by the (\?.*)? of the regex. In the regex the ? in the end means optional, so it means there might or might not be a query string. There is no query string in the url http://www.imgur.com. The parenthesis are used for grouping. We want to group (http://i.imgur.com/(.*)) as one thing because it matches the url, and there is another group within this that matches the page you are request (this is (.*)). We want to group (\?.*)? because it matches the query string.
Here is a diagram to help you

How to match emails with specific rules

How do I achieve the following with a regex:
Match if string doesn't start with a certain character
Match if there are no two ","'s or any other characters
Match if the string has double ", even if they are not adjacent
Using Python.
Currently I am attempting to match email addresses with these rules included. The current pattern I have is
pattern = '^([A-Z0-9._-\"]|\"[!\,;]\"){1-127}+#[^-][A-Z0-9.-]{3-256}+\.[A-Z]{2,4}[^-]$'
But I am confused with how to implement these rules.
Being more specific:
I want a pattern that matches an email adress consisting of 2 parts (name, domain).
The name part should be no longer then 128 characters and should go before #. It should cosist of a-z0-9 chracters and also ., _, -, ". The name can't have to adjacent dots.
If the name has " then it should be paired with another ". The name can have !;, characters if they are in between paired ".
The domain name should be no longer then 256 and no shorter then 3 characters, should be separated by a dot. The domain name can't begin or end with -.
This information is given to help you understand what I want, the main question is about three rules I stated in the top. I will gladly appreciate it if you tell me how to achieve them.
I am confused about your question. Your title says comma separated list but then you talk about email addresses. There is an official standard regex for emails:
(?:[a-z0-9!#$%&'*+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])")#(?:(?:a-z0-9?.)+a-z0-9?|[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?).){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\[\x01-\x09\x0b\x0c\x0e-\x7f])+)])

Categories