Dive into python gives an amazing little tutorial on creating a regular expression for phone numbers: http://diveintopython3.ep.io/regular-expressions.html#phonenumbers
The final version comes out to look like:
phone_re = re.compile(r'(\d{3})\D*(\d{3})\D*(\d{4})\D*(\d*)$', re.VERBOSE)
This works fine for almost all examples I can come up with, however I found a pretty big failure that I can't seem to fix.
If a group of 3 digits comes before the phone number it works fine. IE:
"500 dollars off, call 123-456-7891"
If a group of 3 digits comes after the phone number it fails. IE:
"Call 123-456-7891 for a discount of up to 500"
Any ideas on a fix that would work for both examples?
The (\d*)$ requires that the string you're matching against end with digit characters (the $ signifies "end of line"). Try removing the $ if you're matching against a larger string where the phone number may not be at the end of the line.
Here's your original, with some spaces (use re.VERBOSE, or remove the spaces):
(\d{3}) \D* (\d{3}) \D* (\d{4}) \D* (\d*)
The \D* will match anything that's not a digit, including words. Maybe you should try this:
(\d{3}) \W* (\d{3}) \W* (\d{4}) \W* (\d*)
The \W* matches anything that's not a word. It will match (222) - 222 - 2222. However, it will not match if there is a letter between the numbers, as in (222) x 222 - 2222. The last part of the match (\d*) appears to be looking for an extension. These can be formatted in a variety of ways—I suggest you either drop it or refine it based on how you expect your data to look. And, like Amber says, you should probably drop the $.
Related
I need help to complete a regex pattern. I need a pattern to match a range of numbers including unit.
Examples:
The car drives 50,5 - 80 km/10min on the road.
The car drives 50,5 - 80 km / 10min on the road.
The car drives 40,5-80 km/h on the road.
The car drives 30-50 km/h on the road.
The car drives 40 - 60.8 km/ h on the road.
The car drives 40.90-60,8 km/h on the road.
I need to match the entire ranges. Good would also be (?:km/10min|km / 10min|km/h|km/ h) to simplify this part so that this does not have to be listed multiple times. So also here the blanks taken into account.
([,.\d]+)\s*(?:km/10min|km / 10min|km/h|km/ h)
https://regex101.com/r/Ey792V/1
Currently, unfortunately, only the first number is matched. Thanks in advance for the help.
You could make the pattern a bit more specific and optionally match whitespace chars instead of hard coding all the possible spaces variations
\b\d+(?:[.,]\d+)?(?:\s*-\s*\d+(?:[.,]\d+)?)?\s*km\s*/\s*(?:h|10min)\b
Explanation
\b A word boundary
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
(?: Non capture group
\s*-\s* Match - between optional whitespace chars
\d+(?:[.,]\d+)? Match 1+ digits with an optional decimal part
)? Close the non capture group and make it optional
\s*km\s*/\s* Match km/ surrounded with optional whitespace chars to match different variations
(?:h|10min) Match either h or 10min (Or use \d+min to match 1+ digits)
\b A word boundary
See a regex demo.
Your question is not entirely clear as you framed it in terms of examples. To be precise you need to state the question in words, then use the examples for illustration. To take one example, the question does not make clear whether
"The car drives 40,5- 80 km /h on the road."
is to be matched.
Expressing a question in words is not always easy but it is a skill that you need to acquire in order write clear code specifications. A by-product is that it makes the code easier to write, as that amounts to merely translating the words into code.
Let's give it a try.
Match a string comprised by six successive substrings:
One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
A hyphen, optionally preceded and/or followed by a space.
One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
The literal " km".
A forward slash, optionally preceded and/or followed by a space.
The literal "h" or one or more digits followed by "min", followed by a word boundary.
I cannot be sure that this is what you want but you should be able to easily modify these requirements as necessary.
Now let's translate these requirements into a regular expression.
1. One or more digits that are not preceded by a comma or period, optionally followed by a comma, hyphen or period, which, if present, is followed by one or more digits.
(?<![,.])\d+(?:[,.-]\d+)?
(?<![,.]) is a negative lookbehind. It is needed to avoid matching, for example, the indicated part of the following string.
"The car drives 1,500.5 - 80 km/10min on the road."
^^^^^^^^^^^^^^^
2. A hyphen, optionally preceded and/or followed by a space.
?- ?
(The first question mark is preceded by a space.)
3. One or more digits, optionally followed by a comma or period, the comma or period, if present, being followed by one or more digits.
\d+(?:[,.]\d+)?
4. The literal " km".
km
5. A forward slash, optionally preceded and/or followed by a space.
?\/ ?
(The first question mark is preceded by a space.)
6. The literal "h" or one or more digits followed by "min", followed by a word boundary.
(?:h|\d+min)\b
Now we can simply join these pieces to form the regular expression.
\d+(?:[,.-]\d+)? ?- ?\d+(?:[,.]\d+)?km ?\/ ?(?:h|\d+min)\b
Demo
\d.+(?:h\b|min\b|s\b)
Would also work. Demo
I'm trying to find a simple (not perfect) pattern to recognise French numbers in a French text. French numbers use comma for the Anglo-Saxon decimal, and use dot or space for the thousand separator. \u00A0 is non-breaking space, also often used in French documents for the thousand separator.
So my first attempt is:
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d', flags=re.UNICODE)
... but the trouble is that this doesn't then match a single digit.
But if I do this
number_pattern = re.compile(r'\d[\d\., \u00A0]*\d?', flags=re.UNICODE)
it then picks up trailing space (or NBS) characters (or for that matter a trailing comma or full stop).
The thing is, the pattern must both START and END with a digit, but it is possible that these may be the SAME character.
How might I achieve this? I considered a two-stage process where you try to see whether this is in fact a single-digit number... but that in itself is not trivial: if followed by a space, NBS, comma or dot, you then have to see whether the character after that, if there is one, is or is not a digit.
Obviously I'm hoping to find a solution which involves only one regex: if there is only one regex, it is then possible to do something like:
doubled_dollars_plain_text = plain_text.replace('$', '$$')
substituted_plain_text = re.sub(number_pattern, '$number', doubled_dollars_plain_text)
... having to use a two-stage process would make this much more lengthy and fiddly.
Edit
I tried to see whether I could implement ThierryLathuille's idea, so I tried:
re.compile(r'(\d(?:[\d\., \u00A0]*\d)?)', flags=re.UNICODE)
... this seems to work pretty well. Unlike JvdV's solution it doesn't attempt to check that thousand separators are followed by 3 digits, and for that matter you could have a succession of commas and spaces in the middle and it would still pass, which is quite problematic when you have a list of numbers separated by ", ". But it's acceptable for certain purposes... until something more sophisticated can be found.
I wonder if there's a way of saying "any non-digit in this pattern must be on its own" (i.e. must be bracketed between two digits)?
What about:
\d{1,3}(?:[\s.]?\d{3})*(?:,\d+)?(?!\d)
See an online demo
\d{1,3} - 1-3 digits.
(?: - Open 1st non-capture group:
[\s.]? - An optional whitespace or literal dot. Note that with unicode \s should match \p{Z} to include the non-breaking whitespace.
\d{3} - Three digits.
)* - Close 1st non-capture group and match 0+ times.
(?:,\d+)? - A 2nd optional non-capture group to match a comma followed by at least 1 digit.
(?!\d) - A negative lookahead to prevent trailing digits.
Very much inspired by JvdV's answer, I suggest this:
number_pattern = re.compile(r'(\d{1,3}(?:(?:[. \u00A0])?\d{3})*(?:,\d+)?(?!\d))', flags=re.UNICODE)
... this makes the thousand separator optional, and also makes thousand groups optional. It restricts the thousand-separator to 3 possible characters: dot, space and NBS, which is necessary for French numbers as found in practice.
PS I just found today that in fact Swiss French-speakers appear sometimes to use an apostrophe (of which there are several candidates in the vastness of Unicode) as a thousand separator.
I am often faced with patterns where the part which is interesting is delimited by a specific character, the rest does not matter. A typical example:
/dev/sda1 472437724 231650856 216764652 52% /
I would like to extract 52 (which can also be 9, or 100 - so 1 to 3 digits) by saying "match anything, then when you get to % (which is unique in that line), see before for the matches to extract".
I tried to code this as .*(\d*)%.* but the group is not matched:
.* match anything, any number of times
% ... until you get to the litteral % (the \d is also matched by .* but my understanding is that once % is matched, the regex engine will work backwards, since it now has an "anchor" on which to analyze what was before -- please tell if this reasoning is incorrect, thank you)
(\d*) ... and now before that % you had a (\d*) to match and group
.* ... and the rest does not matter (match everything)
Your regex does not work because . matches too much, and the group matches too little. The group \d* can basically match nothing because of the * quantifier, leaving everything matched by the ..
And your description of .* is somewhat incorrect. It actually matches everything until the end, and moves backwards until the thing after it ((\d*).*) matches. For more info, see here.
In fact, I think your text can be matched simply by:
(\d{1,3})%
And getting group 1.
The logic of "keep looking until you find..." is kind of baked into the regex engine, so you don't need to explicitly say .* unless you want it in the match. In this case you just want the number before the % right?
If you are just looking to extract just the number then I would use:
import re
pattern = r"\d*(?=%)"
string = "/dev/sda1 472437724 231650856 216764652 52% /"
returnedMatches = re.findall(pattern, string)
The regex expression does a positive look ahead for the special character
In your pattern this part .* matches until the end of the string. Then it backtracks giving up as least as possible till it can match 0+ times a digit and a %.
The % is matched because matching 0+ digits is ok. Then you match again .* till the end of the string. There is a capturing group, only it is empty.
What you might do is add a word boundary or a space before the digits:
.* (\d{1,3})%.* or .*\b(\d{1,3})%.*
Regex demo 1 Or regex demo 2
Note that using .* (greedy) you will get the last instance of the digits and the % sign.
If you would make it non greedy, you would match the first occurrence:
.*?(\d{1,3})%.*
Regex demo
By default regex matches as greedily as possible. The initial .* in your regex sequence is matching everything up to the %:
"/dev/sda1 472437724 231650856 216764652 52"
This is acceptable for the regex, because it just chooses to have the next pattern, (\d*), match 0 characters.
In this scenario a couple of options could work for you. I would most recommend to use the previous spaces to define a sequence which "starts with a single space, contains any number of digits in the middle, and ends with a percentage symbol":
' (\d*)%'
Try this:
.*(\b\d{1,3}(?=\%)).*
demo
I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!
You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).
To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.
The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)
I am struggling to reject matches for words separated by newline character.
Here's the test string:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
Red
Abcd
DDDD
Rules for regex:
1) Reject a word if it's followed by comma. Therefore, we will drop Catto.
2) Only select words that begin with a capital letter. Hence, and etc. will be dropped
3) If the word is followed by a carriage return (i.e. it is the first name, then ignore it).
Here's my attempt: \b([A-Z][a-z]+)\s(?!\n)
Explanation:
\b #start at a word boundary
([A-Z][a-z]+) #start with A-Z followed by a-z
\s #Last name must be followed by a space character
(?!\n) #The word shouldn't be followed by newline char i.e. ignore first names.
There are two problems with my regex.
1) Andrew is matched as Andre. I am unsure why w is missed. I have also observed that w of Andrew is not missed if I change the bottom portion of the sample text to remove all characters including and after w of Andrew. i.e. sample text would look like:
Cardoza Fred
Catto, Philipa
Duncan, Jean
Jerry Smith
and
but
and
Andrew
The output should be:
Cardoza
Jerry
You might ask: Why should Andrew be rejected? This is because of two reasons: a) Andrew is not followed by space. b) There is no first_name "space" last_name combination.
2) The first names are getting selected using my regex. How do I ignore first names?
I researched SO. It seems there is similar thread ignoring newline character in regex match, but the answer doesn't talk about ignoring \r.
This problem is adapted from Watt's Begining Regex book. I have spent close to 1 hour on this problem without any success. Any explanation will be greatly appreciated. I am using python's re module.
Here's regex101 for reference.
Andre (and not the trailing w) is being matched in your regex because the last token is negative lookahead for \n, and just before that is an optional space. So, Andrew<end of line> fails due to being at the end of the line, so the engine backtracks to Andre, which succeeds.
Maybe the optional quantifier in \s? in your regex101 was a typo, but it would probably be easier to start from scratch. If you want to find the initial names that are followed by a space and then another name, then you can use
^[A-Z][a-z]+(?= [A-Z][a-z]+$)
with the m flag:
https://regex101.com/r/kqeMcH/5
The m flag allows for ^ to match the beginning of a line, and $ to match the end of the line - easier than messing with looking for \ns. (Without the m flag, ^ will only match the beginning of the string, while $ will similarly only match the end of the string)
That is, start with repeated alphabetical characters, then lookahead for a space and more alphabetical characters, followed by the end of the line. Using positive lookahead will be a lot easier than negative lookahead for newlines and such.
Note that literal spaces are a bit more reliable in a regex than \s, because \s matches any whitespace character, including newlines. If you're looking for literal spaces, better to use a literal space.
To use flags in Python regex, either use the flags=, or define the flags at the beginning of the pattern, eg
pattern = r'(?m)^[a-z]+(?= [A-Z][a-z]+$)'