I am trying to match Australian phone numbers. As the numbers can start with 0 or +61 or 61 followed by 2 or 3 or 4 or 5 or 7 or 8 and then followed by 8 digit number.
txt = "My phone number is 0412345677 or +61412345677 or 61412345677"
find_ph = re.find_all(r'(0|\+61|61)[234578]\d{8}', text)
find_ph
returns
['0', '61']
But I want it to return
['0412345677', '+61412345677' or '61412345677']
Can you please point me in the right direction?
>>> pattern = r'((?:0|\+61|61)[234578]\d{8})'
>>> find_ph = re.findall(pattern, txt)
>>> print(find_ph)
['0412345677', '+61412345677', '61412345677']
The problem you had was that the parentheses around just the prefix part were telling the findall function to only capture those characters, while matching all the rest. (Incidentally it's findall not find_all, and your string was in the variable txtnot text).
Instead, make that a non-capturing group with (?:0|+61|61). Now you capture the whole of the string that matches the entire pattern.
You can using Non-capturing group,
Regex Demo
import re
re.findall("(?:0|\+61|61)\d+", text)
['0412345677', '+61412345677', '61412345677']
One Solution
re.findall(r'(?:0|61|\+61)[2345678]\d{8}', txt)
# ['0412345677', '+61412345677', '61412345677']
Explanation
(?:0|61|\+61) Non-capturing group for 0, 61 or +61
(?:0|61|\+61)[2345678]\d{8} following by one digit except 0, 1, 9
\d{8} followed by 8 digits
Related
I'm trying to create a regex pattern to match account ids following certain rules. This matching will occur within a python script using the re library, but I believe the question is mostly just a regex in general issue.
The account ids adhere to the following rules:
Must be exactly 6 characters long
The letters and numbers do not have to be unique
AND
3 uppercase letters followed by 3 numbers
OR
Up to 6 numbers followed by an amount of letters that bring the length of the id to 6
So, the following would be 'valid' account ids:
ABC123
123456
12345A
1234AB
123ABC
12ABCD
1ABCDE
AAA111
And the following would be 'invalid' account ids
ABCDEF
ABCDE1
ABCD12
AB1234
A12345
ABCDEFG
1234567
1
12
123
1234
12345
I can match the 3 letters followed by 3 numbers very simply, but I'm having trouble understanding how to write a regex to varyingly match an amount of letters such that if x = number of numbers in string, then y = number of letters in string = 6 - x.
I suspect that using lookaheads might help solve this problem, but I'm still new to regex and don't have an amazing grasp on how to use them correctly.
I have the following regex right now, which uses positive lookaheads to check if the string starts with a number or letter, and applies different matching rules accordingly:
((?=^[0-9])[0-9]{1,6}[A-Z]{0,5}$)|((?=^[A-Z])[A-Z]{3}[0-9]{3}$)
This works to match the 'valid' account ids listed above, however it also matches the following strings which should be invalid:
1
12
123
1234
12345
How can I change the first capturing group ((?=^[0-9])[0-9]{1,6}[A-Z]{0,5}$) to know how many letters to match based on how many numbers begin the string, if that's possible?
You could write the pattern as:
^(?=[A-Z\d]{6}$)(?:[A-Z]{3}\d{3}|\d+[A-Z]*)$
Explanation
^ Start of string
(?=[A-Z\d]{6}$) Positive lookahead, assert 6 chars A-Z or digits till the end of the string
(?: Non capture group for the alternatives
[A-Z]{3}\d{3} Match 3 chars A-Z and 3 digits
| Or
\d+[A-Z]* Match 1+ digits and optional chars A-Z
) Close the non capture group
$ End of string
Regex demo
I am unsure how to modify your regex to ensure that the overall username length is 6 characters. However, it would be extremely easy to check that in python.
import re
def check_username(name):
if len(name) == 6:
if re.search("((?=^[0-9])[0-9]{1,6}[A-Z]{0,5}$)|((?=^[A-Z])[A-Z]{3}[0-9]{3}$)", name) != None:
return True
return False
Hopefully this is helpful to you!
I'm trying to get the first number (int and float) after a specific pattern:
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
for x in string:
print(<rule>)
Wanted result:
'38'
'10.5'
I tried:
for x in strings:
print(re.findall(f"(?<=Building).+\d+", x))
print(re.findall(f"(?<=Building).+(\d+.?\d+)", x))
[' 38 House 10']
['10']
[' : 10.5 house 900']
['00']
But I'm missing something.
You could use a capture group:
\bBuilding[\s:]+(\d+(?:\.\d+)?)\b
Explanation
\bBuilding Match the word Building
[\s:]+ Match 1+ whitespace chars or colons
(\d+(?:\.\d+)?) Capture group 1, match 1+ digits with an optional decimal part
\b A word boundary
Regex demo
import re
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
pattern = r"\bBuilding[\s:]+(\d+(?:\.\d+)?)"
for x in strings:
m = re.search(pattern, x)
if m:
print(m.group(1))
Output
38
10.5
An idea to use \D (negated \d) to match any non-digits in between and capture the number:
Building\D*\b([\d.]+)
See this demo at regex101 or Python demo at tio.run
Just to mention, use word boundaries \b around Building to match the full word.
re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)
This will find all numbers in the given string.
If you want the first number only you can access it simply through indexing:
re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)[0]
I am having trouble matching a pattern of this format: p#.g.com where # is not a 1 or a 2. For instance if the pattern is p1.g.com, I don't need to match. If it it p2.g.com, I don't need to match.
But if it is any other number, such as p3.g.com or p29.g.com, then I need to match.
My current pattern is r"(?P<url>p([^1,2])\.g\.com)", but this fails if the pattern is p##.g.com, basically any two digit number it fails on. There is no upper limit on the #, so it could be a 3 or 999 or anything in between.
I also tried r"(?P<url>p([^1,2])\d+\.g\.com)" but that does not match any number beginning with a 1 or a 2. For instance 11 or 23 are not matched, which I do want matched.
Try this regex:
p(?:[03-9]|\d{2,})\.g\.com
Demo
Explanation:
Matches character p
Start of non-capturing group
Match one of:
The digits 0 or 3-9
Any double digit number like 10 or higher
Matches character .g.com
I have the following string, while the first letters can differ and can also be sometimes two, sometimes three or four.
PR191030.213101.ABD
I want to extract the 191030 and convert that to a valid date.
filename_without_ending.split(".")[0][-6:]
PZA191030_392001_USB
Sometimes it looks liket his
This solution is not valid since this is also might differ from time to time. The only REAL pattern is really the first six numbers.
How do I do this?
Thank you!
You could get the first 6 digits using a pattern an a capturing group
^[A-Z]{2,4}(\d{6})\.
^ Start of string
[A-Z]{2,4} Match 2, 3 or 4 uppercase chars
( Capture group 1
\d{6} Match 6 digits
)\. Close group and match trailing dot
Regex demo | Python demo
For example
import re
regex = r"^[A-Z]{2,4}(\d{6})\."
test_str = "PR191030.213101.ABD"
matches = re.search(regex, test_str)
if matches:
print(matches.group(1))
Output
191030
You can do:
a = 'PR191030.213101.ABD'
int(''.join([c for c in a if c.isdigit()][:6]))
Output:
191030
This can also be done by:
filename_without_ending.split(".")[0][2::]
This splits the string from the 3rd letter to the end.
Since first letters can differ we have to ignore alphabets and extract digits.
So using re module (for regular expressions) apply regex pattern on string. It will give matching pattern out of string.
'\d' is used to match [0-9]digits and + operator used for matching 1 digit atleast(1/more).
findall() will find all the occurences of matching pattern in a given string while #search() is used to find matching 1st occurence only.
import re
str="PR191030.213101.ABD"
print(re.findall(r"\d+",str)[0])
print(re.search(r"\d+",str).group())
If I had a sentence that has an age and a time :
import re
text = "I am 21 and work at 3:30"
answer= re.findall(r'\b\d{2}\b', text)
print(answer)
The issue is that it gives me not only the 21, but 30 (since it looks for 2 digits). How do I avoid this so it will only count the numbers and not the non-alphanumeric characters that leads to the issue? I tried to use [0-99] instead of the {} braces but that didn't seem to help.
Using \s\d{2}\s will give you only 2 digit combinations with spaces around them (before and after).
Or if you want to match without trailing whitespace: \s\d{2}
Thats because : is considered as non-word constituent character when you match empty string at word boundary with \b. In Regex term, a word for \b is \w+.
You can check for digits with space or start/end of input line around:
(?:^|\s)(\d{2})(?:\s|$)
Example:
In [85]: text = "I am 21 and work at 3:30"
...: re.findall(r'(?:^|\s)(\d{2})(?:\s|$)', text)
Out[85]: ['21']
You can use (?<!)(?!) negative lookahead to isolate and capture only 2 (two) digits.
Regex: (?<!\S)\d{2}(?!\S)
You can use the following regex:
^\d{2}$|(?<=\s)\d{2}(?=\s)|(?<=\s)\d{2}$|^\d{2}(?=\s)
that will match all the 21 in the following strings:
I am 21 and work at 3:30
21
abc 12:23
12345
I am 21
21 am I
demo: https://regex101.com/r/gP1KSf/1
Explanations:
^\d{2}$ match 2 digits only string or
(?<=\s)\d{2}(?=\s) 2 digits surrounded by space class char or
(?<=\s)\d{2}$ 2 digits at the end of the string and with a preceded by a a space class char
^\d{2}(?=\s) 2 digits at the beginning of the string and followed by a space class char