Python regex Get first element after specific string - python

I'm trying to get the first number (int and float) after a specific pattern:
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
for x in string:
print(<rule>)
Wanted result:
'38'
'10.5'
I tried:
for x in strings:
print(re.findall(f"(?<=Building).+\d+", x))
print(re.findall(f"(?<=Building).+(\d+.?\d+)", x))
[' 38 House 10']
['10']
[' : 10.5 house 900']
['00']
But I'm missing something.

You could use a capture group:
\bBuilding[\s:]+(\d+(?:\.\d+)?)\b
Explanation
\bBuilding Match the word Building
[\s:]+ Match 1+ whitespace chars or colons
(\d+(?:\.\d+)?) Capture group 1, match 1+ digits with an optional decimal part
\b A word boundary
Regex demo
import re
strings = ["Building 38 House 10",
"Building : 10.5 house 900"]
pattern = r"\bBuilding[\s:]+(\d+(?:\.\d+)?)"
for x in strings:
m = re.search(pattern, x)
if m:
print(m.group(1))
Output
38
10.5

An idea to use \D (negated \d) to match any non-digits in between and capture the number:
Building\D*\b([\d.]+)
See this demo at regex101 or Python demo at tio.run
Just to mention, use word boundaries \b around Building to match the full word.

re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)
This will find all numbers in the given string.
If you want the first number only you can access it simply through indexing:
re.findall(r"(?<![a-zA-Z:])[-+]?\d*\.?\d+", x)[0]

Related

How to split a string in different words

I want to split the string: "3quartos2suítes3banheiros126m²"
in this format using python:
3 quartos
2 suítes
3 banheiros
126m²
Is there a built-in function i can use? How can I do this?
You can do this using regular expressions, specifically re.findall()
s = "3quartos2suítes3banheiros126m²"
matches = re.findall(r"[\d,]+[^\d]+", s)
gives a list containing:
['3quartos', '2suítes', '3banheiros', '126m²']
Regex explanation (Regex101):
[\d,]+ : Match a digit, or a comma one or more times
[^\d]+ : Match a non-digit one or more times
Then, add a space after the digits using re.sub():
result = []
for m in matches:
result.append(re.sub(r"([\d,]+)", r"\1 ", m))
which makes result =
['3 quartos', '2 suítes', '3 banheiros', '126 m²']
This adds a space between 126 and m², but that can't be helped.
Explanation:
Pattern :
r"([\d,]+)" : Match a digit or a comma one or more times, capture this match as a group
Replace with:
r"\1 " : The first captured group, followed by a space

Extract Australian phone numbers from string

I am trying to match Australian phone numbers. As the numbers can start with 0 or +61 or 61 followed by 2 or 3 or 4 or 5 or 7 or 8 and then followed by 8 digit number.
txt = "My phone number is 0412345677 or +61412345677 or 61412345677"
find_ph = re.find_all(r'(0|\+61|61)[234578]\d{8}', text)
find_ph
returns
['0', '61']
But I want it to return
['0412345677', '+61412345677' or '61412345677']
Can you please point me in the right direction?
>>> pattern = r'((?:0|\+61|61)[234578]\d{8})'
>>> find_ph = re.findall(pattern, txt)
>>> print(find_ph)
['0412345677', '+61412345677', '61412345677']
The problem you had was that the parentheses around just the prefix part were telling the findall function to only capture those characters, while matching all the rest. (Incidentally it's findall not find_all, and your string was in the variable txtnot text).
Instead, make that a non-capturing group with (?:0|+61|61). Now you capture the whole of the string that matches the entire pattern.
You can using Non-capturing group,
Regex Demo
import re
re.findall("(?:0|\+61|61)\d+", text)
['0412345677', '+61412345677', '61412345677']
One Solution
re.findall(r'(?:0|61|\+61)[2345678]\d{8}', txt)
# ['0412345677', '+61412345677', '61412345677']
Explanation
(?:0|61|\+61) Non-capturing group for 0, 61 or +61
(?:0|61|\+61)[2345678]\d{8} following by one digit except 0, 1, 9
\d{8} followed by 8 digits

Python regexp: exclude specific pattern from sub

Having a string like this: aa5f5 aa5f5 i try to split the tokens where non-digit meets digit, like this:
re.sub(r'([^\d])(\d{1})', r'\1 \2', 'aa5f5 aa5f5')
Out: aa 5f 5 aa 5f 5
Now i try to prevent some tokens from being splitted with specific prefix character($): $aa5f5 aa5f5, the desired output is $aa5f5 aa 5f 5
The problem is that i only came up with this ugly loop:
sentence = '$aa5f5 aa5f5'
new_sentence = []
for s in sentence.split():
if s.startswith('$'):
new_sentence.append(s)
else:
new_sentence.append(re.sub(r'([^\d])(\d{1})', r'\1 \2', s))
print(' '.join(new_sentence)) # $aa5f5 aa 5f 5
But could not find a way to make this possible with single line regexp. Need help with this, thank you.
You may use
new_sentence = re.sub(r'(\$\S*)|(?<=\D)\d', lambda x: x.group(1) if x.group(1) else rf' {x.group()}', sentence)
See the Python demo.
Here, (\$\S*)|(?<=\D)\d matches $ and any 0+ non-whitespace characters (with (\$\S*) capturing the value in Group 1, or a digit is matched that is preceded with a non-digit char (see (?<=\D)\d pattern part).
If Group 1 matched, it is pasted back as is (see x.group(1) if x.group(1) in the replacement), else, the space is inserted before the matched digit (see else rf' {x.group()}').
With PyPi regex module, you may do it in a simple way:
import regex
sentence = '$aa5f5 aa5f5'
print( regex.sub(r'(?<!\$\S*)(?<=\D)(\d)', r' \1', sentence) )
See this online Python demo.
The (?<!\$\S*)(?<=\D)(\d) pattern matches and captures into Group 1 any digit ((\d)) that is preceded with a non-digit ((?<=\D)) and not preceded with $ and then any 0+ non-whitespace chars ((?<!\$\S*)).
This is not something regular expression can do. If it can, it'll be a complex regex which will be hard to understand. And when a new developer joins your team, he will not understand it right away. It's better you write it the way you wrote it already. For the regex part, the following code will probably do the splitting correctly
' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
>>> s = "aa5f5 aa5f53r12"
>>> ' '.join(map(str.strip, re.findall(r'\d+|\D+', s)))
'aa 5 f 5 aa 5 f 53 r 12'

regex two group matches everything until pattern

I have the following examples:
Tortillas Bolsa 2a 1kg 4118
Tortillinas 50p 1 31Kg TAB TR 46113
Bollos BK 4in 36p 1635g SL 131
Super Pan Bco Ajonjoli 680g SP WON 100
Pan Blanco Bimbo Rendidor 567g BIM 49973
Gansito ME 5p 250g MTA MLA 49860
Where I want to keep everything before the number but I also don't want the two uppercase letter word example: ME, BK. I'm using ^((\D*).*?) [^A-Z]{2,3}
The expected result should be
Tortillas Bolsa
Tortillinas
Bollos
Super Pan Bco Ajonjoli
Pan Blanco Bimbo Rendidor
Gansito
With the regex I'm using I'm still getting the two capital letter words Bollos BK and Gansito ME
Pre-compile a regex pattern with a lookahead (explained below) and employ regex.match inside a list comprehension:
>>> import re
>>> p = re.compile(r'\D+?(?=\s*([A-Z]{2})?\s*\d)')
>>> [p.match(x).group() for x in data]
[
'Tortillas Bolsa',
'Tortillinas',
'Bollos',
'Super Pan Bco Ajonjoli',
'Pan Blanco Bimbo Rendidor',
'Gansito'
]
Here, data is your list of strings.
Details
\D+? # anything that isn't a digit (non-greedy)
(?= # regex-lookahead
\s* # zero or more wsp chars
([A-Z]{2})? # two optional uppercase letters
\s*
\d # digit
)
In the event of any string not containing the pattern you're looking for, the list comprehension will error out (with an AttributeError), since re.match returns None in that instance. You can then employ a loop and test the value of re.match before extracting the matched portion.
matches = []
for x in data:
m = p.match(x)
if m:
matches.append(m.group())
Or, if you want a placeholder None when there's no match:
matches = []
for x in data:
matches.append(m.group() if m else None)
My 2 cents
^.*?(?=\s[\d]|\s[A-Z]{2,})
https://regex101.com/r/7xD7DS/1/
You may use the lookahead feature:
I_WANT = '(.+?)' # This is what you want
I_DO_NOT_WANT = '\s(?:[0-9]|(?:[A-Z]{2,3}\s))' # Stop-patterns
RE = '{}(?={})'.format(I_WANT, I_DO_NOT_WANT) # Combine the parts
[re.findall(RE, x)[0] for x in test_strings]
#['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli',
# 'Pan Blanco Bimbo Rendidor', 'Gansito']
Supposing that:
All the words you want to match in your capture group start with an uppercase letter
The rest of each word contains only lowercase letters
Words are separated by a single space
...you can use the following regular expressions:
Using Unicode character properties:
^((\p{Lu}\p{Ll}+ )+)
> Try this regex on regex101.
Without Unicode support:
^(([A-z][a-z]+ )+)
> Try this regex on regex101.
I suggest splitting on the first two uppercase letter word or a digit and grab the first item:
r = re.compile(r'\b[A-Z]{2}\b|\d')
[r.split(item)[0].strip() for item in my_list]
# => ['Tortillas Bolsa', 'Tortillinas', 'Bollos', 'Super Pan Bco Ajonjoli', 'Pan Blanco Bimbo Rendidor', 'Gansito']
See the Python demo
Pattern details
\b[A-Z]{2}\b - a whole (since \b are word boundaries) two uppercase ASCII letter word
| - or
\d - a digit.
With .strip(), all trailing and leading whitespace will get trimmed.
A slight variation for a re.sub:
re.sub(r'\s*(?:\b[A-Z]{2}\b|\d).*', '', s)
See the regex demo
Details
\s* - 0+ whitespace chars
(?:\b[A-Z]{2}\b|\d) - either a two uppercase letter word or a digit
.* - the rest of the line.

Python Regular Expression looking for two digits only

If I had a sentence that has an age and a time :
import re
text = "I am 21 and work at 3:30"
answer= re.findall(r'\b\d{2}\b', text)
print(answer)
The issue is that it gives me not only the 21, but 30 (since it looks for 2 digits). How do I avoid this so it will only count the numbers and not the non-alphanumeric characters that leads to the issue? I tried to use [0-99] instead of the {} braces but that didn't seem to help.
Using \s\d{2}\s will give you only 2 digit combinations with spaces around them (before and after).
Or if you want to match without trailing whitespace: \s\d{2}
Thats because : is considered as non-word constituent character when you match empty string at word boundary with \b. In Regex term, a word for \b is \w+.
You can check for digits with space or start/end of input line around:
(?:^|\s)(\d{2})(?:\s|$)
Example:
In [85]: text = "I am 21 and work at 3:30"
...: re.findall(r'(?:^|\s)(\d{2})(?:\s|$)', text)
Out[85]: ['21']
You can use (?<!)(?!) negative lookahead to isolate and capture only 2 (two) digits.
Regex: (?<!\S)\d{2}(?!\S)
You can use the following regex:
^\d{2}$|(?<=\s)\d{2}(?=\s)|(?<=\s)\d{2}$|^\d{2}(?=\s)
that will match all the 21 in the following strings:
I am 21 and work at 3:30
21
abc 12:23
12345
I am 21
21 am I
demo: https://regex101.com/r/gP1KSf/1
Explanations:
^\d{2}$ match 2 digits only string or
(?<=\s)\d{2}(?=\s) 2 digits surrounded by space class char or
(?<=\s)\d{2}$ 2 digits at the end of the string and with a preceded by a a space class char
^\d{2}(?=\s) 2 digits at the beginning of the string and followed by a space class char

Categories