Unable to capture certain phone numbers with different pattern - python

What should be the appropriate regular expression to capture all the phone numbers listed below? I tried with one and it partially does the work. However, I would like to get them all. Thanks for any suggestion or help.
Here are the numbers along with my script I tried with:
import re
content='''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'\+?1?\s?\(?\d*\)?[\s-]\d*[\s-]\d*',content):
print(phone)
The result I'm getting is:
415
-555-1234
650-555-2345
555-3456
202
555 4567
4035555678
1 416 555
9292
+1 416 555 9292

I suggest to make some parts of the regex obligatory (like the digit patterns, by replacing * with +) or it might match meaningless parts of texts. Also, note that \s matches any whitespace, while you most probably want to match strings on the same lines.
You might try
\+?1? ?(?:\(?\d+\)?)?(?:[ -]?\d+){1,2}
See the regex demo
Details
\+? - an optional plus
1? - an optional 1
? - and optional space
(?:\(?\d+\)?)? - an optional sequence of a (, then 1+ digits and then an optional )
(?:[ -]?\d+){1,2} - 1 or 2 occurrences of:
[ -]? - an optional space or -
\d+ - 1+ digits

I thinks this regx will work in your case
import re
content = '''
415-555-1234
650-555-2345
(416)555-3456
202 555 4567
4035555678
1 416 555 9292
+1 416 555 9292
'''
for phone in re.findall(r'(([+]?\d\s\d?)?\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})', content):
print phone[0]

Related

Regex to unify a format of phone numbers in Python

I'm trying a regex to match a phone like +34(prefix), single space, followed by 9 digits that may or may not be separated by spaces.
+34 886 24 68 98
+34 980 202 157
I would need a regex to work with these two example cases.
I tried this ^(\+34)\s([ *]|[0-9]{9}) but is not it.
Ultimately I'll like to match a phone like +34 "prefix", single space, followed by 9 digits, no matter what of this cases given. For that I'm using re.sub() function but I'm not sure how.
+34 886 24 68 98 -> ?
+34 980 202 157 -> ?
+34 846082423 -> `^(\+34)\s(\d{9})$`
+34920459596 -> `^(\+34)(\d{9})$`
import re
from faker import Faker
from faker.providers import BaseProvider
#fake = Faker("es_ES")
class CustomProvider(BaseProvider):
def phone(self):
#phone = fake.phone_number()
phone = "+34812345678"
return re.sub(r'^(\+34)(\d{9})$', r'\1 \2', phone)
You can try:
^\+34\s*(?:\d\s*){9}$
^ - beginning of the string
\+34\s* - match +34 followed by any number of spaces
(?:\d\s*){9} - match number followed by any number of spaces 9 times
$ - end of string
Regex demo.
Here's a simple approach: use regex to get the plus sign and all the numbers into an array (one char per element), then use other list and string manipulation operations to format it the way you like.
import re
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
pattern = r'[+\d]'
m1 = re.findall(pattern, p1)
m2 = re.findall(pattern, p2)
m1_str = f"{''.join(m1[:3])} {''.join(m1[3:])}"
m2_str = f"{''.join(m2[:3])} {''.join(m2[3:])}"
print(m1_str) # +34 886246898
print(m2_str) # +34 980202157
Or removing spaces using string replacement instead of regex:
p1 = "+34 886 24 68 98"
p2 = "+34 980 202 157"
p1_compact = p1.replace(' ', '')
p2_compact = p2.replace(' ', '')
p1_str = f"{p1_compact[:3]} {p1_compact[3:]}"
p2_str = f"{p2_compact[:3]} {p2_compact[3:]}"
print(p1_str) # +34 886246898
print(p2_str) # +34 980202157
I would capture the numbers like this: r"(\+34(?:\s?\d){9})".
That will allows you to search for numbers allowing whitespace to optionally be placed before any of them. Using a non-capturing group ?: to allow repeating \s?\d without having each number listed as a group on its own.
import re
nums = """
Number 1: +34 886 24 68 98
Number 2: +34 980 202 157
Number 3: +34812345678
"""
number_re = re.compile(r"(\+34(?:\s?\d){9})")
for match in number_re.findall(nums):
print(match)
+34 886 24 68 98
+34 980 202 157
+34812345678

Regular expression to find a sequence of numbers before multiple patterns, into a new column (Python, Pandas)

Here is my sample data:
import pandas as pd
import re
cars = pd.DataFrame({'Engine Information': {0: 'Honda 2.4L 4 cylinder 190 hp 162 ft-lbs',
1: 'Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs',
2: 'Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs',
3: 'MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs',
4: 'Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV',
5: 'GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs'},
'HP': {0: None, 1: None, 2: None, 3: None, 4: None, 5: None}})
Here is my desired output:
I have created a new column called 'HP' where I want to extract the horsepower figure from the original column ('Engine Information')
Here is the code I have tried to do this:
cars['HP'] = cars['Engine Information'].apply(lambda x: re.match(r'\\d+(?=\\shp|hp)', str(x)))
The idea is I want to regex match the pattern: 'a sequence of numbers that come before either 'hp' or ' hp'. This is because some of the cells have no 'space' in between the number and 'hp' as showed in my example.
I'm sure the regex is correct, because I have successfully done a similar process in R. However, I have tried functions such as str.extract, re.findall, re.search, re.match. Either returning errors or 'None' values (as shown in the sample). So here I am a bit lost.
Thanks!
You can use str.extract:
cars['HP'] = cars['Engine Information'].str.extract(r'(\d+)\s*hp\b', flags=re.I)
Details
(\d+)\s*hp\b - matches and captures into Group 1 one or more digits, then just matches 0 or more whitespaces (\s*) and hp (in a case insensitive way due to flags=re.I) as a whole word (since \b marks a word boundary)
str.extract only returns the captured value if there is a capturing group in the pattern, so the hp and whitespaces are not part of the result.
Python demo results:
>>> cars
Engine Information HP
0 Honda 2.4L 4 cylinder 190 hp 162 ft-lbs 190
1 Aston Martin 4.7L 8 cylinder 420 hp 346 ft-lbs 420
2 Dodge 5.7L 8 Cylinder 390hp 407 ft-lbs 390
3 MINI 1.6L 4 Cylinder 118 hp 114 ft-lbs 118
4 Ford 5.0L 8 Cylinder 360hp 380 ft-lbs FFV 360
5 GMC 6.0L 8 Cylinder 352 hp 382 ft-lbs 352
There are several problems:
re.match just looks at the beginning of your string, use re.search if your pattern may appear anywhere
don't escape if you use a raw string, i.e. either'\\d hp' or r'\d hp' - raw strings help your exactly to avoid escaping
Return the matched group. You just search but do not yield the group found. re.search(rex, string) gives you a complex object (a match object) from this you can extract all groups, e.g. re.search(rex, string)[0]
you have to wrap the access in a separate function because you have to check if there was any match before accessing the group. If you don't do that, an exception may stop the apply process right in the middle
apply is slow; use pandas vectorized functions like extract: cars['Engine Information'].str.extract(r'(\d+) ?hp')
Your approach should work with this:
def match_horsepower(s):
m = re.search(r'(\d+) ?hp', s)
return int(m[1]) if m else None
cars['HP'] = cars['Engine Information'].apply(match_horsepower)
This is will get numeric value just before hp, without or with (single or multiple) spaces.
r'\d+(?=\s+hp|hp)'
You can verify Regex Here: https://regex101.com/r/pXySxm/1

Python Regex remove space b/w a Bracket and Number

Python, I have a string like this, Input:
IBNR 13,123 1,234 ( 556 ) ( 2,355 ) 934
Required output- :
Either remove the space b/w the bracket and number
IBNR 13,123 1,234 (556) (2,355) 934
OR Remove the brackets:
IBNR 13,123 1,234 556 2,355 934
I have tried this:
re.sub('(?<=\d)+ (?=\\))','',text1)
This solves for right hand side, need help with left side.
You could use
import re
data = """IBNR 13,123 1,234 ( 556 ) ( 2,355 ) 934 """
def replacer(m):
return f"({m.group(1).strip()})"
data = re.sub(r'\(([^()]+)\)', replacer, data)
print(data)
# IBNR 13,123 1,234 (556) (2,355) 934
Or remove the parentheses altogether:
data = re.sub(r'[()]+', '', data)
# IBNR 13,123 1,234 556 2,355 934
As #JvdV points out, you might better use
re.sub(r'\(\s*(\S+)\s*\)', r'\1', data)
Escape the brackets with this pattern:
(\w+\s+\d+,\d+\s+\d+,\d+\s+)\((\s+\d+\s+)\)(\s+)\((\s+\d+,\d+\s)\)(\s+\d+)
See the results, including substitutions:
https://regex101.com/r/ch6Jge/1
I rarely use the lookahead at all, but I think it does what you want.
re.sub(r'\(\s(\d+(?:\,\d+)*)\s\)', r'\1', text1)

How to extract first floating numbers appearing after a word?

I'm trying to build an application for text extraction use case but I was not able to extract exact price from it.
I have a text like this,
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
I want to extract the price that appearing after the word total using regex but I was only able to extract all floating numbers. Also do note some-times you may also see words such as sub total but I only need price that appears after the word total. Also sometimes after total there may occur other words as well. So Regex should match word total and extract the floating numbers that appears next to it.
Any help is appreciated.
This is what I've tried,
re.findall("\d+\.\d+", string1) # this returns all floating numbers.
You can try
(?<=\\nTotal)\:?\D+([\d\.]+)
Demo
You could try this, should work for the example and the other restrictions you mentioned
import re
result = re.search('Total\n\$(\d+.\d+)', string1)
result.group(1) # 191.44
result = re.search('Total\:\n.+\n(\d+.\d+)', string2)
result.group(1) # 54.50
EDIT: If you want only one expression for both, you could try
result = re.search('\nTotal\:?(\n\D+)*\n\$?(\d+.\d+)', string)
re.group(2)
You could use a positive lookbehind to prevent sub being before total, word boundaries to prevent the words being part of a larger word and a capturing group to capture the price.
(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))
In parts:
(?<!\bsub ) Negative lookbehind, assert what is on the left is not the word sub and a space
\btotal\b Match total between word boundaries to prevent it being part of a larger word
\D* Match 0+ times any char that is not a digit
( Capture group 1
\d+(?:\.\d+) Match 1+ digits with an optional decimal part
) Close group
Regex demo | Python demo
For example
import re
regex = r"(?<!\bsub )\btotal\b\D*(\d+(?:\.\d+))"
string1 = 'Friscos #8603\n8100 E. Orchard Road\nGreenwood Village, Colorado 80111\n2013-11-02\nTable 00\nGuest\n1 Oysters 1/2 Shell #1\n1 Crab Cake\n1 Filet 1602 Bone In\n1 Ribeye 22oz Bone In\n1 Asparagus\n1 Potato Au Gratin\n$17.00\n$19.00\n$66.00\n$53.00\n$12.00\n$11.50\nSub Total\nTax\n$178.50\n$12.94\nTotal\n$191.44\n'
string2 = 'Berghotel\nGrosse Scheidegg\n3818 Grindelwald\nFamilie R. Müller\nRech. Nr. 4572\nBar\n30.07.2007/13:29:17\nTisch 7/01\nNM\n#ರ\n2xLatte Macchiato à 4.50 CHF\n1xGloki\nà 5.00 CHF\n1xSchweinschnitzel à 22.00 CHF\n1xChässpätzli à 18.50 CHF\n#ರ #ರ #1ರ\n5.00\n22.00\n18.50\nTotal:\nCHF\n54.50\nIncl. 7.6% MwSt\n54.50 CHF:\n3.85\nEntspricht in Euro 36.33 EUR\nEs bediente Sie: Ursula\nMwSt Nr. : 430 234\nTel.: 033 853 67 16\nFax.: 033 853 67 19\nE-mail: grossescheidegg#bluewin.ch\n'
print(re.findall(regex, string1, re.IGNORECASE))
print(re.findall(regex, string2, re.IGNORECASE))
Output
['191.44']
['54.50']
If what precedes the price should be a dollar sign of the text CHF, you might use an alternation (?:\$|CHF)\s* matching of the values followed by matching 0+ whitespace chars:
(?<!\bsub )\btotal\b\D*(?:\$|CHF)\s*(\d+(?:\.\d+))
Regex demo
Something like this might do the trick:
(?<!sub )total.*?(\d+.\d+)
Make sure to ignore the case.

Python Regex match all occurrences of decimal pattern followed by another pattern

I've done lots of searching, including this SO post, which almost worked for me.
I'm working with a huge string, trying to capture the groups of four digits that appear after a series of decimal patterns AND before an alphanumeric word.
There are other four digit number groups that don't qualify since they have words or other number patterns before them.
EDIT: my string is not multiline, it is just shown here for visual convenience.
For example:
>> my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
The ideal re.findall should return:
['4010','4060']
Here are patterns I've tried that are lacking:
re.findall(r'(?=(\d\.\d{6}\s+)(\s+\d{4}\s))', my_string)
# also tried
re.findall("(\s+\d{4}\s+)(?:(?!^\d+\.\d+)[\s\S])*", my_string)
# which gets me a little closer but I'm still not getting what I need.
Thanks in advance!
SINGLE LINE STRING APPROACH:
Just match the float number right before the 4 standalone digits:
r'\d+\.\d+\s+(\d{4})\b'
See this regex demo
Python demo:
import re
p = re.compile(r'\d+\.\d+\s+(\d{4})\b')
s = "BEAVER COUNTY 001 0000 1010 BEAVER 2010 BEAVER COUNTY SCH DIST 0.008504 0.008508 4010 COUNTY SPECIAL SERVICE DIST NO.1 4040 BEAVER COUNTY 8005 GREENVILLE SOLAR 0.004258 0.008348 0.008238 4060 SPECIAL SERVICE DISTRICT NO 7"
print(p.findall(s))
# => ['4010', '4060']
ORIGINAL ANSWER: MULTILINE STRING
You may use a regex that will check for a float value on the previous line and then captures the standalone 4 digits on the next line:
re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.M)
See regex demo here
Pattern explanation:
^ - start of a line (as re.M is used)
\d+\.\d+ - 1+ digits, . and again 1 or more digits
* - zero or more spaces (replace with [^\S\r\n] to only match horizontal whitespace)
[\r\n]+ - 1 or more LF or CR symbols (to only restrict to 1 linebreak, replace with (?:\r?\n|\r))
(\d{4})\b - Group 1 returned by the re.findall matching 4 digits followed with a word boundary (a non-digit, non-letter, non-_).
Python demo:
import re
p = re.compile(r'^\d+\.\d+ *[\r\n]+(\d{4})\b', re.MULTILINE)
s = "BEAVER COUNTY 001 0000 \n1010 BEAVER \n2010 BEAVER COUNTY SCH DIST \n0.008504 \n...(more decimals)\n0.008508 \n4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010\n4040 BEAVER COUNTY \n8005 GREENVILLE SOLAR\n0.004258 \n0.008348 \n...(more decimals)\n0.008238 \n4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060"
print(p.findall(s)) # => ['4010', '4060']
This will help you:
"((\d+\.\d+)\s+)+(\d+)\s?(?=\w+)"gm
use group three means \3
Demo And Explaination
Try this patter:
re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
I wrote a little code and checked against it and it works.
import re
p=re.compile(r'(\d+[.]\d+)+\s+(?P<cap>\d{4})\s+\w+')
my_string = """BEAVER COUNTY 001 0000
1010 BEAVER
2010 BEAVER COUNTY SCH DIST
0.008504
...(more decimals)
0.008508
4010 COUNTY SPECIAL SERVICE DIST NO.1 <---capture this 4010
4040 BEAVER COUNTY
8005 GREENVILLE SOLAR
0.004258
0.008348
...(more decimals)
0.008238
4060 SPECIAL SERVICE DISTRICT NO 7 <---capture this 4060
"""
s=my_string.replace("\n", " ")
match=p.finditer(s)
for m in match:
print m.group('cap')

Categories