I want to extract phone number from text. I able to extract phone number from text when all digits are presents on single line. But When some digits are present in next line then regex is not working.
Here is my text:
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
In above text +45 is on first line and 20 32 40 08 presents on second line. I unable to extract phone numbers from text when text like above text. When digits are present on same single line then it's working fine.
Here is my regex:
reg_phonestyle = re.compile(r'(\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}|\(\d{3}\)\s*\d{3}[-\/\.\ \s]??\d{4}|\d{3}[-\/\.\ \s]??\d{4})')
You can specify an additional flag to perform a MULTILINE search.
Given your example I propose the following solution:
import re
input_str = '''
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
'''
phone_reg = re.compile("([0-9]{2,4}[-.\s]{,1}){5}", re.MULTILINE)
print(phone_reg.search(input_str).group(0))
Where this regexp find 5 groups of: 2 to 4 digits followed by 0 or 1 spacing character
Hope this helps
This is my way to get phone number. actually i want more examples to verify my regex.
import re
strs = '''
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
'''
phone = re.compile("(?<=mobile\s)(.?[0-9]|\s)+", re.S)
print( " ".join(phone.search(strs).group(0).split()) ) # remove \n and space and etc.
Related
I have a footer extracted out using regex from a PDF. The footer example is as below
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 2 Copyright 2001-2019 some relevant text here'
I want to find this string across all my text and replace it with a space since I dont need this in the middle of my text extraction. However I have the page number inbetween the text which changes each time so it is not a simple find and replace. I am able to find the page number using
result = re.search(r"\s[\d]\s", footer_text)
But I dont know how to replace this 2 with any number during my find and replace. Any pointers?
Assuming that footer text does contain something that matches r'\s\d+\s` (I am allowing for page numbers >= 10), then first you want to create a regex by replacing the page number with the regex that matches it:
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
Now you can match any footer regardless of page number. The code then is:
>>> import re
...
... footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text h
... ere'
...
... regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
... replacement = ' ' # a single space (should this instead be '' for an empty string?)
...
... some_text = "abc" + footer_text + "def"
... print(regex)
... print(some_text)
... print(re.sub(regex, replacement, some_text))
...
company\ name\.\ \(ABC\)\ Q1\s\d+\sHere\ is\ some\ text\ 01\-Jan\-2019\ 1\-888\-1234567\ www\.company\.com\s\d+\sCopyright\ 2001\-2019some\ relevant\ text\ here
abccompany name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text heredef
abc def
For simpler copying:
import re
footer_text = 'company name. (ABC) Q1 2020 Here is some text 01-Jan-2019 1-888-1234567 www.company.com 11 Copyright 2001-2019some relevant text here'
regex = re.sub(r'\\ \d+\\ ', r'\s\d+\s', re.escape(footer_text))
replacement = ' ' # a single space (should this instead be '' for an empty string?)
some_text = "abc" + footer_text + "def"
print(regex)
print(some_text)
print(re.sub(regex, replacement, some_text))
I need to find a phone number in a given paragraph text, with the conditions as below.
The word Phone/Ph/tel/telephone should exist in the sentence where the phone number is present.
For ex: (consider the below paragraph.)
This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
As you can see this paragraph has a phone number signified, and it has the word "Phone" in the sentence (31 characters before the phone number).
So i would like to detect this as a phone number if and only if it has the words Phone/Ph/tel/telephone 50 characters before or after the phone number.
I tried using lookaround in regex but did not work.
import re
phno = re.compile(r'(?<=Ph\s)(?<=Phone\s)(?<=tel\s)telephone(?<=telephone\s)\b([0-9]{3}[-][0-9]{3}[-][0-9]{4})\b',re.MULTILINE)
data = "This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
l = phno.findall(data)
print(l)
I am getting output empty list [ ] because the word 'Phone' is not detected by regex (I need it to detect 50 chars before or after phone number)
import re
data = """This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 999-123-4567 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
And 555-555-1212 is my telephone."""
phno = re.compile(r'\b(?:phone|ph|telephone)\b.{0,49}\b(\d{3}[-]\d{3}[-]\d{4})\b|\b(\d{3}[-]\d{3}[-]\d{4})\b.{0,49}\b(?:phone|ph|telephone)\b', flags=re.I)
phones = [m.group(1) if m.group(1) else m.group(2) for m in phno.finditer(data)]
print(phones)
Prints:
['999-888-7894', '555-555-1212']
See demo
Assuming you only want to detect hyphen-separated US phone numbers containing area codes, you could use the following regex pattern with re.findall:
\b\d{3}-\d{3}-\d{4}\b
Script:
sentence = "This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', sentence)
print(numbers)
This prints:
['999-888-7894']
I need help grabbing just K334-76A9 from this string:
b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n
Please help, I have tried so many things but none have worked.
Sorry if my question is bad :/
If you want to find the format xxxx-xxxx, no matter what string you have you can do it like this:
import re
b = '\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
splitString = []
splitString = b.split()
r = re.compile('.{4}-.{4}')
for string in splitString:
if r.match(string):
print(string)
Output:
K334-76A9
Here's code that grabs everything after "Serial Number is " up to the next whitespace character.
import re
data = b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
pat = re.compile(r"Serial Number is ([^\s]+)")
match = pat.search(data.decode("ASCII"))
if match:
print(match.group(1))
Result:
K334-76A9
You can adjust the regular expression per your needs. Regular expressions are Da Bomb! This one's really simple, but you can do amazingly complex things with them.
I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...
the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something#whatever.xxx
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})
Another regex for phone numbers:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("\d{3}\d{3}\d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):
reg = re.compile("(\d{3}\d{3}\d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")
Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):
reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)
Below is completion of the answers above. This regex is also able to detect country code:
((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
https://regex101.com/r/6Qcrk1/1
For spanish phone numbers I use this with quite success:
re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :
re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(\d\d\d-)?\d\d\d-\d\d\d\d")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\\(?[2-9]\d{2}\\)?[ -])?[2-9]\d{2}-\d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[\d]{3}-[\d]{3}-[\d]{4}',doc)