the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something#whatever.xxx
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})
Another regex for phone numbers:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("\d{3}\d{3}\d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):
reg = re.compile("(\d{3}\d{3}\d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")
Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):
reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)
Below is completion of the answers above. This regex is also able to detect country code:
((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
https://regex101.com/r/6Qcrk1/1
For spanish phone numbers I use this with quite success:
re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :
re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(\d\d\d-)?\d\d\d-\d\d\d\d")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\\(?[2-9]\d{2}\\)?[ -])?[2-9]\d{2}-\d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[\d]{3}-[\d]{3}-[\d]{4}',doc)
Related
If I have text like this:
CARBON 1569
1.00% IRON 234
99% CARBON, 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON#1% IRON 141
ASD CARBON 2% IRON RANDOMWORD 23
Let's say I want to retain only the element names and percentage values (which includes numbers, decimal point and percentage sign). I can run a regex substitution to do so. I tried out plenty of combinations of stuff like (CARBON|IRON|ZINC), which replaces all occurences of element names, and [^0-9.\%]+ which retains all percentage values.
But I can't figure out how to combine these such that I retain both the percentage values and element names. Any help would be appreciated.
EDIT: The spaces would also need to be retained for the output to make sense. All unnecessary characters can be replaced by spaces. The expected output would be
CARBON 1569
1.00% IRON 234
99% CARBON 1% IRON 181
98.2% CARBON 1% ZINC 181
99% CARBON 1% IRON 141
CARBON 2% IRON 23
You may use this regex to match your desired text:
\b(CARBON\b|IRON\b|ZINC\b|\d+(?:\.\d+)?(?:%|\b))|\S
And replace it by '\1 ' (will add trailing spaces in input lines)
RegEx Demo
RegEx Detail:
\b: Word boundary
(: Start capture group
CARBON\b: Match CARBON followed by word boundary
|: OR
IRON\b: Match IRON followed by word boundary
|: OR
ZINC\b: Match ZINC followed by word boundary
|: OR
\d+(?:\.\d+)?: Match an integer or float number
(?:%|\b): Followed by % or word boundary
):
|: OR
\S: Match a non-whitespace character
To simplify you May start with this as per your requirement:
\b(?!CARBON|ZINC|IRON)[a-zA-Z#]+
Then you may have to post process something (like # being replaced by blank) as per your comments.
REGEX101
You can try replacing all the words except:
* Element names
* Numbers
* Percentage.
To achieve this you can use negative lookahead:
(?!CARBON|IRON|ZINC|(\d+\.\d+\%)|\d+)\b[a-zA-Z#]+
Demo
This might be a silly question but I can't find a nice way to solve it.
I want to capture numbers in some strings that contains a white space between every group of 3 digits. For example "45 000 €".
I can capture the numbers easily with some regex operation but I do not manage to directly remove the white space, i.e I get "45 000" instead of "45000".
import re
digits = re.findall(r"(\d+\s?\d*)", "Salary between 35 000 € and 45 000 €")
print(digits)
Returns :
['35 000', '45 000']
While I directly want:
['35000', '45000']
Of course after that I could use list comprehension to remove the white space for every number but there should be a more direct solution with regex, isn't it ? I tried to play around with non capturing group or look around but with no success (either the white space stay, or the numbers are truncated in two).
Thx for your help
This expression might likely do that:
(?<=\d)\s+(?=\d)
with a re.sub, then we'd perform a simple re.findall.
import re
test_str = "Salary between 35 000 € and 45 000 € 35 000 000 0 0 0 €"
print(re.findall(r"(\d+)", re.sub(r"(?<=\d)\s+(?=\d)", "", test_str)))
Output
['35000', '45000', '35000000000']
The expression is explained on the top right panel of this demo if you wish to explore/simplify/modify it.
I want to extract phone number from text. I able to extract phone number from text when all digits are presents on single line. But When some digits are present in next line then regex is not working.
Here is my text:
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
In above text +45 is on first line and 20 32 40 08 presents on second line. I unable to extract phone numbers from text when text like above text. When digits are present on same single line then it's working fine.
Here is my regex:
reg_phonestyle = re.compile(r'(\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}[-\/\.\ \s]??\d{2}|\(\d{3}\)\s*\d{3}[-\/\.\ \s]??\d{4}|\d{3}[-\/\.\ \s]??\d{4})')
You can specify an additional flag to perform a MULTILINE search.
Given your example I propose the following solution:
import re
input_str = '''
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
'''
phone_reg = re.compile("([0-9]{2,4}[-.\s]{,1}){5}", re.MULTILINE)
print(phone_reg.search(input_str).group(0))
Where this regexp find 5 groups of: 2 to 4 digits followed by 0 or 1 spacing character
Hope this helps
This is my way to get phone number. actually i want more examples to verify my regex.
import re
strs = '''
I will be out of the office. Please send me an email and text my mobile +45
20 32 40 08 if any urgency.
'''
phone = re.compile("(?<=mobile\s)(.?[0-9]|\s)+", re.S)
print( " ".join(phone.search(strs).group(0).split()) ) # remove \n and space and etc.
I want to rename a long list of file names to make them more searchable. The names where auto generated so there is some odd spacing issues. I wrote a little python script that does what I want. But I don't want to remove white spaces between words. For instance I have two names:
0 130 — HG — 1500 — 12" (Page 1 of 2)
01 30 — HD LOW POINT DRAIN
They should read :
0130-HG-1500-12"
0130-HD LOW POINT DRAIN
My code so far :
import os
import re
for filename in os.listdir("."):
if not filename.endswith(".py"):
os.replace(filename, re.sub("[(].*?[)]", "", # Remove anything between ()
"".join(filename.split() # Remove any whitespaces
).replace("—", "-"))) # Replace Em dash with hyphen
Everything is working except I cant figure out how to not strip white spaces between words only.
If by "words" you mean "strings made up of letters" then
re.sub('((?<=[^a-zA-Z]) | (?=[^a-zA-Z]))', '', filename)
will do the trick. In plain language, that would be "replace every space that is either after or before a non-letter character with nothing". Output:
In [24]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '01 30 — HD LOW POINT DRAIN ')
Out[24]: '0130—HD LOW POINT DRAIN'
In [25]: re.sub('((?<=[^A-Z]) | (?=[^A-Z]))', '', '0 130 — HG — 1500 — 12"')
Out[25]: '0130—HG—1500—12"'
Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols
Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.
I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$