Python regex for UK number - python

Below given are the UK phone numbers need to fetch from text file:
07791523634
07910221698
But it only print 0779152363, 0791022169 skipping the 11th character.
Also it produce unnecessary values like ('')
Ex : '', '07800 854536'
Below is the regex I've used:
phnsrch = re.compile(r'\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{5}|\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|/^(?:(?:\(?(?:0(?:0|11)\)?[\s-]?\(?|\+)44\)?[\s-]?(?:\(?0\)?[\s-]?)?)|(?:\(?0))(?:(?:\d{5}\)?[\s-]?\d{4,5})|(?:\d{4}\)?[\s-]?(?:\d{5}|\d{3}[\s-]?\d{3}))|(?:\d{3}\)?[\s-]?\d{3}[\s-]?\d{3,4})|(?:\d{2}\)?[\s-]?\d{4}[\s-]?\d{4}))(?:[\s-]?(?:x|ext\.?|\#)\d{3,4})?$/|')
Need help to fetch the complete set of 11 numbers without any unnecessary symbols

Finally figured out the solution for matching the UK numbers below:
07540858798
0113 2644489
02074 735 217
07512 850433
01942 896007
01915222200
01582 492734
07548 021 475
020 8563 7296
07791523634
re.compile(r'\d{3}[-\.\s]??\d{4}[-\.\s]??\d{4}|\d{5}[-\.\s]??\d{3}[-\.\s]??\d{3}|(?:\d{4}\)?[\s-]?\d{3}[\s-]?\d{4})')
Thanks to those who helped me with this issue.

I think your regex is too long and can be more easier, try this regex instead:
^(07\d{8,12}|447\d{7,11})$

Related

Catastrophic backtracking error with any single character or number?

First of all, I know the title is not as objective as it should be, I don't get why the below error is occurring on python "flavor" in regex101 website.
Just to explain what I'm trying to do, I have to match any number after "item", followed by everything until "consumo estimado".
Regex:
^item\s*(\d{0,})(.*?)consumo
Example text:
ITEM 1 – AGULHA DE PUNÇÃO
Agulha de punção 18 ga x 70 mm
Consumo Estimado Anual: 284
Ampla Participação
ITEM 2 - CATETER ANGIOGRAFICO PIGTAIL
Cateter angiográfico diagnóstico pigtail 5f x 100 cm
Consumo Estimado Anual: 210
Ampla Participação
ITEM 3 – Próteses Vasculares Dracon Reta 80 Cm
PROTESES VASCULARES ANELADA - Enxerto vascular reto constituído
em politetrafluoretileno (PTFE) extrudado e expandido construído com
suporte externo anelado que aumentam a resistência mecânica.
Tamanho
aproximado 8mm (diâmetro) x 70 -80 cm (comprimento)
Consumo Estimado Anual: 34
Ampla Participação
But after entering the word "consumo" followed by a space, I cant put anything else, resulting in "catastrophic backtracking"
Example Regex with error:
^item\s*(\d{0,})(.*?)consumo e
^item\s*(\d{0,})(.*?)consumo 1
The solution was to use .*? to capture everything between "consumo" and "estimado", which worked properly.
^item\s*(\d{0,})(.*?)consumo.*?estimado
Why is this error occurring? I couldn't find any explanation for it.
I already have the solution for the problem, but I just wanna know why the error happened.
https://regex101.com/r/uqm7ra/1
Edit 1:
As suggested, I have added the link to the current saved regex with the problem.
Edit 2:
As suggested, I also have tried to follow the "meta" when asking for anything here in Stack Overflow. Thanks for the advice!
I hope the question is better now.
\d{0,} looks iffy, the regex engine will retry with fewer and fewer digits which can be catastrophic. Anchor it with (\D.*?)?consumo to prevent that.
Also, if you want a number, you mean {1,} (or the more idiomatic and brief +; similarly, {0,} is customarily written *).
^item\s*(\d+)(\D.*?)?consumo

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

Select string which contains punctuation

so I'm trying to remove title from a set of professors' name.
Like Dr.Eng, Dr.rer.nat, M.S., Dr., S.Si so on and so forth. Basically any string that contains more than one dot.
This is an example list after I have split the name and the title based on ","
2 [CHOTIMAH, Dr., M.S., RINTO ANUGRAHA NQZ, S...
3 [HARSOJO, S.U., M.Sc., Dr., SUDARMAJI, S.S...
4 [IKHSAN SETIAWAN, S.Si., M.Si., ARI SETIAWAN...
5 [EKO SULISTYA, Dr., M.Si., YOSEF ROBERTUS UT...
6 [SUNARTA, Drs., M.S., WAGINI R., Drs., M.S.]
7 [BAMBANG MURDAKA EKA JATI, Drs., M.S., KAMSU...
8 [AHMAD KUSUMA ATMAJA, S.Si., M.Sc., Dr.Eng....
9 [MOH. ALI JOKO WASONO, M.S., Dr.]
I have tried r'\S*[^\w\s]\S' but it returned
CHOTIMAH, INTO ANUGRAHA NQZ, .
HARSOJO, UDARMAJI, i.
IKHSAN SETIAWAN, RI SETIAWAN, ng.
EKO SULISTYA, OSEF ROBERTUS UTOMO, Dr.
SUNARTA, AGINI .
BAMBANG MURDAKA EKA JATI, AMSUL ABRAHA, Prof.
AHMAD KUSUMA ATMAJA, ITRAYANA, Dr.
MOH. ALI JOKO WASONO, Dr.
Some professors' names are shortened to XXX. Ex: MOHAMMAD TO MOH. And I don't want that to get removed.
Any help is appreciated!
\w{0,}\.(\w{0,}\.)? This regex test string will grab any length word followed by a period, and will look for another word of any length followed by a period optionally. This captures Dr., M.S. etc. I'm pretty sure that's what you're asking for, if not let me know.
In the future you can use regexr.com to easily test regex matches. Also you've tagged this post with Python and Pandas but those aren't really relevant tags. Please either include more code to make tags relevant or avoid using irrelevant tags

Find USA phone numbers in python script

the following python script allows me to scrape email addresses from a given file using regular expressions.
How could I add to this so that I can also get phone numbers? Say, if it was either the 7 digit or 10 digit (with area code), and also account for parenthesis?
My current script can be found below:
# filename variables
filename = 'file.txt'
newfilename = 'result.txt'
# read the file
if os.path.exists(filename):
data = open(filename,'r')
bulkemails = data.read()
else:
print "File not found."
raise SystemExit
# regex = something#whatever.xxx
r = re.compile(r'(\b[\w.]+#+[\w.]+.+[\w.]\b)')
results = r.findall(bulkemails)
emails = ""
for x in results:
emails += str(x)+"\n"
# function to write file
def writefile():
f = open(newfilename, 'w')
f.write(emails)
f.close()
print "File written."
Regex for phone numbers:
(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})
Another regex for phone numbers:
(?:(?:\+?1\s*(?:[.-]\s*)?)?(?:\(\s*([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9])\s*\)|([2-9]1[02-9]|[2-9][02-8]1|[2-9][02-8][02-9]))\s*(?:[.-]\s*)?)?([2-9]1[02-9]|[2-9][02-9]1|[2-9][02-9]{2})\s*(?:[.-]\s*)?([0-9]{4})(?:\s*(?:#|x\.?|ext\.?|extension)\s*(\d+))?
If you are interested in learning Regex, you could take a stab at writing it yourself. It's not quite as hard as it's made out to be. Sites like RegexPal allow you to enter some test data, then write and test a Regular Expression against that data. Using RegexPal, try adding some phone numbers in the various formats you expect to find them (with brackets, area codes, etc), grab a Regex cheatsheet and see how far you can get. If nothing else, it will help in reading other peoples Expressions.
Edit:
Here is a modified version of your Regex, which should also match 7 and 10-digit phone numbers that lack any hyphens, spaces or dots. I added question marks after the character classes (the []s), which makes anything within them optional. I tested it in RegexPal, but as I'm still learning Regex, I'm not sure that it's perfect. Give it a try.
(\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4})
It matched the following values in RegexPal:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
This is the process of building a phone number scraping regex.
First, we need to match an area code (3 digits), a trunk (3 digits), and an extension (4 digits):
reg = re.compile("\d{3}\d{3}\d{4}")
Now, we want to capture the matched phone number, so we add parenthesis around the parts that we're interested in capturing (all of it):
reg = re.compile("(\d{3}\d{3}\d{4})")
The area code, trunk, and extension might be separated by up to 3 characters that are not digits (such as the case when spaces are used along with the hyphen/dot delimiter):
reg = re.compile("(\d{3}\D{0,3}\d{3}\D{0,3}\d{4})")
Now, the phone number might actually start with a ( character (if the area code is enclosed in parentheses):
reg = re.compile("(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now that whole phone number is likely embedded in a bunch of other text:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?")
Now, that other text might include newlines:
reg = re.compile(".*?(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?", re.S)
Enjoy!
I personally stop here, but if you really want to be sure that only spaces, hyphens, and dots are used as delimiters then you could try the following (untested):
reg = re.compile(".*?(\(?\d{3})? ?[\.-]? ?\d{3} ?[\.-]? ?\d{4}).*?", re.S)
I think this regex is very simple for parsing phone numbers
re.findall("[(][\d]{3}[)][ ]?[\d]{3}-[\d]{4}", lines)
Below is completion of the answers above. This regex is also able to detect country code:
((?:\+\d{2}[-\.\s]??|\d{4}[-\.\s]??)?(?:\d{3}[-\.\s]??\d{3}[-\.\s]??\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]??\d{4}|\d{3}[-\.\s]??\d{4}))
It can detect the samples below:
000-000-0000
000 000 0000
000.000.0000
(000)000-0000
(000)000 0000
(000)000.0000
(000) 000-0000
(000) 000 0000
(000) 000.0000
000-0000
000 0000
000.0000
0000000
0000000000
(000)0000000
# Detect phone numbers with country code
+00 000 000 0000
+00.000.000.0000
+00-000-000-0000
+000000000000
0000 0000000000
0000-000-000-0000
00000000000000
+00 (000)000 0000
0000 (000)000-0000
0000(000)000-0000
Updated as of 03.05.2022:
I fixed some issues in the phone numbers detection regex above, you find it in the link below. Complete the regex to include more country codes.
https://regex101.com/r/6Qcrk1/1
For spanish phone numbers I use this with quite success:
re.findall( r'[697]\d{1,2}.\d{2,3}.\d{2,3}.\d{0,2}',str)
You can check : http://regex.inginf.units.it/. With some training data and target, it constructs you an appropriate regex. It is not always perfect (check F-score). Let's try it with 15 examples :
re.findall("\w\d \w\w \w\w \w\w \w\d|(?<=[^\d][^_][^_] )[^_]\d[^ ]\d[^ ][^ ]+|(?<= [^<]\w\w \w\w[^:]\w[^_][^ ][^,][^_] )(?: *[^<]\d+)+",
"""Lorem ipsum © 04-42-00-00-00 dolor 1901 sit amet, consectetur +33 (0)4 42 00 00 00 adipisicing elit. 2016 Sapiente dicta fugit fugiat hic 04 42 00 00 00 aliquam itaque 04.42.00.00.00 facere, 13205 number: 100 000 000 00013 soluta. 4 Totam id dolores!""")
returns ['04 42 00 00 00', '04.42.00.00.00', '04-42-00-00-00', '50498,']
add more examples to gain precision
Since nobody has posted this regex yet, I will. This is what I use to find phone numbers. It matches all regular phone number formats you see in the United States. I did not need this regex to match international numbers so I didn't make adjustments to regex for that purpose.
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{4}"
Use this pattern if you want simple phone numbers with no characters in between to match. An example of this would be: "4441234567".
phone_number_regex_pattern = r"\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}"
//search phone number using regex in python
//form the regex according to your output
// with this you can get single mobile number
phoneRegex = re.compile(r"\d\d\d-\d\d\d-\d\d\d\d")
Mobile = phoneRegex.search("my number is 123-456-6789")
print(Mobile.group())
Output: 123-456-6789
phoneRegex1 = re.compile(r"(\d\d\d-)?\d\d\d-\d\d\d\d")
Mobile1 = phoneRegex1.search("my number is 123-456-6789")
print(Mobile1.group())
Output: 123-456-789
Mobile1 = phoneRegex1.search("my number is 456-6789")
print(Mobile1.group())
Output: 456-678
While these are simple solutions they are all incorrect for North America. The problem lies in the fact that area-code and exchange numbers cannot start with a zero or a one.
r"(\\(?[2-9]\d{2}\\)?[ -])?[2-9]\d{2}-\d{4}"
would be the correct way to parse a 7 or 10-digit phone number.
(202) 555-4111
(202)-555-4111
202-555-4111
555-4111
will all parse correctly.
Use this code to find the number like "416-676-4560"
doc=browser.page_source
phones=re.findall(r'[\d]{3}-[\d]{3}-[\d]{4}',doc)

Parsing fixed-format data embedded in HTML in python

I am using google's appengine api
from google.appengine.api import urlfetch
to fetch a webpage. The result of
result = urlfetch.fetch("http://www.example.com/index.html")
is a string of the html content (in result.content). The problem is the data that I want to parse is not really in HTML form, so I don't think using a python HTML parser will work for me. I need to parse all of the plain text in the body of the html document. The only problem is that urlfetch returns a single string of the entire HTML document, removing all newlines and extra spaces.
EDIT:
Okay, I tried fetching a different URL and apparently urlfetch does not strip the newlines, it was the original webpage I was trying to parse that served the HTML file that way...
END EDIT
If the document is something like this:
<html><head></head><body>
AAA 123 888 2008-10-30 ABC
BBB 987 332 2009-01-02 JSE
...
A4A 288 AAA
</body></html>
result.content will be this, after urlfetch fetches it:
'<html><head></head><body>AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA</body></html>'
Using an HTML parser will not help me with the data between the body tags, so I was going to use regular expresions to parse my data, but as you can see the last part of one line gets combined with the first part of the next line, and I don't know how to split it. I tried
result.content.split('\n')
and
result.content.split('\r')
but the resulting list was all just 1 element. I don't see any options in google's urlfetch function to not remove newlines.
Any ideas how I can parse this data? Maybe I need to fetch it differently?
Thanks in advance!
I understand that the format of the document is the one you have posted. In that case, I agree that a parser like Beautiful Soup may not be a good solution.
I assume that you are already getting the interesting data (between the BODY tags) with a regular expression like
import re
data = re.findall('<body>([^\<]*)</body>', result)[0]
then, it should be as easy as:
start = 0
end = 5
while (end<len(data)):
print data[start:end]
start = end+1
end = end+5
print data[start:]
(note: I did not check this code against boundary cases, and I do expect it to fail. It is only here to show the generic idea)
Only suggestion I can think of is to parse it as if it has fixed width columns. Newlines are not taken into consideration for HTML.
If you have control of the source data, put it into a text file rather than HTML.
Once you have the body text as a single, long string, you can break it up as follows.
This presumes that each record is 26 characters.
body= "AAA 123 888 2008-10-30 ABCBBB 987 2009-01-02 JSE...A4A 288 AAA"
for i in range(0,len(body),26):
line= body[i:i+26]
# parse the line
EDIT: Reading comprehension is a desirable thing. I missed the bit about the lines being run together with no separator between them, which would kinda be the whole point of this, wouldn't it? So, nevermind my answer, it's not actually relevant.
If you know that each line is 5 space-separated columns, then (once you've stripped out the html) you could do something like (untested):
def generate_lines(datastring):
while datastring:
splitresult = datastring.split(' ', 5)
if len(splitresult) >= 5:
datastring = splitresult[5]
else:
datastring = None
yield splitresult[:5]
for line in generate_lines(data):
process_data_line(line)
Of course, you can change the split character and number of columns as needed (possibly even passing them into the generator function as additional parameters), and add error handling as appropriate.
Further suggestions for splitting the string s into 26-character blocks:
As a list:
>>> [s[x:x+26] for x in range(0, len(s), 26)]
['AAA 123 888 2008-10-30 ABC',
'BBB 987 2009-01-02 JSE',
'A4A 288 AAA']
As a generator:
>>> for line in (s[x:x+26] for x in range(0, len(s), 26)): print line
AAA 123 888 2008-10-30 ABC
BBB 987 2009-01-02 JSE
A4A 288 AAA
Replace range() with xrange() in Python 2.x if s is very long.

Categories