Regex: match one pattern and exclude another pattern - python

I have a regular expression that matches the phone numbers:
import re
phones = re.findall(r'[+(]?[0-9][0-9 \-()]{8,}[0-9]', text)
It shows good accuracy in a large raw text dataset.
But sometimes it matches unwanted results (ranges of years and random IDs).
Ranges of years:
'2012 - 2017'
'(2011 - 2013'
'1999 02224'
'2019 2010-2015'
'2018-2018 (5'
'2004 -2009'
'1) 2005-2006'
'2011 2020'
Random ids:
'5 5 5 5'
'100032479008252'
'100006711277302'
I have ideas on how to solve these problems.
Limit the total number of digits to 12 digits.
Limit the total number of characters to 16 characters.
Remove the ranges of years (19**|20** - 19**|20**).
But I do not know how to implement these ideas and make them as exceptions in my regular expression.
Some examples that a regular expression should catch are presented below:
380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222

I suggest you write different patterns for different phone strucutres. I'm not so sure about your phone number structures, but this matches your example:
import re
test = '''380-956-425979
+38(097)877-43-88
+38(050) 284-24-20
(097) 261-60-52
380-956-425979
(068)1850063
0975533222'''
solution = test.split("\n")
p1 = "\+?\d{3}\-\d{3}\-\d{6}"
p2 = "\+?(?:\d{2})?\(\d{3}\) ?\d{3}\-\d{2}\-\d{2}"
p3 = "\+?\d{3}\-\d{3}\-\d{6}"
p4 = "\+?(?:\(\d{3}\)|\d{3})\d{7}"
result = re.findall(f'{p1}|{p2}|{p3}|{p4}', test)
print(solution)
print(result)

You could do it in python directly:
if regex.match("condition", "teststring") and not regex.match("not-condition", "teststring"):
print("Match!")

Related

Given a string, extract all the necessary information about the person

In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.
A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}

Extract date from a string with a lot of numbers

There seems to be quite a few ways to extract datetimes in various formats from a string. But there seems to be an issue when the string contains many numbers and symbols.
Here is an example:
t = 'Annual Transmission Revenue Requirements and Rates Transmission Owner (Transmission Zone) Annual Transmission Revenue Requirement Network Integration Transmission Service Rate ($/MW-Year) AE (AECO) $136,632,319 $53,775 AEP (AEP) $1,295,660,732 $59,818.14 AP (APS) $128,000,000 $17,895 ATSI (ATSI) $659,094,666 $54,689.39 BC (BGE) $230,595,535 $35,762 ComEd, Rochelle (CE) $702,431,433 $34,515.60 Dayton (DAY) $40,100,000 $13,295.76 Duke (DEOK) $121,250,903 $24,077 Duquesne (DLCO) $139,341,808 $51,954.44 Dominion (DOM) $1,031,382,000 $52,457.21 DPL, ODEC (DPL) $163,224,128 $42,812 East Kentucky Power Cooperative (EKPC) $83,267,903 $24,441 MAIT (METED, PENELEC) $150,858,703 $26,069.39 JCPL $135,000,000 $23,597.27 PE (PECO) $155,439,100 $19,093 PPL, AECoop, UGI (PPL) $435,349,329 $58,865 PEPCO, SMECO (PEPCO) $190,876,083 $31,304.21 PS (PSEG) $1,248,819,352 $130,535.22 Rockland (RECO) $17,724,263 $44,799 TrAILCo $226,652,117.80 n/a Effective June 1, 2018 '
import datefinder
m = datefinder.find_dates(t)
for match in m:
print(match)
Is there a way to smoothly extract the date? I can resort to re for specific formats if no better way exists. From github of datefinder it seems that it was abandoned a year ago.
Although I dont know exactly how your dates are formatted, here's a regex solution that will work with dates separated by '/'. Should work with dates where the months and days are expressed as a single number or if they include a leading zero.
If your dates are separated by hyphens instead, replace the 9th and 18th character of the regex with a hyphen instead of /. (If using the second print statement, replace the 12th and 31st character)
Edit: Added the second print statement with some better regex. That's probably the better way to go.
import re
mystring = r'joasidj9238nlsd93901/01/2021oijweo8939n'
print(re.findall('\d{1,2}\/\d{1,2}\/\d{2,4}', mystring)) # This would probably work in most cases
print(re.findall('[0-1]{0,2}\/[0-3]{0,1}\d{0,1}\/\d{2,4}', mystring)) # This one is probably a better solution. (More protection against weirdness.)
Edit #2: Here's a way to do it with the month name spelled out (in full, or 3-character abbreviation), followed by day, followed by comma, followed by a 2 or 4 digit year.
import re
mystring = r'Jan 1, 2020'
print(re.findall(r'(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Nov(?:ember)?|Dec(?:ember)?)\s+\d{1,2}\,\s+\d{2,4}',mystring))

Regex to detect a phone number

I need to find a phone number in a given paragraph text, with the conditions as below.
The word Phone/Ph/tel/telephone should exist in the sentence where the phone number is present.
For ex: (consider the below paragraph.)
This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
As you can see this paragraph has a phone number signified, and it has the word "Phone" in the sentence (31 characters before the phone number).
So i would like to detect this as a phone number if and only if it has the words Phone/Ph/tel/telephone 50 characters before or after the phone number.
I tried using lookaround in regex but did not work.
import re
phno = re.compile(r'(?<=Ph\s)(?<=Phone\s)(?<=tel\s)telephone(?<=telephone\s)\b([0-9]{3}[-][0-9]{3}[-][0-9]{4})\b',re.MULTILINE)
data = "This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
l = phno.findall(data)
print(l)
I am getting output empty list [ ] because the word 'Phone' is not detected by regex (I need it to detect 50 chars before or after phone number)
import re
data = """This is my phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script.
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 999-123-4567 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
And 555-555-1212 is my telephone."""
phno = re.compile(r'\b(?:phone|ph|telephone)\b.{0,49}\b(\d{3}[-]\d{3}[-]\d{4})\b|\b(\d{3}[-]\d{3}[-]\d{4})\b.{0,49}\b(?:phone|ph|telephone)\b', flags=re.I)
phones = [m.group(1) if m.group(1) else m.group(2) for m in phno.finditer(data)]
print(phones)
Prints:
['999-888-7894', '555-555-1212']
See demo
Assuming you only want to detect hyphen-separated US phone numbers containing area codes, you could use the following regex pattern with re.findall:
\b\d{3}-\d{3}-\d{4}\b
Script:
sentence = "This is my Phone number and I am 25 years old, 999-888-7894 and I am looking for a regex script."
numbers = re.findall(r'\b\d{3}-\d{3}-\d{4}\b', sentence)
print(numbers)
This prints:
['999-888-7894']

How do I grab specific text in between other text?

I need help grabbing just K334-76A9 from this string:
b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n
Please help, I have tried so many things but none have worked.
Sorry if my question is bad :/
If you want to find the format xxxx-xxxx, no matter what string you have you can do it like this:
import re
b = '\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
splitString = []
splitString = b.split()
r = re.compile('.{4}-.{4}')
for string in splitString:
if r.match(string):
print(string)
Output:
K334-76A9
Here's code that grabs everything after "Serial Number is " up to the next whitespace character.
import re
data = b'\x0cWelcome, Pepo \r\nToday is Mon 04/29/2019 \r\n\r\n Volume in drive C has no label.\r\n Volume Serial Number is K334-76A9\r\n'
pat = re.compile(r"Serial Number is ([^\s]+)")
match = pat.search(data.decode("ASCII"))
if match:
print(match.group(1))
Result:
K334-76A9
You can adjust the regular expression per your needs. Regular expressions are Da Bomb! This one's really simple, but you can do amazingly complex things with them.

Python: fastest way to re.findall twice?

I like regular expressions. I often find myself using multiple regex statements to narrow in on the value I need when trying to get a substring from a large block of text.
So far, my approach has been the following:
Use resultOfRegex1 = re.findall(firstRegex, myString) for my first regex
Check to see that resultOfRegex1[0] exists
Use resultOfRegex2 = re.findall(secondRegex, resultOfRegex1[0]) for
my second regex
Check to see that resultOfRegex2[0] exists, and print that value
But I feel like this is much more verbose and costly than it has to be. Is there an easier/faster way to match one regex and then match another regex based on the result of the first?
The whole point of groups is to allow extraction of subgroups from an overall match.
For example, instead two searches done the following fashion:
>>> import re
>>> s = 'The winning team scored 15 points and used only 2 timeouts'
>>> score_clause = re.search(r'scored \d+ point', s).group(0)
>>> re.search(r'\d+', score_clause).group(0)
'15'
Do a single search with a sub-group:
>>> re.search(r'scored (\d+) point', s).group(1)
'15'
One other thought: if you want to make decisions about whether to continue a findall-style search based on the first match, a reasonable choice would be to use re.finditer and extract values as needed:
>>> game_results = '''\
10 point victory: 1 in first period, 6 in second period, 3 in third period.
5 point victory: 0 in first period, 5 in second period, 0 in third period.
12 point victory: 5 in first period, 3 in second period, 4 in third period.
7 point victory: 3 in first period, 0 in second period, 4 in third period.
'''.splitlines()
>>> # Show period-by-period scores for games won by 8 or more points
>>> for game_result in game_results:
it = re.finditer(r'\d+', game_result)
if int(next(it).group(0)) >= 8:
print 'Big win:', [int(mo.group(0)) for mo in it]
Big win: [1, 6, 3]
Big win: [5, 3, 4]

Categories