Python Regex for Phone Numbers is acting strangely - python

I've developed a Python Regex that pulls phone numbers from text around 90% of the time. However, there are sometimes weird anomalies. My code is as follows:
phone_pattern = re.compile(r'(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})')
df['phone'] = df['text'].apply(lambda x: phone_pattern.findall(x))
df['phone']=df['phone'].apply(lambda y: '' if len(y)==0 else y)
df['phone'] = df['phone'].apply(', '.join)
This code extracts the phone numbers and appends a new column called "phone." If there are multiple numbers, they are separated by a comma.
The following text, however, generates a weird output:
university of blah school of blah blah blah (jane doe doe) 1234567890 1234 miller Dr E233 MILLER DR blah blah fl zipcode in the morning or maybe Monday.
The output my current code gives me is:
890 1234
Rather than the desired actual number of:
1234567890
This happens on a few examples. I've tried editing the regex, but it only makes it worse. Any help would be appreciated. Also, I think this question is useful, because a lot of the phone regex offered on Stackoverflow haven't worked for me.

You may use
(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b
See the regex demo
Note that \b word boundary is added before the first and third only alternatives, the second one starts with \( pattern that matches a ( and needs no word boundary check. There is a word boundary at the end, too. Besides, the [-.\s] delimiter in the first alternative is made optional, a ? quantifier makes it match 1 or 0 times.
In Pandas, just use
rx = r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b'
df['phone'] = df['text'].str.findall(rx).apply(', '.join)

Related

How can I capture all sentences in a file with the format of (name): (sentence)\n(name):

I have files of transcripts where the format is
(name): (sentence)\n (<-- There can be multiples of this pattern)
(name): (sentence)\n
(sentence)\n
and so on. I need all of the sentences. So far I have gotten it to work by hard-coding the names in the file, but I need it to be generic.
utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)
Python 3.6 using re. Or if anyone knows how to do this using spacy, that would be a great help, thanks.
I want to just grab the \n after an empty statement, and put it in its own string. And I suppose I will just have to grab the tape information given at the end of this, for example, since I can't think of a way to distinguish if the line is part of someone's speech or not.
Also sometimes, there's more than one word between start of line and colon.
Mock data:
CRO: How far are you from the World Trade Center, how many blocks, about? Three or
four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
You can use a lookahead expression that looks for the same pattern of a name at the beginning of a line and is followed by a colon:
s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)
This outputs:
[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
('CALLER', ''),
('CRO', "You're welcome. Thank you.\n"),
('OPERATOR', 'Bye.\n'),
('CRO', 'Bye.\n'),
('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
('OPERATOR NEWELL', 'blah blah.\n'),
('GUY IN DESK', 'I speak words!')]
You never gave us mock data, so I used the following for testing purposes:
name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.
We can try matching using the following pattern:
^\S+:\s+((?:(?!^\S+:).)+)
This can be explained as:
^\S+:\s+ match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+) then match and capture everything up until the next name
Note that this handles the edge case of the final sentence, because the negative lookahead used above just would not be true, and hence all remaining content would be captured.
Code sample:
import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)
['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']
Demo

Using python regex with backreference matches

I have a doubt about regex with backreference.
I need to match strings, I try this regex (\w)\1{1,} to capture repeated values of my string, but this regex only capture consecutive repeated strings; I'm stuck to improve my regex to capture all repeated values, below some examples:
import re
str = 'capitals'
re.search(r'(\w)\1{1,}', str)
Output None
import re
str = 'butterfly'
re.search(r'(\w)\1{1,}', str)
<_sre.SRE_Match object; span=(2, 4), match='tt'>
I would use r'(\w).*\1 so that it allows any repeated character even if there are special characters or spaces in between.
However this wont work for strings with repeated characters overlapping the contents of groups like the string abcdabcd, in which it only recognizes the first group, ignoring the other repeated characters enclosed in the first group (b,c,d)
Check the demo: https://regex101.com/r/m5UfAe/1
So an alternative (and depending on your needs) is to sort the string analyzed:
import re
str = 'abcdabcde'
re.findall(r'(\w).*\1', ''.join(sorted(str)))
returning the array with the repeated characters ['a','b','c','d']
Hope the code below will help you understand the Backreference concept of Python RegEx
There are two sets of information available in the given string str
Employee Basic Info:
starting with #employeename and ends with employeename
eg: #daniel dxc chennai 45000 male daniel
Employee designation
starting with %employeename then designation and ends with employeename%
eg: %daniel python developer daniel%
import re
#sample input
str="""
#daniel dxc chennai 45000 male daniel #henry infosys bengaluru 29000 male hobby-
swimming henry
#raja zoho chennai 37000 male raja #ramu infosys bengaluru 99000 male hobby-badminton
ramu
%daniel python developer daniel% %henry database admin henry%
%raja Testing lead raja% %ramu Manager ramu%
"""
#backreferencing employee name (\w+) <---- \1
#----------------------------------------------
basic_info=re.findall(r'#+(\w+)(.*?)\1',str)
print(basic_info)
#(%) <-- \1 and (\w+) <--- \2
#-------------------------------
designation=re.findall(r'(%)+(\w+)(.*?)\2\1',str)
print(designation)
for i in range(len(designation)):
designation[i]=(designation[i][1],designation[i][2])
print(designation)

How do I delimit my input by this capture group?

For this regular expression:
(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]
I want the input string to be split by the captured matching \s character - the green matches as seen over here.
However, when I run this:
import re
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+(\s)[A-Z0-9]')
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
re.split(p, test_str)
It seems to split the string at the regions given by [.?!]+ and [A-Z0-9] (thus incorrectly omitting them) and leaves \s in the results.
To clarify:
Input: he paid a lot for it. Did he mind
Received Output: ['he paid a lot for it','\s','id he mind']
Expected Output: ['he paid a lot for it.','Did he mind']
You need to remove the capturing group from around (\s) and put the last character class into a look-ahead to exclude it from the match:
p = re.compile(ur'(?<!Mr|Dr|Ms|Jr|Sr)[.?!]+\s(?=[A-Z0-9])')
# ^^^^^ ^
test_str = u"Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.split(test_str))
See IDEONE demo and the regex demo.
Any capturing group in a regex pattern will create an additional element in the resulting array during re.split.
To force the punctuation to appear inside the "sentences", you can use this matching regex with re.findall:
import re
p = re.compile(r'\s*((?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+)')
test_str = "Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it. Did he mind? Adam Jones Jr. thinks he didn't. In any case, this isn't true... Well, with a probability of .9 it isn't.23 is the ish. My name is! Why wouldn't you... this is.\nAndrew"
print(p.findall(test_str))
See IDEONE demo
Results:
['Mr. Smith bought cheapsite.com for 1.5 million dollars i.e. he paid a lot for it.', 'Did he mind?', "Adam Jones Jr. thinks he didn't.", "In any case, this isn't true...", "Well, with a probability of .9 it isn't.23 is the ish.", 'My name is!', "Why wouldn't you... this is.", 'Andrew']
The regex demo
The regex follows the rules in your original pattern:
\s* - matches 0 or more whitespace to omit from the result
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])*[.?!]|[^.!?]+) - 2 aternatives that are captured and returned by re.findall:
(?:(?:Mr|Dr|Ms|Jr|Sr)\.|\.(?!\s+[A-Z0-9])|[^.!?])* - 0 or more sequences of...
(?:Mr|Dr|Ms|Jr|Sr)\. - abbreviated titles
\.(?!\s+[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then uppercase letters or digits
[^.!?] - any character but a ., !, and ?
or...
[^.!?]+ - any one or more characters but a ., !, and ?

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Simple python re lookahead help

I have three sample twiki names:
names = [ "JohnDoe", "JaneMcAdams", "BillyBobThorton" ]
I want to get the following back:
* John Doe
* Jane McAdams
* BillyBob Thorton
Now I have this which busts them apart on the cap (That's a good thing).
re.findall('[A-Z][^A-Z]*', name)
How do I ignore "Mc" as a split?
Thanks!!
I would recommend against using a regex here. I doubt Mc is the only name particle you need to match. Did you think about Mac, O, Van, Von, De?
I suggest to break them as you say you currently do and build the first name and last name manually.
Bonus. Regex:
re.findall('(?:Mc|Mac|O|Van|Von|De)?[A-Z][^A-Z]*', name)
But Van, Von, De should be separated with a space.
Note: If you say you only want to match McSomething use the a short version (?:Mc)?[A-Z][^A-Z]*.

Categories