Simple python re lookahead help - python

I have three sample twiki names:
names = [ "JohnDoe", "JaneMcAdams", "BillyBobThorton" ]
I want to get the following back:
* John Doe
* Jane McAdams
* BillyBob Thorton
Now I have this which busts them apart on the cap (That's a good thing).
re.findall('[A-Z][^A-Z]*', name)
How do I ignore "Mc" as a split?
Thanks!!

I would recommend against using a regex here. I doubt Mc is the only name particle you need to match. Did you think about Mac, O, Van, Von, De?
I suggest to break them as you say you currently do and build the first name and last name manually.
Bonus. Regex:
re.findall('(?:Mc|Mac|O|Van|Von|De)?[A-Z][^A-Z]*', name)
But Van, Von, De should be separated with a space.
Note: If you say you only want to match McSomething use the a short version (?:Mc)?[A-Z][^A-Z]*.

Related

Python Regex for Phone Numbers is acting strangely

I've developed a Python Regex that pulls phone numbers from text around 90% of the time. However, there are sometimes weird anomalies. My code is as follows:
phone_pattern = re.compile(r'(\d{3}[-\.\s]\d{3}[-\.\s]\d{4}|\(\d{3}\)\s*\d{3}[-\.\s]\d{4}|\d{3}[-\.\s]\d{4})')
df['phone'] = df['text'].apply(lambda x: phone_pattern.findall(x))
df['phone']=df['phone'].apply(lambda y: '' if len(y)==0 else y)
df['phone'] = df['phone'].apply(', '.join)
This code extracts the phone numbers and appends a new column called "phone." If there are multiple numbers, they are separated by a comma.
The following text, however, generates a weird output:
university of blah school of blah blah blah (jane doe doe) 1234567890 1234 miller Dr E233 MILLER DR blah blah fl zipcode in the morning or maybe Monday.
The output my current code gives me is:
890 1234
Rather than the desired actual number of:
1234567890
This happens on a few examples. I've tried editing the regex, but it only makes it worse. Any help would be appreciated. Also, I think this question is useful, because a lot of the phone regex offered on Stackoverflow haven't worked for me.
You may use
(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b
See the regex demo
Note that \b word boundary is added before the first and third only alternatives, the second one starts with \( pattern that matches a ( and needs no word boundary check. There is a word boundary at the end, too. Besides, the [-.\s] delimiter in the first alternative is made optional, a ? quantifier makes it match 1 or 0 times.
In Pandas, just use
rx = r'(\b\d{3}[-.\s]?\d{3}[-.\s]?\d{4}|\(\d{3}\)\s*\d{3}[-.\s]\d{4}|\b\d{3}[-.\s]\d{4})\b'
df['phone'] = df['text'].str.findall(rx).apply(', '.join)

How can I capture all sentences in a file with the format of (name): (sentence)\n(name):

I have files of transcripts where the format is
(name): (sentence)\n (<-- There can be multiples of this pattern)
(name): (sentence)\n
(sentence)\n
and so on. I need all of the sentences. So far I have gotten it to work by hard-coding the names in the file, but I need it to be generic.
utterances = re.findall(r'(?:CALLER: |\nCALLER:\nCRO: |\nCALLER:\nOPERATOR: |\nCALLER:\nRECORDER: |RECORDER: |CRO: |OPERATOR: )(.*?)(?:CALLER: |RECORDER : |CRO: |OPERATOR: |\nCALLER:\n)', raw_calls, re.DOTALL)
Python 3.6 using re. Or if anyone knows how to do this using spacy, that would be a great help, thanks.
I want to just grab the \n after an empty statement, and put it in its own string. And I suppose I will just have to grab the tape information given at the end of this, for example, since I can't think of a way to distinguish if the line is part of someone's speech or not.
Also sometimes, there's more than one word between start of line and colon.
Mock data:
CRO: How far are you from the World Trade Center, how many blocks, about? Three or
four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
You can use a lookahead expression that looks for the same pattern of a name at the beginning of a line and is followed by a colon:
s = '''CRO: How far are you from the World Trade Center, how many blocks, about? Three or four blocks?
63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01
CALLER:
CRO: You're welcome. Thank you.
OPERATOR: Bye.
CRO: Bye.
RECORDER: The preceding portion of tape concludes at 0913 hours, 36 seconds.
This tape will continue on side B.
OPERATOR NEWELL: blah blah.
GUY IN DESK: I speak words!'''
import re
from pprint import pprint
pprint(re.findall(r'^([^:\n]+):\s*(.*?)(?=^[^:\n]+?:|\Z)', s, flags=re.MULTILINE | re.DOTALL), width=200)
This outputs:
[('CRO', 'How far are you from the World Trade Center, how many blocks, about? Three or four blocks?\n63FDNY 911 Calls Transcript - EMS - Part 1 9-11-01\n'),
('CALLER', ''),
('CRO', "You're welcome. Thank you.\n"),
('OPERATOR', 'Bye.\n'),
('CRO', 'Bye.\n'),
('RECORDER', 'The preceding portion of tape concludes at 0913 hours, 36 seconds.\nThis tape will continue on side B.\n'),
('OPERATOR NEWELL', 'blah blah.\n'),
('GUY IN DESK', 'I speak words!')]
You never gave us mock data, so I used the following for testing purposes:
name1: Here is a sentence.
name2: Here is another stuff: sentence
which happens to have two lines
name3: Blah.
We can try matching using the following pattern:
^\S+:\s+((?:(?!^\S+:).)+)
This can be explained as:
^\S+:\s+ match the name, followed by colon, followed by one or more space
((?:(?!^\S+:).)+) then match and capture everything up until the next name
Note that this handles the edge case of the final sentence, because the negative lookahead used above just would not be true, and hence all remaining content would be captured.
Code sample:
import re
line = "name1: Here is a sentence.\nname2: Here is another stuff: sentence\nwhich happens to have two lines\nname3: Blah."
matches = re.findall(r'^\S+:\s+((?:(?!^\S+:).)+)', line, flags=re.DOTALL|re.MULTILINE)
print(matches)
['Here is a sentence.\n', 'Here is another stuff: sentence\nwhich happens to have two lines\n', 'Blah.']
Demo

Parsing file name with RegEx - Python

I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar

Parsing name and degree?

I'm trying to parse a string containing a name and a degree. I have a long list of these. Some contain no degrees, some contain one, and some contain multiple.
Example strings:
Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
As far as I can tell, the degrees come in the following patterns:
x.x.
x.x.x.
x.x.xx.
x.xx.
xx.x.
x.xxx.
two caps (ex: 'MA')
How would I parse this?
I'm new to regex and breaking down this problem has proved very time-consuming. I've been using this post and tried split = re.split('\s+|([.])',s) and split = re.split('\s+|\.',s) but these still split on the first space.
I have thought, in response to the first comment, about the degree designations. I've been trying to make a regex that recognizes 'x.x' and then a wildcard afterwards because there are several patterns within the degrees which look like this: x.x(something):
x.x.
x.x.x.
x.x.xx.
and then I'd have a few more to classify.
Alternatively, classifying the name might be easier?
Or even listing the degrees in a collection and searching for them?
{'M.A.T.','Ph.D.','MA','J.D.','Ed.M.', 'M.A.', 'M.B.A.', 'Ed.S.', 'M.Div.', 'M.Ed.", 'RN', 'B.S.Ed.'}
Try to change your "Jr.", "Sr.", ... replacing them with something like this: "Jr~", "Sr~", ...
This is the the regular expression for doing that:
/ (Jr|Sr)\. / $1~ /g
(See here )
You obtain this string:
Sam da Man J.D.
Green Eggs Jr~ Ed.M.
Argle Bargle Sr~ MA
Cersei Lannister M.A. Ph.D.
Now you can easily capture degrees with this regular expression:
/ (MA|RN|([A-Z][a-z]?[a-z]?\.)+) /g
(See here )
you can use this:
'[ ](MA|RN|([A-Z][a-z]?[a-z]?\.){2,3})'
it doesn't take any word with one dot
I think the best approach is either creating a list or regex of specific degrees you're looking for, instead of trying to define patterns like x.x. that will match several different degrees. A pattern like this is too general, and may match many other values in free text (in this case, people's initials).
import re
s = """Sam da Man J.D.
Green Eggs Jr. Ed.M.
Argle Bargle Sr. MA
Cersei Lannister M.A. Ph.D.
Albus Dumbledore M.A.T.
"""
pattern = r"M.A.T.|Ph.D.|MA|J.D.|Ed.M.|M.A.|M.B.A.|Ed.S.|M.Div.|M.Ed.|RN|B.S.Ed."
degrees = re.findall(pattern, s, re.MULTILINE)
print(degrees)
Output:
['J.D.', 'Ed.M.', 'MA', 'M.A.', 'Ph.D.', 'M.A.T.']
If you're looking to get the names that appear between the degrees in a block of text like the one above, you can use re.split.
names = re.split(pattern, s)
names = [n.strip() for n in names if n.strip()]
print(names)
Output:
['Sam da Man', 'Green Eggs Jr.', 'Argle Bargle Sr.', 'Cersei Lannister', 'Albus Dumbledore']
Note that I had to strip the remaining strings and remove empty strings from the results to capture just the names. Doing that operation on the result allows the regex to be much simpler.
Note also that this can still fail when a specific degree could also be someone's initials, (e.g., J.D. Salinger). You may need to make adjustments or other allowances based on your real data.

Python, Regular Expression Postcode search

I am trying to use regular expressions to find a UK postcode within a string.
I have got the regular expression working inside RegexBuddy, see below:
\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b
I have a bunch of addresses and want to grab the postcode from them, example below:
123 Some Road Name Town, City County PA23 6NH
How would I go about this in Python? I am aware of the re module for Python but I am struggling to get it working.
Cheers
Eef
repeating your address 3 times with postcode PA23 6NH, PA2 6NH and PA2Q 6NH as test for you pattern and using the regex from wikipedia against yours, the code is..
import re
s="123 Some Road Name\nTown, City\nCounty\nPA23 6NH\n123 Some Road Name\nTown, City"\
"County\nPA2 6NH\n123 Some Road Name\nTown, City\nCounty\nPA2Q 6NH"
#custom
print re.findall(r'\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b', s)
#regex from #http://en.wikipedia.orgwikiUK_postcodes#Validation
print re.findall(r'[A-Z]{1,2}[0-9R][0-9A-Z]? [0-9][A-Z]{2}', s)
the result is
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
['PA23 6NH', 'PA2 6NH', 'PA2Q 6NH']
both the regex's give the same result.
Try
import re
re.findall("[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}", x)
You don't need the \b.
#!/usr/bin/env python
import re
ADDRESS="""123 Some Road Name
Town, City
County
PA23 6NH"""
reobj = re.compile(r'(\b[A-Z]{1,2}[0-9][A-Z0-9]? [0-9][ABD-HJLNP-UW-Z]{2}\b)')
matchobj = reobj.search(ADDRESS)
if matchobj:
print matchobj.group(1)
Example output:
[user#host]$ python uk_postcode.py
PA23 6NH

Categories