Python Regular Expression to Identify City Names Out Of Strings - python

Using regular expression in Python 3.4, how would I extract the city names from the following text below?
replacement windows in seattle wa
basement remodeling houston texas
siding contractor new york ny
windows in elk grove village
Sometimes the city name before it has \sin\s, sometimes it doesn't. Sometimes it has a general word like 'windows', 'remodeling', ... anything. Sometimes there is no state full name or state abbreviation at the end.
Is there a single regular expression that can capture these above conditions?
Here's what I've tried so far but it only captures 'seattle'.
import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor new york ny',
'windows in elk grove village'
]
for i in l:
m = re.search(r'(?<=\sin\s)(.+)(?=\s(wa|texas|ny))', i)
m.group(1)

What you are after is not possible with regular expressions. Regular expressions need string patterns to work. In your case, it would seem that the pattern either does not exist or can take a myriad of forms.
What you could do would be to use a search efficient data structure and split your string in words. You would then go through each word and see if it is in your search efficient data structure.

import re
l = ['replacement windows in seattle wa',
'basement remodeling houston texas',
'siding contractor newyork ny',
'windows in elk grove village']
p = re.compile(r"(\w+)\s(?:(wa | texas | ny | village))", re.VERBOSE)
for words in l:
print p.search(words).expand(r"\g<1> <-- the code is --> \g<2>")

Related

Using regex how can we remove period '.' from prefix's like Mr. and Mrs. but not from the end of the sentences in a big paragraph or more?

Lang: Python. Using regex for instance if I use remove1 = re.sub('\.(?!$)', '', text), it removes all periods. I am only able to remove all periods, not just prefixes. Can anyone help, please? Just put the below text for example.
Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.
You can capture what you want to keep, and match the dot that you want to replace.
\b(Mrs?)\.
Regex demo
In the replacement use group 1 like \1
import re
pattern = r"\b(Mrs?)\."
s = ("Mr. and Mrs. Jackson live up the street from us. However, Mrs. Jackson's son lives in the street parallel to us.\n")
result = re.sub(pattern, r"\1", s)
print(result)
Output
Mr and Mrs Jackson live up the street from us. However, Mrs Jackson's son lives in the street parallel to us.

regular expressions (regex) save parts of sentence

New to python and regular expressions, I have been trying to find a way that I can parse a sentence so that I can take parts of it and assign them to their own variables.
An example sentence is: Laura Compton, a Stock Broker from Los Angeles, California
My objective is to have: name = "Laura Compton" ( this one is the easy one, I can target the anchor link no problem), position = "Stock Broker", city = Los Angeles, state = California
All of the sentences I need to iterate over follow the same pattern, name is always in the anchor tag, the position always follows the , after the closing anchor, sometimes its uses "a" or "an" so I would like to strip those off. The city and state always follow the word "from" .
You can use named groups within patterns to capture substrings, which makes referring to them easier and the code doing so slightly more readable:
import re
data = ['Laura Compton, a Stock Broker from Los Angeles, California',
'Miles Miller, a Soccer Player from Seattle, Washington']
pattern = (r'^(?P<name>[^,]+)\, an? (?P<position>.+) from '
r'(?P<city>[^,]+)\, +(?P<state>.+)')
FIELDS = 'name', 'position', 'city', 'state'
for sentence in data:
matches = re.search(pattern, sentence)
name, position, city, state = matches.group(*FIELDS)
print(', '.join([name, position, city, state]))
Output produced from sample data:
Laura Compton, Stock Broker, Los Angeles, California
Miles Miller, Soccer Player, Seattle, Washington
A.M. Kuchling wrote a good tutorial titled Regular Expression HOWTO you ought to check-out.
You can try this:
import re
s = "Laura Compton, a Stock Broker from Los Angeles, California"
new_s = re.findall('^[a-zA-Z\s]+|(?<=a\s)[a-zA-Z\s]+(?=from)|(?<=an\s)[a-zA-Z\s]+(?=from)|(?<=from\s)[a-zA-Z\s]+(?=,)|(?<=,\s)[a-zA-Z\s]+$', s)
headers = ['name', 'title', 'city', 'state']
data = {a:b for a, b in zip(headers, new_s)}
Output:
{'city': 'Los Angeles', 'state': 'California', 'name': 'Laura Compton', 'title': 'Stock Broker '}

Regex - Matching repeating pattern

I have the following string which contains a repeating pattern of text followed by parentheses with an ID number.
The New York Yankees (12980261666)\n\nRedsox (1901659429)\nMets (NYC)
(21135721896)\nKansas City Royals (they are 7-1) (222497247812331)\n\n
other team (618006)\n
I'm struggling to write a regex that would return:
The New York Yankees (12980261666)
Redsox (1901659429)
Mets (NYC) (21135721896)
Kansas City Royals (they are 7-1) (222497247812331)
other team (618006)
The newline character could be replaced later with a string.replace('/n', '').
use the negate character to achieve this.
String pat="([^\\n])"

Regex won't capture past \n

I've been trying to clean some data with the below, but my regex won't go past the \n. I don't understand why because i thought .* should capture everything.
table = POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO \n2002 present Assistant Professor, Department of Development Sociology, Cornell \n University, Ithaca, NY \n \n1999 2001
output = table.encode('ascii', errors='ignore').strip()
pat = r'POSITIONS.*'.format(endword)
print pat
regex = re.compile(pat)
if regex.search(output):
print regex.findall(output)
pieces.append(regex.findall(output))
the above returns:
['POSITIONS AND APPOINTMENTS 2006 present Fellow, University of Colorado at Denver Health Sciences Center, Native Elder Research Center, American Indian and Alaska Native Program, Denver, CO ']
. does not match a newline unless you specify re.DOTALL (or re.S) flag.
>>> import re
>>> re.search('.', '\n')
>>> re.search('.', '\n', flags=re.DOTALL)
<_sre.SRE_Match object at 0x0000000002AB8100>
regex = re.compile(pat, flags=re.DOTALL)

Return a list of words on a line but ignore certain whitespace

Say I have the line:
235Carling Robert 140 Simpson Ave Toronto Ont M6T9H1416/247-2538416/889-6178
You see each collections of characters there? I want those to represent a column in a data file. The problem I'm having is for the "Street Address" column.
for i in master_file:
#returns a list of the words, splitting at whitespace
columns = i.split()
The problem is though this will split up 140 Simpson Ave into three "words". Is there a method I can use to say only separate if the words are surrounded by a certain amount of whitespace or something?
If you have tabs, this is pretty trivial, but if you're just looking for something where there's more than one space, you can use python's re.split method to do this:
import re
re.split('\s{2,}', '235Carling Robert 140 Simpson Ave Toronto Ont M6T9H1416/247-2538416/889-6178')
['235Carling', 'Robert', '140 Simpson Ave', 'Toronto', 'Ont M6T9H1416/247-2538416/889-6178']
Where \s{2,} just matches any series of 2 or more whitespace characters.
If the characters between your rows there are actually tabs you can avoid the regex alltogether:
test = '235Carling Robert 140 Simpson Ave Toronto Ont M6T9H1416/247-2538416/889-6178'
test.split('\t')
['235Carling', 'Robert', '140 Simpson Ave', 'Toronto', 'Ont M6T9H1416/247-2538416/889-6178']

Categories