Address formatting using regex - add state before zip code - python

I have an address formatted like this:
street address, town zip
I need to add the state abbreviation before the zip, which is always 5 digits.
I think I should use regex to do something like below, but I don't know how to finish it:
instr = "123 street st, anytown 12345"
state = 'CA'
outstr = re.sub(r'(???)(/\b\d{5}\b/g)', r'\1state\2', instr)
My question is what to put in the ??? and whether I used the state variable correctly in outstr. Also, did I get the zip regex correct?

You can also use rsplit to do that:
instr = "123 street st, anytown 12345"
state = 'CA'
address, zip_code = instr.rsplit(' ', 1) # ['123 street st, anytown', '12345']
print '%s %s %s' % (address, state, zip_code)
>> "123 street st, anytown CA 12345"
From the str.rsplit documentation:
str.rsplit([sep[, maxsplit]])
Return a list of the words in the
string, using sep as the delimiter string. If maxsplit is given, at
most maxsplit splits are done, the rightmost ones.

You can't put the variable "state" straight into the replacement string. You should use python string formatting to make reference to the variable.
Keep regex simple, assume the data are simple. If ZIP is always appear the the end of the string, then just match from the end, use $.
Let me try :
instr = "123 street st, anytown 12345"
# Always strip the trailing spaces to avoid surprises
instr = instr.rstrip()
state = 'CA'
# Assume The ZIP has no trailing space and in last position.
search_pattern = r"(\d{5})$"
#
# Format the replacement, since I search from the end, so group 1 should be fined
replace_str = r"{mystate} \g<1>'.format(mystate = state)
outstr = re.sub(search_pattern, replace_str, instr)
#Forge example is lean and clean. However, you need to be careful about the data quality when using str.rsplit(). For example
# If town and zip code stick together
instr = "123 street st, anytown12345"
# or trailing spaces
instr = "123 street st, anytown 12345 "
The universal fix is use a strip and regex as shown in my code. Always think ahead of input data quality, some code will failed after going through unit test.

Related

Replace a word in an address string with dictionary value using for-loop

I have an address 2300 S SUPER TEMPLE PL which I expect to get 2300 S SUPER TEMPLE PLACE as a result after spelling out the PL to PLACE. I have a dictionary of abbreviated street names:
st_abbr = {'DR': 'DRIVE',
'RD': 'ROAD',
'BLVD':'BOULEVARD',
'ST':'STREET',
'STE':'SUITE',
'APTS':'APARTMENTS',
'APT':'APARTMENT',
'CT':'COURT',
'LN' : 'LANE',
'AVE':'AVENUE',
'CIR':'CIRCLE',
'PKWY': 'PARKWAY',
'HWY': 'HIGHWAY',
'SQ':'SQUARE',
'BR':'BRIDGE',
'LK':'LAKE',
'MT':'MOUNT',
'MTN':'MOUNTAIN',
'PL':'PLACE',
'RTE':'ROUTE',
'TR':'TRAIL'}
with a for-loop, I would like to replace the key in address be spelled out. What I thought I should do is loop through each word in the address, thus I have the address.split(), and if the split match one of the keys in the dictionary, to replace that with a spelled out word.
for key in st_abbr.keys():
if key in address.split():
address = address.replace(key, st_abbr[key])
print(address)
It works perfectly on abbreviated street names but this is what I get 2300 S SUPER TEMPLACEE PLACE. It also replaced the PL within 'TEMPLE' with PLACE, thus it gave me 'TEMPLACEE'. I am trying to modify the for loop to only replace the abbreviated street if the street.split() is the exact match of the dict.keys(). I would like guidance on how to achieve that.
Use a comprehension:
addr = '2300 S SUPER TEMPLE PL'
new_addr = ' '.join(st_abbr.get(c, c) for c in addr.split())
print(new_addr)
# Output
2300 S SUPER TEMPLE PLACE
Can you shed a light the concept behind the .get(c,c) in the context of my problem?
# Equivalent code
' '.join(st_abbr[c] if c in st_abbr else c for c in addr.split())
Not sure whether it's the best idea or not, but regex usually can be helpful in these cases:
import re
def getValue(value):
before = value.group(1)
name = value.group("name")
after = value.group(3)
if name in st_abbr:
return before + st_abbr[name] + after
else:
return before + name + after
myString = "2300 S SUPER TEMPLE PL"
re.sub("(^|\s)+(?P<name>[A-Z]{2,4})($|\s)", getValue,myString)
Output
2300 S SUPER TEMPLE PLACE

Too many values to unpack (expected 2) while splitting string

I am looking to split strings at "(", this is working fine if there is only one "(" character in the string. However, if there are more than one such character, it throws a value error too many values to unpack
data = 'The National Bank (US) (Bank)'
I've tried the below code:
name, inst = data.split("(")
Desired output:
name = 'The National Bank (US)'
inst = '(Bank)'
Your split method is splitting the input on both ( characters, giving you the result:
["The National Bank ", "US) ", "Bank)"]
You are then attempting to unpack this list of three values into two variables, name and inst. This is what the error "Too many values to unpack" means.
You can restrict the number of splits to be made using the second parameter to split, but this will give you the wrong result as well.
You actually want to split from the right of the string, on the first space character. You can do that with rsplit:
data = 'The National Bank (US) (Bank)'
name, inst = data.rsplit(' ', 1)
name and inst will now be set as you expect.
this is expected behavior of this function. When you split string with n separators, you get n+1 strings in return
e.g.
l = '1,2,3,4'.split(',')
print(l)
print(type(l), len(l))
You can use the rsplit with the maxsplit parameter like this, although you have to append the leading ( to your inst string:
>>> name, inst = data.rsplit("(", maxsplit=1)
>>> name
'The National Bank (US) '
>>> inst
'Bank)'
You may be able to get a little cleaner results by doing the same thing but passing a blank space as the delimiter:
>>> name, inst = data.rsplit(" ", maxsplit=1)
>>> name
'The National Bank (US)'
>>> inst
'(Bank)'

Find a string within a string using pattern matching in python

I'd like to use part of a string ('project') that is returned from an API. The string looks like this:
{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
I'd like to store the 'LS003942_EP... ' part in a new variable called foldername. I'm thought a good way would be to use a regex to find the text after Title. Here's my code:
orders = api.get_all(view='Folder', fields='Project Title', maxRecords=1)
for new in orders:
print ("Found 1 new project")
print (new['fields'])
project = (new['fields'])
s = re.search('Title(.+?)', result)
if s:
foldername = s.group(1)
print(foldername)
This gives me an error -
TypeError: expected string or bytes-like object.
I'm hoping for foldername = 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'
You can use ast.literal_eval to safely evaluate a string containing a Python literal:
import ast
s = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
print(ast.literal_eval(s)['Project Title'])
# LS003942_EP - 5 Random Road, Sunny Place, SA 5000
It seems (to me) that you have a dictionary and not string. Considering this case, you may try:
s = {'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
print(s['Project Title'])
If you have time, take a look at dictionaries.
I don't think you need a regex here:
string = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
foldername = string[string.index(":") + 2: len(string)-1]
Essentially, I'm finding the position of the first colon, then adding 2 to get the starting index of your foldername (which would be the apostrophe), and then I use index slicing and slice everything from the index to the second-last character (the last apostrophe).
However, if your string is always going to be in the form of a valid python dict, you could simply do foldername = (eval(string).values)[0]. Here, I'm treating your string as a dict and am getting the first value from it, which is your desired foldername. But, as #AKX notes in the comments, eval() isn't safe as somebody could pass malicious code as a string. Unless you're sure that your input strings won't contain code (which is unlikely), it's best to use ast.literal_eval() as it only evaluates literals.
But, as #MaximilianPeters notes in the comments, your response looks like a valid JSON, so you could easily parse it using json.parse().
You could try this pattern: (?<='Project Title': )[^}]+.
Explanation: it uses positive lookbehind to assure, that match will occure after 'Project Title':. Then it matches until } is encountered: [^}]+.
Demo

Ignoring Multiple Whitespace Characters in a MongoDB Query

I have a MongoDB query that searches for addresses. The problem is that if a user accidentally adds an extra whitespace, the query will not find the address. For example, if the user types 123 Fakeville St instead of 123 Fakeville St, the query will not return any results.
Is there a simple way to deal with this issue, perhaps using $regex? I guess the space would need to be ignore between the house number (123) and the street name (Fakeville). My query is set up like this:
#app.route('/getInfo', methods=['GET'])
def getInfo():
address = request.args.get("a")
addressCollection = myDB["addresses"]
addressJSON = []
regex = "^" + address
for address in addressCollection.find({'Address': {'$regex':regex,'$options':'i'} },{"Address":1,"_id":0}).limit(3):
addressJSON.append({"Address":address["Address"]})
return jsonify(addresses=addressJSON)
Clean up the query before sending it off:
>> import re
>>> re.sub(r'\s+', ' ', '123 abc')
'123 abc'
>>> re.sub(r'\s+', ' ', '123 abc def ghi')
'123 abc def ghi'
You'll probably want to make sure that the data in your database is similarly normalised. Also consider similar strategies for things like punctuation.
In fact, using a regex for this seems overly strict, as well as reinventing the wheel. Consider using a proper search engine such as Lucene or Elasticsearch.
An alternative approach without using regex you could try is to utilise MongoDB text indexes. By adding a text index on the field you can perform text searches using $text operator
For example:
db.coll.find(
{ $text:{$search:"123 Fakeville St"}},
{ score: { $meta: "textScore" } } )
.sort( { score: { $meta: "textScore" } } ).limit(1)
This should work for entries such as: "123 Fakeville St.", "123 fakeville street", etc. As long as the important parts of the address makes it in.
See more info on $text behaviour

Why doesn't this regular expression work in all cases?

I have a text file containing entries like this:
#markwarner VIRGINIA - Mark Warner
#senatorleahy VERMONT - Patrick Leahy NO
#senatorsanders VERMONT - Bernie Sanders
#orrinhatch UTAH - Orrin Hatch NO
#jimdemint SOUTH CAROLINA - Jim DeMint NO
#senmikelee UTAH -- Mike Lee
#kaybaileyhutch TEXAS - Kay Hutchison
#johncornyn TEXAS - John Cornyn
#senalexander TENNESSEE - Lamar Alexander
I have written the following to remove the 'NO' and the dashes using regular expressions:
import re
politicians = open('testfile.txt')
text = politicians.read()
# Grab the 'no' votes
# Should be 11 entries
regex = re.compile(r'(no\s#[\w+\d+\.]*\s\w+\s?\w+?\s?\W+\s\w+\s?\w+)', re.I)
no = regex.findall(text)
## Make the list a string
newlist = ' '.join(no)
## Replace the dashes in the string with a space
deldash = re.compile('\s-*\s')
a = deldash.sub(' ', newlist)
# Delete 'NO' in the string
delno = re.compile('NO\s')
b = delno.sub('', a)
# make the string into a list
# problem with #jimdemint SOUTH CAROLINA Jim DeMint
regex2 = re.compile(r'(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+)', re.I)
lst1 = regex2.findall(b)
for i in lst1:
print i
When I run the code, it captures the twitter handle, state and full names other than the surname of Jim DeMint. I have stated that I want to ignore case for the regex.
Any ideas? Why is the expression not capturing this surname?
It's missing it because his state name contains two words: SOUTH CAROLINA
Have your second regex be this, it should help
(#[\w\d\.]*\s[\w\d\.]*\s?[\w\d\.]\s?[\w\d\.]*?\s+?\w+(?:\s\w+)?)
I added
(?:\s\w+)?
Which is a optional, non capturing group matching a space followed by one or more alphanumeric underscore characters
http://regexr.com?31fv5 shows that it properly matches the input with the NOs and dashes stripped
EDIT:
If you want one master regex to capture and split everything properly, after you remove the Nos and dashes, use
((#[\w]+?\s)((?:(?:[\w]+?)\s){1,2})((?:[\w]+?\s){2}))
Which you can play with here: http://regexr.com?31fvk
The full match is available in $1, the Twitter handle in $2, the State in $3 And the name in $4
Each capturing group works as follows:
(#[\w]+?\s)
This matches an # sign followed by at least one but as few characters as possible until a space.
((?:(?:[\w]+?)\s){1,2})
This matches and captures 1 or two words, which should be the state. This only works because of the next piece, which MUST have two words
((?:[\w]+?\s){2})
Matches and captures exactly two words, which is defined as few characters as possible followed by a space
text=re.sub(' (NO|-+)(?= |$)','',text)
And to capture everything:
re.findall('(#\w+) ([A-Z ]+[A-Z]) (.+?(?= #|$))',text)
Or all at once:
re.findall('(#\w+) ([A-Z ]+[A-Z])(?: NO| -+)? (.+?(?= #|$))',text)

Categories