Input: 1 10 avenue
Desired Output: 1 10th avenue
As you can see above I have given an example of an input, as well as the desired output that I would like. Essentially I need to look for instances where there is a number followed by a certain pattern (avenue, street, etc). I have a list which contains all of the patterns and it's called patterns.
If that number does not have "th" after it, I would like to add "th". Simply adding "th" is fine, because other portions of my code will correct it to either "st", "nd", "rd" if necessary.
Examples:
1 10th avenue OK
1 10 avenue NOT OK, TH SHOULD BE ADDED!
I have implemented a working solution, which is this:
def Add_Th(address):
try:
address = address.split(' ')
except AttributeError:
pass
for pattern in patterns:
try:
location = address.index(pattern) - 1
number_location = address[location]
except (ValueError, IndexError):
continue
if 'th' not in number_location:
new = number_location + 'th'
address[location] = new
address = ' '.join(address)
return address
I would like to convert this implementation to regex, as this solution seems a bit messy to me, and occasionally causes some issues. I am not the best with regex, so if anyone could steer me in the right direction that would be greatly appreciated!
Here is my current attempt at the regex implementation:
def add_th(address):
find_num = re.compile(r'(?P<number>[\d]{1,2}(' + "|".join(patterns + ')(?P<following>.*)')
check_th = find_num.search(address)
if check_th is not None:
if re.match(r'(th)', check_th.group('following')):
return address
else:
# this is where I would add th. I know I should use re.sub, i'm just not too sure
# how I would do it
else:
return address
I do not have a lot of experience with regex, so please let me know if any of the work I've done is incorrect, as well as what would be the best way to add "th" to the appropriate spot.
Thanks.
Just one way, finding the positions behind a digit and ahead of one of those pattern words and placing 'th' into them:
>>> address = '1 10 avenue 3 33 street'
>>> patterns = ['avenue', 'street']
>>>
>>> import re
>>> pattern = re.compile(r'(?<=\d)(?= ({}))'.format('|'.join(patterns)))
>>> pattern.sub('th', address)
'1 10th avenue 3 33th street'
Related
In my homework, I need to extract the first name, last name, ID code, phone number, date of birth and address of a person from a given string using Regex. The order of the parameters always remains the same. Each parameter requires a separate pattern.
Requirements are as follows:
Both first and last names always begin with a capital letter followed by at least one lowercase letter.
ID code is always 11 characters long and consists only of numbers.
The phone number itself is a combination of 7-8 numbers. The phone number might be separated from the area code with a whitespace, but not necessarily. It is also possible that there is no area code at all.
Date of birth is formatted as dd-MM-YYYY
Address is everything else that remains.
I got the following patterns for each parameter:
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
first_name_pattern = r"^[A-Z][a-z]+"
last_name_pattern = r"[A-z][a-z]+(?=[0-9])"
id_code_pattern = r"\d{11}(?=\+)"
phone_number_pattern = r"\+\d{3}?\s*\d{7,8}"
date_pattern = r"\d{1,2}\-\d{1,2}\-\d{1,4}"
address_pattern = r"[A-Z][a-z]*\s.*$"
first_name_match = re.findall(first_name_pattern, str1)
last_name_match = re.findall(last_name_pattern, str1)
id_code_match = re.findall(id_code_pattern, str1)
phone_number_match = re.findall(phone_number_pattern, str1)
date_match = re.findall(date_pattern, str1)
address_match = re.findall(address_pattern, str1)
So, given "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti", I get ['Heino'] ['Plekk'] ['69712047623'] ['+372 56887364' ] ['12-09-2020'] ['Tartu mnt 183,Tallinn,16881,Eesti'], which suits me perfectly.
The problem starts when the area code is missing, because now id_code_pattern can't find the id code because of (?=\+), and if one tries to use |\d{11} (or) there is another problem because now it finds both id code and phone number (69712047623 and 37256887364). And how to improve phone_number_pattern so that it finds only 7 or 8 digits of the phone number, I do not understand.
A single expression with some well-crafted capture groups will help you immensely:
import re
str1 = "HeinoPlekk69712047623+3725688736412-09-2020Tartu mnt 183,Tallinn,16881,Eesti"
pattern = r"^(?P<first_name>[A-Z][a-z]+)(?P<last_name>[A-Z][a-z]+)(?P<id_code>\d{11})(?P<phone>(?:\+\d{3})?\s*\d{7,8})(?P<dob>\d{1,2}\-\d{1,2}\-\d{1,4})(?P<address>.*)$"
print(re.match(pattern, str1).groupdict())
Repl.it | regex101
Result:
{'first_name': 'Heino', 'last_name': 'Plekk', 'id_code': '69712047623', 'phone': '+37256887364', 'dob': '12-09-2020', 'address': 'Tartu mnt 183,Tallinn,16881,Eesti'}
I have an address 2300 S SUPER TEMPLE PL which I expect to get 2300 S SUPER TEMPLE PLACE as a result after spelling out the PL to PLACE. I have a dictionary of abbreviated street names:
st_abbr = {'DR': 'DRIVE',
'RD': 'ROAD',
'BLVD':'BOULEVARD',
'ST':'STREET',
'STE':'SUITE',
'APTS':'APARTMENTS',
'APT':'APARTMENT',
'CT':'COURT',
'LN' : 'LANE',
'AVE':'AVENUE',
'CIR':'CIRCLE',
'PKWY': 'PARKWAY',
'HWY': 'HIGHWAY',
'SQ':'SQUARE',
'BR':'BRIDGE',
'LK':'LAKE',
'MT':'MOUNT',
'MTN':'MOUNTAIN',
'PL':'PLACE',
'RTE':'ROUTE',
'TR':'TRAIL'}
with a for-loop, I would like to replace the key in address be spelled out. What I thought I should do is loop through each word in the address, thus I have the address.split(), and if the split match one of the keys in the dictionary, to replace that with a spelled out word.
for key in st_abbr.keys():
if key in address.split():
address = address.replace(key, st_abbr[key])
print(address)
It works perfectly on abbreviated street names but this is what I get 2300 S SUPER TEMPLACEE PLACE. It also replaced the PL within 'TEMPLE' with PLACE, thus it gave me 'TEMPLACEE'. I am trying to modify the for loop to only replace the abbreviated street if the street.split() is the exact match of the dict.keys(). I would like guidance on how to achieve that.
Use a comprehension:
addr = '2300 S SUPER TEMPLE PL'
new_addr = ' '.join(st_abbr.get(c, c) for c in addr.split())
print(new_addr)
# Output
2300 S SUPER TEMPLE PLACE
Can you shed a light the concept behind the .get(c,c) in the context of my problem?
# Equivalent code
' '.join(st_abbr[c] if c in st_abbr else c for c in addr.split())
Not sure whether it's the best idea or not, but regex usually can be helpful in these cases:
import re
def getValue(value):
before = value.group(1)
name = value.group("name")
after = value.group(3)
if name in st_abbr:
return before + st_abbr[name] + after
else:
return before + name + after
myString = "2300 S SUPER TEMPLE PL"
re.sub("(^|\s)+(?P<name>[A-Z]{2,4})($|\s)", getValue,myString)
Output
2300 S SUPER TEMPLE PLACE
I'd like to use part of a string ('project') that is returned from an API. The string looks like this:
{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
I'd like to store the 'LS003942_EP... ' part in a new variable called foldername. I'm thought a good way would be to use a regex to find the text after Title. Here's my code:
orders = api.get_all(view='Folder', fields='Project Title', maxRecords=1)
for new in orders:
print ("Found 1 new project")
print (new['fields'])
project = (new['fields'])
s = re.search('Title(.+?)', result)
if s:
foldername = s.group(1)
print(foldername)
This gives me an error -
TypeError: expected string or bytes-like object.
I'm hoping for foldername = 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'
You can use ast.literal_eval to safely evaluate a string containing a Python literal:
import ast
s = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
print(ast.literal_eval(s)['Project Title'])
# LS003942_EP - 5 Random Road, Sunny Place, SA 5000
It seems (to me) that you have a dictionary and not string. Considering this case, you may try:
s = {'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
print(s['Project Title'])
If you have time, take a look at dictionaries.
I don't think you need a regex here:
string = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
foldername = string[string.index(":") + 2: len(string)-1]
Essentially, I'm finding the position of the first colon, then adding 2 to get the starting index of your foldername (which would be the apostrophe), and then I use index slicing and slice everything from the index to the second-last character (the last apostrophe).
However, if your string is always going to be in the form of a valid python dict, you could simply do foldername = (eval(string).values)[0]. Here, I'm treating your string as a dict and am getting the first value from it, which is your desired foldername. But, as #AKX notes in the comments, eval() isn't safe as somebody could pass malicious code as a string. Unless you're sure that your input strings won't contain code (which is unlikely), it's best to use ast.literal_eval() as it only evaluates literals.
But, as #MaximilianPeters notes in the comments, your response looks like a valid JSON, so you could easily parse it using json.parse().
You could try this pattern: (?<='Project Title': )[^}]+.
Explanation: it uses positive lookbehind to assure, that match will occure after 'Project Title':. Then it matches until } is encountered: [^}]+.
Demo
I’m trying to search a long string of characters for a country name. The country name is sometimes more than one word, such as Costa Rica.
Here is my code:
eol = len(CountryList)
for c in range(0, eol):
country = str(CountryList[c])
countrymatch = re.search(country, fullsampledata)
if countrymatch:
...
fullsampledata is a long string with all the data in one line. I’m trying to parse out the country by cycling thru a list of valid country names. If country is only one word, such as ‘Holland’, it finds it. However, if country is two or more words, ‘Costa Rica’, it doesn’t find it. Why?
You can search for a substring in a string using the .find() function as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
fullsampledata.find("Morocco")
-1
fullsampledata.index("Costa Rica")
17
So you can make your if statement as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
country = "Costa Rica"
if fullsampledata.index(country) != -1:
# Found
pass
else:
# Not Found
pass
In [1]: long_string = 'asdfsadfCosta Ricaasdkj asdfsd asdjas USA alsj'
In [2]: 'Costa Rica' in long_string
Out[2]: True
You don't have your code properly shown and I'm a little too lazy to parse it. Hope this helps.
I'm trying to get the "real" name of a movie from its name when you download it.
So for instance, I have
Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY
and would like to get
Star Wars Episode 4 A New Hope
So I'm using this regex:
.*?\d{1}?[ .a-zA-Z]*
which works fine, but only for a movie with a number, as in 'Iron Man 3' for example.
I'd like to be able to get movies like 'Interstellar' from
Interstellar.2014.1080p.BluRay.H264.AAC-RARBG
and I currently get
Interstellar 2
I tried several ways, and spent quite a lot of time on it already, but figured it wouldn't hurt asking you guys if you had any suggestion/idea/tip on how to do it...
Thanks a lot!
Given your examples and assuming you always download in 1080p (or know that field's value):
x = 'Interstellar.2014.1080p.BluRay.H264.AAC-RARBG'
y = x.split('.')
print " ".join(y[:y.index('1080p')-1])
Forget the regex (for now anyway!) and work with the fixed field layout. Find a field you know (1080p) and remove the information you don't want (the year). Recombine the results and you get "Interstellar" and "Star Wars Episode 4 A New Hope".
The following regex would work (assuming the format is something like moviename.year.1080p.anything or moviename.year.720p.anything:
.*(?=.\d{4}.*\d{3,}p)
Regex example (try the unit tests to see the regex in action)
Explanation:
\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$
Try this with re.sub.See demo.
https://regex101.com/r/hR7tH4/10
import re
p = re.compile(r'\.(?=.*?(?:19|20)\d{2}\b)|(?:19|20)\d{2}\b.*$', re.MULTILINE)
test_str = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY\nInterstellar.2014.1080p.BluRay.H264.AAC-RARBG\nIron Man 3"
subst = " "
result = re.sub(p, subst, test_str)
Assuming, there is always a four-digit-year, or a four-digit-resolution notation within the movie's file name, a simple solution replaces the not-wanted parts as this:
"(?:\.|\d{4,4}.+$)"
by a blank, strip()'ing them afterwards ...
For example:
test1 = "Star.Wars.Episode.4.A.New.Hope.1977.1080p.BrRip.x264.BOKUTOX.YIFY"
test2 = "Interstellar.2014.1080p.BluRay.H264.AAC-RARBG"
res1 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test1).strip()
res2 = re.sub(r"(?:\.|\d{4,4}.+$)",' ',test2).strip()
print(res1, res2, sep='\n')
>>> Star Wars Episode 4 A New Hope
>>> Interstellar