Find a string within a string using pattern matching in python - python

I'd like to use part of a string ('project') that is returned from an API. The string looks like this:
{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
I'd like to store the 'LS003942_EP... ' part in a new variable called foldername. I'm thought a good way would be to use a regex to find the text after Title. Here's my code:
orders = api.get_all(view='Folder', fields='Project Title', maxRecords=1)
for new in orders:
print ("Found 1 new project")
print (new['fields'])
project = (new['fields'])
s = re.search('Title(.+?)', result)
if s:
foldername = s.group(1)
print(foldername)
This gives me an error -
TypeError: expected string or bytes-like object.
I'm hoping for foldername = 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'

You can use ast.literal_eval to safely evaluate a string containing a Python literal:
import ast
s = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
print(ast.literal_eval(s)['Project Title'])
# LS003942_EP - 5 Random Road, Sunny Place, SA 5000
It seems (to me) that you have a dictionary and not string. Considering this case, you may try:
s = {'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}
print(s['Project Title'])
If you have time, take a look at dictionaries.

I don't think you need a regex here:
string = "{'Project Title': 'LS003942_EP - 5 Random Road, Sunny Place, SA 5000'}"
foldername = string[string.index(":") + 2: len(string)-1]
Essentially, I'm finding the position of the first colon, then adding 2 to get the starting index of your foldername (which would be the apostrophe), and then I use index slicing and slice everything from the index to the second-last character (the last apostrophe).
However, if your string is always going to be in the form of a valid python dict, you could simply do foldername = (eval(string).values)[0]. Here, I'm treating your string as a dict and am getting the first value from it, which is your desired foldername. But, as #AKX notes in the comments, eval() isn't safe as somebody could pass malicious code as a string. Unless you're sure that your input strings won't contain code (which is unlikely), it's best to use ast.literal_eval() as it only evaluates literals.
But, as #MaximilianPeters notes in the comments, your response looks like a valid JSON, so you could easily parse it using json.parse().

You could try this pattern: (?<='Project Title': )[^}]+.
Explanation: it uses positive lookbehind to assure, that match will occure after 'Project Title':. Then it matches until } is encountered: [^}]+.
Demo

Related

Replace a word in an address string with dictionary value using for-loop

I have an address 2300 S SUPER TEMPLE PL which I expect to get 2300 S SUPER TEMPLE PLACE as a result after spelling out the PL to PLACE. I have a dictionary of abbreviated street names:
st_abbr = {'DR': 'DRIVE',
'RD': 'ROAD',
'BLVD':'BOULEVARD',
'ST':'STREET',
'STE':'SUITE',
'APTS':'APARTMENTS',
'APT':'APARTMENT',
'CT':'COURT',
'LN' : 'LANE',
'AVE':'AVENUE',
'CIR':'CIRCLE',
'PKWY': 'PARKWAY',
'HWY': 'HIGHWAY',
'SQ':'SQUARE',
'BR':'BRIDGE',
'LK':'LAKE',
'MT':'MOUNT',
'MTN':'MOUNTAIN',
'PL':'PLACE',
'RTE':'ROUTE',
'TR':'TRAIL'}
with a for-loop, I would like to replace the key in address be spelled out. What I thought I should do is loop through each word in the address, thus I have the address.split(), and if the split match one of the keys in the dictionary, to replace that with a spelled out word.
for key in st_abbr.keys():
if key in address.split():
address = address.replace(key, st_abbr[key])
print(address)
It works perfectly on abbreviated street names but this is what I get 2300 S SUPER TEMPLACEE PLACE. It also replaced the PL within 'TEMPLE' with PLACE, thus it gave me 'TEMPLACEE'. I am trying to modify the for loop to only replace the abbreviated street if the street.split() is the exact match of the dict.keys(). I would like guidance on how to achieve that.
Use a comprehension:
addr = '2300 S SUPER TEMPLE PL'
new_addr = ' '.join(st_abbr.get(c, c) for c in addr.split())
print(new_addr)
# Output
2300 S SUPER TEMPLE PLACE
Can you shed a light the concept behind the .get(c,c) in the context of my problem?
# Equivalent code
' '.join(st_abbr[c] if c in st_abbr else c for c in addr.split())
Not sure whether it's the best idea or not, but regex usually can be helpful in these cases:
import re
def getValue(value):
before = value.group(1)
name = value.group("name")
after = value.group(3)
if name in st_abbr:
return before + st_abbr[name] + after
else:
return before + name + after
myString = "2300 S SUPER TEMPLE PL"
re.sub("(^|\s)+(?P<name>[A-Z]{2,4})($|\s)", getValue,myString)
Output
2300 S SUPER TEMPLE PLACE

Need to search a string for a "two word" pattern in python

I’m trying to search a long string of characters for a country name. The country name is sometimes more than one word, such as Costa Rica.
Here is my code:
eol = len(CountryList)
for c in range(0, eol):
country = str(CountryList[c])
countrymatch = re.search(country, fullsampledata)
if countrymatch:
...
fullsampledata is a long string with all the data in one line. I’m trying to parse out the country by cycling thru a list of valid country names. If country is only one word, such as ‘Holland’, it finds it. However, if country is two or more words, ‘Costa Rica’, it doesn’t find it. Why?
You can search for a substring in a string using the .find() function as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
fullsampledata.find("Morocco")
-1
fullsampledata.index("Costa Rica")
17
So you can make your if statement as follows
fullsampledata = "hwfekfwekjfnkwfehCosta Ricakwjfkwfekfekfw"
country = "Costa Rica"
if fullsampledata.index(country) != -1:
# Found
pass
else:
# Not Found
pass
In [1]: long_string = 'asdfsadfCosta Ricaasdkj asdfsd asdjas USA alsj'
In [2]: 'Costa Rica' in long_string
Out[2]: True
You don't have your code properly shown and I'm a little too lazy to parse it. Hope this helps.

Delimiters in splitting string in python

I know this question has been asked a few times, but what I'm asking is not how to do it, but which delimiter should be used.
So I have a very long string and I want to split it into words. The result is not what I wanted, so I thought to add another delimiter.
The problem is there are words like vs. and U.S. in the string. If I use . as a delimiter, I will get vs but U.S. becomes U and S. This is not what I wanted.
Another example, there are words brainf*ck *7 F***ing x*x+y*y works* f*k in the string. If I use * as a delimiter, the result will be very messy (brainf*ck becomes brainf and ck, F***ing becomes F and ing, and so on)
' delimiter have the same problem; (don't 'starting out' what's do's dont's)
- = + ( ) also have some minor problem but I can handle those delimiters. The problem is with . * '.
Does anyone have any idea how to tackle this problem?
What about using re:
import re
text = 'U.S. vs. brainf*ck *7 F***ing x*x+y*y works* f*k'
get = re.split('\s', text)
# ['U.S.', 'vs.', 'brainf*ck', '*7', 'F***ing', 'x*x+y*y', 'works*', 'f*k']
#Example
print(get[0]) # U.S.
print(get[1]) # vs.

Add letters to string conditionally

Input: 1 10 avenue
Desired Output: 1 10th avenue
As you can see above I have given an example of an input, as well as the desired output that I would like. Essentially I need to look for instances where there is a number followed by a certain pattern (avenue, street, etc). I have a list which contains all of the patterns and it's called patterns.
If that number does not have "th" after it, I would like to add "th". Simply adding "th" is fine, because other portions of my code will correct it to either "st", "nd", "rd" if necessary.
Examples:
1 10th avenue OK
1 10 avenue NOT OK, TH SHOULD BE ADDED!
I have implemented a working solution, which is this:
def Add_Th(address):
try:
address = address.split(' ')
except AttributeError:
pass
for pattern in patterns:
try:
location = address.index(pattern) - 1
number_location = address[location]
except (ValueError, IndexError):
continue
if 'th' not in number_location:
new = number_location + 'th'
address[location] = new
address = ' '.join(address)
return address
I would like to convert this implementation to regex, as this solution seems a bit messy to me, and occasionally causes some issues. I am not the best with regex, so if anyone could steer me in the right direction that would be greatly appreciated!
Here is my current attempt at the regex implementation:
def add_th(address):
find_num = re.compile(r'(?P<number>[\d]{1,2}(' + "|".join(patterns + ')(?P<following>.*)')
check_th = find_num.search(address)
if check_th is not None:
if re.match(r'(th)', check_th.group('following')):
return address
else:
# this is where I would add th. I know I should use re.sub, i'm just not too sure
# how I would do it
else:
return address
I do not have a lot of experience with regex, so please let me know if any of the work I've done is incorrect, as well as what would be the best way to add "th" to the appropriate spot.
Thanks.
Just one way, finding the positions behind a digit and ahead of one of those pattern words and placing 'th' into them:
>>> address = '1 10 avenue 3 33 street'
>>> patterns = ['avenue', 'street']
>>>
>>> import re
>>> pattern = re.compile(r'(?<=\d)(?= ({}))'.format('|'.join(patterns)))
>>> pattern.sub('th', address)
'1 10th avenue 3 33th street'

Remove items in string paragraph if they belong to a list of strings?

import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
Using BeautifulSoup, I scraped one of Obama's speeches off of the above website. Now, I need to replace some residual HTML in an efficient manner. I've stored a list of elements I'd like to eliminate in remove_char. I'm trying to write a simple for statement, but am getting the error: TypeError: expected a character object buffer. It's a beginner question, I know, but how can I get around this?
Since you are using BeautifulSoup already , you can directly use obama_4427_div.text instead of str(obama_4427_div) to get the correctly formatted text. Then the text you get would not contain any residual html elements, etc.
Example -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
For completeness, for removing elements from a string, I would create a list of elements to remove (like the remove_char list you have created) and then we can do str.replace() on the string for each element in the list. Example -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')

Categories