regex match and replace multiple patterns

regex match and replace multiple patterns - python

I have a situation where a user submits an address and I have to replace
user inputs to my keys. I can join this using an address without suffixes.
COVERED WAGON TRAIL
CHISHOLM TRAIL
LAKE TRAIL
CHESTNUT ST
LINCOLN STREET
to:
COVERED WAGON
CHISHOLM
LAKE
CHESTNUT
LINCOLN
However I can't comprehend how this code can be written to replace only the last word.
I get:
LINCOLN
CHESTNUT
CHISHOLM
LAKEAIL
CHISHOLMAIL
COVERED WAGONL
I've tried regex verbose, re.sub and $.
import re
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
rdict = {
' ST': '',
' STREET': '',
' TR': '',
' TRL': '',
}
robj = re.compile('|'.join(rdict.keys()))
re.sub(' TRL', '',target.rsplit(' ', 1)[0]), target
result = robj.sub(lambda m: rdict[m.group(0)], target)
print result

Use re.sub with $.
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
import re
print re.sub('\s+(STREET|ST|TRAIL|TRL|TR)\s*$', '', target, flags=re.M)

If you do store your string in the format:
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
There is no need to use regex:
>>> print '\n'.join([x.rsplit(None, 1)[0] for x in target.strip().split('\n')])
LINCOLN
CHESTNUT
CHISHOLM
LAKE
CHISHOLM
COVERED WAGON

Related

How to parse data from HTML using Regex?

I want to scrape the job title, location, and job description from Indeed (1st page only) using Regex, and store the results to a data frame. Here is the link: https://www.indeed.com/jobs?q=data+scientist&l=California
I have completed the task using BeautifulSoup and they worked totally fine:
from urllib.request import urlopen
from bs4 import BeautifulSoup as BS
import pandas as pd
url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
htmlfile = urlopen(url)
soup = BS(htmlfile,'html.parser')
companies = []
locations = []
summaries = []
company = soup.findAll('span', attrs={'class':'company'})
for c in company:
companies.append(c.text.replace("\n",""))
location = soup.findAll(class_ = 'location accessible-contrast-color-location')
for l in location:
locations.append(l.text)
summary = soup.findAll('div', attrs={'class':'summary'})
for s in summary:
summaries.append(s.text.replace("\n",""))
jobs_df = pd.DataFrame({'Company':companies, 'Location':locations, 'Summary':summaries})
jobs_df
Result from BS:
Company Location Summary
0 Cisco Careers San Jose, CA Work on massive structured, unstru...
1 AllyO Palo Alto, CA Extensive knowledge of scientific ...
2 Driven Brands Benicia, CA 94510 Develop scalable statistical, mach...
3 eBay Inc. San Jose, CA These problems require deep analys...
4 Disney Streaming Services San Francisco, CA Deep knowledge of machine learning...
5 Trimark Associates, Inc. Sacramento, CA The primary focus is in applying d...
But when I tried to use the same tags in Regex it failed.
import urllib.request, urllib.parse, urllib.error
import re
import pandas as pd
url = 'https://www.indeed.com/jobs?q=data+scientist&l=California'
text = urllib.request.urlopen(url).read().decode()
companies = []
locations = []
summaries = []
company = re.findall('<span class="company">(.*?)</span>', text)
for c in company:
companies.append(str(c))
location = re.findall('<div class="location accessible-contrast-color-location">(.*?)</div>', text)
for l in location:
locations.append(str(l))
summary = re.findall('<div class="summary">(.*?)</div>', text)
for s in summary:
summaries.append(str(s))
print(companies)
print(locations)
print(summaries)
There was an error saying the length of lists don't match so I checked on the individual lists. It turned out the contents could not be fetched. What I got from above:
[]
['Palo Alto, CA', 'Sunnyvale, CA', 'San Francisco, CA', 'South San Francisco, CA 94080', 'Pleasanton, CA 94566', 'Aliso Viejo, CA', 'Sacramento, CA', 'Benicia, CA 94510', 'San Bruno, CA']
[]
What did I do wrong?

. matches any character except newline. In the HTML code, there are newlines as well.
So you need to use re.DOTALL as flags option in the re.findall like below:
company = re.findall('<span class="company">(.*?)</span>', text, flags=re.DOTALL)
From the above code, you will not get the names only. Instead you will get all the descendents of the span element you are selecting.
So, you need to select only that part of regex which you want.
for c in company:
# selecting only the company name, discarding everything in the anchor tag.
name = re.findall('<a.*>(.*)</a>', c, flags = re.DOTALL)
for n in name:
# doing a little cleanup by removing the newlines and spaces.
companies.append(str(n.strip()))
print(companies)
Output:
['Driven Brands', 'Southern California Edison', 'Paypal', "Children's Hospital Los Angeles", 'Cisco Careers', 'University of California, Santa Cruz', 'Beyond Limits', 'Shutterfly', 'Walmart', 'Trimark Associates, Inc.']
For location and summary, there are no further HTML tags.
Only the text is present.
So, only re.DOTALL and stripping the text will do the job.
No need of second for loop and second findall.

. will match any character except line terminators. The content you are trying to get are on new lines \n. So you need to mach anything, including line terminators.
you'll want to do: company = re.findall('<span class="company">(.*?)</span>', text, re.DOTALL)
But this will also require a little cleanup after.

Can't get all the information when I print two categories together

I've written some code in python to scrape some movie names and some additional information related to those movies. What I have written for far is doing fine if I consider printing the two items separately, as in print(movie) located in the
middle portion in my script and print(addinfo) located at the bottom.
However, when I try to print both of them together at the bottom then I get only the movie names which have additional information as well (the addition information are retrieved from links attached to each movie name. The problem is most of the movie names do not contain the extra link.)
For example, If there are 5 movie names out of which only three have additional links then when I print them together, I get the three movie names and the additional information whereas I am supposed to get 5 movie names printed. I expect to print those names which doesn't have extra information. How can i fix this? Thanks in advance. I think the site address and html information is irrelevant as the code is working well. However, I'm pasting the full code for your consideration.
Script I've tried with:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
URL = "https://in.bookmyshow.com/vizag/movies"
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
for item in soup.select(".card-container"):
movie = item.select_one(".__movie-name").text.strip()
print(movie) ####I do not wish to print it here. I expect to print both (movie and addinfo) together
blink = item.select_one(".book-button a")
if blink:
req = requests.get(urljoin(URL,blink['href']))
soup = BeautifulSoup(req.text,"lxml")
addinfo = ' '.join([item.select_one(".__venue-name").text.strip() for item in soup.select(".listing-info")])
print(movie,addinfo) ####if i print both of them together then I only get those movies which have items informations as well

Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
URL = "https://in.bookmyshow.com/vizag/movies"
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'html.parser')
for item in soup.select(".card-container"):
addinfo = ''
movie = item.select_one(".__movie-name").text.strip()
blink = item.select_one(".book-button a")
if blink:
req = requests.get(urljoin(URL,blink['href']))
soup = BeautifulSoup(req.text,"lxml")
addinfo = ' '.join([item.select_one(".__venue-name").text.strip() for item in soup.select(".listing-info")])
print(movie, addinfo)
Output:
Tholi Prema Gokul A/C DTS: Vizag
Howrah Bridge INOX: CMR Central, Maddilapalem INOX: Varun Beach, Beach Road INOX: Vizag Chitralaya Mall Satyam A/C Dts: Gopalapatnam V Max: Vizag
Chalo Ganesh A/C Dts: Tagarapuvalasa INOX: CMR Central, Maddilapalem INOX: Varun Beach, Beach Road INOX: Vizag Chitralaya Mall Mukta A2 Cinemas: Vizag Central, Vizag Mohini Mini: Gajuwaka Mohini 70mm Dolby Atmos: Gajuwaka Narasimha a/c Dts: Gopalapatnam Sri Lakshmi Narasimha Picture Palace: Vizag Sri Venkateshwara Screen 1: Vizag Sarat Theater - 4K Dolby Atmos: Vizag
Touch Chesi Chudu INOX: CMR Central, Maddilapalem INOX: Varun Beach, Beach Road INOX: Vizag Chitralaya Mall Mukta A2 Cinemas: Vizag Central, Vizag Raja Cine Max 2K A/c Dts: Kothavalasa Sharada 4K: Vizag Sri Rama Picture Palace: Vizag Tata Picture Palace A/c Dts: Tagarapuvalasa V Max: Vizag
Bhaagamathie INOX: CMR Central, Maddilapalem INOX: Varun Beach, Beach Road INOX: Vizag Chitralaya Mall Jagadamba 4k: Vizag Kinnera Cinema: Maddilapalem Mukta A2 Cinemas: Vizag Central, Vizag Sri Ramulamma Theatre, Thagarapuvalasa: Vizag Sri Lakshmi Narasimha Picture Palace: Vizag Shankara A/C Dts: Gopalapatnam Sri Jaya A/c Dts: Kothavalasa
Padmaavat
Gang Gokul A/C DTS: Vizag Sri Parameswari Picture Palace: Kancharapalem
Jai Simha Mourya Theatre: Gopalapatnam Sree Leela Mahal: Vizag Saptagiri Theatre: Chittivalasa
Maze Runner: The Death Cure INOX: Varun Beach, Beach Road INOX: Vizag Chitralaya Mall Ramadevi 4K: Vizag
Jumanji: Welcome To The Jungle INOX: Vizag Chitralaya Mall
Hey Jude INOX: Varun Beach, Beach Road
Green Apple
Sollividava
Tagaru
Savarakathi
KEE
Prema Baraha
Befaam
Shadow
Rosapoo
Aapla Manus
Kalakalappu 2
Kumari 21 F
Karu
Kirrak Party
Gayatri
Inttelligent
KEY
Downup The Exit 796
Pad Man
The Boy and The World
The 15:17 to Paris
Leera The Soulmates
Aiyaary
Kanam

If you make use of else block then another approach could be something like below:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
URL = "https://in.bookmyshow.com/vizag/movies"
res = requests.get(URL)
soup = BeautifulSoup(res.text, 'lxml')
for item in soup.select(".card-container"):
movie = item.select_one(".__movie-name").text.strip()
blink = item.select_one(".book-button a")
if blink:
req = requests.get(urljoin(URL,blink['href']))
soup = BeautifulSoup(req.text,"lxml")
addinfo = ' '.join([item.select_one(".__venue-name").text.strip() for item in soup.select(".listing-info")])
print(movie,addinfo)
else:
print(movie)

Nested, Same-Level For Loop, Output to List

I am having trouble appending data into a list as I iterate through the following:
import urllib
import urllib.request
from bs4 import BeautifulSoup
import pandas
def make_soup(url):
thepage = urllib.request.urlopen(url)
thepage.addheaders = [('User-Agent', 'Mozilla/5.0')]
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
soup = make_soup('https://www.wellstar.org/locations/pages/default.aspx')
locationdata = []
for table in soup.findAll('table', class_ = 's4-wpTopTable'):
for name in table.findAll('div', 'PurpleBackgroundHeading'):
name = name.get_text(strip = True)
for loc_type in table.findAll('h3', class_ = 'WebFont SpotBodyGreen'):
loc_type = loc_type.get_text()
for address in table.findAll('div', class_ = ['WS_Location_Address', 'WS_Location_Adddress']):
address = address.get_text(strip = True, separator = ' ')
locationdata.append([name, loc_type, address])
df = pandas.DataFrame(columns = ['name', 'loc_type', 'address'], data = locationdata)
print(df)
The produced dataframe includes all unique addresses, however only the last possible text corresponding to the name.
For example, even though 'WellStar Windy Hill Hospital' is the last hospital within the hospital category/type, it appears as the name for all hospitals. If possible, I prefer a list.append solution as I have several more, similar steps to go through to finalize this project.

This is occurring because you're looping through all the names and loc_types before you're appending to locationdata.
You can instead do:
import itertools as it
from pprint import pprint as pp
for table in soup.findAll('table', class_='s4-wpTopTable'):
names = [name.get_text(strip=True) for
name in table.findAll('div', 'PurpleBackgroundHeading')]
loc_types = [loc_type.get_text() for
loc_type in table.findAll('h3', class_='WebFont SpotBodyGreen')]
addresses = [address.get_text(strip=True, separator=' ') for
address in table.findAll('div', class_=['WS_Location_Address',
'WS_Location_Adddress'])]
for name, loc_type, address in it.izip_longest(names,loc_types,addresses):
locationdata.append([name, loc_type, address])
Result:
>>> pp.pprint(locationdata)
[[u'WellStar Urgent Care in Acworth',
u'WellStar Urgent Care Centers',
u'4550 Cobb Parkway NW Suite 101 Acworth, GA 30101 770-917-8140'],
[u'WellStar Urgent Care in Kennesaw',
None,
u'3805 Cherokee Street Kennesaw, GA 30144 770-426-5665'],
[u'WellStar Urgent Care in Marietta - Delk Road',
None,
u'2890 Delk Road Marietta, GA 30067 770-955-8620'],
[u'WellStar Urgent Care in Marietta - East Cobb',
None,
u'3747 Roswell Road Ne Suite 107 Marietta, GA 30062 470-956-0150'],
[u'WellStar Urgent Care in Marietta - Kennestone',
None,
u'818 Church Street Suite 100 Marietta, GA 30060 770-590-4190'],
[u'WellStar Urgent Care in Marietta - Sandy Plains Road',
None,
u'3600 Sandy Plains Road Marietta, GA 30066 770-977-4547'],
[u'WellStar Urgent Care in Smyrna',
None,
u'4480 North Cooper Lake Road SE Suite 100 Smryna, GA 30082 770-333-1300'],
[u'WellStar Urgent Care in Woodstock',
None,
u'120 Stonebridge Parkway Suite 310 Woodstock, GA 30189 678-494-2500']]

Wikipedia Data Scraping with Python

I am trying to retrieve 3 columns (NFL Team, Player Name, College Team) from the following wikipedia page. I am new to python and have been trying to use beautifulsoup to get this done. I only need the columns that belong to QB's but I haven't even been able to get all the columns despite position. This is what I have so far and it outputs nothing and I'm not entirely sure why. I believe it is due to the a tags but I do not know what to change. Any help would be greatly appreciated.'
wiki = "http://en.wikipedia.org/wiki/2008_NFL_draft"
header = {'User-Agent': 'Mozilla/5.0'} #Needed to prevent 403 error on Wikipedia
req = urllib2.Request(wiki,headers=header)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page)
rnd = ""
pick = ""
NFL = ""
player = ""
pos = ""
college = ""
conf = ""
notes = ""
table = soup.find("table", { "class" : "wikitable sortable" })
#print table
#output = open('output.csv','w')
for row in table.findAll("tr"):
cells = row.findAll("href")
print "---"
print cells.text
print "---"
#For each "tr", assign each "td" to a variable.
#if len(cells) > 1:
#NFL = cells[1].find(text=True)
#player = cells[2].find(text = True)
#pos = cells[3].find(text=True)
#college = cells[4].find(text=True)
#write_to_file = player + " " + NFL + " " + college + " " + pos
#print write_to_file
#output.write(write_to_file)
#output.close()
I know a lot of it is commented it out because I was trying to find where the breakdown was.

Here is what I would do:
find the Player Selections paragraph
get the next wikitable using find_next_sibling()
find all tr tags inside
for every row, find td an th tags and get the desired cells by index
Here is the code:
filter_position = 'QB'
player_selections = soup.find('span', id='Player_selections').parent
for row in player_selections.find_next_sibling('table', class_='wikitable').find_all('tr')[1:]:
cells = row.find_all(['td', 'th'])
try:
nfl_team, name, position, college = cells[3].text, cells[4].text, cells[5].text, cells[6].text
except IndexError:
continue
if position != filter_position:
continue
print nfl_team, name, position, college
And here is the output (only quarterbacks are filtered):
Atlanta Falcons Ryan, MattMatt Ryan† QB Boston College
Baltimore Ravens Flacco, JoeJoe Flacco QB Delaware
Green Bay Packers Brohm, BrianBrian Brohm QB Louisville
Miami Dolphins Henne, ChadChad Henne QB Michigan
New England Patriots O'Connell, KevinKevin O'Connell QB San Diego State
Minnesota Vikings Booty, John DavidJohn David Booty QB USC
Pittsburgh Steelers Dixon, DennisDennis Dixon QB Oregon
Tampa Bay Buccaneers Johnson, JoshJosh Johnson QB San Diego
New York Jets Ainge, ErikErik Ainge QB Tennessee
Washington Redskins Brennan, ColtColt Brennan QB Hawaiʻi
New York Giants Woodson, Andre'Andre' Woodson QB Kentucky
Green Bay Packers Flynn, MattMatt Flynn QB LSU
Houston Texans Brink, AlexAlex Brink QB Washington State

Unable to remove escape charactors from the print

Hi I am trying to extract information to put into a list containing plain text but can't find a way to remove the escape characters.
I am very new to python and programming in general. I have been trying to solve this but unable to find one.
This is my code:
import urllib
import re
from bs4 import BeautifulSoup
x=1
while x<2:
url = "http://search.insing.com/ts/food-drink/bars-pubs/bars-pubs?page=" +str(x)
htmlfile = urllib.urlopen(url).read()
soup = BeautifulSoup(htmlfile.decode('utf-8','ignore'))
reshtml = soup.find("div", "results").find_all("h3")
reslist = []
for item in reshtml:
res = item.get_text()
reslist.append(res)
print reslist
x += 1

seems like you're really after the text in the anchor there, consider changing
reshtml = soup.find("div", "results").find_all("h3")
to:
reshtml = [h3.a for h3 in soup.find("div", "results").find_all("h3")]
also change:
reslist.append(res)
to:
reslist.append(' '.join(res.split()))
here is what I get after changing:
[u'Parco Caffe', u'AdstraGold Microbrewery & Bistro Bar',
u'Alkaff Mansion Ristorante', u'The Fat Cat Bistro', u'Gravity Bar',
u'The Wine Company (Evans Road)', u'Serenity Spanish Bar & Restaurant (VivoCity)',
u'The New Harbour Cafe & Bar', u'Indian Times', u'Sunset Bay Beach Bar',
u'Friends # Jelita', u'Talk Cock Sing Song # Thomson',
u'En Japanese Dining Bar (UE Square)', u'Magma German Wine Bistro',
u"Tam Kah Shark's Fin", u'Senso Ristorante & Bar',
u'Hard Rock Cafe (HPL House)', u'St. James Power Station',
u'The St. James', u'Brotzeit German Bier Bar & Restaurant (Vivocity)']

The current output looks like this:
[u'\n\r\n Parco Caffe\n',
u'\n\r\n AdstraGold Microbrewery & Bistro Bar\n',
u'\n\r\n Alkaff Mansion Ristorante\n',
u'\n\r\n The Fat Cat Bistro\n',
u'\n\r\n Gravity Bar\n',
u'\n\r\n The Wine Company\r\n (Evans Road)\r\n \n',
u'\n\r\n Serenity Spanish Bar & Restaurant\r\n (VivoCity)\r\n \n',
u'\n\r\n The New Harbour Cafe & Bar\n',
u'\n\r\n Indian Times\n',
u'\n\r\n Sunset Bay Beach Bar\n',
u'\n\r\n Friends # Jelita\n',
u'\n\r\n Talk Cock Sing Song # Thomson\n',
u'\n\r\n En Japanese Dining Bar\r\n (UE Square)\r\n \n',
u'\n\r\n Magma German Wine Bistro\n',
u"\n\r\n Tam Kah Shark's Fin\n",
u'\n\r\n Senso Ristorante & Bar\n',
u'\n\r\n Hard Rock Cafe\r\n (HPL House)\r\n \n',
u'\n\r\n St. James Power Station \n',
u'\n\r\n The St. James\n',
u'\n\r\n Brotzeit German Bier Bar & Restaurant\r\n (Vivocity)\r\n \n']
Adding these lines before the print:
reslist = [y.replace('\n','').replace('\r','') for y in reslist]
reslist = [y.strip() for y in reslist]
Gives me this output:
[u'Alkaff Mansion Ristorante',
u'Parco Caffe',
u'AdstraGold Microbrewery & Bistro Bar',
u'Gravity Bar',
u'The Fat Cat Bistro',
u'The Wine Company (Evans Road)',
u'Serenity Spanish Bar & Restaurant (VivoCity)',
u'The New Harbour Cafe & Bar',
u'Indian Times',
u'Sunset Bay Beach Bar',
u'Friends # Jelita',
u'Talk Cock Sing Song # Thomson',
u'En Japanese Dining Bar (UE Square)',
u'Magma German Wine Bistro',
u"Tam Kah Shark's Fin",
u'Senso Ristorante & Bar',
u'Hard Rock Cafe (HPL House)',
u'St. James Power Station',
u'The St. James',
u'Brotzeit German Bier Bar & Restaurant (Vivocity)']
Is that what you are looking for?
Guy's answer is much better, and more BeautifulSoup specific.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

regex match and replace multiple patterns - python

Use re.sub with $. target = ''' LINCOLN STREET CHESTNUT ST CHISHOLM TR LAKE TRAIL CHISHOLM TRAIL COVERED WAGON TRL ''' import re print re.sub('\s+(STREET|ST|TRAIL|TRL|TR)\s*$', '', target, flags=re.M)

Related

How to parse data from HTML using Regex?

Can't get all the information when I print two categories together

Nested, Same-Level For Loop, Output to List

Wikipedia Data Scraping with Python

Unable to remove escape charactors from the print

Categories

Resources