Web scraping table filtering results - python

I'm using Python to Web Scrape a table of data found here. Specifically, I want to pull the business name, url, owners name, street, city, and phone. After being run through Beautiful Soup and split the code to filter appears as:
['\\\', \\\' href="?listingid=9758&profileid=217Y3Q544Y&action=uweb&url=http%3a%2f%2fwww.jpspa.com" target="_BLANK"', "Johnson Price Sprinkle PA', '/a", "', '/b", "', '/td", "', '/tr", "', '/table", "', '/td", "', '/tr", '', 'tr class="GeneralBody"', '', 'td bgcolor="#808080" height="1"', '', 'img border="0" height="1" src="images/dot_clear.gif" width="1"/', "', '/td", "', '/tr", "', '/table", "', '/td", "', '/tr", '', 'tr class="GeneralBody"', '', 'td align="left" valign="top" width="90%"', 'Maria Pilos', "', '', '79 Woodfin Place, Suite 300", "', '', 'Asheville, NC 28801", "', '', '", 'b', "Phone:', '/b", ' **(828) 254-2374**', "', '', '", 'b', "Fax:', '/b", " (828) 252-9994', '\', \'", '\\\', \\\' href="DirectoryEmailForm.aspx?listingid=9758"', "Send Email', '/a", "', '/td", '', 'td align="right" rowspan="3" valign="top" width="10%"', '', 'span style="font-size: 8pt"', '\\\', \\\' href="?, '!--..End Listing--", '', "/td']<
I bolded the values I want to return and I identified their position in the code. To filter them the code is below. Temp_array is the code above to filter, temp_count is the position in the array, and business_listing is the array I'm appending the value to when found. Basically when the temp_count == the position of the value in the array, it appends that value to the array.
<
temp_count=0
for i in temp_array:
if temp_count ==0:
business_listings.append(i)
temp_count+=1
elif temp_count ==2:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==19:
business_listings.append(i)
temp_count+=1
elif temp_count ==20:
business_listings.append(i)
temp_count+=1
elif temp_count ==23:
business_listings.append(i)
temp_count+=1
elif temp_count ==27:
business_listings.append(i)
temp_count+=1
elif temp_count ==42:
business_listings.append(i)
temp_count+=1
else:
count+=1
The output is as follows:
['\\\', \\\' href="?listingid=9758&profileid=2B713K5Z48&action=uweb&url=http%3a%2f%2fwww.jpspa.com" target="_BLANK"']>
and only filters the first 2 values or won't filter anything.

This script will print information about various businesses:
import requests
from bs4 import BeautifulSoup
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', business_url)
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
Prints:
Business Name : Johnson Price Sprinkle PA
Business URL : ?listingid=9758&profileid=2D7R3B5E4N&action=uweb&url=http%3a%2f%2fwww.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : ?listingid=9656&profileid=549S620J3J&action=uweb&url=http%3a%2f%2fwww.lbnoelcpa.com%2f
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Worley, Woodbery, & Associates, PA
Business URL : ?listingid=9661&profileid=3L7R304J8X&action=uweb&url=http%3a%2f%2fwww.worleycpa.com%2f
Owner : Mr. David Worley
Phone : (828) 271-7997
Address:
7 Orchard Street, Ste. 202
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Peridot Consulting, Inc.
Business URL : ?listingid=14005&profileid=7L724E5W7E&action=uweb&url=http%3a%2f%2fwww.PeridotConsultingInc.com
Owner : John Michael Kledis
Phone : (828) 242-6971
Address:
PO Box 8904
Asheville, NC 28804
--------------------------------------------------------------------------------
Business Name : DHG
Business URL : ?listingid=9579&profileid=25711D625I&action=uweb&url=http%3a%2f%2fwww.dhgllp.com%2f
Owner : Adrienne Bernardi
Phone : (828) 254-2254
Address:
PO Box 3049
Asheville, NC 28802
--------------------------------------------------------------------------------
Business Name : Gould Killian CPA Group, P.A.
Business URL : ?listingid=9659&profileid=2P7X216Y66&action=uweb&url=http%3a%2f%2fwww.gk-cpa.com
Owner : Ed Towson
Phone : (828) 258-0363
Address:
100 Coxe Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Michelle Tracz CPA, CFE, PLLC
Business URL : ?listingid=12921&profileid=610C8H3I7N&action=uweb&url=http%3a%2f%2fwww.michelletraczcpa.com
Owner : Michelle Tracz
Phone : (828) 280-2530
Address:
1238 Hendersonville Rd.
Asheville, NC 28803
--------------------------------------------------------------------------------
Business Name : Burleson & Earley, P.A.
Business URL : ?listingid=10436&profileid=57132N5P9C&action=uweb&url=http%3a%2f%2fwww.burlesonearley.com%2f
Owner : Bronwyn Burleson, CPA
Phone : (828) 251-2846
Address:
902 Sand Hill Road
Asheville, NC 28806
--------------------------------------------------------------------------------
Business Name : Carol L. King & Associates, P.A.
Business URL : ?listingid=10439&profileid=2Z8C7I0B4X&action=uweb&url=http%3a%2f%2fwww.clkcpa.com
Owner : Carol King
Phone : (828) 258-2323
Address:
40 North French Broad Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Goldsmith Molis & Gray
Business URL : ?listingid=12638&profileid=6C8D2C7F55&action=uweb&url=http%3a%2f%2fwww.gmg-cpa.com
Owner : Allen Gray
Phone : (828) 281-3161
Address:
32 Orange St.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Corliss & Solomon, PLLC
Business URL : ?listingid=12407&profileid=6T7Y798S1R&action=uweb&url=http%3a%2f%2fwww.candspllc.com
Owner : Slater Solomon
Phone : (828) 236-0206
Address:
242 Charlotte St., Suite 1
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Mountain BizWorks
Business URL : ?listingid=12733&profileid=2L9E9G6A1S&action=uweb&url=http%3a%2f%2fwww.mountainbizworks.org
Owner : Matthew Raker
Phone : (828) 253-2834
Address:
153 South Lexington Ave.
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : LeBlanc CPA Limited
Business URL : -
Owner : Leslie LeBlanc
Phone : (828) 225-4940
Address:
218 Broadway
Asheville, NC 28801-2347
--------------------------------------------------------------------------------
Business Name : Bolick & Associates, PA, CPA's
Business URL : -
Owner : Alan E Bolick, CPA
Phone : (828) 253-4692
Address:
Central Office Park Suite 104
56 Central Avenue
Asheville, NC 28801
--------------------------------------------------------------------------------
EDIT: To parse URLs:
import requests
from bs4 import BeautifulSoup
from urllib.parse import unquote
url = 'https://web.ashevillechamber.org/cwt/external/wcpages/wcdirectory/Directory.aspx?CategoryID=1242&Title=Accounting++and++Bookkeeping&AdKeyword=Accounting++and++Bookkeeping'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for b in soup.select('td[bgcolor="#E6E6E6"] b'):
business_name = b.text
business_url = b.a['href'] if b.a else '-'
owner = b.find_next('td', width="90%").contents[0]
addr, current = [], owner.find_next(text=True)
while not current.find_parent('b'):
addr.append(current.strip())
current = current.find_next(text=True)
addr = '\n'.join(addr)
phone = current.find_next(text=True).strip()
print('Business Name :', business_name)
print('Business URL :', unquote(business_url).rsplit('=', maxsplit=1)[-1])
print('Owner :', owner)
print('Phone :', phone)
print('Address:')
print(addr)
print('-' * 80)
Prints:
Business Name : Johnson Price Sprinkle PA
Business URL : http://www.jpspa.com
Owner : Maria Pilos
Phone : (828) 254-2374
Address:
79 Woodfin Place, Suite 300
Asheville, NC 28801
--------------------------------------------------------------------------------
Business Name : Leah B. Noel, CPA, PC
Business URL : http://www.lbnoelcpa.com/
Owner : Ms. Leah Noel
Phone : 828-333-4529
Address:
14 S. Pack Square #503
Asheville, NC 28801
--------------------------------------------------------------------------------
...and so on.

Related

how to extract the text from the div tag using BeautifulSoup and python

I am trying to extract the text that exist inside a div tag using BeautifulSoup package in python.
example I want to extract the text inside the tag <p></p>
and the text inside <dt> and <dd>
When I run the code the system crash and display the below error:
--------------------------------------------------------------------------- AttributeError Traceback (most recent call
last) in
60 # # # article_body = s.find('div', {'class' :'card-content t-small bt p20'}).text
61 # text_info = s.find_all("div",{"class":"card-content is-spaced"})
---> 62 text_desc = text_info.find('div', attrs={'class':'card-content t-small bt p20'}).getText(strip=True)
63
64 print(f"{date_published} {title}\n\n{text_desc}\n", "-" * 80)
f:\aienv\lib\site-packages\bs4\element.py in getattr(self, key)
2172 """Raise a helpful exception to explain a common code
fix.""" 2173 raise AttributeError(
-> 2174 "ResultSet object has no attribute '%s'. You're probably treating a list of elements like a single element. Did you
call find_all() when you meant to call find()?" % key 2175
)
AttributeError: ResultSet object has no attribute 'find'. You're
probably treating a list of elements like a single element. Did you
call find_all() when you meant to call find()?
html
<div class="card-content t-small bt p20" style="max-height:50vh" data-viewsize='{"d":{"height": {"max": 1}}, "offset":"JobSearch.jobViewSize"}'>
<h2 class="h6">Job Description</h2>
<p>The Executive Chef has full knowledge and capability of managing the general operations of the kitchen, specialty outlets kitchen including Stewarding.</p>
<h2 class="h6 p10t">Skills</h2>
<p>• Provide, develop, train and maintain a professional workforce• Excellent in English both in oral and written.• Computer knowledge is required and good in correspondences and reports writing.</p>
<h2 class="h6 p10t">Job Details</h2>
<dl class="dlist is-spaced is-fitted t-small m0">
<div>
<dt>Job Location</dt>
<dd> Al Olaya, Riyadh , Saudi Arabia </dd>
</div>
<div>
<dt>Company Industry</dt>
<dd>Food & Beverage Production; Entertainment; Catering, Food Service, & Restaurant</dd>
</div>
<div>
<dt>Company Type</dt>
<dd>Employer (Private Sector)</dd>
</div>
<div>
<dt>Job Role</dt>
<dd>Hospitality and Tourism</dd>
</div>
<div>
<dt>Employment Type</dt>
<dd>Unspecified</dd>
</div>
<div>
<dt>Monthly Salary Range</dt>
<dd>$4,000 - $5,000</dd>
</div>
<div>
<dt>Number of Vacancies</dt>
<dd>1</dd>
</div>
</dl>
<h2 class="h6 p10t">Preferred Candidate</h2>
<dl class="dlist is-spaced is-fitted t-small m0">
<div>
<dt>Career Level</dt>
<dd>Management</dd>
</div>
<div>
<dt>Years of Experience</dt>
<dd>Min: 10 Max: 20</dd>
</div>
<div>
<dt>Residence Location</dt>
<dd> Riyadh, Saudi Arabia ; Algeria; Bahrain; Comoros; Djibouti; Egypt; Iraq; Jordan; Kuwait; Lebanon; Libya; Mauritania; Morocco; Oman; Palestine; Qatar; Saudi Arabia; Somalia; Sudan; Syria; Tunisia; United Arab Emirates; Yemen</dd>
</div>
<div>
<dt>Gender</dt>
<dd>Male</dd>
</div>
<div>
<dt>Age</dt>
<dd>Min: 26 Max: 55</dd>
</div>
</dl>
</div>
================================================
code:
import time
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,
"lxml"
)
links = []
for a in soup.select("h2.m0.t-regular a"):
if a['href'] not in links:
links.append("https://www.bayt.com"+ a['href'])
for link in links:
s = BeautifulSoup(requests.get(link).content, "lxml")
text_info = s.find_all("div",{"class":"card-content is-spaced"})
text_desc = text_info.find('div', attrs={'class':'card-content t-small bt p20'}).getText(strip=True)
print(f"{date_published} {title}\n\n{text_desc}\n", "-" * 80)
you are doing a find_all and then using it, maybe you need to do a loop for text in text_info: and extract the information of the loop. if you want the first div use find instead of find_all
Hope that could help you!
To get the jobdesc and other details use the following css selector.
import requests
from bs4 import BeautifulSoup
soup = BeautifulSoup(
requests.get("https://www.bayt.com/en/international/jobs/executive-chef-jobs/").content,"lxml")
links = []
for a in soup.select("h2.m0.t-regular a"):
if a['href'] not in links:
links.append("https://www.bayt.com"+ a['href'])
for link in links:
print(link)
s = BeautifulSoup(requests.get(link).content, "lxml")
jobdesc=s.select_one("div[class='card-content is-spaced'] p")
print(jobdesc.text)
alldt = [dt.text for dt in s.select("div[class='card-content is-spaced'] dt")]
print(alldt)
alldt = [dd.text for dd in s.select("div[class='card-content is-spaced'] dd")]
print(alldt)
print("-" * 80)
Console Output:
https://www.bayt.com/en/qatar/jobs/executive-chef-4276199/
The ideal candidate is a seasoned chef with a background in fine dining. You will run an efficient kitchen by consistently looking to improve the menu, producing quality food, and working closely with rthe other staffs in the overall food and beverage operations of the palace.
['Job Location', 'Company Industry', 'Company Type', 'Job Role', 'Employment Type', 'Monthly Salary Range', 'Number of Vacancies', 'Career Level', 'Years of Experience', 'Residence Location', 'Gender', 'Nationality', 'Degree', 'Age']
[' Doha, Qatar ', 'Food & Beverage Production', 'Employer (Private Sector)', 'Management', 'Contractor', 'Unspecified', '2', 'Senior Executive', 'Min: 5', 'India; Lebanon', 'Male', 'Bahrain; Kuwait; Oman; Qatar; Saudi Arabia; United Arab Emirates', 'Certification / diploma', 'Min: 36']
--------------------------------------------------------------------------------
https://www.bayt.com/en/saudi-arabia/jobs/executive-chef-for-5-star-hotel-4274940/
The Executive Chef has full knowledge and capability of managing the general operations of the kitchen, specialty outlets kitchen including Stewarding. Responsibility includes food preparations that are used for banqueting, conferences, outside events, and catering. Basically ensures the culinary dishes are of high-quality prepared and served to enhance the guest experience. Monitors local competitors and compare their operations with the Food & Beverage Preparation enable to modify and develop a popular menu as needed so they remain effective for the purpose of the restaurants and other establishments. Also performs many administrative tasks including kitchen item requisition, ordering supplies, and maintain the highest professional food quality, hygiene, and sanitation standards.
['Job Location', 'Company Industry', 'Company Type', 'Job Role', 'Employment Type', 'Monthly Salary Range', 'Number of Vacancies', 'Career Level', 'Years of Experience', 'Residence Location', 'Gender', 'Age']
[' Al Olaya, Riyadh , Saudi Arabia ', 'Food & Beverage Production; Entertainment; Catering, Food Service, & Restaurant', 'Employer (Private Sector)', 'Hospitality and Tourism', 'Unspecified', '$4,000 - $5,000', '1', 'Management', 'Min: 10 Max: 20', ' Riyadh,Saudi Arabia ; Algeria; Bahrain; Comoros; Djibouti; Egypt; Iraq; Jordan; Kuwait; Lebanon; Libya; Mauritania; Morocco; Oman; Palestine; Qatar; Saudi Arabia; Somalia; Sudan; Syria; Tunisia; United Arab Emirates; Yemen', 'Male', 'Min: 26 Max: 55']
--------------------------------------------------------------------------------
https://www.bayt.com/en/saudi-arabia/jobs/executive-chef-4273678/
['Job Location', 'Company Industry', 'Company Type', 'Job Role', 'Employment Type', 'Monthly Salary Range', 'Number of Vacancies', 'Career Level', 'Residence Location']
[' Riyadh, Saudi Arabia ', 'Hospitality & Accomodation', 'Employer (Private Sector)', 'Hospitality and Tourism', 'Unspecified', 'Unspecified', 'Unspecified', 'Management', 'Saudi Arabia']
--------------------------------------------------------------------------------
https://www.bayt.com/en/other/jobs/executive-chef-4-58272955/
Unit Description: Artisan Restaurant Collection has a great Executive Chef 4 (resource lasting up-to6 months)opportunity in the Los Angeles area of California for a new piece of business. The Artisan Restaurant Collection was imagined and created in California by a market need for local sustainable, chef driven, farm to fork food created with love. The Executive Chef 4 will have total culinary responsibilities including the supervision ofhourly staff with a focus on amazing fresh food for this location. The Ideal candidate must have
['Job Location', 'Company Industry', 'Company Type', 'Job Role', 'Employment Type', 'Monthly Salary Range', 'Number of Vacancies']
['Other', 'Other Business Support Services', 'Unspecified', 'Hospitality and Tourism', 'Full Time Employee', 'Unspecified', 'Unspecified']
--------------------------------------------------------------------------------
https://www.bayt.com/en/other/jobs/executive-chef-3-58273086/
Unit Description: Artisan Restaurant Collection has a great Executive Chef 3 opportunity in San Jose, California for a new business venture. The Artisan Restaurant Collection was imagined and created in California by a market need for local sustainable, chef driven, farm to fork food created with love. The Executive Chef 3 will have total culinary responsibilities including the supervision ofhourly staff with a focus on amazing Asian food for this location. The Ideal candidate must have
['Job Location', 'Company Industry', 'Company Type', 'Job Role', 'Employment Type', 'Monthly Salary Range', 'Number of Vacancies']
['Other', 'Other Business Support Services', 'Unspecified', 'Hospitality and Tourism', 'Full Time Employee', 'Unspecified', 'Unspecified']
--------------------------------------------------------------------------------
so on..............

Python-To Extract Data from Text file using Regex using python script

i want to Extract firm name(Samsung India Electronics Pvt. Ltd.) from my text file that are present in next line after Firm name. i have extract some data by my code but i am not able to extact firm name because i am new in python or python regex
import re
hand = open(r'C:\Users\sachin.s\Downloads\wordFile_Billing_PrintDocument_7528cc93-3644-4e38-a7b3-10f721fa2049.txt')
copy=False
for line in hand:
line = line.rstrip()
if re.search('Order Number\S*: [0-9.]+', line):
print(line)
if re.search('Invoice No\S*: [0-9.]+', line):
print(line)
if re.search('Invoice Date\S*: [0-9.]+', line):
print(line)
if re.search('PO No\S*: [0-9.]+', line):
print(line)
Firm Name: Address:
Samsung India Electronics Pvt. Ltd.
Regd Office: 6th Floor, DLF Centre, Sansad Marg, New Delhi-110001
SAMSUNG INDIA ELECTRONICS PVT LTD, MEDCHAL MANDAL HYDERABAD
RANGA REDDY DISTRICT HYDERABAD TELANGANA 501401
Phone: 1234567
Fax No:
Branch: S5S2 - [SIEL]HYDERABAD
Order Number: 1403543436
Currency: INR
Invoice No: 36S2I0030874
Invoice Date: 15.12.2018
PI No: 5929947652
Use regex:
import re
data = """
Firm Name: Address:
Samsung India Electronics Pvt. Ltd.
Regd Office: 6th Floor, DLF Centre, Sansad Marg, New Delhi-110001
SAMSUNG INDIA ELECTRONICS PVT LTD, MEDCHAL MANDAL HYDERABAD
RANGA REDDY DISTRICT HYDERABAD TELANGANA 501401 Phone: 1234567 Fax No: Branch: S5S2 - [SIEL]HYDERABAD
Order Number: 1403543436
Currency: INR
Invoice No: 36S2I0030874
Invoice Date: 15.12.2018
PI No: 5929947652
"""
result = re.findall('Address:(.*)Regd', data, re.MULTILINE|re.DOTALL)[0]
Output:
Samsung India Electronics Pvt. Ltd.

Parsing through HTML in a dictionary

I'm trying to pull table data from the following website: https://msih.bgu.ac.il/md-program/residency-placements/
While there are no table tags I found the common tag to pull individual segments of the table to be div class=accord-con
I made a dictionary where the keys are the graduation year (ie, 2019, 2018, etc), and the values is the html from each div class-accord con.
I'm stuck and don't know how to parse the html within the dictionary. My goal is to have separate lists of the specialty, hospital, and location for each year. I don't know how to move forward.
Below is my working code:
import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
Here is a sample of what my dictionary looks like:
{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
'2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
My ultimate goal is to pull this data into a pandas dataframe with the following columns: grad year, specialty, hospital, location
Your code is quite close to finding the end result. Once you have paired the years with the student placement data, simply apply an extraction function to the latter.:
from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
r = block.find_all(re.compile('ul|h4'))
return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}
result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
Output:
{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}
Note: I ended up using selenium because for me, the returned HTML response from requests.get did not included the rendered student placement data.
You have dictionary with BS elements ('bs4.element.Tag') and you don't have to parse them.
You can directly uses find(), find_all(), etc.
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
Result
<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine
Full code:
import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
You can go to pandas once you get the soup, then parse the necessary information
df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])
You will have to do some pandas magic after that
I don't know pandas. The following code can get the data in the table. I don't know if this will meet your needs.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
grad_year = div.h2.text[-4:]
rez_classe = div.getElementByClass('accord-con')
h4s = rez_classe.h4s # get h4
for h4 in h4s:
if not h4.next:
continue
lis = h4.next.lis
specialty = h4.text
hospital = [li.text for li in lis]
datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
print (data,datas[data])

Python: html table content

I am trying to scrape this website but I keep getting error when I try to print out just the content of the table.
soup = BeautifulSoup(urllib2.urlopen('http://clinicaltrials.gov/show/NCT01718158
').read())
print soup('table')[6].prettify()
for row in soup('table')[6].findAll('tr'):
tds = row('td')
print tds[0].string,tds[1].string
IndexError Traceback (most recent call last)
<ipython-input-70-da84e74ab3b1> in <module>()
1 for row in soup('table')[6].findAll('tr'):
2 tds = row('td')
3 print tds[0].string,tds[1].string
4
IndexError: list index out of range
The table has a header row, with <th> header elements rather than <td> cells. Your code assumes there will always be <td> elements in each row, and that fails for the first row.
You could skip the row with not enough <td> elements:
for row in soup('table')[6].findAll('tr'):
tds = row('td')
if len(tds) < 2:
continue
print tds[0].string, tds[1].string
at which point you get output:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, tds[1].string
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: None
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: None
The last row contains text interspersed with <br/> elements; you could use the element.strings generator to extract all strings and perhaps join them into newlines; I'd strip each string first though:
>>> for row in soup('table')[6].findAll('tr'):
... tds = row('td')
... if len(tds) < 2:
... continue
... print tds[0].string, '\n'.join(filter(unicode.strip, tds[1].strings))
...
Responsible Party: Bristol-Myers Squibb
ClinicalTrials.gov Identifier: NCT01718158
History of Changes
Other Study ID Numbers: AI452-021, 2011‐005409‐65
Study First Received: October 29, 2012
Last Updated: November 7, 2014
Health Authority: United States: Institutional Review Board
United States: Food and Drug Administration
Argentina: Administracion Nacional de Medicamentos, Alimentos y Tecnologia Medica
France: Afssaps - Agence française de sécurité sanitaire des produits de santé (Saint-Denis)
Germany: Federal Institute for Drugs and Medical Devices
Germany: Ministry of Health
Israel: Israeli Health Ministry Pharmaceutical Administration
Israel: Ministry of Health
Italy: Ministry of Health
Italy: National Bioethics Committee
Italy: National Institute of Health
Italy: National Monitoring Centre for Clinical Trials - Ministry of Health
Italy: The Italian Medicines Agency
Japan: Pharmaceuticals and Medical Devices Agency
Japan: Ministry of Health, Labor and Welfare
Korea: Food and Drug Administration
Poland: National Institute of Medicines
Poland: Ministry of Health
Poland: Ministry of Science and Higher Education
Poland: Office for Registration of Medicinal Products, Medical Devices and Biocidal Products
Russia: FSI Scientific Center of Expertise of Medical Application
Russia: Ethics Committee
Russia: Ministry of Health of the Russian Federation
Spain: Spanish Agency of Medicines
Taiwan: Department of Health
Taiwan: National Bureau of Controlled Drugs
United Kingdom: Medicines and Healthcare Products Regulatory Agency

regex match and replace multiple patterns

I have a situation where a user submits an address and I have to replace
user inputs to my keys. I can join this using an address without suffixes.
COVERED WAGON TRAIL
CHISHOLM TRAIL
LAKE TRAIL
CHESTNUT ST
LINCOLN STREET
to:
COVERED WAGON
CHISHOLM
LAKE
CHESTNUT
LINCOLN
However I can't comprehend how this code can be written to replace only the last word.
I get:
LINCOLN
CHESTNUT
CHISHOLM
LAKEAIL
CHISHOLMAIL
COVERED WAGONL
I've tried regex verbose, re.sub and $.
import re
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
rdict = {
' ST': '',
' STREET': '',
' TR': '',
' TRL': '',
}
robj = re.compile('|'.join(rdict.keys()))
re.sub(' TRL', '',target.rsplit(' ', 1)[0]), target
result = robj.sub(lambda m: rdict[m.group(0)], target)
print result
Use re.sub with $.
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
import re
print re.sub('\s+(STREET|ST|TRAIL|TRL|TR)\s*$', '', target, flags=re.M)
If you do store your string in the format:
target = '''
LINCOLN STREET
CHESTNUT ST
CHISHOLM TR
LAKE TRAIL
CHISHOLM TRAIL
COVERED WAGON TRL
'''
There is no need to use regex:
>>> print '\n'.join([x.rsplit(None, 1)[0] for x in target.strip().split('\n')])
LINCOLN
CHESTNUT
CHISHOLM
LAKE
CHISHOLM
COVERED WAGON

Categories