Related
I'm creating a web scraper in order to pull the name of a company from a chamber of commerce website directory.
Im using BeautifulSoup. The page and soup objects appear to be working, but when I scrape the HTML content, an empty list is returned when it should be filled with the directory names on the page.
Web page trying to scrape: https://www.austinchamber.com/directory
Here is the HTML:
<div>
<ul> class="item-list item-list--small"> == $0
<li>
<div class='item-content'>
<div class='item-description'>
<h5 class = 'h5'>Women Helping Women LLC</h5>
Here is the python code:
def pageRequest(url):
page = requests.get(url)
return page
def htmlSoup(page):
soup = BeautifulSoup(page.content, "html.parser")
return soup
def getNames(soup):
name = soup.find_all('h5', class_='h5')
return name
page = pageRequest("https://www.austinchamber.com/directory")
soup = htmlSoup(page)
name = getNames(soup)
for n in name:
print(n)
The data is loaded dynamically via Ajax. To get the data, you can use this script:
import json
import requests
url = 'https://www.austinchamber.com/api/v1/directory?filter[categories]=&filter[show]=all&page={page}&limit=24'
page = 1
for page in range(1, 10):
print('Page {}..'.format(page))
data = requests.get(url.format(page=page)).json()
# uncommentthis to print all data:
# print(json.dumps(data, indent=4))
for d in data['data']:
print(d['title'])
Prints:
...
Indeed
Austin Telco Federal Credit Union - Taos
Green Bank
Seton Medical Center Austin
Austin Telco Federal Credit Union - Jollyville
Page 42..
Texas State SBDC - San Marcos Office
PlainsCapital Bank - Motor Bank
University of Texas - Thompson Conference Center
Lamb's Tire & Automotive Centers - #2 Research & Braker
AT&T Labs
Prosperity Bank - Rollingwood
Kerbey Lane Cafe - Central
Lamb's Tire & Automotive Centers - #9 Bee Caves
Seton Medical Center Hays
PlainsCapital Bank - North Austin
Ellis & Salazar Body Shop
aLamb's Tire & Automotive Centers - #6 Lake Creek
Rudy's Country Store and BarBQ
...
i want to Extract firm name(Samsung India Electronics Pvt. Ltd.) from my text file that are present in next line after Firm name. i have extract some data by my code but i am not able to extact firm name because i am new in python or python regex
import re
hand = open(r'C:\Users\sachin.s\Downloads\wordFile_Billing_PrintDocument_7528cc93-3644-4e38-a7b3-10f721fa2049.txt')
copy=False
for line in hand:
line = line.rstrip()
if re.search('Order Number\S*: [0-9.]+', line):
print(line)
if re.search('Invoice No\S*: [0-9.]+', line):
print(line)
if re.search('Invoice Date\S*: [0-9.]+', line):
print(line)
if re.search('PO No\S*: [0-9.]+', line):
print(line)
Firm Name: Address:
Samsung India Electronics Pvt. Ltd.
Regd Office: 6th Floor, DLF Centre, Sansad Marg, New Delhi-110001
SAMSUNG INDIA ELECTRONICS PVT LTD, MEDCHAL MANDAL HYDERABAD
RANGA REDDY DISTRICT HYDERABAD TELANGANA 501401
Phone: 1234567
Fax No:
Branch: S5S2 - [SIEL]HYDERABAD
Order Number: 1403543436
Currency: INR
Invoice No: 36S2I0030874
Invoice Date: 15.12.2018
PI No: 5929947652
Use regex:
import re
data = """
Firm Name: Address:
Samsung India Electronics Pvt. Ltd.
Regd Office: 6th Floor, DLF Centre, Sansad Marg, New Delhi-110001
SAMSUNG INDIA ELECTRONICS PVT LTD, MEDCHAL MANDAL HYDERABAD
RANGA REDDY DISTRICT HYDERABAD TELANGANA 501401 Phone: 1234567 Fax No: Branch: S5S2 - [SIEL]HYDERABAD
Order Number: 1403543436
Currency: INR
Invoice No: 36S2I0030874
Invoice Date: 15.12.2018
PI No: 5929947652
"""
result = re.findall('Address:(.*)Regd', data, re.MULTILINE|re.DOTALL)[0]
Output:
Samsung India Electronics Pvt. Ltd.
I'm trying to pull table data from the following website: https://msih.bgu.ac.il/md-program/residency-placements/
While there are no table tags I found the common tag to pull individual segments of the table to be div class=accord-con
I made a dictionary where the keys are the graduation year (ie, 2019, 2018, etc), and the values is the html from each div class-accord con.
I'm stuck and don't know how to parse the html within the dictionary. My goal is to have separate lists of the specialty, hospital, and location for each year. I don't know how to move forward.
Below is my working code:
import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
Here is a sample of what my dictionary looks like:
{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
'2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
My ultimate goal is to pull this data into a pandas dataframe with the following columns: grad year, specialty, hospital, location
Your code is quite close to finding the end result. Once you have paired the years with the student placement data, simply apply an extraction function to the latter.:
from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
r = block.find_all(re.compile('ul|h4'))
return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}
result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
Output:
{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}
Note: I ended up using selenium because for me, the returned HTML response from requests.get did not included the rendered student placement data.
You have dictionary with BS elements ('bs4.element.Tag') and you don't have to parse them.
You can directly uses find(), find_all(), etc.
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
Result
<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine
Full code:
import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
You can go to pandas once you get the soup, then parse the necessary information
df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])
You will have to do some pandas magic after that
I don't know pandas. The following code can get the data in the table. I don't know if this will meet your needs.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
grad_year = div.h2.text[-4:]
rez_classe = div.getElementByClass('accord-con')
h4s = rez_classe.h4s # get h4
for h4 in h4s:
if not h4.next:
continue
lis = h4.next.lis
specialty = h4.text
hospital = [li.text for li in lis]
datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
print (data,datas[data])
I am trying to pick the part of body from "The Lalit Kala Akademi Scholarship 2017 - 2018 from the...."
to
"Email: lka#lalitkala.gov.in; lalitkala1954#yahoo.in Website: lalitkala.gov.in"
But my output is many "\n" and "\t". I guess it happening due to adwords in between. Any idea how to solve this?
import scrapy
class MySpider(scrapy.Spider):
name = "test"
start_urls = [
'http://www.indiaeducation.net/scholarships/lalit-kala-akademi-scholarship.aspx',
]
def parse(self, response):
for scholarships in response.xpath('//*[#id="wrapper"]'):
yield {
'text': scholarships.xpath('//*[#id="artBody"]/text()').extract(),
}
Something like this?
>>> u" ".join(line.strip() for line in response.xpath('//div[#id="artBody"]//*[not(self::div)][not(self::script)]/text()').extract())
u'The Lalit Kala Akademi Scholarship 2017 - 2018 from the National Academy of Art, Delhi is awarded to learners who have passion for the visual arts. The scholarship is given by Lalit Kala Akademi, an Indian governmental institution for promotion and innovation of the visual arts. Presently 40 scholarships are offered by the National Academy of Art, however, this number may vary depending upon availability of resources. Scholarship Through this scholarship the artists are given a work space to improve their skills and to develop new ideas within their field of visual art. Visual artists ( in disciplines such as graphic, sculpture, painting, ceramics), art historians and art critics may apply. The scholarship is worth Rs. 10,000 per month for a one-year period. Important dates Application available: available online now at lalitkala.gov.in Last date for the filled application to reach the Akademi at New Delhi: 25 May, 2015 Eligibility criteria The Lalit Kala Akademi , New Delhi is looking for young rising artists in the visual arts, as well as art historians and art critics in the age group of 21-35 years Application procedure Prospective candidates should fill out the application form and include a Rs. 100 entry fee through postal order/ bank draft payable to Secretary, Lalit Kala Akademi drawn on New Delhi branch . A forum of senior artists decides who will be selected for the award. Selection is partly based on a published article or photographs of exhibits. The candidates awarded Lalit Kala Akademy Scholarship or National Academy of Art Scholarship need to work with regional centres of the Akademi listed below: Bhubaneswar Guwahati Kolkata Lucknow Delhi Shimla Contact details Lalit Kala Akademi Rabindra Bhavan, 35, Ferozeshah Road , New Delhi-110001 Telephone: 011 - 23009200 Fax : 011 - 23009292 Email: lka#lalitkala.gov.in; lalitkala1954#yahoo.in Website: lalitkala.gov.in'
I'm trying to parse a txt file from EDGAR however with different filing types, there are different formats of reports even though they are all txt files. I have no problem using BeautifulSoup to parse xml reports however i came across this type of report:
<SEC-DOCUMENT>0001047469-13-001017.txt : 20130214
<SEC-HEADER>0001047469-13-001017.hdr.sgml : 20130214
<ACCEPTANCE-DATETIME>20130214060031
ACCESSION NUMBER: 0001047469-13-001017
CONFORMED SUBMISSION TYPE: 13F-HR
PUBLIC DOCUMENT COUNT: 1
CONFORMED PERIOD OF REPORT: 20121231
FILED AS OF DATE: 20130214
DATE AS OF CHANGE: 20130214
EFFECTIVENESS DATE: 20130214
FILER:
COMPANY DATA:
COMPANY CONFORMED NAME: BILL & MELINDA GATES FOUNDATION TRUST
CENTRAL INDEX KEY: 0001166559
IRS NUMBER: 911663695
STATE OF INCORPORATION: WA
FISCAL YEAR END: 1231
FILING VALUES:
FORM TYPE: 13F-HR
SEC ACT: 1934 Act
SEC FILE NUMBER: 028-10098
FILM NUMBER: 13605999
BUSINESS ADDRESS:
STREET 1: 2365 CARILLON POINT
CITY: KIRKLAND
STATE: WA
ZIP: 98033
BUSINESS PHONE: 4258897900
MAIL ADDRESS:
STREET 1: 2365 CARILLON POINT
CITY: KIRKLAND
STATE: WA
ZIP: 98033
FORMER COMPANY:
FORMER CONFORMED NAME: GATES BILL & MELINDA FOUNDATION
DATE OF NAME CHANGE: 20020205
</SEC-HEADER>
<DOCUMENT>
<TYPE>13F-HR
<SEQUENCE>1
<FILENAME>a2212666z13f-hr.txt
<DESCRIPTION>13F-HR
<TEXT>
<Page>
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
WASHINGTON, D.C. 20549
FORM 13F
FORM 13F COVER PAGE
Report for the Calendar Year or Quarter Ended: December 31, 2012
-----------------------
Check Here if Amendment / /; Amendment Number:
---------
This Amendment (Check only one.): / / is a restatement.
/ / adds new holdings entries.
Institutional Investment Manager Filing this Report:
Name: Bill & Melinda Gates Foundation Trust
-------------------------------------
Address: 2365 Carillon Point
-------------------------------------
Kirkland, WA 98033
-------------------------------------
Form 13F File Number: 28-10098
---------------------
The institutional investment manager filing this report and the person by whom
it is signed hereby represent that the person signing the report is authorized
to submit it, that all information contained herein is true, correct and
complete, and that it is understood that all required items, statements,
schedules, lists, and tables, are considered integral parts of this form.
Person Signing this Report on Behalf of Reporting Manager:
Name: Michael Larson
-------------------------------
Title: Authorized Agent
-------------------------------
Phone: (425) 889-7900
-------------------------------
Signature, Place, and Date of Signing:
/s/ Michael Larson Kirkland, Washington February 14, 2013
------------------------------- -------------------- -----------------
[Signature] [City, State] [Date]
Report Type (Check only one.):
/X/ 13F HOLDINGS REPORT. (Check here if all holdings of this reporting
manager are reported in this report.)
/ / 13F NOTICE. (Check here if no holdings reported are in this report,
and all holdings are reported by other reporting manager(s).)
/ / 13F COMBINATION REPORT. (Check here if a portion of the holdings for this
reporting manager are reported in this report and a portion are reported by
other reporting manager(s).)
<Page>
FORM 13F SUMMARY PAGE
Report Summary:
Number of Other Included Managers: 0
--------------------
Form 13F Information Table Entry Total: 26
--------------------
Form 13F Information Table Value Total: $ 16,788,719
--------------------
(thousands)
List of Other Included Managers:
Provide a numbered list of the name(s) and Form 13F file number(s) of all
institutional investment managers with respect to which this report is filed,
other than the manager filing this report.
NONE
2
<Page>
FORM 13 INFORMATION TABLE
As of December 31, 2012
<Table>
<Caption>
VOTING AUTHORITY
VALUE SHRS OR SH/ PUT/ INVESTMENT OTHER ----------------------
NAME OF ISSUER TITLE OF CLASS CUSIP (x$1000) PRN AMOUNT PRN CALL DISCRETION MANAGERS SOLE SHARED NONE
---------------------------- ---------------- --------- ---------- ------------ --- ---- ---------- -------- ---------- ------ ----
<S> <C> <C> <C> <C> <C> <C> <C> <C> <C> <C> <C>
AUTOLIV INC COM 052800109 8,329 123,600 SH SOLE 123,600
AUTONATION INC COM 05329W102 75,379 1,898,716 SH SOLE 1,898,716
BERKSHIRE HATHAWAY INC DEL CL B NEW 084670702 7,811,199 87,081,373 SH SOLE 87,081,373
BP PLC SPONSORED ADR 055622104 297,018 7,133,000 SH SOLE 7,133,000
CANADIAN NATL RY CO COM 136375102 779,358 8,563,437 SH SOLE 8,563,437
CATERPILLAR INC DEL COM 149123101 919,168 10,260,857 SH SOLE 10,260,857
COCA COLA CO COM 191216100 1,232,573 34,002,000 SH SOLE 34,002,000
COCA COLA FEMSA SAB DE CV SPON ADR REP L 191241108 926,242 6,214,719 SH SOLE 6,214,719
CROWN CASTLE INTL CORP COM 228227104 384,822 5,332,900 SH SOLE 5,332,900
DIAMOND FOODS INC COM 252603105 6,031 441,163 SH SOLE 441,163
ECOLAB INC COM 278865100 313,946 4,366,425 SH SOLE 4,366,425
EXXON MOBIL CORP COM 30231G102 661,576 7,643,858 SH SOLE 7,643,858
FEDEX CORP COM 31428X106 277,453 3,024,999 SH SOLE 3,024,999
FOMENTO ECONOMICO MEXICANO SPON ADR UNITS 344419106 21,953 218,000 SH SOLE 218,000
GRUPO TELEVISA SA SPON ADR REP ORD 40049J206 448,647 16,879,103 SH SOLE 16,879,103
LIBERTY GLOBAL INC COM SER A 530555101 133,508 2,119,515 SH SOLE 2,119,515
LIBERTY GLOBAL INC COM SER C 530555309 41,507 706,507 SH SOLE 706,507
MCDONALDS CORP COM 580135101 870,853 9,872,500 SH SOLE 9,872,500
ORBOTECH LTD ORD M75253100 6,973 823,300 SH SOLE 823,300
PROCTER & GAMBLE CO COM 742718109 101,835 1,500,000 SH SOLE 1,500,000
REPUBLIC SVCS INC COM 760759100 39,596 1,350,000 SH SOLE 1,350,000
SIGNET JEWELERS LIMITED SHS G81276100 9,993 187,130 SH SOLE 187,130
TOYOTA MOTOR CORP SP ADR REP2COM 892331307 14,295 153,300 SH SOLE 153,300
WAL-MART STORES INC COM 931142103 757,558 11,103,000 SH SOLE 11,103,000
WASTE MGMT INC COM 94106L109 628,700 18,633,672 SH SOLE 18,633,672
WILLIS GROUP HOLDINGS PUBLIC SHS G96666105 20,209 602,700 SH SOLE 602,700
---------- ------------
16,788,719 240,235,774
</Table>
</TEXT>
</DOCUMENT>
</SEC-DOCUMENT>
As you can see this file is just plain txt file with custom tags.
My question is: how do i target texts within a specific tag? for example I only need the texts inside the TEXT tag from the above txt file.
You can select the Text tags and then work on that content:
soup = BeautifulSoup(open("/yourfile.html"), "html.parser")
text_tags = soup.find('text')
for text in text_tags:
print text
# work from here
Note: I used a html.parser and it already returns the text tag. You may need to change to a xml parser if that suits your needs better