I am using BeautifulSoup4 to do some HTML scraping.
I am trying to extract important info such as the title, meta data, paragraphs and listed information.
My issue is I can take the paragraphs like so:
def main():
response = urllib.request.urlopen('https://ecir2019.org/industry-day/')
html = response.read()
soup = BeautifulSoup(html,features="html.parser")
text = [e.get_text() for e in soup.find_all('p')]
article = '\n'.join(text)
print(article)
main()
But if my website link has bullet points in the body of text it would include the navigation bar. i.e. if i change p to li or ul
For example what I want to get as output is:
The Industry Day's objectives are three-fold:
The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.
The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.
Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.
What I actually get:
The Industry Day's objectives are three-fold:
The tags in the HTML Source:
<p>The Industry Day's objectives are three-fold:</p>
<ol>
<li>The first objective is to present the state of the art in search and search-related areas, delivered as keynote talks by influential technical leaders from the search industry.</li>
<li>The second objective of the Industry Day is the presentation of interesting, novel and innovative ideas related to information retrieval.</li>
<li>Finally, we are looking forward to a highly-interactive discussion involving both industry and academia.</li>
</ol>
You can use Or css selector syntax so you can select the li elements as well.
import requests
from bs4 import BeautifulSoup
url = 'https://ecir2019.org/industry-day/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('p, ol li')]
print(items)
Just that section:
import requests
from bs4 import BeautifulSoup
url = 'https://ecir2019.org/industry-day/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.kg-card-markdown p:nth-of-type(2), .kg-card-markdown p:nth-of-type(2) + ol li')]
print(items)
The page appears to have changed so I am using a cached version (this will only work until cache is updated). You can limit to the post body with an additional class selector:
import requests
from bs4 import BeautifulSoup
url = 'http://webcache.googleusercontent.com/search?q=cache:https://ecir2019.org/industry-day'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
items = [item.text for item in soup.select('.post-body p, .post-body ol li, .post-body ul li')]
print(items)
Related
I'm trying to scrape a random site and get all the text with a certain class off of a page.
from bs4 import BeautifulSoup
import requests
sources = ['https://cnn.com']
for source in sources:
page = requests.get(source)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_='cd_content')
for result in results:
title = result.find('span', class_="cd__headline-text vid-left-enabled")
print(title)
From what I found online, this should work but for some reason, it can't find anything and results is empty. Any help is greatly appreciated.
Upon inspecting the network calls, you see that the page is loaded dynamically via sending a GET request to:
https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl
The HTML is available within the html key on the page
import requests
from bs4 import BeautifulSoup
URL = "https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl"
response = requests.get(URL).json()["html"]
soup = BeautifulSoup(response, "html.parser")
for tag in soup.find_all(class_="cd__headline-text vid-left-enabled"):
print(tag.text)
Output (truncated):
This is the first Covid-19 vaccine in the US authorized for use in younger teens and adolescents
When the US could see Covid cases and deaths plummet
'Truly, madly, deeply false': Keilar fact-checks Ron Johnson's vaccine claim
These are the states with the highest and lowest vaccination rates
I just developing Scraper with the python.
I want to scrape some text in homepage, and I wrote the code in like this to get the specific test data, but it returns nothing.
This is the part of the html where I want to scape
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active" id="ui-id-94" aria-labelledby="ui-id-93" role="tabpanel" aria-hidden="false" style="display: block; height: 210px;">
<p>
Computer and Information Systems (Post-Baccalaureate Diploma)
Computing Studies and Information Systems (Diploma)
Data Analytics (Post-Degree Diploma)
Data and Analytics
Emerging Technology (Post-Degree Diploma)
Information and Communication Technology (Post-Degree Diploma)
Web and Mobile Computing
</p>
I want to get the program names, I code like this but it returns an empty list.
from bs4 import BeautifulSoup
import requests
import os
import re
import sys
URL = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
r = requests.get(URL, headers = self.requestHeaders())
soup = BeautifulSoup(r.text, "html.parser")
test = soup.find_all("a", class_='ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom ui-accordion-content-active')
print(test)
what is the problem...?
Your call to soup.find_all() searches for "a" elements with the classes ui-accordion-content, ui-helper-reset, etc but none of those "a" elements have those classes. Try removing the class part.
First problem: this page uses JavaScript and requests, Beautifulsoup can't run JavaScript. You may need Selenium to control web browser which can run JavaScript. And it can gives you full HTML which you can search with Selenium or use with Beautifulsoup
Second problem: you have to search div with these classes and later inside div you have to search a whithout these classes.
BTW: to control browser you will have also driver for Firefox or Chrome
Code:
import selenium.webdriver
from bs4 import BeautifulSoup
url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
driver = selenium.webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, "html.parser")
all_div = soup.find_all("div", class_='ui-accordion-content')
for div in all_div:
all_items = div.find_all("a")
for item in all_items:
print(item.text)
Part of result:
Basic Occupational Education - Electronics and General Assembly
Basic Occupational Education - Food Services
Basic Occupational Education - Retail and Business Services
Child and Youth Care (Bachelor of Arts)
Child and Youth Care (Diploma)
Classroom and Community Support (Certificate)
Classroom and Community Support (Diploma)
Education Assistance and Inclusion (Certificate)
Early Childhood Education (Certificate)
Early Childhood Education (Diploma)
Early Childhood Education: Infant/Toddler (Post-Basic Certificate)
Early Childhood Education: Special Needs - Inclusive Practices (Post-Basic Certificate)
Employment Supports Specialty
Therapeutic Recreation (Bachelor)
Therapeutic Recreation (Diploma)
Accounting (Bachelor of Business Administration)
Accounting (Certificate)
EDIT: The same without BeautifulSoup using only Selenium
import selenium.webdriver
url = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
driver = selenium.webdriver.Firefox()
driver.get(url)
all_div = driver.find_elements_by_xpath('//div[contains(#class, "ui-accordion-content")]')
for div in all_div:
all_items = div.find_elements_by_tag_name("a")
for item in all_items:
print(item.get_attribute('textContent'))
#print(item.text) # doesn't work for hidden element
I could be mistaken but it looks like the page you are trying to scrape has javascript which means BS won't do the job. When I simplify the code to return all of the soup it should return all of the html. So the following:
from bs4 import BeautifulSoup
import requests
import os
import re
import sys
URL = "https://www.douglascollege.ca/programs-courses/catalogue/programs"
r = requests.get(URL)
coverpage = r.content
soup = BeautifulSoup(coverpage, 'html5lib')
print(soup)
produces
<html><head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr/><center>Microsoft-Azure-Application-Gateway/v2</center>
</body></html>
That's why you aren't getting any <a>'s because there aren't any in the soup.
If the URL is changed to something else like:
URL = "https://www.tutorialspoint.com/gensim/gensim_creating_lda_mallet_model.htm"
The URL returns the page's html by calling the soup, and then there are <a>'s to get.
Viewing the source of the page you are trying to scrape turns up this line
<script src="/-/media/A1FA8497F6534B7D915442DEC3FA6541.ashx?636948345000000000"></script><script src="/-/media/ACA0B6DEC2124962B48341E8092B8B4D.ashx?636948345010000000"></script><script src="/-/media/68BA4C1C2A0D494F97E7CD7D5ECE72B0.ashx?637036665710000000"></script>
<!-- Javascripts goes between here -->
Along with several other mentions of javascript on the page. As discussed on in this question, you might try Selenium rather than BS. Good luck.
BeautifulSoup is not finding the div tag 'pl-price js-pl-price'. I see it in the inspect element, however when I run my code, my code returns 'None'. Div 'product-details' is also in the HTML and it is found. But, div tag 'pl-price js-pl-price' can't be found by beautifulsoup. Why is that?
My Code:
import urllib2, sys, requests
from bs4 import BeautifulSoup
site = "https://www.lowes.com/pl/Refrigerators-Appliances/4294857973?goToProdList=true"
hdr = {'User-Agent': 'Mozilla/5.0'}
req = urllib2.Request(site,headers=hdr)
page = urllib2.urlopen(req)
soup = BeautifulSoup(page, 'html.parser')
details = soup.find('div', attrs={'class': 'product-details'})
name = details.find('p', attrs={'class': 'h6 product-title js-product-title met-product-title v-spacing-large ellipsis-three-line art-plp-itemDescription'})
price = soup.find('div', attrs={'class': "product-pricing"})
actual_price = price.find('div', attrs={'class': "pl-price js-pl-price"})
print actual_price
HTML FROM WEBSITE:
<div class="product-details">
<div class="product-pricing">
<div class="pl-price js-pl-price" tabindex="-1">
<!-- Start of Product Family Pricing -->
<!-- Map price and savings through date present for product family -->
RESULTS:
scrape_products.py
None
If you look at the results of the price search, you can see in the HTML the following text:
"Since Lowes.com is national in scope, we check inventory at your local store first in an effort to fulfill your order more quickly. You may find product or pricing that differ from that of your local store, but we make every effort to minimize those differences so you can get exactly what you want at the best possible price."
This would lead me to believe that the pricing information isn't loading immediately, and thus isn't loaded into the HTML that is parsed by BeautifulSoup. You should try a headless browser solution with Selenium.
i wrote a script in python to pull out particular paragraphs but then i end up getting all the information in that page. I want to scrap paragraphs inside with varying ids with different pages eg.
<div id="content-body-123123">
and this id varies for different pages. How can i identify this particular tag and pull out paragraphs inside this tag alone?
url='http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-
ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
html=page.content
soup = bs(html, 'html.parser')
for tag in soup.find_all('p'):
print tag.text.encode('utf-8')+'\n'
Try this. The change of id number should not affect your result:
from bs4 import BeautifulSoup
import requests
url = 'http://www.thehindu.com/opinion/op-ed/Does-Beijing-really-want-to-ldquobreak-uprdquo-India/article16875298.ece'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
for content in soup.select("[id^='content-body-'] p"):
print(content.text)
I have a webpage of popular articles which I want to scrape for each quoted webpage's hyperlink and the title of the article it's displaying.
The desired output of my script is a CSV file which lists each title and the article content in one line. So if there are 50 articles on this webpage, I want one file with 50 lines and 100 data points.
My problem here is that the article titles and their hyperlinks are contained in an SVG container, which is throwing me off. I've utilized BeautifulSoup for web scraping before but am not sure how to select each article's title and hyperlink. Any and all help is much appreciated.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('http://fundersandfounders.com/what-internet-thinks-based-on-media/')
res.raise_for_status()
playFile = open('top_articles.html', 'wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
f = open('top_articles.html')
soup = BeautifulSoup(f, 'html.parser')
links = soup.select('p') #i know this is where i'm messing up, but i'm not sure which selector to actually utilize so I'm using the paragraph selector as a place-holder
print(links)
I am aware that this is in effect a two step project: the current version of my script doesn't iterate through the list of all the hyperlinks whose actual content I'm going to be scraping. That's a second step which I can execute easily on my own, however if anyone would like to write that bit too, kudos to you.
You should do it in two steps:
parse the HTML and extract the link to the svg
download svg page, parse it with BeautifulSoup and extract the "bubbles"
Implementation:
from urllib.parse import urljoin # Python3
import requests
from bs4 import BeautifulSoup
base_url = 'http://fundersandfounders.com/what-internet-thinks-based-on-media/'
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
# download and parse svg
res = session.get(svg_link)
soup = BeautifulSoup(res.content, 'html.parser')
for article in soup.select("#bubbles .bgroup"):
title, resource = [item.get_text(strip=True, separator=" ") for item in article.select("a text")]
print("Title: '%s'; Resource: '%s'." % (title, resource))
Prints article titles and resources:
Title: 'CNET'; Resource: 'Android Apps That Extend Battery Life'.
Title: '5-Years-Old Shoots Sister'; Resource: 'CNN'.
Title: 'Samsung Galaxy Note II'; Resource: 'Engaget'.
...
Title: 'Predicting If a Couple Stays Together'; Resource: 'The Atlantic Magazine'.
Title: 'Why Doctors Die Differently'; Resource: 'The Wall Street Journal'.
Title: 'The Ideal Nap Length'; Resource: 'Lifehacker'.