I have a webpage of popular articles which I want to scrape for each quoted webpage's hyperlink and the title of the article it's displaying.
The desired output of my script is a CSV file which lists each title and the article content in one line. So if there are 50 articles on this webpage, I want one file with 50 lines and 100 data points.
My problem here is that the article titles and their hyperlinks are contained in an SVG container, which is throwing me off. I've utilized BeautifulSoup for web scraping before but am not sure how to select each article's title and hyperlink. Any and all help is much appreciated.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('http://fundersandfounders.com/what-internet-thinks-based-on-media/')
res.raise_for_status()
playFile = open('top_articles.html', 'wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
f = open('top_articles.html')
soup = BeautifulSoup(f, 'html.parser')
links = soup.select('p') #i know this is where i'm messing up, but i'm not sure which selector to actually utilize so I'm using the paragraph selector as a place-holder
print(links)
I am aware that this is in effect a two step project: the current version of my script doesn't iterate through the list of all the hyperlinks whose actual content I'm going to be scraping. That's a second step which I can execute easily on my own, however if anyone would like to write that bit too, kudos to you.
You should do it in two steps:
parse the HTML and extract the link to the svg
download svg page, parse it with BeautifulSoup and extract the "bubbles"
Implementation:
from urllib.parse import urljoin # Python3
import requests
from bs4 import BeautifulSoup
base_url = 'http://fundersandfounders.com/what-internet-thinks-based-on-media/'
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
# download and parse svg
res = session.get(svg_link)
soup = BeautifulSoup(res.content, 'html.parser')
for article in soup.select("#bubbles .bgroup"):
title, resource = [item.get_text(strip=True, separator=" ") for item in article.select("a text")]
print("Title: '%s'; Resource: '%s'." % (title, resource))
Prints article titles and resources:
Title: 'CNET'; Resource: 'Android Apps That Extend Battery Life'.
Title: '5-Years-Old Shoots Sister'; Resource: 'CNN'.
Title: 'Samsung Galaxy Note II'; Resource: 'Engaget'.
...
Title: 'Predicting If a Couple Stays Together'; Resource: 'The Atlantic Magazine'.
Title: 'Why Doctors Die Differently'; Resource: 'The Wall Street Journal'.
Title: 'The Ideal Nap Length'; Resource: 'Lifehacker'.
Related
I am new to BeautifulSoup, and I'm trying to extract data from the following website.
https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx
I am trying to extract the availability of the hospital beds information (along with the detailed breakup) after choosing a particular district and also with the 'With available bed only' option selected.
Should I choose the table, the td, the tbody, or the div class for this instance?
My current code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
locations= soup.find('div', {'class': 'col-lg-12 col-md-12 col-sm-12'})
print(locations)
This only prints out a blank output:
Output
I have also tried using tbody and from table still could not work it out.
Any help would be greatly appreciated!
EDIT: Trying to find a certain element returns []. The code -
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
location = soup.find_all('h5')
print(location)
It is probably a dynamic website, it means that when you use bs4 for retrieving data it doesn't retrieve what you see because the page updates or loads the content after the initial HTML load.
For these dynamic webpages you should use selenium and combine it with bs4.
https://selenium-python.readthedocs.io/index.html
I want to create a web scraper so that it identifies the headings and text related to it on the web page. Can anyone help in how can that be done?
Demo Image
For example, here in the image attached, "Prerequisites" is the heading and the text below is "corresponding text".
You should use python and BeautifulSoup, a library made for web scraping.
For a given url you extract the actual content of the page using request the following way :
import requests
from bs4 import BeautifulSoup
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
Once you have the object soup you can find all headings the following way :
headings = list()
for i in range(1, 7):
# <h1> to <h6>
headings.extend(soup.findAll(f'h{i}'))
headings now contains all the headings from h1 to h6. Now to extract the text you just proceed as follows :
text_content = soup.text
I'm having some serious issues trying to extract the titles from a webpage. I've done this before on some other sites but this one seems to be an issue because of the Javascript.
The test link is "https://www.thomasnet.com/products/adhesives-393009-1.html"
The first title I want extracted is "Toagosei America, Inc."
Here is my code:
import requests
from bs4 import BeautifulSoup
url = ("https://www.thomasnet.com/products/adhesives-393009-1.html")
r = requests.get(url).content
soup = BeautifulSoup(r, "html.parser")
print(soup.get_text())
Now if I run it like this, with get_text, i can find the titles in the result, however as soon as I change it to find_all or find, the titles are lost. I cant find them using web browser's inspect tool, because its all JS generated.
Any advice would be greatly appreciated.
You have to specify what to find, in this case <h2> to get first title:
import requests
from bs4 import BeautifulSoup
url = 'https://www.thomasnet.com/products/adhesives-393009-1.html'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
first_title = soup.find('h2')
print(first_title.text)
Prints:
Toagosei America, Inc.
So I'm making a scraper with bs4 that scrapes this userscripts website. But I'm running in to a issue where I cant remove whitespaces. Everything I've done doesn't work. Can someone help me?
from bs4 import BeautifulSoup
import requests
import os
url = "https://openuserjs.org"
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
os.system('cls')
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
print(Titles.text.replace("Microsoft is aquiring GitHub", "").replace("TOS Changes", "").replace("Google Authentication Deprecation 2.0", "").replace("Server Maintenance", "").replace("rawgit.com Deprecation and EOL", ""))
To get the title without Announcements try below css selector.
for Titles in soup.select("a.tr-link-a>b"):
print(Titles.text.strip())
Output:
TopAndDownButtonsEverywhere
Anti-Adblock Killer | Reek
YouTube Center
EasyVideoDownload
AdsBypasser
Endless Google
YouTube +
Shadow Selection
bongacamsKillAds
Google View Image
Youtube - Restore Classic
Webcomic Reader
Shiki Rating
Warez-BB +
cinemapress
Google Hit Hider by Domain (Search Filter / Block Sites)
Chaturbate Clean
google cache comeback
translate.google tooltip
Amazon Smile Redirect
oujs - JsBeautify
IMDb 'My Movies' enhancer
EX-百度云盘
Wide Github
DuckDuckGo Extended
If you want to use findall() then try this.
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
if Titles.find('b'):
print(Titles.find('b').text.strip())
Code:
from bs4 import BeautifulSoup
import requests
import os
url = "https://openuserjs.org"
source = requests.get(url)
soup = BeautifulSoup(source.text,'lxml')
for Titles in soup.findAll("a", {"class": "tr-link-a"}):
if Titles.find('b'):
print(Titles.find('b').text.strip())
I've pieced together a script which scrapes various pages of products on a product search page, and collects the title/price/link to the full description of the product. It was developed using a loop and adding a +i to each page (www.exmple.com/search/laptops?page=(1+i)) until a 200 error applied.
The product title contains the link to the actual products full description - I would now like to "visit" that link and do the main data scrape from within the full description of the product.
I have an array built for the links extracted from the product search page - I'm guessing running off this would be a good starting block.
How would I go about extracting the HTML from the links within the array (ie. visit the individual product page and take the actual product data and not just the summary from the products search page)?
Here are the current results I'm getting in CSV format:
Link Title Price
example.com/laptop/product1 laptop £400
example.com/laptop/product2 laptop £400
example.com/laptop/product3 laptop £400
example.com/laptop/product4 laptop £400
example.com/laptop/product5 laptop £400
First get all pages link.Then iterate that list and get whatever info you need from individual pages. I have only retrieve specification values here.you do whatever value you want.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('div.gunDetails'):
print(soup.select_one('div.gunDetails').text)
The output would be like from each page.
Specifications
Make:Schultz & Larsen
Model:VICTORY GRADE 2 SPIRAL-FLUTED
Licence:Firearm
Orient.:Right Handed
Barrel:23"
Stock:14"
Weight:7lb.6oz.
Origin:Other
Circa:2017
Cased:Makers-Plastic
Serial #:DK-V11321/P20119
Stock #:190912/002
Condition:Used
Specifications
Make:Howa
Model:1500 MINI ACTION [ 1-7'' ] MDT ORYX CHASSIS
Licence:Firearm
Orient.:Right Handed
Barrel:16"
Stock:13 ½"
Weight:7lb.15oz.
Origin:Other
Circa:2019
Cased:Makers-Plastic
Serial #:B550411
Stock #:190905/002
Condition:New
Specifications
Make:Weihrauch
Model:HW 35
Licence:No Licence
Orient.:Right Handed
Scope:Simmons 3-9x40
Total weight:9lb.3oz.
Origin:German
Circa:1979
Serial #:746753
Stock #:190906/004
Condition:Used
If you want to fetch title and price from each link.Try this.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('h1[itemprop="name"]'):
print("Title:" + soup.select_one('h1[itemprop="name"]').text)
print("Price:" + soup.select_one('p.price').text)
Just extract that part of the string which is a URL from the project title.
do a :
import requests
res = requests.get(<url-extracted-above->)
res.content
then using the package beautifulsoup, do :
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
and keep iterating just taking this html as an xml-tree format. You may refer this easy to find link on requests and beautifulsoup : https://www.dataquest.io/blog/web-scraping-tutorial-python/
Hope this helps? not sure If I got your question correct but anything in here can be done with urllib2 / requests / beautifulSoup / json / xml python libraries when it copes to web scraping / parsing.