I've pieced together a script which scrapes various pages of products on a product search page, and collects the title/price/link to the full description of the product. It was developed using a loop and adding a +i to each page (www.exmple.com/search/laptops?page=(1+i)) until a 200 error applied.
The product title contains the link to the actual products full description - I would now like to "visit" that link and do the main data scrape from within the full description of the product.
I have an array built for the links extracted from the product search page - I'm guessing running off this would be a good starting block.
How would I go about extracting the HTML from the links within the array (ie. visit the individual product page and take the actual product data and not just the summary from the products search page)?
Here are the current results I'm getting in CSV format:
Link Title Price
example.com/laptop/product1 laptop £400
example.com/laptop/product2 laptop £400
example.com/laptop/product3 laptop £400
example.com/laptop/product4 laptop £400
example.com/laptop/product5 laptop £400
First get all pages link.Then iterate that list and get whatever info you need from individual pages. I have only retrieve specification values here.you do whatever value you want.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('div.gunDetails'):
print(soup.select_one('div.gunDetails').text)
The output would be like from each page.
Specifications
Make:Schultz & Larsen
Model:VICTORY GRADE 2 SPIRAL-FLUTED
Licence:Firearm
Orient.:Right Handed
Barrel:23"
Stock:14"
Weight:7lb.6oz.
Origin:Other
Circa:2017
Cased:Makers-Plastic
Serial #:DK-V11321/P20119
Stock #:190912/002
Condition:Used
Specifications
Make:Howa
Model:1500 MINI ACTION [ 1-7'' ] MDT ORYX CHASSIS
Licence:Firearm
Orient.:Right Handed
Barrel:16"
Stock:13 ½"
Weight:7lb.15oz.
Origin:Other
Circa:2019
Cased:Makers-Plastic
Serial #:B550411
Stock #:190905/002
Condition:New
Specifications
Make:Weihrauch
Model:HW 35
Licence:No Licence
Orient.:Right Handed
Scope:Simmons 3-9x40
Total weight:9lb.3oz.
Origin:German
Circa:1979
Serial #:746753
Stock #:190906/004
Condition:Used
If you want to fetch title and price from each link.Try this.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('h1[itemprop="name"]'):
print("Title:" + soup.select_one('h1[itemprop="name"]').text)
print("Price:" + soup.select_one('p.price').text)
Just extract that part of the string which is a URL from the project title.
do a :
import requests
res = requests.get(<url-extracted-above->)
res.content
then using the package beautifulsoup, do :
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
and keep iterating just taking this html as an xml-tree format. You may refer this easy to find link on requests and beautifulsoup : https://www.dataquest.io/blog/web-scraping-tutorial-python/
Hope this helps? not sure If I got your question correct but anything in here can be done with urllib2 / requests / beautifulSoup / json / xml python libraries when it copes to web scraping / parsing.
Related
I am new to BeautifulSoup, and I'm trying to extract data from the following website.
https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx
I am trying to extract the availability of the hospital beds information (along with the detailed breakup) after choosing a particular district and also with the 'With available bed only' option selected.
Should I choose the table, the td, the tbody, or the div class for this instance?
My current code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
locations= soup.find('div', {'class': 'col-lg-12 col-md-12 col-sm-12'})
print(locations)
This only prints out a blank output:
Output
I have also tried using tbody and from table still could not work it out.
Any help would be greatly appreciated!
EDIT: Trying to find a certain element returns []. The code -
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://excise.wb.gov.in/CHMS/Public/Page/CHMS_Public_Hospital_Bed_Availability.aspx').text
soup = BeautifulSoup(html_text, 'lxml')
location = soup.find_all('h5')
print(location)
It is probably a dynamic website, it means that when you use bs4 for retrieving data it doesn't retrieve what you see because the page updates or loads the content after the initial HTML load.
For these dynamic webpages you should use selenium and combine it with bs4.
https://selenium-python.readthedocs.io/index.html
I'm trying to scrape a random site and get all the text with a certain class off of a page.
from bs4 import BeautifulSoup
import requests
sources = ['https://cnn.com']
for source in sources:
page = requests.get(source)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all("div", class_='cd_content')
for result in results:
title = result.find('span', class_="cd__headline-text vid-left-enabled")
print(title)
From what I found online, this should work but for some reason, it can't find anything and results is empty. Any help is greatly appreciated.
Upon inspecting the network calls, you see that the page is loaded dynamically via sending a GET request to:
https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl
The HTML is available within the html key on the page
import requests
from bs4 import BeautifulSoup
URL = "https://www.cnn.com/data/ocs/section/index.html:homepage1-zone-1/views/zones/common/zone-manager.izl"
response = requests.get(URL).json()["html"]
soup = BeautifulSoup(response, "html.parser")
for tag in soup.find_all(class_="cd__headline-text vid-left-enabled"):
print(tag.text)
Output (truncated):
This is the first Covid-19 vaccine in the US authorized for use in younger teens and adolescents
When the US could see Covid cases and deaths plummet
'Truly, madly, deeply false': Keilar fact-checks Ron Johnson's vaccine claim
These are the states with the highest and lowest vaccination rates
My code successfully scrapes the table class tags from https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false
However, there are multiple pages available at the site above in which I would like to be able to scrape all the codes in each page. (The first column of the table in each page)
For example, with the url above, when I click the link to "2" the overall url does NOT change. I am not also able to find the hidden link of each page, however, I am able to see all the tables in every pages under source.
It seems quite similar to this: Scrape multiple pages with BeautifulSoup and Python
However, I can not find the source for page number under network.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4 as bs
import pickle
import requests
def save_hkex_tickers():
resp = requests.get('https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false')
soup = bs.BeautifulSoup(resp.text, "lxml")
table = soup.find('table',{'class':'greygeneraltxt'})
tickers = []
for row in table.findAll('tr')[2:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
print(tickers)
return tickers
save_hkex_tickers()
I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item
I have a webpage of popular articles which I want to scrape for each quoted webpage's hyperlink and the title of the article it's displaying.
The desired output of my script is a CSV file which lists each title and the article content in one line. So if there are 50 articles on this webpage, I want one file with 50 lines and 100 data points.
My problem here is that the article titles and their hyperlinks are contained in an SVG container, which is throwing me off. I've utilized BeautifulSoup for web scraping before but am not sure how to select each article's title and hyperlink. Any and all help is much appreciated.
import requests
from bs4 import BeautifulSoup
import re
res = requests.get('http://fundersandfounders.com/what-internet-thinks-based-on-media/')
res.raise_for_status()
playFile = open('top_articles.html', 'wb')
for chunk in res.iter_content(100000):
playFile.write(chunk)
f = open('top_articles.html')
soup = BeautifulSoup(f, 'html.parser')
links = soup.select('p') #i know this is where i'm messing up, but i'm not sure which selector to actually utilize so I'm using the paragraph selector as a place-holder
print(links)
I am aware that this is in effect a two step project: the current version of my script doesn't iterate through the list of all the hyperlinks whose actual content I'm going to be scraping. That's a second step which I can execute easily on my own, however if anyone would like to write that bit too, kudos to you.
You should do it in two steps:
parse the HTML and extract the link to the svg
download svg page, parse it with BeautifulSoup and extract the "bubbles"
Implementation:
from urllib.parse import urljoin # Python3
import requests
from bs4 import BeautifulSoup
base_url = 'http://fundersandfounders.com/what-internet-thinks-based-on-media/'
with requests.Session() as session:
# extract the link to svg
res = session.get(base_url)
soup = BeautifulSoup(res.content, 'html.parser')
svg = soup.select_one("object.svg-content")
svg_link = urljoin(base_url, svg["data"])
# download and parse svg
res = session.get(svg_link)
soup = BeautifulSoup(res.content, 'html.parser')
for article in soup.select("#bubbles .bgroup"):
title, resource = [item.get_text(strip=True, separator=" ") for item in article.select("a text")]
print("Title: '%s'; Resource: '%s'." % (title, resource))
Prints article titles and resources:
Title: 'CNET'; Resource: 'Android Apps That Extend Battery Life'.
Title: '5-Years-Old Shoots Sister'; Resource: 'CNN'.
Title: 'Samsung Galaxy Note II'; Resource: 'Engaget'.
...
Title: 'Predicting If a Couple Stays Together'; Resource: 'The Atlantic Magazine'.
Title: 'Why Doctors Die Differently'; Resource: 'The Wall Street Journal'.
Title: 'The Ideal Nap Length'; Resource: 'Lifehacker'.