Scrape multiple page ids with BeautifulSoup and Python - python

My code successfully scrapes the table class tags from https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false
However, there are multiple pages available at the site above in which I would like to be able to scrape all the codes in each page. (The first column of the table in each page)
For example, with the url above, when I click the link to "2" the overall url does NOT change. I am not also able to find the hidden link of each page, however, I am able to see all the tables in every pages under source.
It seems quite similar to this: Scrape multiple pages with BeautifulSoup and Python
However, I can not find the source for page number under network.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4 as bs
import pickle
import requests
def save_hkex_tickers():
resp = requests.get('https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false')
soup = bs.BeautifulSoup(resp.text, "lxml")
table = soup.find('table',{'class':'greygeneraltxt'})
tickers = []
for row in table.findAll('tr')[2:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
print(tickers)
return tickers
save_hkex_tickers()

Related

Can't access table and table elements using bs4

So I am trying to scrape the following webpage: https://www.omscentral.com/
The main table there is my item of interest. I want to scrape the table, and all of its content. When I inspect the content of the page, the table is on a table tag, so I figured it would be easy to access it, with the code below.
url = 'https://www.omscentral.com/'
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
soup.find_all('table')
However, that code only returns the table header. I saw a similar example here, but the solution of switching the parser did not work.
When I look at the soup object in itself, it seems that the requests does not expand the table, and only captures the header. Not too sure what to do here - any advice would be much appreciated!
Content is stored in script tag and rendered dynamically, so you have to extract the data from there.
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
To display in DataFrame simply use:
pd.DataFrame(data)
Example
import requests, json
from bs4 import BeautifulSoup
headers = {'User-Agent':'Mozilla/5.0'}
url = 'https://www.omscentral.com/'
soup = BeautifulSoup(requests.get(url, headers=headers).text)
data = json.loads(soup.select_one('#__NEXT_DATA__').text)['props']['pageProps']['courses']
for item in data:
print(item['name'], item.get('officialURL'))
Output
Introduction to Information Security https://omscs.gatech.edu/cs-6035-introduction-to-information-security
Computing for Good https://omscs.gatech.edu/cs-6150-computing-good
Introduction to Operating Systems https://omscs.gatech.edu/cs-6200-introduction-operating-systems
Advanced Operating Systems https://omscs.gatech.edu/cs-6210-advanced-operating-systems
Secure Computer Systems https://omscs.gatech.edu/cs-6238-secure-computer-systems
Computer Networks https://omscs.gatech.edu/cs-6250-computer-networks
...

How to get options data with BeautifulSoup?

I am trying to scrape data from option contract pages on Yahoo Finance using BS4. Below is my testing link and the code I wrote to get the info I want.
def get_option_volume(url):
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
value = soup.findAll(attrs={"data-reactid":"84"})
for i in value:
print(i)
get_option_volume("https://finance.yahoo.com/quote/AAPL200221C00320000?p=AAPL200221C00320000")
When I look at the HTML data for the page I see that the data I want has a tag with attribute datareact-id=84 ...
but when I run the code, I get values is an empty variable. Where am I going wrong? I looked at documentation and other stackoverflow posts but cannot figure out where my program fails. It works for other random websites but not this one.

Scraping HTML from array of links

I've pieced together a script which scrapes various pages of products on a product search page, and collects the title/price/link to the full description of the product. It was developed using a loop and adding a +i to each page (www.exmple.com/search/laptops?page=(1+i)) until a 200 error applied.
The product title contains the link to the actual products full description - I would now like to "visit" that link and do the main data scrape from within the full description of the product.
I have an array built for the links extracted from the product search page - I'm guessing running off this would be a good starting block.
How would I go about extracting the HTML from the links within the array (ie. visit the individual product page and take the actual product data and not just the summary from the products search page)?
Here are the current results I'm getting in CSV format:
Link Title Price
example.com/laptop/product1 laptop £400
example.com/laptop/product2 laptop £400
example.com/laptop/product3 laptop £400
example.com/laptop/product4 laptop £400
example.com/laptop/product5 laptop £400
First get all pages link.Then iterate that list and get whatever info you need from individual pages. I have only retrieve specification values here.you do whatever value you want.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('div.gunDetails'):
print(soup.select_one('div.gunDetails').text)
The output would be like from each page.
Specifications
Make:Schultz & Larsen
Model:VICTORY GRADE 2 SPIRAL-FLUTED
Licence:Firearm
Orient.:Right Handed
Barrel:23"
Stock:14"
Weight:7lb.6oz.
Origin:Other
Circa:2017
Cased:Makers-Plastic
Serial #:DK-V11321/P20119
Stock #:190912/002
Condition:Used
Specifications
Make:Howa
Model:1500 MINI ACTION [ 1-7'' ] MDT ORYX CHASSIS
Licence:Firearm
Orient.:Right Handed
Barrel:16"
Stock:13 ½"
Weight:7lb.15oz.
Origin:Other
Circa:2019
Cased:Makers-Plastic
Serial #:B550411
Stock #:190905/002
Condition:New
Specifications
Make:Weihrauch
Model:HW 35
Licence:No Licence
Orient.:Right Handed
Scope:Simmons 3-9x40
Total weight:9lb.3oz.
Origin:German
Circa:1979
Serial #:746753
Stock #:190906/004
Condition:Used
If you want to fetch title and price from each link.Try this.
from bs4 import BeautifulSoup
import requests
all_links=[]
url="https://www.guntrader.uk/dealers/street/ivythorn-sporting/guns?page={}"
for page in range(1,3):
res=requests.get(url.format(page)).text
soup=BeautifulSoup(res,'html.parser')
for link in soup.select('a[href*="/dealers/street"]'):
all_links.append("https://www.guntrader.uk" + link['href'])
print(len(all_links))
for a_link in all_links:
res = requests.get(a_link).text
soup = BeautifulSoup(res, 'html.parser')
if soup.select_one('h1[itemprop="name"]'):
print("Title:" + soup.select_one('h1[itemprop="name"]').text)
print("Price:" + soup.select_one('p.price').text)
Just extract that part of the string which is a URL from the project title.
do a :
import requests
res = requests.get(<url-extracted-above->)
res.content
then using the package beautifulsoup, do :
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
and keep iterating just taking this html as an xml-tree format. You may refer this easy to find link on requests and beautifulsoup : https://www.dataquest.io/blog/web-scraping-tutorial-python/
Hope this helps? not sure If I got your question correct but anything in here can be done with urllib2 / requests / beautifulSoup / json / xml python libraries when it copes to web scraping / parsing.

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

Categories