web scraping data without matched pages using BeautifulSoup - python

I'm Trying to scrape the status from bulbapedia
I want to get this table from each page
this table isn't in specific place in the page \ sometimes there is multiple of it
i want my script to look for the table in the page and if it find 1 return the element tag and ignore the other ones
here is some pages with the table in different places:
page 1
page 2
page 3
i just want to select the table element and then i will extract the data i need

when working with 'wiki' pages without specific ids or classes, what you really want to do - is to find any type of specific characteristic, which will distinguish focus objects from others.
In your case, if we analyze all three pages, the 'stat table' always has a tag, which href is always /wiki/Statistic.
Therefore, to find this specific table, you have two options:
find each table, which has a tag inside with href equals /wiki/Statistic
find parent tag table of each link with href equals /wiki/Statistic
Here is an example of code:
from bs4 import BeautifulSoup
import requests
pages = [
'https://bulbapedia.bulbagarden.net/wiki/Charmander_(Pokémon)',
'https://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pokémon)',
'https://bulbapedia.bulbagarden.net/wiki/Eternatus_(Pokémon)'
]
for page in pages:
response = requests.get(page)
soup = BeautifulSoup(response.text, 'html.parser')
stat_tables = [table for table in soup.find_all('table') if table.find('a') != None and table.find('a')['href'] == '/wiki/Statistic']
# OR
stat_tables = [a.find_parent('table') for a in soup.find_all('a', href = '/wiki/Statistic')]
for table in stat_tables:
# Parse table
Since you said that you want to just extract the table, I'm leaving parsing part on you :)
However, if you have some question, please, feel free to ask.

Related

How do I further narrow my search with BeautifulSoup?

So I'm just learning how to scrape websites for data, and I have one I want to scrape to populate a database to do something with for practice. The site doesn't have all the information posted in one big page i can scrape, but instead has the information broken up into multiple "sets", with each set having its own page/collection of the data I want to scrape. All the "sets" are listed on a singular page though, with each set listed having the link to its individual page. I figured my best bet would be to scrape the "sets" page for their URL, and then request through to the "set" pages to collect the data I'm trying to get. Checking the html, each set is listed in a container, with the URL being the first thing listed within each section, like this:
<td class="flexbox">
<a href="url_i_need">
<more stuff i don't need>
</td>
<repeats_as_above_for_next_set>
what I've tried is:
response = requests.get('site_url')
content = response.content
soup = BeautifulSoup(content, 'html.parser')
data = soup.find_all('td', 'flexbox')
This seems to do the trick scraping each of the TD sections, but nothing I try let me further skim through the data to just the portion I need. After narrowing my search down to the general section I care about, how do I scrape the URL of each of those sections?
You can loop over data and for each item find nested tag a:
for item in data:
link = item.find('a')
url = link['href']
Btw is this line correct?
soup = BeautifulSoup.find_all(content, 'html.parser')
Standard way is this one:
soup = BeautifulSoup(content, 'html.parser')

How do I use Bs4 to pull similar information but from different places in DOM hierarchy?

I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)

Scrape multiple page ids with BeautifulSoup and Python

My code successfully scrapes the table class tags from https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false
However, there are multiple pages available at the site above in which I would like to be able to scrape all the codes in each page. (The first column of the table in each page)
For example, with the url above, when I click the link to "2" the overall url does NOT change. I am not also able to find the hidden link of each page, however, I am able to see all the tables in every pages under source.
It seems quite similar to this: Scrape multiple pages with BeautifulSoup and Python
However, I can not find the source for page number under network.
How can my code be changed to scrape data from all the available listed pages?
My code that works for page 1 only:
import bs4 as bs
import pickle
import requests
def save_hkex_tickers():
resp = requests.get('https://www.hsi.com.hk/HSI-Net/HSI-Net?cmd=tab&pageId=en.indexes.hscis.hsci.constituents&expire=false&lang=en&tabs.current=en.indexes.hscis.hsci.overview_des%5Een.indexes.hscis.hsci.constituents&retry=false')
soup = bs.BeautifulSoup(resp.text, "lxml")
table = soup.find('table',{'class':'greygeneraltxt'})
tickers = []
for row in table.findAll('tr')[2:]:
ticker = row.findAll('td')[1].text
tickers.append(ticker)
print(tickers)
return tickers
save_hkex_tickers()

Web Scraping with Python - Looping for city name, clicking and get interested value

This is my first time with Python and web scraping. Have been looking around and still unable to get what I need to do.
Below are print screen of the elements that I've used via Chrome.
As you can see, it is from the dropdown 'Apartments'.
My 1st step in trying to do is get the list of cities from the drop down
My 2nd step is then, from the given city list, go to each of them (...url.../Brantford/ for example)
My 3rd step is then, given the available apartments, click each of the available apartments to get the price range for each bedroom type
Currently, I am JUST trying to 'loop' through the cities in the first step and it's not working.
Could you please help me out as well, if there's a good forum, article, tutorial etc that's good for beginner like me to read and learn. I'd really like to be good in this so that I may give me to society one day.
Thank you!
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.find(".child-pages dropdown-menu a href")
print (dropdown_list.prettify())
Screenshot
You can access the elements by the class and a child "a" node. Then access the attribute "href" and add a domain name.
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.select(".primary .child-pages a")
links=['http://www.homestead.ca'+x['href'] for x in dropdown_list]
print (links)
city_names=[x.text for x in dropdown_list]
print (city_names)
result=[]
for link in links:
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,'lxml')
...
result.append(...)
Explanation:
soup.select(".primary .child-pages a")
Using CSS selector I select the "a" nodes that are children of a node with the class "child-pages" which is a child of the the node with a class "primary". There were two nodes with class "child-pages" and I filtered one that was under node with "primary" class.
[x.text for x in dropdown_list]
This is a list comprehension in Python. It means that I choose all elements of dropdown_list and then take only the attribute text of each of them and then return as a list.
You can then iterate over the links and append the data to a list (here "result").
I found this introduction to BeautifulSoup pretty good though without going through the links: http://programminghistorian.org/lessons/intro-to-beautiful-soup
I would also recommend reading a book. For example this one: Web Scraping with Python: Collecting Data from the Modern Web

Navigation with BeautifulSoup

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
import requests
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")
Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href link and save into a file format?
With BeautifulSoup, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For
example, all the href's I want are called by a certain name, e.g.
price in an online catalog.
Say, all the links you need have price in the text - you can use a text argument:
soup.find_all("a", text="price") # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:
import re
soup.find_all("a", text=re.compile(r"^[pP]rice"))
If price is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price'
soup.select("a[href^=price]") # href starts with 'price'
soup.select("a[href$=price]") # href ends with 'price'
or, via find_all():
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the
webpage, within the page's and a certain . Can I access only these
links?
Sure, locate the appropriate container and call find_all() or other searching methods:
container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:
soup.select("div.container a[href]")
(3) How can I scrape the contents within each href link and save into
a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://examplewebsite.com'
with requests.Session() as session: # maintaining a web-scraping session
soup = BeautifulSoup(session.get(base_url).content, "html.parser")
for link in soup.select("div.container a[href]"):
full_link = urljoin(base_url, link["href"])
title = a.get_text(strip=True)
with open(title + ".html", "w") as f:
f.write(session.get(full_link).content)
You may look into grequests or Scrapy to solve that part.

Categories