Navigation with BeautifulSoup - python

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
import requests
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")
Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href link and save into a file format?

With BeautifulSoup, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For
example, all the href's I want are called by a certain name, e.g.
price in an online catalog.
Say, all the links you need have price in the text - you can use a text argument:
soup.find_all("a", text="price") # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:
import re
soup.find_all("a", text=re.compile(r"^[pP]rice"))
If price is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price'
soup.select("a[href^=price]") # href starts with 'price'
soup.select("a[href$=price]") # href ends with 'price'
or, via find_all():
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the
webpage, within the page's and a certain . Can I access only these
links?
Sure, locate the appropriate container and call find_all() or other searching methods:
container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:
soup.select("div.container a[href]")
(3) How can I scrape the contents within each href link and save into
a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://examplewebsite.com'
with requests.Session() as session: # maintaining a web-scraping session
soup = BeautifulSoup(session.get(base_url).content, "html.parser")
for link in soup.select("div.container a[href]"):
full_link = urljoin(base_url, link["href"])
title = a.get_text(strip=True)
with open(title + ".html", "w") as f:
f.write(session.get(full_link).content)
You may look into grequests or Scrapy to solve that part.

Related

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.
To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!

Extract links from html page using BeautifulSoup

I need to extract some articles from the Piography website.
so from this page http://www.biography.com/people I need all the sublinks.
for example:
/people/ryan-seacrest-21095899
/people/edgar-allan-poe-9443160
but I have two problems:
1- when I am trying to a find all < a >. I couldn't find the href that I need.
import urllib2
from BeautifulSoup import BeautifulSoup
url = "http://www.biography.com/people"
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
divs = soup.findAll('a')
for div in divs:
print(div)
2- There are a "see more" button. so how I can take all the links for all the people in the website. not just that appear in the first page?
On site what you show, use angular and part of content generate with JS. BeautifulSoup not execute JS. You need to use http://selenium-python.readthedocs.io/ or another like instrument. Or you may pry in ajax need for you GET(or may be POST) method, and give data through him.

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Extract Link URL After Specified Element with Python and Beautifulsoup4

I'm trying to extract a link from a page with python and the beautifulsoup library, but I'm stuck. The link is on the following page, on the sidebar area, directly underneath the h4 subtitle "Original Source:
http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php
I've managed to isolate the link (mostly), but I'm unsure of how to further advance my targeting to actually extract the link. Here's my code so far:
import requests
from bs4 import BeautifulSoup
url = "http://www.eurekalert.org/pub_releases/2016-06/uonc-euc062016.php"
data = requests.get(url)
soup = BeautifulSoup(data.text, 'lxml')
source_url = soup.find('section', class_='widget hidden-print').find('div', class_='widget-content').findAll('a')[-1]
print(source_url)
I am currently getting the full html of the last element in which I've isolated, where I'm trying to simply get the link. Of note, this is the only link on the page I'm trying to get.
You're looking for the link which is the href html attribute. source_url is a bs4.element.Tag which has the get method like:
source_url.get('href')
You almost got it!!
SOLUTION 1:
You just have to run the .text method on the soup you've assigned to source_url.
So instead of:
print(source_url)
You should use:
print(source_url.text)
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense
SOLUTION 2:
You should call source_url.get('href') to get only the specific href tag related to your soup.findall element.
print source_url.get('href')
Output:
http://news.unchealthcare.org/news/2016/june/e-cigarette-use-can-alter-hundreds-of-genes-involved-in-airway-immune-defense

How to use Beautiful soup to return destination from HTML anchor tags

I am using python 2 and Beautiful soup to parse HTML retrieved using the requests module
import requests
from bs4 import BeautifulSoup
site = requests.get("http://www.stackoverflow.com/")
HTML = site.text
links = BeautifulSoup(HTML).find_all('a')
Which returns a list containing output which looks like Navigate
The content of the attribute href for each anchor tag can be in several forms, for example it could be a javascript call on the page, it could be a relative address to a page with the same domain(/next/one/file.php), or it could be a specific web address (http://www.stackoverflow.com/).
Using BeautifulSoup is it possible to return the web addresses of both the relative and specific addresses to one list, excluding all javascript calls and such, leaving only navigable links?
From the BS docs:
One common task is extracting all the URLs found within a page’s <a> tags:
for link in soup.find_all('a'):
print(link.get('href'))
You can filter out the href="javascript:whatever()" cases like this:
hrefs = []
for link in soup.find_all('a'):
if link.has_key('href') and not link['href'].lower().startswith('javascript:'):
hrefs.append(link['href'])

Categories