Access attributes with beautifulSoup and print - python

I'd like to scrape a site to findall title attributes of h2 tag
<h2 class="1">Titanic_Caprio</h2>
Using this code, I'm accessing the entire h2 tag
from bs4 import BeautifulSoup
import urllib2
url = "http://www.example.it"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
links = soup.findAll('h2')
print "".join([str(x) for x in links] )
using findAll('h2', attrs = {'title'}) doesn't have results. What Am I doing wrong? How can I print out the entire title's list in a file?

The problem is that title is not an attribute of the h2 tag, but of a tag included in it. So you must first search for <h2> tags, and then subtags having a title attribute:
titles = []
h2_list = links = soup.findAll('h2')
for h2 in h2_list:
titles.extend(h2.findAll(lambda x: x.has_attr('title')))
It works because BeautifulSoup can use functions as search filters.

you need to pass key value pairs in attrs
findAll('h2', attrs = {"key":"value"})

Related

Which element to use in Selenium?

I want to find "Moderat" in <p class="text-spread-level">Moderat</p>
I have tried with id, name, xpath and link text.
Would you like to try this?
from bs4 import BeautifulSoup
import requests
sentences = []
res = requests.get(url) # assign your url in variable
soup = BeautifulSoup(res.text, "lxml")
tag_list = soup.select("p.text-spread-level")
for tag in tag_list:
sentences.append(tag.text)
print(sentences)
Find the element by class name and get the text.
el=driver.find_element_by_class_name('text-spread-level')
val=el.text
print(val)

Scrape href not working with python

I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
# ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find('span', class_='letter-pages'):
#Not working Here
pages_scrape.append(href.attrs['href'])
# Will print all the anchor tags with hrefs if I remove the above comment.
print(href)
You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if span:
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
print(href)
else:
print('span.letter-pages not found on ' + page)
Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.
You probably meant to do find_all instead of find -- so change
for href in soup.find('span', class_='letter-pages'):
to
for href in soup.find_all('span', class_='letter-pages'):
You want to be iterating over a list of tags, not a single tag. find would give you a single tag object. When you iterate over a single tag, you iterate get NavigableString objects. find_all gives you the list of tag objects you want.

BeautifulSoup to find unofficial HTML tags/attributes

In my job we are using tags that we have created. One of the tags called can-edit and it looks like this in the code (for example):
<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>
It could be inside any tag (img, p, h1, h2, div...).
What i wish to get is all the can-edit tags within a page, for example with the HTML above:
['banner top text', 'banner bottom text']
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
i've tried
soup = BeautifulSoup(html, "html.parser")
can_edits = soup.find_all("can-edit")
But it not finding any.
The reason that this does not work is because here you look for a tag with the name can-edit, so <can-edit ...>, and this thus does not work.
You can use the find_all function of the soup to find all tags with a certain attribute. For example:
soup.find_all(attrs={'can-edit': True})
So here we use the attrs parameter and pass it an attribute that says that we filter tags that have a can-edit attribute. This will give us a list of tags with a can-edit attribute (regardless the value). If we now want to obtain the value of that attribute, we can get the ['can-edit'] item of it, so we can write a list comprehension:
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]
Or a full working version:
from bs4 import BeautifulSoup
s = """<h1 can-edit="banner top text" class="mainText">some text</h1>
<h2 can-edit="banner bottom text" class="bottomText">some text</h2>"""
bs = BeautifulSoup(s, 'lxml')
all_can_edit_attrs = [tag['can-edit']
for tag in soup.find_all(attrs={'can-edit': True})]

Getting a specific div tag with Beautful soup

Usually I would just call the div by a class name but it's not unique. The only unique thing the div tag has is the word "data-sc-replace" right after div. This is a shorten example of the source code
<div data-sc-replace data-sc-slot="1234" class = "inlineblock" data-sc-params="{'magnet': 'magnet:?......'extension': 'epub', 'stream': '' }"></div>
How would I go about calling the word "data-sc-replace" if it's not attached to a class or an id?
This is the code I have
import requests
from bs4 import BeautifulSoup
url_to_scrape = "http://example.com"
r = requests.get(url_to_scrape)
soup = BeautifulSoup(r.text, "html5lib")
list = soup.findAll('div', {'class':'inlineblock'})
print(list)
# list = soup.findAll("div", "data-sc-params")
# list = soup.find('data-sc-replace')
# list = soup.find('data-sc-params')
# list = soup.find('div', {'class':'inlineblock'}, 'data-sc-params')
Use CSS query selectors. Finds all divs with data-sc-replace attributes.
result = soup.select('div[data-sc-replace]')
That distinctive mark seems to be an HTML attribute without value. So try this:
soup.find('div', attrs = {'data-sc-replace': ''})
# or use find_all() to get all such div containers

Beautifulsoup to retrieve the href list

Thanks for attention!
I'm trying to retrieve the href of products in search result.
For example this page:
However When I narrow down to the product image class, the retrived href are image links....
Can anyone solve that? Thanks in advance!
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
content = urllib2.urlopen(url).read()
content = preprocess_yelp_page(content)
soup = BeautifulSoup(content)
content = soup.findAll('div',{'class':'content dynamic'})
draft = str(content)
soup = BeautifulSoup(draft)
items = soup.findAll('div',{'class':'cell_section1'})
draft = str(items)
soup = BeautifulSoup(draft)
content = soup.findAll('div',{'class':'product-image'})
draft = str(content)
soup = BeautifulSoup(draft)
You don't need to load the content of each found tag with BeautifulSoup over and over again.
Use CSS selectors to get all product links (a tag under a div with class="product-image")
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.homedepot.com/b/Husky/N-5yc1vZrd/Ntk-All/Ntt-chest%2Band%2Bcabinet?Ntx=mode+matchall&NCNI-5'
soup = BeautifulSoup(urllib2.urlopen(url))
for link in soup.select('div.product-image > a:nth-of-type(1)'):
print link.get('href')
Prints:
http://www.homedepot.com/p/Husky-41-in-16-Drawer-Tool-Chest-and-Cabinet-Set-HOTC4016B1QES/205080371
http://www.homedepot.com/p/Husky-26-in-6-Drawer-Chest-and-Cabinet-Combo-Black-C-296BF16/203420937
http://www.homedepot.com/p/Husky-52-in-18-Drawer-Tool-Chest-and-Cabinet-Set-Black-HOTC5218B1QES/204825971
http://www.homedepot.com/p/Husky-26-in-4-Drawer-All-Black-Tool-Cabinet-H4TR2R/204648170
...
div.product-image > a:nth-of-type(1) CSS selector would match every first a tag directly under the div with class product-image.
To save the links into a list, use a list comprehension:
links = [link.get('href') for link in soup.select('div.product-image > a:nth-of-type(1)')]

Categories