Targeting the third list item with beautiful soup - python

I'm scraping a website with Beautiful Soup and am having trouble trying to target an item in a span tag nested within an li tag. The website I'm trying to scrape is using the same classes for each list items which is making it harder. The HTML looks something like this:
<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>
My first thought was to try and target it using "nth-of-type() - I found a similar questions here but it hasn't helped. I've tried playing with it for a while now but my code basically looks like this:
import requests
from bs4 import BeautifulSoup
url = 'url of website I'm scraping'
headers = {User-Agent Header}
for page in range(1):
r = requests.get(url, headers = headers)
soup = BeautifulSoup(r.content, features="lxml")
scrape = soup.find_all('div', class_ = 'even_bigger_container_not_included_in_html_above')
for item in scrape:
condition = soup.find('li:nth-of-type(2)', 'span:nth-of-type(1)').text
print(condition)
Any help is greatly appreciated!

To use a CSS Selector, use the select() method, not find().
So to get the third <li>, use li:nth-of-type(3) as a CSS Selector:
from bs4 import BeautifulSoup
html = """<div class="bigger-container">
<div class="smaller-container">
<ul class="ulclass">
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
<li>
<span class="description"></span>
<span class="item">**This is the only tag I want to scrape**</span>
</li>
<li>
<span class="description"></span>
<span class="item"></span>
</li>
</ul>"""
soup = BeautifulSoup(html, "html.parser")
print(soup.select_one("li:nth-of-type(3)").get_text(strip=True))
Output:
**This is the only tag I want to scrape**

Related

How to find ahref link inside a <li>

<div class="body>
<ul class = "graph">
<li>
<a href = "Address one"> Text1
</a>
</li>
<li>
<a href = "Address two"> Text2
</a>
</li>
<li>
<a href = "Address three"> Text3
</a>
</li>
</ul>
</div>
I am doing a web scraping project right now and I am having trouble extracting these ahref links above.
right now I have
from bs4 import BeautifulSoup as soup
import requests
page = requests.get(url)
content = soup(page.content, "html.parser")
I tried using the find_all('a') and get('href') functions but they dont seem to work in this situation.
Hope this helps:
for x in content.find_all('li'):
href = x.find('a').get('href')
print(href)

Select specific tag on BS4 Python

I have the following HTML
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
I use this code to get the data
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
productsize.append(size_available.text.strip())
But it gets both tags, since it shares the same class (product-size__option), how can I get only the information I need?
Thanks
The data you don't want has a CSS class product-size__option--no-stock. You can check if the element does not contain this class, by doing the following check: if 'product-size__option--no-stock' not in size_available.attrs['class']
For example:
from bs4 import BeautifulSoup
html = '''<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>'''
soup = BeautifulSoup(html, 'html.parser')
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
if 'product-size__option--no-stock' not in size_available.attrs['class']:
productsize.append(size_available.text.strip())

How can i crawl web data that not in tags(class name is same)

Sorry.
I have asked a question like this.
After that i still have problem about data not in tag.
A few different the question i asked
(How can i crawl web data that not in tags)
<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>
I only need
I am here
The string is a child of the main div of type NavigableString, so you can loop through div.children and filter based on the type of the node:
from bs4 import BeautifulSoup, NavigableString
[x.strip() for x in soup.find("div", {'id': 'main-content'}).children if isinstance(x, NavigableString) and x.strip()]
# [u'I am here']
Data:
soup = BeautifulSoup("""<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>""", "html.parser")
soup = BeautifulSoup(that_html)
div_tag = soup.div
required_string = div_tag.string
go thought this documentation

Python : Extract HTML content

Is there any way to get "Data to be extracted" content by extracting the following html, using BeautifulSoup or any library
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
Thanks in advance for any help !! :)
There are certainly multiple options. For starters, you can find the p element with class="class_label" and get the next p sibling:
from bs4 import BeautifulSoup
data = """
<div>
<ul class="main class">
<li>
<p class="class_label">User Name</p>
<p>"Data to be extracted"</p>
</li>
</ul>
</div>
"""
soup = BeautifulSoup(data)
print soup.find('p', class_='class_label').find_next_sibling('p').text
Or, using a CSS selector:
soup.select('div ul.main li p.class_label + p')[0].text
Or, relying on the User Name text:
soup.find(text='User Name').parent.find_next_sibling('p').text
Or, relying on the p element's position inside the li tag:
soup.select('div ul.main li p')[1].text

Analyze and edit links in html code with BeautifulSoup

I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>.
Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).
How to solve this issue correctly?
# -*- coding: utf-8 -*-
import BeautifulSoup
import re
if __name__=="__main__":
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
metka_link = '<can_be_link>'
soup = BeautifulSoup.BeautifulSoup(body)
hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
repl = {}
for t in hrefs:
line = str(t)
# print '\n'*2, line
if not t.has_key('href'):
continue
href = t['href'].lower()
if href.find('http') == 0 or href.find('//') == 0:
body = body.replace(line, metka_link)
print body
The rezult is
<can_be_link>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
But the desired result must be
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Use replace_with() method:
PageElement.replace_with() removes a tag or string from the tree, and
replaces it with the tag or string of your choice
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
soup = BeautifulSoup(body, 'html.parser')
links = soup.find_all('a')
for link in links:
link = link.replace_with('<can_be_link>')
print soup.prettify(formatter=None)
prints:
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Note the import statement - use the 4th BeautifulSoup version since Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.

Categories