find xpath with specified ID regex

find xpath with specified ID regex - python

I am trying to scrape a webpage with uses the following <li id="size_name_1" ....> <li id="size_name_2"....> <li id="size_name_a" is there a way to find size_name_NUMBER' such as
response.xpath('//*[#id="size_name_\d+"]')
I want to use regex in the id search, Note I use scrapy.

You could do this with css selectors instead by using regex to grab the appropriate ids first. I do note you are using scrapy but same principle should apply.
from bs4 import BeautifulSoup
import re
html = '''
<html>
<head></head>
<body>
<li id="size_name_1" > me </li>
<li id="size_name_2" > and me </li>
<li id="size_name_a" > but not me :-(</li>
</body>
</html>
'''
p = re.compile(r'id="(size_name_\d+)"')
ids = p.findall(html)
soup = bs(html, 'lxml')
for i in ids:
print(soup.select_one(f'li[id="{i}"]'))

Related

Tag <li> not showing when using BeautifulSoup in Python

I was learning web scraping, and the 'li' tag is not showing when I run soup.findAll
Here's the html:
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
<a href=stuff</a>
</li>
</ul>
</label>
I tried:
soup = BeautifulSoup(r.content,'html5lib')
dropdown = soup.findAll('ul', {'class':'dropdown-content'})
print(dropdown)
And it only shows:
[<ul class="dropdown-content"></ul>]
Any help will do. Thanks!

in this command: dropdown = soup.findAll('ul', {'class':'dropdown-content'}), yo search for ul and dropdown-content class.
dropdown = soup.find('ul').findAll('li')

Your selection per se is okay to find the <ul> it may do not contain any <li> cause I assume these elements are generated dynamically by javascript. To validate this, question should be improved and url of website should be provided.
If content is provided dynamically one approach could be to work with selenium that will render the website like a browser and could return the "full" dom.
Note: In new code use find_all() instead of old syntax findAll()
Example
Html in your example is broken, but your code works if any lis are in the ul in your soup.
import requests
from bs4 import BeautifulSoup
html = '''
<label>
<input type="checkbox">
<ul class="dropdown-content">
<li>
</li>
</ul>
</label>
'''
soup = BeautifulSoup(html,'html5lib')
dropdown = soup.find_all('ul', {'class':'dropdown-content'})
print(dropdown)
Output
[<ul class="dropdown-content">
<li>
</li>
</ul>]

Select specific tag on BS4 Python

I have the following HTML
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
I use this code to get the data
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
productsize.append(size_available.text.strip())
But it gets both tags, since it shares the same class (product-size__option), how can I get only the information I need?
Thanks

The data you don't want has a CSS class product-size__option--no-stock. You can check if the element does not contain this class, by doing the following check: if 'product-size__option--no-stock' not in size_available.attrs['class']
For example:
from bs4 import BeautifulSoup
html = '''<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428006" class="product-size__option">
I WANT THIS</a>
</li>
<li class="product-size__option-wrapper">
<a onclick="ACC.productDetail.getNewProductSize(this)" data-option-code="000000000196428007" class="product-size__option product-size__option--no-stock">
I DONT WANT THIS</a>
</li>'''
soup = BeautifulSoup(html, 'html.parser')
linksize =soup.find_all('li', class_='product-size__option-wrapper')
productsize = []
for size in linksize:
for size_available in size.find_all('a', {'class':['product-size__option']}):
if 'product-size__option--no-stock' not in size_available.attrs['class']:
productsize.append(size_available.text.strip())

Python: String filter

I need help in python. I need to find my code in this -->
<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>
These words:Name, TEST TEST.

Try using the find method in bs4
Ex:
from bs4 import BeautifulSoup
html = """<li class="hide" style="display: list-item;">
<div class="name">Name</div>
<span class="value">TEST TEST</span>
</li>"""
soup = BeautifulSoup(html, "html.parser")
print( soup.find("li", class_="hide").text.strip() )
Output:
Name
TEST TEST
After you find the required element use .text to extract the string.

How to scrape tags that appear within a script

My intention is to scrape the names of the top-selling products on Ali-Express.
I'm using the Requests library alongside Beautiful Soup to accomplish this.
# Remember to import BeautifulSoup, requests and pprint
url = "https://bestselling.aliexpress.com/en?spm=2114.11010108.21.3.qyEJ5m"
soup = bs(req.get(url).text, 'html.parser')
#pp.pprint(soup) Verify that the page has been found
all_items = soup.find_all('li',class_= 'top10-item')
pp.pprint(all_items)
# []
However this returns an empty list, indicating that soup_find_all() did not find any tags fitting that criteria.
Inspect Element in Chrome displays the list items as such
.
However in source code (ul class = "top10-items") contains a script, which seems to iterate through each list item (I'm not familiar with HTML).
<div class="container">
<div class="top10-header"><span class="title">TOP SELLING</span> <span class="sub-title">This week's most popular products</span></div>
<ul class="top10-items loading" id="bestselling-top10">
</ul>
<script class="X-template-top10" type="text/mustache-template">
{{#topList}}
<li class="top10-item">
<div class="rank-orders">
<span class="rank">{{rank}}</span>
<span class="orders">{{productOrderNum}}</span>
</div>
<div class="img-wrap">
<a href="{{productDetailUrl}}" target="_blank">
<img src="{{productImgUrl}}" alt="{{productName}}">
</a>
</div>
<a class="item-desc" href="{{productDetailUrl}}" target="_blank">{{productName}}</a>
<p class="item-price">
<span class="price">US ${{productMinPrice}}</span>
<span class="uint">/ {{productUnitType}}</span>
</p>
</li>
{{/topList}}</script>
</div>
</div>
So this probably explains why soup.find_all() doesn't find the "li" tag.
My question is: How can I extract the item names from the script using Beautiful soup?

Analyze and edit links in html code with BeautifulSoup

I have a part of html page. I have to find all out links from it and replace them with the mark <can_be_link>.
Next code do almost all what I want, but it fails on links that are located on some lines (not on one) and that lines starts with tabs (in my example this is link with http://bad.com).
How to solve this issue correctly?
# -*- coding: utf-8 -*-
import BeautifulSoup
import re
if __name__=="__main__":
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
metka_link = '<can_be_link>'
soup = BeautifulSoup.BeautifulSoup(body)
hrefs = soup.findAll(name = 'a', attrs = { 'href': re.compile('\.*') })
repl = {}
for t in hrefs:
line = str(t)
# print '\n'*2, line
if not t.has_key('href'):
continue
href = t['href'].lower()
if href.find('http') == 0 or href.find('//') == 0:
body = body.replace(line, metka_link)
print body
The rezult is
<can_be_link>
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
But the desired result must be
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>

Use replace_with() method:
PageElement.replace_with() removes a tag or string from the tree, and
replaces it with the tag or string of your choice
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
body = """
good link
<ul>
<li class="FOLLOW">
<a href="http://bad.com" target="_blank">
<em></em>
<span>
<strong class="FOLLOW-text">Follow On</strong>
<strong class="FOLLOW-logo"></strong>
</span>
</a>
</li>
</ul>
"""
soup = BeautifulSoup(body, 'html.parser')
links = soup.find_all('a')
for link in links:
link = link.replace_with('<can_be_link>')
print soup.prettify(formatter=None)
prints:
<can_be_link>
<ul>
<li class="FOLLOW">
<can_be_link>
</li>
</ul>
Note the import statement - use the 4th BeautifulSoup version since Beautiful Soup 3 is no longer being developed, and that Beautiful Soup 4 is recommended for all new projects.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

find xpath with specified ID regex - python

I am trying to scrape a webpage with uses the following <li id="size_name_1" ....> <li id="size_name_2"....> <li id="size_name_a" is there a way to find size_name_NUMBER' such as response.xpath('//*[#id="size_name_\d+"]') I want to use regex in the id search, Note I use scrapy.

Related

Tag <li> not showing when using BeautifulSoup in Python

Select specific tag on BS4 Python

Python: String filter

How to scrape tags that appear within a script

Analyze and edit links in html code with BeautifulSoup

Categories

Resources