I can't select values in a list. I tried using find_element_by_class_name() to open the menu but when I need to select a <li> returns that element doesn't have a function click().
Here the code:
click_menu = driver.find_element_by_class_name("periodSelector")
click_menu[1].click()
Here is the HTML that I am trying to parse:
<div data-period-selector="" data-period="periodFilter">
<div class="periodSelectorContainer">
<div class="btn-group periodSelector">
<button class="flat-btn dropdown-toggle periodToggle ng-binding" data-toggle="dropdown"> 20/02/2021 - 22/03/2021 <span class="dropdown-arrow"></span> </button>
<ul class="dropdown-menu">
<li>
<a href="javascript:void(0);" class="new-financ" ng-click="selectToday()"><i></i>
<span class="pull-left">Hoje</span>
<span class="pull-right"></span>
</a>
</li>
<li>
<a href="javascript:void(0);" class="new-financ" ng-click="selectThisWeek()"><i>
</li>
There are multiple class names you have to use a css selector.
click_menu = driver.find_element_by_css_selector("button.flat-btn.dropdown-toggle.periodToggle.ng-binding")
click_menu.click()
Clicks 1st li tag.
driver.find_element_by_xpath("ul[#class='dropdown-menu']/li[1]").click()
periodSelector is a class on a DIV
<div class="btn-group periodSelector">
I'm assuming that you need to click on the BUTTON
<button class="flat-btn dropdown-toggle periodToggle ng-binding" data-toggle="dropdown">
Most of those classes seem generic (probably not unique) but I'm guessing that periodToggle might be unique given the date range. Try
driver.find_element_by_css_selector("button.periodToggle").click()
NOTE:
You have an error in your code. You are using .find_element_by_class_name() (singular) but have array notation on the next line, click_menu[1]. In this case, you can just use click_menu.click(). You'd only need the array notation if you were using .find_elements_by_*() (note the plural, elements).
I'm trying to parse through html using beautifulsoup (being called with lxml).
On nested tags I'm getting repeated text
I've tried going through and only counting tags that have no children, but then I'm losing out on data
given:
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments</span>
</li>
</ul>
</div>
and running:
soup = BeautifulSoup(file_info, features = "lxml")
soup.prettify().encode("utf-8")
for tag in soup.find_all(True):
if check_text(tag.text): #false on empty string/ all numbers
print (tag.text)
I get "to post comments" 4 times.
Is there a beautifulsoup way of just getting the result once?
Given an input like
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments1</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments2</span>
</li>
</ul>
</div>
<div class="links">
<ul class="links inline">
<li class="comment_forbidden first last">
<span> to post comments3</span>
</li>
</ul>
</div>
You could do something like
[x.span.string for x in soup.find_all("li", class_="comment_forbidden first last")]
which would give
[' to post comments1', ' to post comments2', ' to post comments3']
find_all() is used to find all the <li> tags of class comment_forbidden first last and the <span> child tag of each of these <li> tag's content is obtained using their string attribute.
For anyone struggling with this, try swapping out the parser. I switched to html5lib and I no longer have repetitions. It is a costlier parser though, so may cause performance issues.
soup = BeautifulSoup(html, "html5lib")
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser
You can use find() instead of find_all() to get your desired result only once
So I'm attempting to clean some HTML. I've got the following function:
def clean_html(self, html):
replaced_html = html.decode('utf-8').replace('<', ' <')
tree = etree.HTML(replaced_html)
etree.strip_elements(tree, 'script', 'style', 'img', 'noscript', 'svg')
for el in tree.xpath('//*[#style]'):
el.attrib.pop('style')
for el in tree.xpath('//*[#class]'):
el.attrib.pop('class')
for el in tree.xpath('//*[#id]'):
el.attrib.pop('id')
etree.strip_tags(tree, etree.Comment)
return etree.tostring(tree, encoding='unicode', method='html')
I'm looking to also remove all data-attributes e.g
<li data-direction="ltr" '
'data-listposition="center" data-data-id="dataItem-ifz7cqbs" '
'data-state="menu idle link notMobile">sky</li>
But the attributes are unknown to me (above is just an example).
So I'm looking to transform the above into just <li>sky</li> and would run on every element on the page.
In my code above I'm able to remove simple things like id, class but I'm not sure how to handle the dynamic attributes data-*. Possibly regex?
EDIT
I should clarify a bit about the input. My example above shows the use of <li> tags. But the actual input is the entire html of a page so it would be something like:
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
<div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp35za1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black LinkedIn Icon","uri":"6ea5b4a88f0b4f91945b40499aa0af00.png","width":200,"height":200,"alt":"Black LinkedIn Icon","link":{"type":"ExternalLink","id":"dataItem-ig84dp5v","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.linkedin.com/in/beth-liu-aba2b487?trk=hp-identity-name","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.pinterest.com/agencyb/" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ijxtrrjj","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Pinterest Icon","uri":"8f6f59264a094af0b46e9f6c77dff83e.png","width":200,"height":200,"alt":"Black Pinterest Icon","link":{"type":"ExternalLink","id":"dataItem-ikg674xm","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"https://www.pinterest.com/agencyb/","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="http://www.twitter.com/lubecka" target="_blank" > <div data-image-info='{"imageData":{"type":"Image","id":"dataItem-ifp3554u","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"2.0","isHidden":false},"title":"Black Twitter Icon","uri":"c7d035ba85f6486680c2facedecdcf4d.png","description":"","width":200,"height":200,"alt":"Black Twitter Icon","link":{"type":"ExternalLink","id":"dataItem-ifp3554u1","metaData":{"pageId":"masterPage","isPreset":false,"schemaVersion":"1.0","isHidden":false},"url":"http://www.twitter.com/lubecka","target":"_blank"}},"displayMode":"fill"}' > </div> </a> </li> <li> <a href="https://www.instagram.com/" target="_blank">
<html>
Assuming that the names of the "data attributes" always start with "data-", you can remove them like this:
for el in tree.xpath("//*"):
for attr in el.attrib:
if attr.startswith("data-"):
el.attrib.pop(attr)
you can clear the attributes like this
import re
def strip_attribute(data):
p = re.compile('data-[^=]*="[^"]*"')
print(p)
return p.sub('', data)
print(strip_attribute('with attribute'))
Maybe this is what you're looking for:
from lxml import etree
code = """
<html>
<ul>
<li data-i="sdfdsf">something</li>
<li data-i="dsfd">something</li>
</ul>
<p data-para="cvcv">content</p>
</html>
"""
xml = etree.XML(code)
elements = list(xml.iter())
for element in elements:
if len(element.text.strip())>0:
print('<'+element.tag+'>'+element.text+'</'+element.tag+'>')
Output:
<li>something</li>
<li>something</li>
<p>content</p>
I have a html code that looks kind of like this (shortened);
<div id="activities" class="ListItems">
<h2>Standards</h2>
<ul>
<li>
<a class="Title" href="http://www.google.com" >Guidelines on management</a>
<div class="Info">
<p>
text
</p>
<p class="Date">Status: Under development</p>
</div>
</li>
</ul>
</div>
<div class="DocList">
<h3>Reports</h3>
<p class="SupLink">+ <a href="http://www.google.com/test" >View More</a></p>
<ul>
<li class="pdf">
<a class="Title" href="document.pdf" target="_blank" >Document</a>
<span class="Size">
[1,542.3KB]
</span>
<div class="Info">
<p>
text <a href="http://www.google.com" >Read more</a>
</p>
<p class="Date">
14/03/2018
</p>
</div>
</li>
</ul>
</div>
I am trying to select the value in 'href=' under 'a class="Title"' by using this code:
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
But I get two returns, the one under 'div class="DocList"' is also returned.
I am trying to change my xpath expressions so that I would only look within the node but I cannot get it to work.
Could someone please help me understand how to "search" within a specific node. I have gone through multiple xpath documentations but I cannot seem to figure it out.
Using // you are already selecting all the a elements in the document.
To search in a specific div try specifying the parent with // and then use //a again to look anywhere in the div
//div[#class="ListItems"]//a[#class="Title"]
for node in tree.xpath('//div[#class="ListItems"]//a[#class="Title"]'):url2.append(node.get("href"))
Try this xpath expression to select the div with a specific id recursively :
'//div[#id="activities"]//a[#class="Title"]'
so :
def sub_path02(url):
page = requests.get(url)
tree = html.fromstring(page.content)
url2 = []
for node in tree.xpath('//div[#id="activities"]//a[#class="Title"]'):
url2.append(node.get("href"))
return url2
Note :
It's ever better to select an id than a class because an id should be unique (in real life, there's sometimes bad code with multiple same id in the same page, but a class can be repeated N times)
I need to use XPath with lxml in Python 2.6 to extract two text items:
-Name One Type 1 Description 1
-Name Two Type 2 Description 2
I've tried using the following Xpath: '//*[#id="results"]/li/div/p/child::text()'
However this gives me only the following text
-Name One Type 1
-Name Two Type 2
Any suggestions on the correct Xpath to use?
<div id="container">
<ol id="results">
<li class="mod1" data-li-position="0">
<img src="image001.jpg">
<div class="bd">
<h3>
Category 1
</h3>
<p class="description">
<strong class="highlight">Name One</strong>
<strong class="highlight">Type 1</strong>
Description 1
</p>
</div>
</li>
<li class="mod2" data-li-position="1">
<img src="image002.jpg">
<div class="bd">
<h3>
Category 2
</h3>
<p class="description">
<strong class="highlight">Name Two</strong>
Description 2
<strong class="highlight">Type 2</strong>
</p>
</div>
</li>
This last part of your XPath :
...../p/child::text()
... select only text nodes which is child of child of <p>. That's why you missed, for example, Description 1, because it is direct child of <p>. You can try to change that part to be as follow :
...../p//text()
Above XPath will select all text nodes which are descendants of <p>, in other words, all text nodes anywhere within <p>.