I've been trying to scoop out a portion of text out of some html elements using xapth but It seems I'm going somewhere wrong that is why I can't make it.
Html elements:
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
I would like to dig out International using xpath. I know I could get success using .next_sibling If I wanted to extract the same using css selector but I'm not interested in going that route.
That said If I try like this I can get the same using xpath:
tree.xpath("//*[#class='content']/p/following::text()")[0]
But the above expression is not what I'm after cause I can't use the same within selenium webdriver If I stick to driver.find_element_by_xpath()
The only way that I'm interested in is like the following but it is not working:
"//*[#class='content']/p/following::*"
Real-life example:
from lxml.html import fromstring
htmlelem = """
<div class="content">
<p>Type of cuisine: </p>International
</div>
"""
tree = fromstring(htmlelem)
item = tree.xpath("//*[#class='content']/p/following::text()")[0].strip()
elem = tree.xpath("//*[#class='content']/p/following::*")[0].text
print(elem)
In the above example, I can get success printing item but can't printing elem. However, I would like to modify the expression used within elem.
How can I make it work so that the same xpath I can use within lxml library or within selenium?
Since OP was looking for a solution which extracts the text from outside the xpath, the following should do that, albeit in a somewhat awkward manner:
tree.xpath("//*[#class='content']")[0][0].tail
Output:
International
The need for this approach is a result of the way lxml parses the html code:
tree.xpath("//*[#class='content']") results in a list of length=1.
The first (and only) element in the list - tree.xpath("//*[#class='content']")[0] is a lxml.html.HtmlElement which itself can be treated as a list and also has length=1.
In the tail of the first (and only) element in that lxml.html.HtmlElement hides the desired output...
Related
I am trying to extract data from multiple pages of search results where the HTML in question looks like so:
<ul>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
<li class="Card___StyledLi4-ulg8ho-7 jmevwM">...</li>
</ul>
I want to extract the text from the "li" tags, so I have:
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.XPATH,'Card___StyledLi4-ulg8ho-7.jmevwM')
print(text_data.text)
to wait and target "li" item. However, I get a "TimeoutException" error.
However, if I try to locate a single "li" item using the XPATH under the same conditions, the data is returned which leads me to question if I am inputting the class correctly?
Can anyone tell me what I'm doing wrong? Please let me know if there is any further information, you'd like me to provide.
I believe the XPath for these list items would be //li[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] (or //*[#class="Card___StyledLi4-ulg8ho-7 jmevwM"] if you want all elements with that class rather than just li tags). You can take a look at this cheatsheet and this tutorial for further rules and examples of XPath.
You can also just use CSS Selectors like (By.CSS_SELECTOR, '.Card___StyledLi4-ulg8ho-7.jmevwM') in this case.
You have mentioned the wrong locator type, it should be CSS_SELECTOR and also put a dot '.' in front of element's property, because it is a 'class':
text_data = WebDriverWait(driver,10).until(EC.visibility_of_all_element_located((By.CSS_SELECTOR,'.Card___StyledLi4-ulg8ho-7.jmevwM')
In the below html elements, I have been unsuccessful using beautiful soup.select to only obtain the first child after div class="wrap-25PNPwRV"> (i.e. -11.94M and 2.30M) in list format
<div class="value-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>−11.94M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−119.94%</div></div></div>
<div class="value-25PNPwRV additional-25PNPwRV">
<div class="wrap-25PNPwRV">
<div>2.30M</div>
<div class="change-25PNPwRV negative-25PNPwRV">−80.17%</div></div></div>
Above is just two examples within the html I'm attempting to scrape within the dynamic javascript coded table which the above source code lies within, but there are many more div attributes on the page, and many more div class "wrap-25PNPwRV" inside the javascript table
I currently have the below code which allows me to scrape all the contents within div class ="wrap-25PNPwRV"
data_list = [elem.get_text() for elem in soup.select("div.wrap-25PNPwRV")]
Output:
['-11.94M', '-119.94%', '2.30M', '-80.17%']
However, I would like to use soup.select to yield the desired output :
['-11.94M', '2.30M']
I tried following this guide https://www.crummy.com/software/BeautifulSoup/bs4/doc/ but have been unsuccessful to implement it to my above code.
Please note, if soup.select is not possible to perform the above, I am happy to use an alternative providing it generates the same list format/output
You can use the :nth-of-type CSS selector:
data_list = [elem.get_text() for elem in soup.select(".wrap-25PNPwRV div:nth-of-type(1)")]
I'd suggest to not use the .wrap-25PNPwRV class. Seems random and almost certainly will change in the future.
Instead, select the <div> element which has other element with class="change..." as sibling. For example
print([t.text.strip() for t in soup.select('div:has(+ [class^="change"])')])
Prints:
['−11.94M', '2.30M']
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text
If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.
As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))
I have a problem with the XPath function of pythons lxml. A minimal example is the following python code:
from lxml import html, etree
text = """
<p class="goal">
<strong>Goal</strong> <br />
<ul><li>test</li></ul>
</p>
"""
tree = html.fromstring(text)
thesis_goal = tree.xpath('//p[#class="goal"]')[0]
print etree.tostring(thesis_goal)
Running the code produces
<p class="goal">
<strong>Goal</strong> <br/>
</p>
As you can see, the entire <ul> block is lost. This also means that it is not possible to address the <ul> with an XPath along the lines of //p[#class="goal"]/ul, as the <ul> is not counted as a child of the <p>.
Is this a bug or a feature of lxml, and if it is the latter, how can I get access to the entire contents of the <p>? The thing is embedded in a larger website, and it is not guaranteed that there will even be a <ul> tag (there may be another <p> inside, or anything else, for that matter).
Update: Updated title after answer was received to make finding this question easier for people with the same problem.
ul elements (or more generally flow content) are not allowed inside p elements (which can only contain phrasing content). Therefore lxml.html parses text as
In [45]: print(html.tostring(tree))
<div><p class="goal">
<strong>Goal</strong> <br>
</p><ul><li>test</li></ul>
</div>
The ul follows the p element. So you could find the ul element using the XPath
In [47]: print(html.tostring(tree.xpath('//p[#class="goal"]/following::ul')[0]))
<ul><li>test</li></ul>
#unutbu has correct anwser. Your HTML is not valid and html parser will produce unexpected results. As it's said in the lxml docs,
The support for parsing broken HTML depends entirely on libxml2's
recovery algorithm. It is not the fault of lxml if you find documents
that are so heavily broken that the parser cannot handle them. There
is also no guarantee that the resulting tree will contain all data
from the original document. The parser may have to drop seriously
broken parts when struggling to keep parsing. Especially misplaced
meta tags can suffer from this, which may lead to encoding problems.
Depending on what you are trying to achieve, you can fallback to xml parser
# Changing html to etree here will produce behaviour you expect
tree = etree.fromstring(text)
or move to more advanced website parsing package such as BeautifulSoup4, for example