Having such xml file. How can I select only that tag, which href attribute ends with parent, like third element below.
Determine it by position like
elem = tree.findall('{*}CustomProperty')[2]
does not fit because some documents might have only one parent href, others 5-10 and third might not have such hrefs at all.
I tend to use xpath but can not figure out how can I tell xpath to search for end of attribute match.
Also xpath is not must, I will be glad to use any way that fits to my purpose
So how can I get CustomProperty element which has a href attribute that ends with word parent ?
<CustomProperty href="urn:1653267:643562dafewq:cs:46wey5ge:234566">urn:1653267:643562dafewq:cs:46wey5ge:234566:ss</CustomProperty>
<CustomProperty href="urn:1653267:643562dafewq:cs:46wey5ge:234566">urn:1653267:643562dafewq:cs:46wey5ge:234566:ss</CustomProperty>
<CustomProperty href="urn:1653267:643562dafewq:cs:46wey5ge:234566:parent">urn:1653267:643562dafewq:cs:46wey5ge:234566:ss</CustomProperty>
Thank you in advance for help
Try using the contains selector to find the element with an attribute href which contains the word parent
//*[contains(#href, 'parent')]
or if you are sure about the position of text "parent" you can use the ends-with
//*[ends-with(#href, 'parent')]
Does
//CustomProperty[contains(#href, 'parent') and substring-after(#href, 'parent') = '']
cater to your requirements? One issue with the suggestion is that it fails for href attributes where parent occurs more than once.
If your xpath processor supports xpath 2.0, use aberna's suggestion.
Remember to replace the '//' axis by specific paths whereever possible for performance reasons.
Related
a In a HTML page there is this line:
<td data-sort="funny" class="coin-name tw-text-right" style="min-width: 60px;">
and I can find it by using this XPATH:
//tbody/tr/td[5]
But I only interesting to put in a variable the "funny". Keep in mind that the word "funny" is changing all the time so I need to find it and push it to variable but how do I extract this changing text?
Thank you for helping ;-)
I am not sure if it will work 100% but here is one potential solution:
If you open up that tag you will find out that the first child's second child(Refer to image in solution) has a unique id attribute.
Then, you can use that unique attribute and work your way up to the parent tag with "data-sort attribute" using Child-to-Parent Traversing using Xpath. [Refer to the image it basically explains the same approach written above][1]
[1]: https://i.stack.imgur.com/9Dc2k.png
3.Once you uniquely identify the td tag you can then use getAttribute() and store its value.
I am scraping this webpage and while trying to extract text from one element, I am hitting a dead end.
So the element in question is shown below in the image -
The text in this element is within the <p> tags inside the <div>. I tried extracting the text in the scrapy shell using the following code - response.css("div.home-hero-blurb no-select::text").getall(). I received an empty list as the result.
Alternatively, if I try going a bit further and reference the <p> tags individually, I can get the text. Why does this happen? Isn't the <div> a parent element and shouldn't my code extract the text?
Note - I wanted to use the div because I thought that'll help me get both the <p> tags in one query.
I can see two issues here.
The first is that if you separate the class name with spaces, the css selector will understand you are looking for a child element of that name. So the correct approach is "div.home-hero-blurb.no-select::text" instead of "div.home-hero-blurb no-select::text".
The second issue is that the text you want is inside a p element that is a child of that div. If you only select the div, the selector will return the text inside the div, but not in it's childs. Since there is also a strong element as child of p, I would suggest using a generalist approach like:
response.css("div.home-hero-blurb.no-select *::text").getall()
This should return all text from the div and it's descendants.
It's relevant to point out that extracting text from css selectors are a extension of the standard selectors. Scrapy mention this here.
Edit
If you were to use XPath, this would be the equivalent expression:
response.xpath('//div[#class="home-hero-blurb no-select"]//text()').getall()
Regex is definitely giving me a headache. Every time I am moving one step ahead, I have a feeling that I stepping twice back!
I am trying to extract the class attribute of the last tag before the one containing any first name.
I randomly found that website which I thought would be a good example to practice. I am trying to write a general rule! Nothing specifically applied to that website.
The only assumption is that I know what the first name is and that it is contained in a tag (div, span, h1, ...) with a certain class.
Here is my regex trials:
re.findall(r'(?:class="(.+)".+){2}.*' + val, source) #'source' is the source code of the page
re.findall(r'(?:class="(.+)".*class=)+' + val, source) #'val' a name that I know is in the page
Any explanations on what is wrong or on what to do to succeed in my task would be highly appreciated !
Thanks a lot and stay safe.
Here is the solution that I found. Assuming any keyword, you want to retrieve the text of the preceding element.
First find the class of the tag containing your keyword:
elt = driver.find_element_by_xpath("//*[contains(text(),'{}')]".format(keyword))
keyword_class = elt.get_attribute('class')
Next you can find the parent or precedingsiblings using xpath.
# Find the class of firstnames preceding siblings and access their text
xpath = "//*[#class='{}']//preceding-sibling::*".format(class_name)
pre_siblings = driver.find_elements_by_xpath(xpath)
for sibling in pre_siblings:
print(sibling.text)
How to locate text within same xpath?
I used but not work:
//div[contains(text(),"Review") and contains(text(),"received"]
The texts belong to two different tags. You can look for element with "Review" text which has a child with "received" text
//div[contains(text(),"Review") and div[contains(text(),"received")]]
Take care, you are mission the closer ')'
//div[contains(text(),"Review") and contains(text(),"received")]
But this is not the good xpath, cause "received" is on inner element
Try this, .//* means any child element, can use ./div on second contains
//div[.//*[contains(text(),"Review")] and .//*[contains(text(),"received")]]
or
//div[contains(text(),"Review") and .//*[contains(text(),"received")]]
I am trying to find the input type with statusid_103408 and with text() Draft
here is the xpath i am using, not sure where I am going wrong
//input[#name='statusid_103408' and contains(text(), 'Draft')]
The reason this xpath does not work is because the text of "Draft" is not actually a property of the input element. It is contained in the li element that is the parent. Therefore, your search is returning no results.
I suggest just using the name only in your xpath search (if it unique). If you definitely need the text in your search, you can search the li item's text first, then find your input, like so:
//li[text()='Draft']/input[#name='statusid_103408']
Use Value it will work , because value is unique, text is not inside the input tag!