python scrapy xpath text() failed extract text with <b /> - python

I am using scrapy shell and trying to get the text part of the following element
<div class="CCCCC">AAA<br />BBB<br />CCC<br />DDD</div>
By using
response.xpath('//div [#class="CCCCC"]')[0].extract()
I got a whole element includes tags,
<div class="CCCCC">AAA<br>BBB<br>CCC<br>DDD</div>
but using
response.xpath('//div [#class="CCCCC"]/text()')[0].extract()
I got only 'AAA' instead of 'AAA<br>BBB<br>CCC<br>DDD'that I expected.
Is the behavior of text() correct ?

The behaviour is correct.
response.xpath('//div [#class="CCCCC"]/text()')
will give [AAA, BBB, CCC, DDD] as a list but your code is
response.xpath('//div [#class="CCCCC"]/text()')[0].extract()
Note that you ask the first element of the with [0]. That's why you only get AAA.
If you remove the [0] you will have all the four elements.

Please avoid using "[0].extract()" in scrapy,it may lead to list index out of error.
Please use response.xpath('//div [#class="CCCCC"]/text()').extract_first(),it will save you ,if there is no first element
for further details check here Scrapy Selector

Related

InvalidSelectorException Error while trying to get text from div class in Selenium Python

I'm trying to get text using Selenium WebDriver and here is my code. Please note that I don't want to use XPath, because in my case the ID gets changed on every relaunch of the web page.
My code:
driver.find_element_by_class_name("05uR6d").text
HTML:
<div class="O5uR6d">to fasten stuff</div>
Error:
selenium.common.exceptions.InvalidSelectorException: Message: invalid selector: An invalid or illegal selector was specified (Session info: chrome=88.0.4324.150)
Error is specific to the line of code I mentioned above.
How can I fix this?
Use this xpath:
driver.find_element_by_xpath("//div[contains(text(),'to fasten stuff')]")
Or this CSS:
driver.find_element_by_css_selector(".O5uR6d")
If both won't work, improve your question by adding more data of HTML you are looking at.
It can be done using multiple ways let me try to explain most of them.
Get element by class name.
this is the most easiest solution to get any element by class name you can simply do is:
driver.find_element_by_class_selector('foo');
Get Element by xpath
This is a bit tricky one, you can apply xpath either the class name, title, id or whatever remains same. it also works even if there's a text inside your div. For example:
driver.find_element_by_xpath("//tagname[#attribute='value']")
or in your case:
driver.find_element_by_xpath("//div['class='O5uR6d']")
or you can do something like #vitaliis said
driver.find_element_by_xpath("//div[contains(text(),'to fasten stuff')]")
You can read more about xpath and how to find it on this link
Get Elements by ID:
You can also get the element from id if there's any that's static:
driver.find_element_by_id('baz')
Get Elements by Name:
Get Elements by name using the following syntax:
driver.find_element_by_name('bazz')
Using CSS Selectors:
You can also use the css selectors to find the elements. Consider a following tag that has some attributes:
<p class="content">Site content goes here.</p>
You can get this element by:
driver.find_element_by_css_selector('p.content')
You can read more about it over here

How to find a specific html element with python selenium

So I have this element in html :
<input data-v-72dea36a="" type="text" name="tradelink" placeholder="Enter your code" style="margin-right: 25px;">
How can I select exactly this one? I tried:
find_element_by_name('tradelink')
There is another element before this one with the same name...it doesn't have an id or class name...
If there are just 2 elements with attribute name="tradelink", and assuming you need the seconds one, you can use:
find_elements_by_name('tradelink')[1]
Another way is using xpath to match the placeholder value:
find_element_by_xpath("//input[#placeholder='Enter your code']")
Notes:
Notice the plural on elementS on the 1st example
Selenium Docs - Locating Elements
Fore more than 2 elements, you may have to change the array item number, i.e.: find_elements_by_name('tradelink')[2]
Rightclick the element within the inspect element editor and select copy full xpath. As posted above, paste that code into the parentheses in find_element_by_xpath()

what to do for dynamically changing xpaths in python using selenium?

I have a xpath as:
//*[#id="jobs-search-box-keyword-id-ember968"]
The number 968 constantly keeps on changing after every reload.
Rest of the string remains constant.
How to I find the constantly changing xpath?
You can use partial id with contains()
//*[contains(#id, "jobs-search-box-keyword-id-ember")]
You can try using starts-with below,
//*[starts-with(#id,'jobs-search-box-keyword-id-ember')]
The details provided is insufficient to to provide the accurate result. Still you can follow the below code references
In //*[#id="jobs-search-box-keyword-id-ember968"] the last number 968 keeps changing. but if you make this like //*[starts-with(#id,'jobs-search-box-keyword-id-ember')] then there might be possibility that you can have more then one element with the same partial is i.e. jobs-search-box-keyword-id-ember in this case it will locate on 1st matching element. that may not be your expected one
Use the tag name lets say element is an input tag whose id is jobs-search-box-keyword-id-ember968
Xpath - //input[starts-with(#id,'jobs-search-box-keyword-id-ember')]
CSS - input[id^='jobs-search-box-keyword-id-ember']
Use the relevant parent element to make this more specific. e.g the element is in parent tag <div class="container">
Xpath- //div[#class='container']//input[starts-with(#id,'jobs-search-box-keyword-id-ember')]
CSS - div.container input[id^='jobs-search-box-keyword-id-ember']
This worked for me:
Locator:
JOBS_SEARCH_BOX_XPATH = "//*[contains(#id,'jobs-search-box-keyword-id-ember')]"
Code:
element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, JOBS_SEARCH_BOX_XPATH)))
.send_keys("SDET")

Finding an element by partial href (Python Selenium)

I'm trying to access text from elements that have different xpaths but very predictable href schemes across multiple pages in a web database. Here are some examples:
<a href="/mathscinet/search/mscdoc.html?code=65J22,(35R30,47A52,65J20,65R30,90C30)">
65J22 (35R30 47A52 65J20 65R30 90C30) </a>
In this example I would want to extract "65J22 (35R30 47A52 65J20 65R30 90C30)"
<a href="/mathscinet/search/mscdoc.html?code=05C80,(05C15)">
05C80 (05C15) </a>
In this example I would want to extract "05C80 (05C15)". My web scraper would not be able to search by xpath directly due to the xpaths of my desired elements changing between pages, so I am looking for a more roundabout approach.
My main idea is to use the fact that every href contains "/mathscinet/search/mscdoc.html?code=". Selenium can't directly search for hrefs, but I was thinking of doing something similar to this C# implementation:
Driver.Instance.FindElement(By.XPath("//a[contains(#href, 'long')]"))
To port this over to python, the only analogous method I could think of would be to use the in operator, but I am not sure how the syntax will work when everything is nested in a find_element_by_xpath. How would I bring all of these ideas together to obtain my desired text?
driver.find_element_by_xpath("//a['/mathscinet/search/mscdoc.html?code=' in #href]").text
If I right understand you want to locate all elements, that have same partial href. You can use this:
elements = driver.find_elements_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]")
for element in elements:
print(element.text)
or if you want to locate one element:
driver.find_element_by_xpath("//a[contains(#href, '/mathscinet/search/mscdoc.html')]").text
This will give a list of all elements located.
As per the HTML you have shared #AndreiSuvorkov's answer would possibly cater to your current requirement. Perhaps you can get much more granular and construct an optimized xpath by:
Instead of using contains using starts-with
Include the ?code= part of the #href attribute
Your effective code block will be:
all_elements = driver.find_elements_by_xpath("//a[starts-with(#href,'/mathscinet/search/mscdoc.html?code=')]")
for elem in all_elements:
print(elem.get_attribute("innerHTML"))

Python Selenium find element by link text contains a string with wildcard or regex

I have a HTML snippet like this:
<span class="line S_line1">
评论
<em>1</em>
</span>
The thing is that number in <em>1</em> is not predictable or sometime just omit, I want to find this element by
driver.find_element_by_link_text(u'评论*')
But it didn't work, is there a way to do that with a wildcard or regex?
driver.find_element_by_partial_link_text(u'评论')
You can using partial_link_text.This way you can find a link with changing content using some part which is always constant.

Categories