<div class="inner-article">
<a style="height:150px;" href="this is a link"><img width="150" height="150" src="this is an image" alt="K1 88ahiwyu"></a>
<h1><a class="name-link" href="/shop/jackets/pegroxdya/dao7kdzej">title</a></h1>
<p><a class="name-link" href="/shop/jackets/pegroxdya/dao7kdzej">subtitle</a></p>
</div>
Hello!
I need to find a XPath to get the "div" with the class="inner-article" by Title and Subtitle of the two "a" children. The website I want to operate on has a lot of these inner-articles and I need to find a specific one, only given a title and a subtitle.
E.G.: The website has an inner article with the title "Company® Leather Work Jacket" and a subtitle with its color "Silver".
Now I need to be able to find the "div" element even if I only have the keywords "Work Jacket" for the title and "Silver" for the subtitle.
This is what I came up with already:
e1 = driver.find_element_by_xpath("//*[text()[contains(.,'" + kw + "')]]")
kw is a string which contains the keywords for the title and if I print it out it correctly responds the "a" element and clicking on it works too, but it's not specific enough because there are more objects which also have these keywords in their title, which is why I also need the subtitle that always contains the color(Here referred to as string "clr"):
e2 = driver.find_element_by_xpath("//*[text()[contains(.,'" + clr + "')]]")
This also works and clicks on the subtitle correctly but only the color would also return multiple objects on the website.
That's why I need to find the "div" element with keywords for the title and the color for the subtitle.
I've tried this but it doesn't work:
e1 = driver.find_element_by_xpath("//*[text()[contains(.,'" + kw + "') and contains(.,'" + clr + "')]]")
Try
driver.find_element_by_xpath("//div[h1/a[contains(text(),'" + kw + "')] and p/a[contains(text(),'" + clr + "')]]")
You can learn more xpath grammar.Refer to this link
In your case you can use the xpath like this.
("//*[text()[contains(.,'" + kw + "')]]/parent::div[1]")
Related
I am trying to search into a table for a specific value (Document ID) and then press a button that is next to that column (Retire). I should add here that the 'Retire' button is only visible once the mouse is hovered over, but I have built that into my code which I'll share further down.
So for example:
My Document ID would be 0900766b8001b6a3, and I would want to click the button called 'Retire'. The issue I'm having is pulling the XPaths for the retire buttons, this needs to be dynamic. I got it working for some Document IDs that had a common link, for example:
A700000007201082 Xpath = //[#id="retire-7201082"] (you can see the commonality here with the Document ID ending in the same as the Xpath number 7201082. Whereas in the first example, the xpath for '0900766b8001b6a3' = //[#id="retire-251642"], you can see the retire number here is completely random to the Document ID and therefore hard to manually build the Xpath.
Here is my code:
before_XPath = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr["
aftertd_XPath_1 = "]/td[1]"
aftertd_XPath_2 = "]/td[2]"
aftertd_XPath_3 = "]/td[3]"
before_XPath_1 = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[1]/th["
before_XPath_2 = "//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[2]/td["
aftertd_XPath = "]/td["
after_XPath = "]"
aftertr_XPath = "]"
search_text = "0900766b8001af05"
time.sleep(10)
num_rows = len(driver.find_elements_by_xpath("//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr"))
num_columns = len (driver.find_elements_by_xpath("//*[#class='wp-list-table widefat fixed striped table-view-list pages']/tbody/tr[2]/td"))
elem_found = False
for t_row in range(2, (num_rows + 1)):
for t_column in range(1, (num_columns + 1)):
FinalXPath = before_XPath + str(t_row) + aftertd_XPath + str(t_column) + aftertr_XPath
cell_text = driver.find_element_by_xpath(FinalXPath).text
if ((cell_text.casefold()) == (search_text.casefold())):
print("Search Text "+ search_text +" is present at row " + str(t_row) + " and column " + str(t_column))
elem_found = True
achains = ActionChains(driver)
move_to = driver.find_element_by_xpath("/html/body/div[1]/div[2]/div[2]/div[1]/div[3]/form[1]/table/tbody/tr[" + str(t_row) + "]/td[2]")
achains.move_to_element(move_to).perform()
retire_xpath = driver.find_element_by_xpath("//*[#id='retire-"+ str(search_text[-7:])+"']")
time.sleep(6)
driver.execute_script("arguments[0].click();", move_to)
time.sleep(6)
driver.switch_to.alert.accept()
break
if (elem_found == False):
print("Search Text "+ search_text +" not found")
This particular bit of code lets me handle any Document IDs such as 'A700000007201082' as I can just cut off the part I need and build it into an XPath:
retire_xpath = driver.find_element_by_xpath("//*[#id='retire-"+ str(search_text[-7:])+"']")
I've tried to replicate the above for the Doc IDs starting with 09007, but I can't find how to pull that unique number as it isn't anywhere accessible in the table.
I am wondering if there's something I can do to build it the same way I have above or perhaps focus on the index? Any advice is much appreciated, thanks.
EDIT:
This is the HTML code for the RETIRE button for Document ID: 0900766b8001b6a3
<span class="retire"><button id="retire-251642" data-document-id="251642" rel="bookmark" aria-label="Retire this document" class="rs-retire-link">Retire</button></span>
You can see the retire button id is completely different to the Document ID. Here is some HTML code just above it which I think could be useful:
<div class="hidden" id="inline_251642">
<div class="post_title">General Purpose Test Kit Lead</div><div class="post_name">0900766b8001b6a3</div>
<div class="post_author">4</div>
<div class="comment_status">closed</div>
<div class="ping_status">closed</div>
<div class="_status">publish</div>
<div class="jj">30</div>
<div class="mm">03</div>
<div class="aa">2001</div>
<div class="hh">15</div>
<div class="mn">43</div>
<div class="ss">03</div>
<div class="post_password"></div><div class="post_parent">0</div><div class="page_template">default</div><div class="tags_input" id="rs-language-code_251642">de, en, fr, it</div><div class="tags_input" id="rs-current-state_251642">public</div><div class="tags_input" id="rs-doc-class-code_251642">rs_instruction_sheet</div><div class="tags_input" id="rs-restricted-countries_251642"></div></div>
Would it be possible to call the div class "post_name" as this has the correct doc ID, and the press the RETIRE button for that specific Doc ID?
Thank you.
I have this sample-html:
<div class="classname1">
"This is text inside of"
<b>"a subtag"</b>
"I would like to select."
<br>
"More text I don't need"
</br>
(more br and b tags on the same level)
</div>
The result should be a list containing:
["This is text inside of a subtag I would like to select."]
I tried:
response.xpath('//div[#class="classname1"]//text()[1]').getall()
but this gives me only the first part "This is text inside".
There are two challenges:
Sometimes there is no b tag
There is even more text after the desired section that should be expluded
Maybe a loop?
If anyone has an approach it would be really helpful.
What about this (used More text I don't need as a stopword):
parts = []
for text in response.xpath('//div[#class="classname1"]//text()').getall():
if 'More text I don't need' in text:
break
parts.append(text)
result = ' '.join(parts)
UPDATE For example, you need to extract all text before Ort: :
def parse(self, response):
for card_node in response.xpath('//div[#class="col-md-8 col-sm-12 card-place-container"]'):
parts = []
for text in card_node.xpath('.//text()').getall():
if 'Ort: ' in text:
break
parts.append(text)
before_ort = '\n'.join(parts)
print(before_ort)
Use the descendant or self xpath selector in combination with the position selector as below
response.xpath('//div[#class="classname1"]/descendant-or-self::*/text()[position() <3]').getall()
I'm having trouble with beautiful soup. started today to learn about it but can't manage to find a way to fix my issue.
I want to get only 1 link each time, and what is written in the h1 and p.
article_name_list = soup.find(class_='turbolink_scroller')
#find all links in the div
article_name_list_items = article_name_list.find_all('article')
#loop to print all out
for article_name in article_name_list_items:
names = article_name.find('h1')
color = article_name.find('p')
print(names)
print(color)
the output is:
<h1><a class="name-link" href="/shop/jackets/gw1diqgyr/km21a8hnc">Gonz Logo Coaches Jacket </a></h1>
<p><a class="name-link" href="/shop/jackets/gw1diqgyr/km21a8hnc">Red</a></p>
I would like to get in output :
href="blablabla"
Gonz Logo Coatches Jacket Red
and put it in a variable each time (if possible) like link = href"blablabla" and name = "gonz logo ..." or 3 variables with the color in another one.
EDIT here is how the page looks like:
<div class="turbolink_scroller" id="container" style="opacity: 1;">
<article>
<div class="inner-article">
<a style="height:150px;" href="/shop/jackets/h21snm5ld/jick90fel">
<img width="150" height="150" src="//assets.supremenewyork.com/146917/vi/MCHFhUqvN0w.jpg" alt="Mchfhuqvn0w">
<div class="sold_out_tag" style="">sold out</div>
</a>
<h1><a class="name-link" href="/shop/jackets/h21snm5ld/jick90fel">NY Tapestry Denim Chore Coat</a></h1>
<p><a class="name-link" href="/shop/jackets/h21snm5ld/jick90fel">Maroon</a></p>
</div>
</article>
<article></article>
<article></article>
<article></article>
</div>
EDIT 2: problem resolved (thank you)
here is the solution for others:
article_name_list = soup.find(class_='turbolink_scroller')
#find all links in the div
article_name_list_items = article_name_list.find_all('article')
#loop to print all out
for article_name in article_name_list_items:
link = article_name.find('h1').find('a').get('href')
names = article_name.find('h1').find('a').get_text()
color = article_name.find('p').find('a').get_text()
print(names)
print(color)
print(link)
thank you all for your answers.
I assume you're looking to put each of those into individual lists.
name_list = []
link_list = []
color_list = []
for article_name in article_name_list_items:
names = article_name.find('h1').find('a', class_ = 'name-link').get_text()
links = article_name.find('p').find('a', class_ = 'name-link').get('href')
colors = article_name.find('p').find('a', class_ = 'name-link').get_text()
name_list.append(names)
link_list.append(links)
color_list.append(colors)
Not exactly sure what article_name_list_items looks like but names will get you the text of the <h1> element, links will get you the href of the <p> element, and colors will get you the text of <p> element.
You could also opt to include all elements in a list of lists which would be this (initialize new list list_of_all and replace 3 list appends with the single append in the second line):
list_of_all = []
list_of_all.append([names, links, colors])
I believe you are very close. However, you should tell us a little more about the structure of the page. Are all articles structured in the same h1>a,p>a structure?
Assuming this structure then the following should work:
names = article_name.find('h1').find('a').get('href')
color = article_name.find('p').find('a').get_text()
I have a source file like this:
<div class="l_post j_l_post l_post_bright " ...>
<div class="lzl_cnt">
...
<span class="lzl_content_main">
text1
<a class="at j_user_card" username="...">
username
</a>
text3
</span>
</div>
...
</div>
And I want to get text3, Currently, I tried this:(I am at <div class="lzl_cnt">)
driver.find_element(By.XPATH,'.//span[#class="lzl_content_main"]/text()[1]')
but I got
"Message: invalid selector: The result of the xpath expression
".//span[#class="lzl_content_main"]/text()[1]" is: [object Text]. It
should be an element".
And Is there a way to get the "text3"?
I should make it clearer:
The above HTML is part of the bigger structure, and I selected it out with the following python code:
for sel in driver.find_elements_by_css_selector('div.l_post.j_l_post.l_post_bright'):
for i in sel.find_elements_by_xpath('.//div[#class="lzl_cnt"]'):
#user1 = i.find_element_by_xpath('.//a[#class="at j_user_card "]').text
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
text3 = ???
print(user2, text3)
In selenium you cannot use XPath that returns attributes or text nodes, so /text() syntax is not allowed. If you want to get specific child text node only instead of complete text content (returned by text property), you might execute JavaScript
You can apply below code to get required text node:
...
try: user2 = i.find_element_by_xpath('.//span[#class="lzl_content_main"]/a[#username]').text
except: user2 = ""
span = i.find_element_by_xpath('.//span[#class="lzl_content_main"]')
reply = driver.execute_script('return arguments[0].lastChild.textContent;', span)
You might also need to do reply = reply.strip() to get rid of trailing spaces
Yes:
//div[#class='lzl_cnt']
And then you should use the .text on that element
Except you span isn't closed, so assuming it closes before the div.
Here i am answering a solution for you.
List<WebElement> list = driver.findElements(By.tagName("span"));
for(WebElement el : list){
String desiredText = el.getAttribute("innerHTML");
if(desiredText.equalsIgnoreCase("text3")){
System.out.println("desired text found");
break;
}
}
Please use the above code and let me know your feedback.
I'm having a div table where each row has two cells/columns.
The second cell/column sometimes has a clear text (<div class="something">Text</div>) while sometimes it's hidden within an "a" tag inside: <div class="something">Text</div>.
Now, I have no problem in getting everything but the linked text. I can also get the linked text separately, but I don't know how to get everything at once, so I get three columns of data:
1. first column text,
2. second column text no matter if it is linked or not,
3. link, if it exist
The code that extracts everything not linked and works is:
times = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " time ")]/text()').extract()
titles = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " name ")]/text()').extract()
for time, title in zip(times, titles):
print time.strip(), title.strip()
I can get the linked items only with
ltitles = scrapy.Selector(response).xpath('//div[contains(concat(" ", normalize-space(#class), " "), " name ")]/a/text()').extract()
for ltitle in ltitles:
print ltitle.strip()
But don't know how to combine the "query" to get everything together.
Here's a sample HTML:
<div class="programRow rowOdd">
<div class="time ColorVesti">
22:55
</div>
<div class="name">
Dnevnik
</div>
</div>
<div class="programRow rowEven">
<div class="time ColorOstalo">
23:15
</div>
<div class="name">
<a class="recnik" href="/page/tv/sr/story/20/rts-1/2434373/kulturni-dnevnik.html" rel="/ajax/storyToolTip.jsp?id=2434373">Kulturni dnevnik</a>
</div>
</div>
Sample output (one I cannot get):
22:55, Dnevnik, []
23:15, Kulturni dnevnik, /page/tv/sr/story/20/rts-1/2434373/kulturni-dnevnik.html
I either get the first two columns (without the linked text) or just the linked text with the code samples above.
If I understand you correctly then you should probably just iterate through program nodes and create item for every cycle. Also there's xpath shortcut //text() which captures all text under the node and it's childrem
Try something like:
programs = response.xpath("//div[contains(#class,'programRow')]")
for program in programs:
item = dict()
item['name'] = program.xpath(".//div[contains(#class,'name')]//text()").extract_first()
item['link'] = program.xpath(".//div[contains(#class,'name')]/a/#href").extract_first()
item['title'] = program.xpath(".//div[contains(#class,'title')]//text()").extract_first()
return item