How to download all images under one specific node in Scrapy?

How to download all images under one specific node in Scrapy? - python

I am using Scrapy, I want to download all images under one node, for example, here is the web page:
<!-- language: lang-xml -->
<div class="A">
<div class="A1">
<div class="A2">
<img original="a1.png"></img>
</div>
</div>
<div class="A3">
<img original="a2.png"></img>
</div>
</div>
<div class="B">
<div class="B1">
<div class="B2">
<img original="b1.png"></img>
</div>
</div>
<div class="B3">
<img original="b2.png"></img>
</div>
</div>
I want to download all images (I need to find the original urls for these images) under class="A" but not class="B", and under class="A" we have a1.png and a2.png, they are at different levels (the number of level is uncertain).
Is there any way to target one attribute under one specific node using XPATH (something like //div[#class="A"]/**/img ) ?
or is there any solution in Scrapy?

Using the following XPath to select all img nodes that are descendants of a div node with class A:
//div[#class="A"]//img
Demo
The // matches descendants, as opposed to /, which matches direct children.

Related

Unable to locate element with Python Selenium fin_element_by_xpath

I have a HTML code like :
<div class="A">
<div class="B"></div>
<div class="B">
<div class="C"></div>
<div class="C">
<p class="D"> Element 1 </p>
<div class="C"></div>
</div>
</div>
<div class="A">
<div class="B"></div>
<div class="B">
<div class="C"></div>
<div class="C">
<p class="D"> Element 2 </p>
<div class="C"></div>
</div>
</div>
(this is an example, there is more class "A")
I want to extract the text "Element 2" with Python Selenium.
I tried a lot of things but always the same result : No such element: Unable to locate element...
I tried :
elem = driver.find_element_by_xpath("//div[#class='A:last-child']/p[#class='D']").text
same result...

Try this:
"(//div[#class='A']//p)[2]"
This should get the second instance of Class = "A" and then the p element beneath that

Try this xpath:
"(//div[#class='A']//p)[last()]"
The main problem with your xpath, I think, is that the single slash before the p element means to only look for direct children of the div. You want the double slash to find any descendant.

In this structure Xpath
(//div[#class="A"]//p[#class="D"])[2]
if this is a second hierarchy or
(//div[#class="A"]//p[#class="D"])[last()]
if it is a last should work

Python read forms in webpage

I read some webpage contents in html that has the following form:
<div class="cart">
<div class="cart-title">
<img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
המקצועות שלי
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
104134
</div>
<div class="course-name">
אלגברה מודרנית ח
</div>
<div class="course-points">
2.5 נק'
</div>
<div class="entry-group">
קבוצה 11
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
234118
</div>
<div class="course-name">
ארגון ותכנות המחשב
</div>
<div class="course-points">
3 נק'
</div>
<div class="entry-group">
קבוצה 22
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div>
Now the question is how can I read the courses numbers which appear in blue in my image??
Here's an example of how course number appears in the webpage:
<div class="course-number">
104134
</div>
and I want to read: 104134 in this example

First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.
from bs4 import BeautifulSoup
r = requests.get(<your-target>)
soup = BeautifulSoup(r.text, 'lxml')
numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]
I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.
Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.
Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.

Python: find all certain tag by using selenium

HTML:
<div id="searchResult">
<div class="buySearchResultContent">
<div class="buySearchResultContentImg">
<a href="carinfo-333285.php">
<img src="carpics/9400180056/290x200/20180305101502854_4567823.jpg" srcset="carpics/9400180056/290x200/20180305101502854_9098765.jpg 290w, carpics/9400180056/435x300/20180305101502854_00000.jpg 435w , carpics/9400180056/720x520/20180305101502854_00001.jpg 720w" sizes="(min-width: 992px) 75vw, 90vw" alt="auto">
</a>
</div>
<div class="buySearchResultContentImg">
<a href="carinfo-333286.php">
<img src="carpics/9400180056/290x200/20180305101502854_4567824.jpg" srcset="carpics/9400180056/290x200/20180305101502854_9098766.jpg 290w, carpics/9400180056/435x300/20180305101502854_00000.jpg 436w , carpics/9400180056/720x520/20180305101502854_00001.jpg 721w" sizes="(min-width: 992px) 75vw, 90vw" alt="auto">
</a>
</div>
</div>
</div>
What I am trying to do is extract two hrefs, but with my code, I can only extract the first one.
Code:
driver.find_element_by_css_selector("buySearchResultContentImg > div").get_attribute("href")

Try below code to get list of #href values:
links = [link.get_attribute("href") for link in driver.find_elements_by_css_selector(".buySearchResultContentImg>a")]

Selenium - driver.find_element_by_css_selector can't find the element (python)

I got a problem to use "find_element_by_css_selector" to get the element "Select" (a href).
I tried the methods below but all of them didn't work:
driver.find_element_by_css_selector("div.plan.right > a.select.").click()
driver.find_element_by_xpath("//div[#class='plan right']/div[2]/a/select").click()
Could anyone kindly give me some suggestions? Thanks!!
<div class="choose_plan">
<h1>Sign up now for <strong>UNLIMITED</strong> access <br/>to all </h1>
<div class="plans">
<div class="plan left">
<div class="head">
<p>MONTHLY</p>
</div>
<div class="body">
<p>annually</p>
</div>
<hr />
SELECT
</div>
<div class="plan right">
<img alt="Popular-right" class="popular" src="/assetse8.png" />
<div class="head">
<p>14</p>
</div>
<div class="body">
<p>Unlimited</p>
</div>
<hr />
SELECT
</div>
</div>
</div>

I know you already have an answer but there's a simple alternative that may be helpful in the future. From the HTML you provided, it looks like the data-planId attribute is unique for each A tag. You can take advantage of that using the code below.
driver.find_element_by_css_selector("a[data-planId='31']")
See this CSS Selector reference for more info.

It would help to have well formed HTML, as line 15 (<div class="choose_plan">) appears to be unclosed. This solution below was done with this line removed, but the rest of the HTML as shown. You can test online XPath here.
driver.find_element_by_xpath("//div[#class='plan right']/a").click()
yields the following:
Element='SELECT'

I would try to make it simple:
driver.find_element_by_css_selector("div.right a.select")
Or:
driver.find_elements_by_link_text("SELECT")[-1]
Here we are basically getting the last a element having SELECT text.

handling deeply nested tags in xpath

Please help me!
I don't know how to select deeply nested tag to select the text
inside of it.
If someone would please help me by saying, how to do it in a single line with
xpath query and please give me an explanation regarding the answer.
Below I have given a html code will anybody explain how to display the Hello world or whatever may be in that tags.
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div>
<div class="deep">
<span>
<strong class="select">Hello world!</strong>
</span>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
</div>

I assume since you asked for the text property the node you'd like to match is the strong tag (the only one with content).
If you are guaranteed only one <strong> tag from the document root and the level of nesting is irrelevant, the simplest xpath would be:
//strong/text()
To match via class specifically as well:
//strong[#class="select"]/text()
// will start from the document root, and # is an attribute match clause.
http://www.b624.net/modelare-software-uml-si-xml/laboratoare-an-3-is/xpath-cheat-sheet

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to download all images under one specific node in Scrapy? - python

Using the following XPath to select all img nodes that are descendants of a div node with class A: //div[#class="A"]//img Demo The // matches descendants, as opposed to /, which matches direct children.

Related

Unable to locate element with Python Selenium fin_element_by_xpath

Python read forms in webpage

Python: find all certain tag by using selenium

Selenium - driver.find_element_by_css_selector can't find the element (python)

handling deeply nested tags in xpath

Categories

Resources