I need to get an image src from the Xpath - python

I'm using Python to get some elements from a web page,
I need to get the SRC path from an Image tag using its Xpath code.
The Xpath I'm trying to get data from is the following:
//*[#id="product_24793"]/a/figure/img
I've already tried using these two formats for the Xpath
//*[#id="product_24793"]/a/figure/img/#src
//*[#id="product_24793"]/#src
Also the code that I have already tried is the following :
imgsrc = driver.find_elements_by_xpath('//*[#id="product_24793"]/a/figure/img')
for ele in imgsrc:
print(ele.text)
path = ele.text
I'd like to have the src path as a result
Thanks.

Use get_attribute() to fetch the src value.
imgsrc = driver.find_elements_by_xpath('//*[#id="product_24793"]/a/figure/img[#class="top"]')
for ele in imgsrc:
print(ele.get_attribute('src'))

Related

Download HTML code from element with XPath - Python Selenium

I need download HTML codes from a element using XPath.
XPath:
//*[#id=":nn"]/div[1]
And a picture from element:
How I can download these HTML codes?
You can use the the 'innerHTML' attribute after you get the element using the XPath. It would look something like:
element = driver.find_element_by_xpath("//*[#id=":nn"]/div[1]")`
content = element.get_attribute('innerHTML')
I think by downloading you mean to get the innerHTML, if so
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
print(html)
update :
import pathlib
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
pathlib.Path("output.txt").write_text(f"Purchase Amount: {html}")
or
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
file = open("sample.html","w")
file.write(html)
file.close()

unable to extract full url #href using scrapy

I am trying to extract the url of a product from amazon.in. The href-attribute inside the a-tag from the source looks like this:
href="/Parachute-Coconut-Oil-600-Free/dp/B081WSB91C/ref=sr_1_49?dchild=1&fpw=pantry&fst=as%3Aoff&qid=1588693187&s=pantry&sr=8-49&srs=9574332031&swrs=789D2F4EC1B25821250A55BFCB953F03"
What Scrapy is extracting is:
/Parachute-Coconut-Oil-Bottle-600ml/dp/B071FB2ZVT?dchild=1
I used the following xpath:
//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href
This is the website I am trying to scrape:
https://www.amazon.in/s?i=pantry&srs=9574332031&bbn=9735693031&rh=n%3A9735693031&dc&page=2&fst=as%3Aoff&qid=1588056650&swrs=789D2F4EC1B25821250A55BFCB953F03&ref=sr_pg_2
How can I extract the expected url with Scrapy?
That is known as a relative URL. To get the full URL you can simply combine it to the base URL. I don't know what your code is but try something like this.
half_url = response.xpath('//div[#class="a-section a-spacing-none a-spacing-top-small"]//a[#class="a-link-normal a-text-normal"]/#href').extract_first()
full_url = 'https://www.amazon.in/' + half_url

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?
Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python
The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.
First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")
The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

How to get the link from a clickable element in selenium python

I want to download CSV files from a website. this is why i use the the click() command from selenium.
Elements have the following structure
code
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
#get link here
l.click()
Question
My question is how can we get the download link from the element before I download it? the link that pointed to by the black arrow in the picture.
When I use l.get_attribute('href') it gives me None.
For each element l in csvList, get the parent element by xpath and then get that element's href:
csvList = browser.find_elements_by_class_name("csv")
for l in csvlist:
if 'error' not in l.text and 'after' not in l.text:
currentLink = l.find_element_by_xpath("..")
href = currentLink.get_attribute("href")
Note: If you do a .click() in this loop and the link takes you to a new page, you will get a StaleElementException for each click after the first. In that case, extract each href and save to a collection. Then navigate to each href (URL) in the collection.
The div does not have the href attribute. Its parent the "a" tag does. I would use xpath.
By.XPath("//a[/div[#class='csv']]")

Getting src link of XKCD image?

I am trying to get src(URL) link of main image from xkcd.com website. I am using the following code but it returns something like session="2f69dd2e-b377-4d1f-9779-16dad1965b81", element="{ca4e825a-88d4-48d3-a564-783f9f976c6b}"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://xkcd.com')
assert 'xkcd' in browser.title
idlink= browser.find_element_by_id("comic")
#link = idlink.get_attribute("src") ## print link prints null
print idlink
using xpath method also returns same as above.
browser.find_element_by_id returns web element, and that is what you print.
In addition, the text you want is in child element of idlink. Try
idlink = browser.find_element_by_css_selector("#comic > img")
print idlink.get_attribute("src")
idlink is now web element with img tag who has parent with comic ID.
The URL is in src so we want that attribute.
Building off the answer here
You need to:
Select the img tag (you're currently selecting the div)
Get the contents of the source attribute of the img tag
img_tag = browser.find_element_by_xpath("//div[#id='comic']/img")
print img_tag.get_attribute("src")
The above should print the URL of the image
More techniques for locating elements using selenium's python bindings are available here
For more on using XPath with Selenium, see this tutorial

Categories