Download HTML code from element with XPath - Python Selenium - python

I need download HTML codes from a element using XPath.
XPath:
//*[#id=":nn"]/div[1]
And a picture from element:
How I can download these HTML codes?

You can use the the 'innerHTML' attribute after you get the element using the XPath. It would look something like:
element = driver.find_element_by_xpath("//*[#id=":nn"]/div[1]")`
content = element.get_attribute('innerHTML')

I think by downloading you mean to get the innerHTML, if so
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
print(html)
update :
import pathlib
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
pathlib.Path("output.txt").write_text(f"Purchase Amount: {html}")
or
html = driver.find_element_by_xpath("//*[#id=':nn']/div[1]").get_attribute('innerHTML')
file = open("sample.html","w")
file.write(html)
file.close()

Related

how to get the href link of latest version only in Selenium python

I want the link of latest version of csv . If new version will come then I my program will pick latest href link.
Output :- https://www.nucc.org/images/stories/CSV/nucc_taxonomy_201.csv
home_page1 = "https://www.nucc.org/index.php/code-sets-mainmenu-41/provider-taxonomy-mainmenu-40/csv-mainmenu-57"
driver = webdriver.Chrome("xx\\xx\\chromedriver.exe")
driver.get(home_page1)
elements = driver.find_elements_by_css_selector("li a")
for link in elements:
print(link.get_attribute('href'))
Using some regex combined with BeautifulSoup:
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(driver.page_source)
#Find the tag containing the text 'current version'
current_version_tag = soup.find('p',text=re.compile('current version'))
#The download link will be the first link after it
download_link = "https://www.nucc.org" + current_version_tag.find_next("a").get('href')
Output
'https://www.nucc.org/images/stories/CSV/nucc_taxonomy_201.csv'
iv looked at the website and its seems that you can do that by selenium
"find_element_by_xpath".
iv attached a photo.
so the right XPATH IS
//*[#id="main"]/div[2]/div/div/div[3]/div/ul[1]/li/a
please check if its working again when the new version is coming.
and then you can look for it with this command :
login_form = driver.find_element_by_xpath("//*[#id="main"]/div[2]/div/div/div[3]/div/ul[1]/li/a")
let me know if it was helful.

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?
Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python
The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.
First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")
The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

Python: xpath unable to find element

I want to get the text for features companies from a link. I inspect it and get the xpath but it is unable to find the element. The links is always change but it has a similarity with ended with listedcompanies.com
The text I want to scrape is highlighted in the screenshot.
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
browser.find_element_by_xpath("//*[#href='http://salcon.listedcompany.com']")
The error is
selenium.common.exceptions.NoSuchElementException: Message: u'Unable to locate element: {"method":"xpath","selector":"//*[#href=\'http://salcon.listedcompany.com\']"}' ; Stacktrace:
I want to get the text for those companies
If you need text below Featured Companies tab you can use this code:
import requests
from parsel import Selector
url = 'https://www.shareinvestor.com/my'
r = requests.get(url)
sel = Selector(r.text)
all_text = sel.xpath('//div[#class="sic_scrollPane" and a[img]]')
for ind, text in enumerate(all_text, start=1):
text = ''.join(text.xpath('p//text()').extract())
print(ind, text)
It gets you all text from that tab without the use of Selenium.
Note: I use Parsel library built on top of the lxml, but you can use bs4 or lxml.
Try to use "//a[contains(#href, 'listedcompany.com')]" XPath to match all links with href attribute that contains "listedcompany.com" as below:
browser = webdriver.Firefox()
browser.get("https://www.shareinvestor.com/my")
time.sleep(20)
lint_text_list = [link.text for link in browser.find_elements_by_xpath("//a[contains(#href, 'listedcompany.com')]") if link.text]

Scraping hidden product details on a webpage using Selenium

Sorry I am a Selenium noob and have done a lot of reading but am still having trouble getting the product price (£0.55) from this page:
https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628. Product details are not visible when parsing the html using bs4. Using Selenium I can get a string of the entire page and can see the price in there (using the following code). I should be able to extract the price from this somehow but would prefer a less hacky solution.
browser = webdriver.Firefox(executable_path=r'C:\Users\Paul\geckodriver.exe')
browser.get('https://groceries.asda.com/product/tinned-tomatoes/asda-smart-price-chopped-tomatoes-in-tomato-juice/19560')
content = browser.page_source
If I run something like this:
elem = driver.find_element_by_id("bodyContainerTemplate")
print(elem)
It just returns: selenium.webdriver.firefox.webelement.FirefoxWebElement (session="df23fae6-e99c-403c-a992-a1adf1cb8010", element="6d9aac0b-2e98-4bb5-b8af-fcbe443af906")
The price is the text associated with this element: p class="prod-price" but I cannot seem to get this working. How should I go about getting this text (the product price)?
The type of elem is WebElement. If you need to extract text value of web-element you might use below code:
elem = driver.find_element_by_class_name("prod-price-inner")
print(elem.text)
Try this solution, it works with selenium and beautifulsoup
from bs4 import BeautifulSoup
from selenium import webdriver
url='https://groceries.asda.com/product/spaghetti-tagliatelle/asda-spaghetti/36628'
driver = webdriver.PhantomJS()
driver.get(url)
data = driver.page_source
soup = BeautifulSoup(data, 'html.parser')
ele = soup.find('span',{'class':'prod-price-inner'})
print ele.text
driver.quit()
It will print :
£0.55

Getting src link of XKCD image?

I am trying to get src(URL) link of main image from xkcd.com website. I am using the following code but it returns something like session="2f69dd2e-b377-4d1f-9779-16dad1965b81", element="{ca4e825a-88d4-48d3-a564-783f9f976c6b}"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://xkcd.com')
assert 'xkcd' in browser.title
idlink= browser.find_element_by_id("comic")
#link = idlink.get_attribute("src") ## print link prints null
print idlink
using xpath method also returns same as above.
browser.find_element_by_id returns web element, and that is what you print.
In addition, the text you want is in child element of idlink. Try
idlink = browser.find_element_by_css_selector("#comic > img")
print idlink.get_attribute("src")
idlink is now web element with img tag who has parent with comic ID.
The URL is in src so we want that attribute.
Building off the answer here
You need to:
Select the img tag (you're currently selecting the div)
Get the contents of the source attribute of the img tag
img_tag = browser.find_element_by_xpath("//div[#id='comic']/img")
print img_tag.get_attribute("src")
The above should print the URL of the image
More techniques for locating elements using selenium's python bindings are available here
For more on using XPath with Selenium, see this tutorial

Categories