Getting src link of XKCD image?

Getting src link of XKCD image? - python

I am trying to get src(URL) link of main image from xkcd.com website. I am using the following code but it returns something like session="2f69dd2e-b377-4d1f-9779-16dad1965b81", element="{ca4e825a-88d4-48d3-a564-783f9f976c6b}"
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
browser = webdriver.Firefox()
browser.get('http://xkcd.com')
assert 'xkcd' in browser.title
idlink= browser.find_element_by_id("comic")
#link = idlink.get_attribute("src") ## print link prints null
print idlink
using xpath method also returns same as above.

browser.find_element_by_id returns web element, and that is what you print.
In addition, the text you want is in child element of idlink. Try
idlink = browser.find_element_by_css_selector("#comic > img")
print idlink.get_attribute("src")
idlink is now web element with img tag who has parent with comic ID.
The URL is in src so we want that attribute.

Building off the answer here
You need to:
Select the img tag (you're currently selecting the div)
Get the contents of the source attribute of the img tag
img_tag = browser.find_element_by_xpath("//div[#id='comic']/img")
print img_tag.get_attribute("src")
The above should print the URL of the image
More techniques for locating elements using selenium's python bindings are available here
For more on using XPath with Selenium, see this tutorial

Related

Python Search multiple html files in a variable

I have used Selenium driver to crawl through many site pages. Every time I get a new page I append the html to a variable called "All_APP_Pages". The variable All_APP_Pages is a variable holding html for many pages. Did not post code because its long and no relevant to issue. Python list "All_APP_Pages" as being of type bytes.
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from selenium.webdriver.common.by import By
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
print(link)
Once all pages have been scanned I need to get the link from this xpath
"//tr[.//span[contains(.,'Product Data Solutions (ABC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
The xpath listed here works. However it only works with the selenium driver if driver is on the page where this link exists. That is why all page are in one variable since I dont know what page the link will be on. The print shows this result
[<Element a at 0x1c39dea1180>]
How do I get this value from link I so can check if value is correct?

You need to iterate the list and get the href value
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
hrefs=[l.attrib["href"] for l in link]
print(hrefs)

I need to get an image src from the Xpath

I'm using Python to get some elements from a web page,
I need to get the SRC path from an Image tag using its Xpath code.
The Xpath I'm trying to get data from is the following:
//*[#id="product_24793"]/a/figure/img
I've already tried using these two formats for the Xpath
//*[#id="product_24793"]/a/figure/img/#src
//*[#id="product_24793"]/#src
Also the code that I have already tried is the following :
imgsrc = driver.find_elements_by_xpath('//*[#id="product_24793"]/a/figure/img')
for ele in imgsrc:
print(ele.text)
path = ele.text
I'd like to have the src path as a result
Thanks.

Use get_attribute() to fetch the src value.
imgsrc = driver.find_elements_by_xpath('//*[#id="product_24793"]/a/figure/img[#class="top"]')
for ele in imgsrc:
print(ele.get_attribute('src'))

Unable to print the href?

I have been searching across the site in the hope of finding an answer, however, every question I view doesn't have heavily nested HTML code like the page I am trying to scrape.
I am really hoping someone will spot my obvious error. I have the following code which is pulling the category headers and but annoyingly not the href that goes with each one. When run, the code currently returns 'None' for all the href's but I cannot decipher why. I think it may be because I am targeting the wrong element, tag or class in the HTML but cannot correctly identify which one it should be.
from selenium import webdriver
import time
# The website to scrape
url = "https://www.jtinsight.com/JTIRA/JTIRA.aspx#!/full-category-list"
# Creating the WebDriver object using the ChromeDriver
driver = webdriver.Chrome()
# Directing the driver to the defined url
driver.get(url)
# driver.implicitly_wait(5)
time.sleep(1)
# Locate the categories
categories = driver.find_elements_by_xpath('//div[#class="subCatEntry ng-scope"]')
# Print out all categories on current page
num_page_items = len(categories)
print(num_page_items)
for headers in range(num_page_items):
print(categories[headers].text)
for elem in categories:
print(elem.get_attribute("a.divLink[href='*']"))
# Clean up (close browser once task is completed)
time.sleep(1)
driver.close()
I would really appreciate if anyone can point out my error.

Try this below code.
for elem in categories:
print(elem.find_element_by_css_selector("a.divLink").get_attribute('href'))

You are passing the CSS selector for the get_attribute method. That wouldn't work. You have to provide the attribute name only. If the web element elem has an attribute named href then it would print the value of that attribute.
First, get the anchor <a> element. All the subcategory anchors have class divLink. For getting anchor elements try this,
categories = driver.find_elements_by_class_name('divLink')
Second, Print the attribute value by passing the attribute name in the get_ttribute. Try this,
print(elem.get_attribute("href"))
This way you'll be able to print all the href values.

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?

Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python

The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.

First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")

The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

Trying to select element by xpath with Selenium but getting error "Unable to locate element"

I am trying to scrape the list of followings for a given instagram user. This requires using Selenium to navigate to the user's Instagram page and then clicking "following". However, I cannot seem to click the "following" button with Selenium.
driver = webdriver.Chrome()
url = 'https://www.instagram.com/beforeeesunrise/'
driver.get(url)
driver.find_element_by_xpath('//*[#id="react-root"]/section/main/article/header/div[2]/ul/li[3]/a').click()
However, this results in a NoSuchElementException. I copied the xpath from the html, tried using the class name, partial link and full link and cannot seem to get this to work! I've also made sure that the above xpath include the element with a "click" event listener.
UPDATE: By logging in I was able to get the above information. However (!), now I cannot get the resulting list of "followings". When I click on the button with the driver, the html does not include the information in the pop up dialog that you see on Instagram. My goal is to get all of the users that the given username is following.

Make sure you are using the correct X Path.
Use the following link to get perfect X Paths to access web elements and then try.
Selenium Command
Hope this helps to solve the problem!

Try a different XPath. I've verified this is unique on the page.
driver.find_element_by_xpath("//a[contains(.,'following')]")

It's not the main goal of selenium to provide rich functionalities, from a web-scraping perspective, to find elements on the page, so the better option is to delegate this task to a specific tool, like BeautifulSoup. After we find what we're looking for, then, we can ask for selenium to interact with the element.
The bridge between selenium and BeautifulSoup will be this amazing function below that I found here. The function gets a single BeautifulSoup element and generates a unique XPATH that we can use on selenium.
import os
import re
from selenium import webdriver
from bs4 import BeautifulSoup as bs
import itertools
def xpath_soup(element):
"""
Generate xpath of soup element
:param element: bs4 text or node
:return: xpath as string
"""
components = []
child = element if element.name else element.parent
for parent in child.parents:
"""
#type parent: bs4.element.Tag
"""
previous = itertools.islice(parent.children, 0, parent.contents.index(child))
xpath_tag = child.name
xpath_index = sum(1 for i in previous if i.name == xpath_tag) + 1
components.append(xpath_tag if xpath_index == 1 else '%s[%d]' % (xpath_tag, xpath_index))
child = parent
components.reverse()
return '/%s' % '/'.join(components)
driver = webdriver.Chrome(executable_path=YOUR_CHROMEDRIVER_PATH)
driver.get(url = 'https://www.instagram.com/beforeeesunrise/')
source = driver.page_source
soup = bs(source, 'html.parser')
button = soup.find('button', text=re.compile(r'Follow'))
xpath_for_the_button = xpath_soup(button)
elm = driver.find_element_by_xpath(xpath_for_the_button)
elm.click()
...and works!
( but you need writing some code to log in with an account)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting src link of XKCD image? - python

Related

Python Search multiple html files in a variable

I need to get an image src from the Xpath

Unable to print the href?

Get dynamically generated content with python Selenium

Trying to select element by xpath with Selenium but getting error "Unable to locate element"

Categories

Resources