I am trying to scrape NASDAQ's website for real time stock quotes. When I use chrome developer tools, I can see the span I want to target is (for example with Alphabet as of writing this) <span class="symbol-page-header__pricing-price">$2952.77</span>. I want to extract the $2952.77. My python code is:
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
def get_last_price(ticker):
driver.get(f"https://www.nasdaq.com/market-activity/stocks/{ticker}")
price = driver.find_element(By.CLASS_NAME, "symbol-page-header__pricing-last-price")
print(price.get_attribute('text'))
# p = price.get_attribute('innerHTML')
get_last_price('googl')
The above code returns 'None'. If you uncomment out the line defining p and print it's output, it shows that Selenium thinks the span is empty.
<span class="symbol-page-header__pricing-price"></span>
I don't understand why this is happening. My thought is that it has something to do with the fact that it's probably being rendered dynamically with Javascript, but I thought that was an advantage of Selenium say as opposed to BeautifulSoup... there shouldn't be an issue right?
If you look into the HTML DOM of NASDAQ's Coinbase Global your Locator Strategy selects 2 nodes and the the one you don't want is:
Solution
To print the price information you can use the following Locator Strategy:
Using XPATH and text attribute:
driver.get("https://www.nasdaq.com/market-activity/stocks/coin")
print(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='symbol-page-header__pricing-price' and text()]"))).text)
Using XPATH and get_attribute("innerHTML"):
driver.get("https://www.nasdaq.com/market-activity/stocks/coin")
print(WebDriverWait(driver, 20).until(EC.element_to_be_clickable((By.XPATH,"//span[#class='symbol-page-header__pricing-price' and text()]"))).get_attribute("innerHTML"))
Console Output:
$263.91
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
There are 2 nodes with the class symbol-page-header__pricing-price. The node that you want is under
<div class="symbol-page-header__pricing-details symbol-page-header__pricing-details--current symbol-page-header__pricing-details--decrease"></div>
So, you need to get inside this div first to ensure you scrape the right one.
Anyways, I'd recommend you to use BeautifulSoup to scrape the HTML text after you’ve finished interacting with the dynamic website with selenium. This will save your time and your memory. It has no need to keep running the browser, so, it would be better if you terminate it (i.e. driver.close()) and use BeautifulSoup to explore the static HTML text.
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
s = Service(ChromeDriverManager().install())
driver = webdriver.Chrome(service=s)
def get_last_price(ticker):
driver.get(f"https://www.nasdaq.com/market-activity/stocks/{ticker}")
time.sleep(1)
driver.close()
soup = BeautifulSoup(driver.page_source, "lxml")
header = soup.find('div', attrs={'class': 'symbol-page-header__pricing-details symbol-page-header__pricing-details--current symbol-page-header__pricing-details--decrease'})
price = header.find('span', attrs={'class':'symbol-page-header__pricing-price'})
print(price)
print(price.text)
get_last_price('googl')
Output:
>>> <span class="symbol-page-header__pricing-price">$2952.77</span>
>>> $2952.77
Related
I am trying to scrape some tennis statistics starting from 01-01-2019.
For this I try to scrape the following webpage with selenium: https://www.sofascore.com/de/tennis/2019-01-01
When I click on the first match manually the container on the right side changes and shows the statistics.
This is what I want to access automatically.
When I try to click on the element with selenium it redirects me to another page.
Can anyone tell me why it is not just showing the same content as by manually clicking and how I can solve this issue?
Here is my code:
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait as wait
from selenium.webdriver.support import expected_conditions as EC
import time
options = Options()
options.binary_location = "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe"
browser = webdriver.Chrome(chrome_options = options)
url = 'https://www.sofascore.com/de/tennis/2019-01-01'
browser.get(url)
browser.maximize_window()
xpath = '/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div'
browser.find_element_by_xpath(xpath).click()
time.sleep(2)
browser.close()`
You can use the below xpath :
//div[contains(#class, 'Col-pm5mcz-')]//descendant::div[contains(#class, 'styles__StyledWidget-')]
and get the innerHTML of that using get_attribute method
Code :
url = "https://www.sofascore.com/de/tennis/2019-01-01"
driver.get(url)
xpath = '/html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div'
driver.find_element_by_xpath(xpath).click()
sleep(2)
details = driver.find_element_by_xpath("//div[contains(#class, 'Col-pm5mcz-')]//descendant::div[contains(#class, 'styles__StyledWidget-')]").get_attribute('innerHTML')
print(details)
The xpath that you are using is absolute xpath /html/body/div[1]/main/div/div[2]/div/div[3]/div[2]/div/div/div/div/div[2]/a/div
try to replace that with Relative xpath.
See if this works
tableRows = driver.find_elements_by_xpath(".//div[#class='ReactVirtualized__Grid ReactVirtualized__List']//following::div/a[contains(#class,'EventCellstyles__Link')]")
for e in tableRows:
e.click()
//You can add implicit wait here for the statics section to load
driver.find_element_by_xpath(".//a[text()='Statistiken']").click()
I need to scrape some information from a dynamically changing html. The website in question is :
https://www.mitartlending.com/featuredartworks. Here, when you click on a given image and hover your mouse over the enlarged image a text overlay pops up. I am trying to scrape that text. After trying to do this with BS I decided that I am going to have to probably use selenium. How would you go about about solving this problem? So far, I have:
from selenium import webdriver
driver = webdriver.Chrome('/Users/Abramo/SeleniumDrivers/chromedriver')
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver.get('https://www.mitartlending.com/featuredartworks')
driver.implicitly_wait(3)
my_element = driver.find_element_by_xpath(f'/html/body/div[5]/div[2]/div/main/section/div/div/div/div[3]/div/div/div/div[1]/div/a/img')
my_element.click()
copy_from = driver.find_element_by_class_name('sqs-lightbox-meta overlay-description-visible')
my_next_button = driver.find_element_by_class_name('sqs-lightbox-next')
The data is all there within attributes. You just need to extract the appropriate ones. No need for the overhead of selenium.
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.mitartlending.com/featuredartworks')
soup = bs(r.content, 'lxml')
results = {i['data-title']:' '.join(bs(i['data-description'], 'lxml').text.split('\n')) for i in soup.select('.margin-wrapper > a')}
print(results)
You can locate any of those images by
images = driver.find_elements_by_xpath('//img[contains(#class,'thumb-image loaded')]')
So, for example to click on the second image you can with
images[1].click()
To hover over the element you can do this:
from selenium.webdriver.common.action_chains import ActionChains
hover = ActionChains(driver).move_to_element(images[1])
hover.perform()
Now, when the text is appeared you can locate and get it with
text = driver.find_elements_by_xpath('(//img[contains(#class,'thumb-image loaded')])[2]/..//p').text
The same can be done for any other image there.
Altogether the code will look like:
from selenium.webdriver.common.action_chains import ActionChains
images = driver.find_elements_by_xpath('//img[contains(#class,"thumb-image loaded")]')
images[1].click()
time.sleep(2)
hover = ActionChains(driver).move_to_element(images[1])
hover.perform()
time.sleep(2)
text = driver.find_elements_by_xpath('(//img[contains(#class,"thumb-image loaded")])[2]/..//p')
for t in text:
print(t.text)
I added sleeps just to make it simply while it's preferred to use expected conditions waits instead
Consider:
I am using Selenium to scrape the contents from the App Store: https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830
I tried to extract the text field "As subject matter experts, our team is very engaging..."
I tried to find elements by class
review_ratings = driver.find_elements_by_class_name('we-truncate we-truncate--multi-line we-truncate--interactive ember-view we-customer-review__body')
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
review_ratings
But it returns an empty list [].
Is anything wrong with the code? Or is there a better solution?
Using Requests and Beautiful Soup:
import requests
from bs4 import BeautifulSoup
url = 'https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830'
res = requests.get(url)
soup = BeautifulSoup(res.text,'lxml')
item = soup.select_one("blockquote > p").text
print(item)
Output:
As subject matter experts, our team is very engaging and focused on our near and long term financial health!
You can use WebDriverWait to wait for visibility of an element and get the text. Please check good Selenium locator.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#...
wait = WebDriverWait(driver, 5)
review_ratings = wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, ".we-customer-review")))
for review_rating in review_ratings:
starts = review_rating.find_element_by_css_selector(".we-star-rating").get_attribute("aria-label")
title = review_rating.find_element_by_css_selector("h3").text
review = review_rating.find_element_by_css_selector("p").text
Mix Selenium with Beautiful Soup.
Using WebDriver:
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Chrome()
url = "https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830"
browser.get(url)
innerHTML = browser.execute_script("return document.body.innerHTML")
bs = BeautifulSoup(innerHTML, 'html.parser')
bs.blockquote.p.text
Output:
Out[22]: 'As subject matter experts, our team is very engaging and focused on our near and long term financial health!'
Use WebDriverWait and wait for presence_of_all_elements_located and use the following CSS selector.
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://apps.apple.com/us/app/bank-of-america-private-bank/id1096813830")
review_ratings = WebDriverWait(driver, 20).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, '.we-customer-review__body p[dir="ltr"]')))
review_ratingsList = []
for e in review_ratings:
review_ratingsList.append(e.get_attribute('innerHTML'))
print(review_ratingsList)
Output:
['As subject matter experts, our team is very engaging and focused on our near and long term financial health!', 'Very much seems to be an unfinished app. Can’t find secure message alert. Or any alerts for that matter. Most of my client team is missing from the “send to” list. I have other functions very useful, when away from my computer.']
I want to extract the tag names (hashtags) from the explore page in twitter using selenium on python3. But there are no special tags or classes or even ids to be able to locate them and save them.
Is there a way that I can extract them even if they change without having to edit my code every time?
I think the following code will take me to the explore page using the link text. But I can not use the same method to locate the tags as they change every now and then.
explore = driver.find_element_by_link_text("Explore")
I want to be able to locate the tags and save them into a list so I can use that list in my work later on.
This is the html code for on of the tags:
<span class="r-18u37iz"><span dir="ltr" class="css-901oao css-16my406 r-1qd0xha r-ad9z0x r-bcqeeo r-qvutc0">#ARSBUR</span></span>
The classes are not unique and they are used in other elements of the page, so I can not use them.
If there is a way to locate the (#) mark so I can only get the text that includes them.
To extract the hashtags from the explore page in twitter i.e https://twitter.com/explorer?lang=en using Selenium on Python 3 you have to induce WebDriverWait for the visibility_of_all_elements_located() and you can use either of the following Locator Strategies:
Using CSS_SELECTOR:
driver.get("https://twitter.com/explorer?lang=en")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "a[href^='/hashtag']>span.trend-name")))])
Using XPATH:
driver.get("https://twitter.com/explorer?lang=en")
print([my_elem.get_attribute("innerHTML") for my_elem in WebDriverWait(driver, 5).until(EC.visibility_of_all_elements_located((By.XPATH, "//a[starts-with(#href, '/hashtag')]/span[contains(#class, 'trend-name')]")))])
Note : You have to add the following imports :
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
Console Output:
['#MCITOT', '#WorldSupportsKashmir', '#MCIvsTOT', '#11YearsOFViratism', '#ManCity']
You could dump page source into beautifulsoup 4.7.1 + and use :contains along with class. Your classes appear different from the ones I see but I am making an assumption about url.
N.B. On the page there can be other # under a different class which would make selector ".trend-name, .twitter-hashtag" .
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome(r'path\chromedriver.exe')
d.get('https://twitter.com/explorer?lang=en')
WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".trend-name")))
soup = bs(d.page_source, 'lxml')
hashtag_trends = [i.text for i in soup.select('.trend-name:contains("#")')]
print(hashtag_trends)
Or test whether .text begins with # for selenium only
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
d = webdriver.Chrome(r'path\chromedriver.exe')
d.get('https://twitter.com/explorer?lang=en')
hashtag_trends = [i.text for i in
WebDriverWait(d,5).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".trend-name")))
if i.text.startswith('#')
]
For locator trending topic you can using xpath.
driver.find_element(By.XPATH, '(//*[contains(#class,"trend-name")])[1]').text
driver.find_element(By.XPATH, '(//*[contains(#class,"trend-name")])[1]').click()
You can get count the element by :
len_locator = driver.find_elements(By.XPATH, '//*[contains(#class,"trend-name")]')
print len(len_locator)
Or if you only want locator only start with #, you can use :
driver.find_element(By.XPATH, '(//*[#dir="ltr" and starts-with(text(), "#")])[1]').text
driver.find_element(By.XPATH, '(//*[#dir="ltr" and starts-with(text(), "#")])[1]').click
You can get count the element by :
len_locator = driver.find_elements(By.XPATH, '//*[#dir="ltr" and starts-with(text(), "#")]')
print len(len_locator)
It's the first locator of the trending topic, if you want the second and so on, then replace [1] to [2] etc. Use iteration to the grab all.
I would like to make scrape a web page which was opened by Selenium from a different webpage.
I entered a search term into a website using Selenium and this landed me in a new page. My aim is to create soup out of this new page. But, the soup is getting created out of the previous page where I entered my search term. Help please!
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
inputElement.send_keys(Keys.ENTER)
driver.wait.until(staleness_of('txtStock')
source = driver.page_source
soup = BeautifulSoup(source)
You need to know the exect company names for your search. After you are using send_keys, you tried to check for staleness of an element. I did not understand how that statement should work. I added WebDriverWait for an element of the new page.
The following works for me reagrding the selenium part up to getting the page source:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries Ltd.')
inputElement.send_keys(Keys.ENTER)
company = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'lblCompany')))
source = driver.page_source
You should add exception handling.
#Jens Dibbern has given a working solution. But it is not necessary that the exact name of the company should be given in the search. What happens is that when you type a non-exact name, a drop-down will pop up.
I have observed that until and unless this drop-down is present enter key is not working. You can check this by going to the site, pasting the name and without waiting press the enter key as fast as possible. Nothing happens.
You could also wait for this drop-down to be visible instead and the send the enter key.This also works perfectly. Note that this will end up selecting the first item in the drop-down if more than one is present.
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('http://www.ratestar.in/')
inputElement = driver.find_element_by_css_selector("#txtStock")
inputElement.send_keys('GM Breweries')
drop_down=driver.find_element_by_css_selector("#listPlacementStock")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CSS_SELECTOR, '#listPlacementStock:not([style*="display: none"])')))
inputElement.send_keys(Keys.ENTER)
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '//*[#id="CompanyLink"]')))
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
print(soup)