I have been using the selenium webdriver with python in an attempt to try and login to this website Login Page Here
To do this I did the following in python:
from selenium import webdriver
import bs4 as bs
driver = webdriver.Chrome()
driver.get('https://app.chatra.io/')
I then go on to make an attempt at parsing using Beautiful Soup:
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
The main issue is that the page never fully loads. When I load the page in a browser on my own, all is fine. However when the selenium webdriver tries to load it, it just seemingly stops halfway.
Any idea why? Any ideas on how to fix it or where to look to learn?
First of all, the issue is also reproducible for me in the latest Chrome (with chromedriver 2.34 - also currently latest) - not yet sure what's happening at the moment. Workaround: Firefox worked for me perfectly.
And, I would add an extra step in between driver.get() and HTML parsing - an explicit wait to let the page properly load until the desired condition would be true:
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('https://app.chatra.io/')
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "signin-email")))
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify())
Note that you also needed to call prettify() - it's a method.
There are several aspects to the issue you are facing as below :
As you are trying to take help of BeautifulSoup so if you try to use urlopen from urllib.request the error says it all :
urllib.error.HTTPError: HTTP Error 403: Forbidden
Which means urllib.request is getting detected and HTTP Error 403: Forbidden is raised. Hence using webdriver from selenium makes sense.
Next, when you take help of ChromeDriver and Chrome initially the Website opens and renders. But soon ChromeDriver being a WebDriver is detected and ChromeDriver is unable to parse the <head> & <body> tags. You see the minimal header as :
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" class="supports cssfilters flexwrap chrome webkit win hover web"></html>
Finally, when you take help of GeckoDriver and Firefox Quantum the Website opens and renders properly as follows :
Code Block :
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup)
Console Output :
<html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>
Adding prettify to the soup extraction :
Code Block :
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup.prettify)
Console Output :
<bound method Tag.prettify of <html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>>
Even you can use Selenium's page_source method as follows :
Code Block :
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
print(driver.page_source)
Console Output :
<html class="supports cssfilters flexwrap firefox gecko win hover web">
<head>
<link rel="stylesheet" type="text/css" class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover">
<!-- platform specific stuff -->
<meta name="msapplication-tap-highlight" content="no">
<meta name="apple-mobile-web-app-capable" content="yes">
<!-- favicon -->
<link rel="shortcut icon" href="/static/favicon.ico">
<!-- win8 tile -->
<meta name="msapplication-TileImage" content="/static/win-tile.png">
<meta name="msapplication-TileColor" content="#ffffff">
<meta name="application-name" content="Chatra">
<!-- apple touch icon -->
<!--<link rel="apple-touch-icon" sizes="256x256" href="/static/?????.png">-->
<title>··· Chatra</title>
<style>
body {
background: #f6f5f7
}
</style>
<style type="text/css"></style>
</head>
<body>
<script async="" src="https://www.google-analytics.com/analytics.js"></script>
<script type="text/javascript" src="/meteor_runtime_config.js"></script>
<script type="text/javascript" src="https://app.chatra.io/9153feecdc706adbf2c71253473a6aa62c803e45.js?meteor_js_resource=true&_g_app_v_=51"></script>
<div class="body body-layout">
<div class="body-layout__main main-layout">
<aside class="main-layout__left-sidebar">
<div class="left-sidebar-layout">
</div>
</aside>
<div class="main-layout__content">
<div class="content-layout">
<main class="content-layout__main is-no-fades js-popover-boundry js-main">
<div class="center loading loading--light">
<div class="content-padding nothing">
<em>··· Chatra</em>
</div>
</div>
</main>
</div>
</div>
</div>
</div>
</body>
</html>
Related
I'm very new to web scraping and have run into an issue where I'm trying to scrape the World Football Elo Ratings webpage (https://www.eloratings.net/) for a data science project I'm working on but I'm not getting the nested HTML elements, only the "top level" as shown below:
<!DOCTYPE html>
<html lang="en"><head><title>World Football Elo Ratings</title>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="Ratings for national football teams based on the Elo rating system." name="description"/>
<meta content="football, ratings, Elo, rankings, national, international, soccer, teams" name="keywords"/>
<meta content="width=device-width, initial-scale=1.0, minimum-scale=1.0, maximum-scale=1.0, user-scalable=no" name="viewport"/>
<link href="scripts/slick.grid.css" rel="stylesheet" type="text/css"/>
<link href="scripts/dygraph.css" rel="stylesheet" type="text/css"/>
<script src="scripts/dygraph.js" type="text/javascript"></script>
<script src="scripts/jquery.js" type="text/javascript"></script>
<script src="scripts/slick.core.js" type="text/javascript"></script>
<script src="scripts/slick.grid.js" type="text/javascript"></script>
<script src="scripts/cldr.js" type="text/javascript"></script>
<script src="scripts/event.js" type="text/javascript"></script>
<script src="scripts/supplemental.js" type="text/javascript"></script>
<script src="scripts/globalize.js" type="text/javascript"></script>
<script src="scripts/number.js" type="text/javascript"></script>
<script src="scripts/date.js" type="text/javascript"></script>
<script src="scripts/ratings.js" type="text/javascript"></script>
<link href="scripts/css.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="main" id="main">
<h1 class="mainheader" id="mainheader"></h1>
<div class="topnav" id="topnav"></div>
<h3 class="subheader" id="subheader"></h3>
<div class="maindiv" id="maindiv"></div>
</div>
<div class="mainmenu" id="mainmenu"></div>
<div class="mainloader">
<div class="loadheader" id="loadheader">World Football Elo Ratings</div>
</div>
</body>
</html>
And here is my code so far:
import requests
from bs4 import BeautifulSoup
import pprint
response = requests.get('https://www.eloratings.net/')
soupObject = BeautifulSoup(response.text, 'html.parser')
pprint.pprint(soupObject)
My initial thought is that JavaScript is being used to generate the majority of the HTML, but I am unsure if this is the case, or how to solve it if it is.
Any advice would be greatly appreciated.
You are right, the table is generated by Javascript, bs4 won't be able to find it.
If you look at the network tab, you'll see a request to this url:
https://www.eloratings.net/World.tsv?_=1670338063316
This gives an World.tsv which contains the table.
This can be parsed using the CSV module:
How to parse tsv file with python?
I am relatively new to Stack Overflow, and in fact you are the first question I am going to try to offer any advice to!
I am not too sure what you are looking to do, ie: are you trying to get each country and their stats? Or are you simply looking for the order of rankings?
I have in the past done something similar using Selenium.
I loaded up the webpage you are looking to scrape and tried to figure out how I would do it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.firefox.options import Options
import time
fireFoxOptions = Options()
fireFoxOptions.headless = True
driver = webdriver.Firefox(options=fireFoxOptions)
driver.get("https://www.eloratings.net/")
original_window = driver.current_window_handle
wait = WebDriverWait(driver, 10)
time.sleep(10)
num = 1
stats = []
for i in range(1,240):
div_name = f"div.ui-widget-content:nth-child({num})"
element = driver.find_elements(By.CSS_SELECTOR, div_name)
num = num + 1
stats.append(element)
print(stats)
This little bit of code will go in headless mode (no gui) of firefox and get all the div elements that match the css_selector. Unfortunately their wasn't a common CSS_SELECTOR name between all the elements yet they did have a pattern of just changing the number in the (). So just using a simple four loop we can get all of them. From here if you wanted to get each link for instance you would do something like:
for stat in stats:
link = stats.get_attribute("href")
Then you could iterate through those links and follow them to the their teams page.
****** My Final Solution ******
Thank you to all the individuals who helped me resolve this, below is the implementation I am using:
from selenium import webdriver
from selenium.webdriver.common.by import By
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.eloratings.net/')
teamData = driver.find_elements(By.CLASS_NAME, 'ui-widget-content')
From this, if for example you do:
print(teamData[0].text)
Ouput will be (at the time of writing):
1
Brazil
2150
4
1999
0
+1
1030
364
331
335
657
162
211
2237
914
I'd like to get all source code in Elements with Chrome DevTools.
Although I tried the following code, these values are not match with the above code.
body = driver.execute_cdp_cmd("DOM.getOuterHTML", {"backendNodeId": 1})
print(body)
Is it possible to get all source code with CDP?
How can I get all source code with CDP?
I know the another way to scrape the source code.
But I'd like to know how to get the source code in Elements in DevTools. (F12)
EDIT: See CDP solution at the end
Assuming by "f12 source code" you mean "the current DOM, after it has been manipulated by JS or anything else, as opposed to the original source code".
so, consider the following html page:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello</h1>
</body>
</html>
3 seconds after page load, the h1 will contain "Hello World!"
And that is exactly what we see when running the following code:
from selenium import webdriver
from time import sleep
driver = webdriver.Chrome()
driver.get("http://localhost:8000/") # replace with your page
sleep(6) # probably replace with smarter logic
html = driver.execute_script("return document.documentElement.outerHTML")
print (html)
That outputs:
<html lang="en"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello World!</h1>
</body></html>
EDIT, using CDP instead:
The behavior you're describing is odd, but okay, let's find a different solution.
It seems there's limited support for CDP in selenium 4 (so far) in python.
as of Now (May 2022) There is no driver.getDevTools() in python, only java and JS (Node) (?).
Anyway, I'm not even sure that would have helped us.
Raw CDP will suffice for now:
from selenium import webdriver
from time import sleep
# webdriver.remote.webdriver.import_cdp()
driver = webdriver.Chrome()
driver.get("http://localhost:8000/")
sleep(6)
doc = driver.execute_cdp_cmd(cmd="DOM.getDocument",cmd_args={})
doc_root_node_id = doc["root"]["nodeId"]
result = driver.execute_cdp_cmd(cmd="DOM.getOuterHTML",cmd_args={"nodeId":doc_root_node_id})
print (result['outerHTML'])
prints:
<!DOCTYPE html><html lang="en"><head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Hi</title>
<script>
document.addEventListener("DOMContentLoaded", function(){
setTimeout(function(){
document.getElementById("test").innerHTML+=" World!"
}, 3000)
});
</script>
</head>
<body>
<h1 id="test">Hello World!</h1>
</body></html>
I'm trying to pull the price from this site.
I tried with beautifulsoup first then opened page with selenium webdriver browser but got this response.
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<link rel="shortcut icon" href="about:blank">
</head>
<body>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/j.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/f.js"></script>
<script src="/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint/script/kpf.js?url=/149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint&token=9d98d39f-e497-2d15-7332-7e21738bd6e2"></script>
</body>
</html>
This is my python code.
from selenium import webdriver
dove_coles_url = "https://shop.coles.com.au/a/churchill-centre/product/dove-antiperspirant-deodorant-invisible-dry"
PATH = "C:\\chromedriver.exe"
driver = webdriver.Chrome(PATH)
driver.delete_all_cookies()
driver.get(dove_coles_url)
thanks in advance.
Using your browser console, in the network tab, you can see this request being made :
https://shop.coles.com.au/search/resources/store/20509/productview/bySeoUrlKeyword/dove-antiperspirant-deodorant-invisible-dry?catalogId=17056
Opening it you'll see that it contains all the data for this product in json.
I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link
I'm a big fan of stackoverflow and typically find solutions to my problems through this website. However, the following problem has bothered me for so long that it forced me to create an account here and ask directly:
I'm trying to scape this link: https://permid.org/1-21475776041 What i want is the row "TRCS Asset Class" and "Currency".
For starters, I'm using this code:
from bs4 import BeautifulSoup
import urllib2
url = 'https://permid.org/1-21475776041'
req = urllib2.urlopen(url)
raw = req.read()
soup = BeautifulSoup(raw)
print soup.prettify()
The html code returned (see below) is different from what you can see in your browser upon clicking the link:
<!DOCTYPE html>
<!--[if lt IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" ng-app="tmsMdaasApp">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="max-age=0,no-cache" http-equiv="Cache-Control"/>
<base href="/"/>
<title ng-bind="PageTitle">
Thomson Reuters | PermID
</title>
<meta content="" name="description"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#ff8000" name="theme-color"/>
<!-- Place favicon.ico and apple-touch-icon.png in the root directory -->
<link href="app/vendor.daf96efe.css" rel="stylesheet"/>
<link href="app/app.1405210f.css" rel="stylesheet"/>
<link href="favicon.ico" rel="icon"/>
<!-- Typekit -->
<script src="//use.typekit.net/gnw2rmh.js">
</script>
<script>
try{Typekit.load({async:true});}catch(e){}
</script>
<!-- // Typekit -->
<!-- Google Tag Manager Data Layer -->
<!--<script>
analyticsEvent = function() {};
analyticsSocial = function() {};
analyticsForm = function() {};
dataLayer = [];
</script>-->
<!-- // Google Tag Manager Data Layer -->
</head>
<body class="theme-grey" id="top" ng-esc="">
<!--[if lt IE 7]>
<p class="browserupgrade">You are using an <strong>outdated</strong> browser. Please upgrade your browser to improve your experience.</p>
<![endif]-->
<!-- Add your site or application content here -->
<navbar class="tms-navbar">
</navbar>
<div id="body" role="main" ui-view="">
</div>
<div id="footer-wrapper" ng-show="!params.elementsToHide">
<footer id="main-footer">
</footer>
</div>
<!--[if lt IE 9]>
<script src="bower_components/es5-shim/es5-shim.js"></script>
<script src="bower_components/json3/lib/json3.min.js"></script>
<![endif]-->
<script src="app/vendor.8cc12370.js">
</script>
<script src="app/app.6e5f6ce8.js">
</script>
</body>
</html>
Does anyone know what I'm missing here and how I could get it to work?
Thanks, Teemu Risikko - a comment (albeit not the solution) of the website you linked got me on the right path.
In case someone else is bumping into the same problem, here is my solution: I'm getting the data via requests and not via traditional "scraping" (e.g. BeautifulSoup or lxml).
Navigate to the website using Google Chrome.
Right-click on the website and select "Inspect".
On the top navigation bar select "Network".
Limit network monitor to "XHR".
One of the entries (market with an arrow) shows the link that can be used with the requests library.
import requests
url = 'https://permid.org/api/mdaas/getEntityById/21475776041'
headers = {'X-AG-Access-Token': YOUR_ACCESS_TOKEN}
r = requests.get(url, headers=headers)
r.json()
Which gets me this:
{u'Asset Class': [u'Units'],
u'Asset Class URL': [u'https://permid.org/1-302043'],
u'Currency': [u'CAD'],
u'Currency URL': [u'https://permid.org/1-500140'],
u'Exchange': [u'TOR'],
u'IsQuoteOf.mdaas': [{u'Is Quote Of': [u'Convertible Debentures Income Units'],
u'URL': [u'https://permid.org/1-21475768667'],
u'quoteOfInstrument': [u'21475768667'],
u'quoteOfInstrument URL': [u'https://permid.org/1-21475768667']}],
u'Mic': [u'XTSE'],
u'PERM ID': [u'21475776041'],
u'Quote Name': [u'CONVERTIBLE DEBENTURES INCOME UNT'],
u'Quote Type': [u'equity'],
u'RIC': [u'OCV_u.TO'],
u'Ticker': [u'OCV.UN'],
u'entityType': [u'Quote']}
Using the default user-agent with a lot of pages will give you a different looking page because it is using an outdated user-agent. This is what your output is telling you.
Reference on Changing user-agents
Thought this may be your problem, it does not exactly answer the question about getting dynamically applied changes on a webpage. To get the dynamically changed data you need to emulate the javascript requests that the page is making on load. If you make the requests that the javascript is making you will get the data that the javascript is getting.