Please Note this question remains opened, as the suggested "answer" still gives same output since it doesn't explain why JS isn't running on that page or why selenium can't extract it
I'm trying to read page source of: http://147.235.97.36/ (Hp printer) which is rendered by JS.
So I wrote:
driver.get(url)
wait_for_page(driver)
source = driver.page_source
print(source)
but in the printed source I see:
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
and some of the content isn't there, so I changed my code to:
driver.get(url)
wait_for_page(driver)
source = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
print(source)
Still same output, can you help me understand what's the problem here?
Here is my init_driver function:
def init_driver():
# --Initialize Driver--#
chrome_options = Options()
chrome_options.add_argument("--headless") # Run in Background
chrome_options.add_argument('--disable-gpu') if os.name == 'nt' else None # Windows workaround
prefs = {"profile.default_content_settings.images": 2,
"profile.managed_default_content_settings.images": 2} # Disable Loading of Images
chrome_options.add_experimental_option("prefs", prefs)
chrome_options.add_argument('--ignore-ssl-errors=yes')
chrome_options.add_argument('--ignore-certificate-errors')
chrome_options.add_argument("--window-size=1920,1080") # Standard Window Size
chrome_options.add_argument("--pageLoadStrategy=normal")
driver = None
try:
driver = webdriver.Chrome(options=chrome_options, service=Service('./chromedriver'))
driver.set_page_load_timeout(REQUEST_TIMEOUT)
except Exception as e:
log_warning(str(e))
return driver
You can add a few arguments to avoid geting detected and print the Page Source as follows:
Code Block:
options = Options()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('excludeSwitches', ['enable-logging'])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('--disable-blink-features=AutomationControlled')
s = Service('C:\\BrowserDrivers\\chromedriver.exe')
driver = webdriver.Chrome(service=s, options=options)
driver.get("http://147.235.97.36/")
print(driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML"))
Console Output:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link href="/framework/Unified.css" rel="stylesheet" type="text/css">
<script type="text/javascript">
frameWorkObj = {};
frameWorkObj.pkg = "ews";
</script>
<script src="/framework/Unified.js" type="text/javascript"></script>
</head>
<body class="theme-gray">
<iframe src="/framework/cookie/client/cookie.html" style="display: none;"></iframe>
<div id="pgm-overall-container">
<div id="pgm-left-pane-bkground"></div>
<div id="pgm-banner"></div>
<div id="pgm-search-div" class="gui-hidden"></div>
<div id="pgm-top-pane"></div>
<div id="pgm-container-div">
<div id="pgm-left-pane"></div>
<div id="pgm-container" class="clear-fix">
<div id="pgm-title-div" class="gui-hidden"></div>
<div id="contentPane" class="contentPane"></div>
</div>
</div>
<div id="pgm-footer"></div>
</div> <!-- #pgm-overall-container -->
<div id="pgm-theatre-staging-div"></div>
<script type="text/javascript">
// frame buster
if(top != self)
top.location.replace(self.location.href);
</script>
<noscript>
<div id="pgm-no-js-text">
<p>JavaScript is required to access this website.</p>
<p>Please enable JavaScript or use a browser that supports JavaScript.</p>
</div>
</noscript>
<div id="ui-datepicker-div" style="display: none;" tabindex="0"></div></body>
Related
I have a Django webapp displaying a form. One of the fields is a FileField, defined via the Django model of the form:
From models.py:
class Document(models.Model):
...
description = models.CharField(max_length=100, default="")
document = models.FileField(upload_to="documents/", max_length=500)
The document file_field has an onchange ajax function attached that will parse the uploaded filename, check some database stuff depending on it, and populate other fields on the html-page with the results.
From forms.py:
class DocumentForm(forms.ModelForm):
class Meta:
model = Document
fields = ("document",)
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.fields["customer"] = forms.CharField(initial="", required=True)
self.fields["output_profile"] = forms.CharField(initial="", required=True)
self.fields["document"].widget.attrs[
"onchange"
] = "checkFileFunction(this.value, '/ajax/check_file/')"
From urls.py:
urlpatterns = [
#...
path("ajax/check_file/", views.check_file, name="ajax_check_file")
]
From views.py:
def check_file(request):
full_data = {"my_errors": []}
my_path = pathlib.Path(request.GET.get("file_path").replace("\\", os.sep))
# parse customer ID from file_path
# get data of customer from db
# assemble everything into full_data
return JsonResponse(full_data)
This is the full html page as displayed (copied from Chrome => show source and cleaned up the indentation & whitespaces some):
<!doctype html>
<html>
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.1.3/css/bootstrap.min.css" integrity="sha384-MCw98/SFnGE8fJT3GXwEOngsV7Zt27NXFoaoApmYm81iuXoPkFOJwJ8ERdknLPMO" crossorigin="anonymous">
<link href="/static/css/main.css" rel="stylesheet" type="text/css">
<link rel="stylesheet" type="text/css" href="https://cdnjs.cloudflare.com/ajax/libs/jqueryui/1.12.1/themes/base/jquery-ui.css"/>
<link rel="stylesheet" type="text/css" href="https://cdn.datatables.net/1.10.20/css/dataTables.jqueryui.css"/>
<script src="/static/js/jquery.js"></script>
<script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js" integrity="sha384-wfSDF2E50Y2D1uUdj0O3uMBJnjuUD4Ih7YwaYd1iqfktj0Uod8GCExl3Og8ifwB6" crossorigin="anonymous"></script>
<script type="text/javascript" src="https://cdn.datatables.net/1.10.20/js/jquery.dataTables.js"></script>
<script type="text/javascript" src="https://cdn.datatables.net/1.10.20/js/dataTables.jqueryui.js"></script>
<title>
Convert RES FILE
</title>
</head>
<body>
<header id="header">
<section class="top_menu_left">
Login
<a> | </a>
Logout
<a> | </a>
Edit User
<a> | </a>
Register
</section>
<section class="top_menu_right">
About Us
<a> | </a>
Contact Us
<a> | </a>
Submit an issue
<a> | </a>
Documentation
<a> | </a>
Home
</section>
<div id="topbanner" >
<img src="/static/banner_small.png" alt="" width="100%" height="150"/>
</div>
</header>
<aside id="leftsidebar">
<section class="nav_account">
<h4>Submit a New Request</h3>
<ul>
<li>Get Typing Results</li>
<li>Compare Typing Results</li>
<li>Convert Typing Format</li>
</ul>
</section>
<section class="nav_tools">
<h4>View Your Requests</h3>
<ul>
<li>View My Submissions</li>
</ul>
</section>
</aside>
<section id="main">
<p> </p>
<h2>Convert Typing Results to Format of Choice</h2>
<p> </p>
<h3>Upload a file to our database</h3>
<form method="post" enctype="multipart/form-data">
<input type="hidden" name="csrfmiddlewaretoken" value="qqIKcsAynuE35MQ37dvjF5XeIyfcEbHb3wjtgygGZaigQReNxHLQewoDKcEb8Roj">
<div id="div_id_document" class="form-group">
<label for="id_document" class=" requiredField">
Document<span class="asteriskField">*</span>
</label>
<div class="">
<input type="file" name="document" onchange="checkFileFunction(this.value, '/ajax/check_file/')" class="clearablefileinput form-control-file" required id="id_document">
</div>
</div>
<input type="hidden" id="id_description" name="description" value="">
<p> </p>
<div id="div_id_customer" class="form-group">
<label for="id_customer" class=" requiredField">
Customer<span class="asteriskField">*</span>
</label>
<div class="">
<input type="text" name="customer" readonly class="textinput textInput form-control" required id="id_customer">
</div>
</div>
<div id="div_id_output_profile" class="form-group">
<label for="id_output_profile" class=" requiredField">
Output profile<span class="asteriskField">*</span>
</label>
<div class="">
<input type="text" name="output_profile" readonly class="textinput textInput form-control" required id="id_output_profile">
</div>
</div>
<div class="form-group">
<div id="div_id_notify_me" class="form-check">
<input type="checkbox" name="notify_me" style="width:15px;height:15px;" class="checkboxinput form-check-input" id="id_notify_me">
<label for="id_notify_me" class="form-check-label">
Notify me
</label>
</div>
</div>
<p>
<button class="linkbutton" type="submit" id="submit_btn">Convert</button>
<button id="create-book" class="linkbutton" type="button" name="button" style="float: right;">Create an Output Profile</button>
</p>
</form>
<div class="modal fade" tabindex="-1" role="dialog" id="modal">
<div class="modal-dialog modal-lg" role="document">
<div class="modal-content" id="form-modal-content">
</div>
</div>
</div>
<div class="modal fade" tabindex="-1" role="dialog" id="modal-check-res-file">
<div class="modal-dialog" role="document">
<div class="modal-content" id="form-modal-content-check-res-file">
</div>
</div>
</div>
<script>
var formAjaxSubmit = function(form, modal) {
$(form).on('submit', function (e) {
e.preventDefault();
console.log("submitting...");
var my_val = $("#id_profile_name").val();
var this_val = $("#confirm_save").val();
var res = this_val.split(",");
var this_val_contains_my_val = (res.indexOf(my_val) > -1);
if (this_val_contains_my_val === true) {
var conf = confirm("Are you sure want to overwrite an exsisting profile?");
}else {var conf = true;};
if (conf === true) {
$.ajax({
type: $(this).attr('method'),
url: "/new_customer_profile/",
data: $(this).serialize(),
success: function (xhr, ajaxOptions, thrownError) {
if ( $(xhr).find('.invalid-feedback').length > 0 ) {
$(modal).find('.modal-content').html(xhr);
formAjaxSubmit(form, modal);
} else {
$(modal).find('.modal-content').html(xhr);
}
},
error: function (xhr, ajaxOptions, thrownError) {
}
});
};
});
};
$('#create-book').click(function() {
console.log("hhallo");
$('#form-modal-content').load('/new_customer_profile/', function () {
var iam_alive = document.getElementById("modal");
// check if iam_alive is defined, this is required if a session expired -> in that case the modal is lost and it would redirect to an almost empty page.
if (iam_alive) {
$('#modal').modal('toggle');
formAjaxSubmit('#form-modal-body form', '#modal');
}
// if not iam_alive: redirect to login page
else {
window.location.replace('/accounts/login/');
}
});
});
$('#check-res-file').click(function() {
console.log("hhallo hier unten jetzt");
$('#form-modal-content-check-res-file').load('/check_res_file/', function () {
$('#modal-check-res-file').modal('toggle');
//formAjaxSubmit('#form-modal-body form', '#modal');
});
});
</script>
<script type="text/javascript">
$(document).ready(function() {
var mycell = document.getElementById("create-book");
mycell.style.display = "none";
});
</script>
<script>
function checkFileFunction(myfile, url) {
$.ajax({ // initialize an AJAX request
url: url, // set the url of the request (= localhost:8000/hr/ajax/load-cities/)
data: {"file_path": myfile},
dataType: 'json',
success: function (x) {
if (x.my_errors.length == 0) {
$('#id_customer').val(x.customer_name);
$('#id_output_profile').val(x.customer_profile);
$('#id_description').val(x.customer_file);
}else{
$('#id_customer').val("");
$('#id_customer').val("");
$('#id_output_profile').val("");
alert(x.my_errors);
var showme = function myFunction() {
var mycell = document.getElementById("create-book");
mycell.style.display = "block";
};
showme();
}
},
});
}
</script>
</section>
</body>
</html>
Now, I'm trying to test this with pytest via Selenium.
I can send the file path to the field via send_keys(). However, the onchange event seems not to be triggered. (It does work fine when I select the file manually.)
file_field = self.driver.find_element(By.NAME, "document")
file_field.clear()
file_field.send_keys(str(path/to/myfile))
This will register the file fine and it will be uploaded, but the onchange function never happens.
I have searched and it seems others also have encountered the problem of send_keys not triggering the onchange event. But I have not been able to implement any of the suggested solutions in my Python code. (I have not written the Django code for this app, I'm just the tester and my grasp on Django and javascript is not very good, yet. My native programming language is Python.)
The only solution I understood how to implement was sending a TAB or ENTER afterwards (file_field.send_keys(Keys.TAB)) to change the focus, but that triggers an
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: File not found
(The file I enterted does exist, the path is fine. I can successfully call .exists() on it.)
Simply selecting a different element after send_keys to shift the focus (i.e., customer_field.click()) does not trigger the onchange function of file_field, either.
How can I trigger an onchange event via Selenium from Python? Or otherwise make sure it is triggered?
First of all, your onchange specification is a bit kludgy and would be preferably specified as:
<input type="file" name="document" onchange="checkFileFunction(this.value, '/ajax/check_file/');">
I am using Selenium with the latest version of Chrome and its ChromeDriver under Windows 10 and have no problems with the onchange event being taken. This can be demonstrated with the following HTML document. If the onchange event is taken, then it should create a new div element with id 'result' that will contain the path of the filename selected:
File test.html
<!doctype html>
<html>
<head>
<title>Test</title>
<meta name=viewport content="width=device-width,initial-scale=1">
<meta charset="utf-8">
<script>
function checkFileFunction(value)
{
const div = document.createElement('div');
div.setAttribute('id', 'result');
const content = document.createTextNode(value);
div.appendChild(content);
document.body.appendChild(div);
}
</script>
</head>
<body>
<input type="file" name="document" onchange="checkFileFunction(this.value);">
</body>
</html>
Next we have this simple Selenium program that sends a file path to the file input element and then waits for up to 3 seconds (with the call to driver.implicitly_wait(3)) for an element with an id value of 'result' to be found on the current page and then prints out the text value. This element will only exist if the onchange event occurs:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
# Wait up to 3 seconds for an element to appear
driver.implicitly_wait(3)
driver.get('http://localhost/test.html')
file_field = driver.find_element_by_name("document")
file_field.clear()
file_field.send_keys(r'C:\Util\chromedriver_win32.zip')
result = driver.find_element_by_id('result')
print(result.text)
finally:
driver.quit()
Prints:
C:\Util\chromedriver_win32.zip
Now if your driver is different and that is the reason why the onchange event is not occurring and you do not wish to or cannot switch to the lastest ChromeDriver, then you can manually execute the function specified by the onchange argument. In this version of the HTML file, I have not specified the onchange argument to simulate the situation where specifying it has no effect:
File test.html Version 2
<!doctype html>
<html>
<head>
<title>Test</title>
<meta name=viewport content="width=device-width,initial-scale=1">
<meta charset="utf-8">
<script>
function checkFileFunction(value)
{
const div = document.createElement('div');
div.setAttribute('id', 'result');
const content = document.createTextNode(value);
div.appendChild(content);
document.body.appendChild(div);
}
</script>
</head>
<body>
<input type="file" name="document">
</body>
</html>
And the new Selenium code:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("headless")
options.add_experimental_option('excludeSwitches', ['enable-logging'])
driver = webdriver.Chrome(options=options)
try:
# Wait up to 3 seconds for an element to appear
driver.implicitly_wait(3)
driver.get('http://localhost/test.html')
file_field = driver.find_element_by_name("document")
file_field.clear()
file_path = r'C:\Util\chromedriver_win32.zip'
file_field.send_keys(file_path)
# Force execution of the onchange event function:
driver.execute_script(f"checkFileFunction('{file_path}');")
result = driver.find_element_by_id('result')
print(result.text)
finally:
driver.quit()
Update
I guess you missed <script src="/static/js/jquery.js"> in the <head> section, which appears to be jQuery. But I would have thought that with this script tag being in the <head> section that jQuery would have to be loaded by time Selenium found the file element. So I confess that your getting a javascript error: $ is not defined is somewhat surprising. I can only suggest now that you try loading it manually as follows as detailed in the code below.
I have re-iterated the 3 things you should try in order in the code below moving on to the next approach if the previous one does not work:
Give jQuery time to load before sending keystrokes to eliminate the $ not defined error.
Manually loading jQuery before sending keystrokes.
Manually executing the checkFileFunction.
# 1. Give jQuery time to load before sending keystrokes:
import time
time.sleep(3)
# 2. if the above sleeping does not work,
# remove the above call to sleep and manually load jQuery.
# Specify where the home page would be loaded from:
document_path = '/home/usr/account/public_html' # for example
jQuery_path = document_path = '/static/js/jQuery.js'
with open(jQuery_path, 'r') as f:
jQuery_js = f.read()
self.driver.execute_script(jQuery_js)
# Send keystrokes:
file_path = str(path/to/myfile)
file_field.send_keys(file_path)
# 3. Execute the following if the onchange doesn't fire by itself:
self.driver.execute_script(f"checkFileFunction('{file_path}', '/ajax/check_file/');")
As it turns out, the actual problem was that my manual test were done on the Django app served via python manage.py runserver. This calls some hidden Django magic, including collecting the statics files (css, jQuery.js etc.) under the hood.
I now learned that, in order to serve a Django app on a proper server, one needs to first call python manage.py collectstatic. This will generate a static folder in the parent directory, which contains all the static files and also an explicit jQuery.js.
Then, when Selenium is run, it will find that static folder and the jQuery.js file therein. And then, everything works as expected, including onchange.
So the problem was, that this parent static folder was missing, which I never saw because serving the website via python manage.py runserver doesn't need it.
I am looking to scrape prices for different products from Metro's online grocery store. To do this, I need to set a particular store as a "favourite" so that Metro knows which products to show. I'm currently using Selenium to automate this part and return the cookies after selecting a particular store. However, I am still getting 403 errors when passing the cookies to a Request despite the fact that I can access other pages on Metro's website.
import requests
import time
from user_agent import generate_user_agent
from selenium import webdriver
from selenium_stealth import stealth
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from bs4 import BeautifulSoup
user_agent = generate_user_agent(navigator="chrome")
header = {"User-Agent": user_agent}
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
def getMetroCookies(store_url):
browser = webdriver.Chrome(options=options, executable_path="C:/Users/XXXX/Documents/chrome_driver/chromedriver.exe")
browser.delete_all_cookies()
stealth(browser,
languages=["en-US", "en"],
vendor="Google Inc.",
platform="Win32",
webgl_vendor="Intel Inc.",
renderer="Intel Iris OpenGL Engine",
fix_hairline=True,
)
browser.get(store_url)
time.sleep(1.5)
cookie_button = browser.find_element_by_xpath("/html/body/div[4]/div/div[3]/button")
cookie_button.click()
WebDriverWait(browser, 10).until(EC.invisibility_of_element_located((By.XPATH, "/html/body/div[4]/div/div[3]/button")))
store_button = browser.find_element_by_xpath("/html/body/div[1]/div[2]/div[1]/div[2]/div[3]/div/div/div/div[1]/div/div[3]/button")
time.sleep(1)
store_button.click()
time.sleep(3)
driver_cookies = browser.get_cookies()
c = {c['name']:c['value'] for c in driver_cookies}
browser.close()
return(c)
store_url = "https://www.metro.ca/en/find-a-grocery/164"
cookies = getMetroCookies(store_url)
base_url = "https://www.metro.ca/en/online-grocery/search?filter="
search_item = "chicken"
search_url = base_url+search_item
page = requests.get(search_url, headers=header, cookies=cookies)
content = BeautifulSoup(page.text, 'html.parser')
This gives me a 403 error along with the following page content.
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en-US"> <!--<![endif]-->
<head>
<title>Attention Required! | Cloudflare</title>
<meta id="captcha-bypass" name="captcha-bypass"/>
<meta charset="utf-8"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=Edge,chrome=1" http-equiv="X-UA-Compatible"/>
<meta content="noindex, nofollow" name="robots"/>
<meta content="width=device-width,initial-scale=1" name="viewport"/>
<link href="/cdn-cgi/styles/cf.errors.css" id="cf_styles-css" media="screen,projection" rel="stylesheet" type="text/css"/>
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">body{margin:0;padding:0}</style>
<!--[if gte IE 10]><!-->
<script>
if (!navigator.cookieEnabled) {
window.addEventListener('DOMContentLoaded', function () {
var cookieEl = document.getElementById('cookie-alert');
cookieEl.style.display = 'block';
})
}
</script>
<!--<![endif]-->
<script type="text/javascript">
//<![CDATA[
(function(){
window._cf_chl_opt={
cvId: "2",
cType: "interactive",
cNounce: "94024",
cRay: "6657c0090c70ecee",
cHash: "f2ab1c66a7c7fb9",
cFPWv: "g",
cTTimeMs: "4000",
cLt: "n",
cRq: {
ru: "aHR0cHM6Ly93d3cubWV0cm8uY2EvZW4vb25saW5lLWdyb2Nlcnkvc2VhcmNoP2ZpbHRlcj1jaGlja2Vu",
ra: "TW96aWxsYS81LjAgKFdpbmRvd3MgTlQgMTAuMDsgV09XNjQpIEFwcGxlV2ViS2l0LzUzNy4zNiAoS0hUTUwsIGxpa2UgR2Vja28pIENocm9tZS81NC4wLjI4ODIuNzQgU2FmYXJpLzUzNy4zNg==",
rm: "R0VU",
d: "EI8002UISMNZV4/wX/5oFZrkU66iZFrjnbrNYKgh3Ttb0AlT4tTYpyyzbKdGR4wfseBSZjcF8rJrwqdQEMKdIRBqLQjf0JlIowEseVWSf0dY03uEBGR+076Co1cm3pAeU83GN1kzFNq/sMe832Ng4oWK/pCpJ6XdIvbGWpk1l8Qtrwbi/hVtj3R1BXapeIgGrJRGlUjcsa72BbNFXOb97CsKqFb+6xMTSO9D/nTxFlouAqHyvbrkTG+CeGvImNQTqu9AVSsZiibNCRQ9C/IlNzCwn0tEvnJ6dZ6WA5RaS4riPmOdbpVGDcS2hIOjIfeGK4/Xj0dho0VkraSq+NPcFTfs18YuqtvQq/h7+V7uST5whKYXu1DM5F1TwPLbzM3KpB/KeYlad+JgxDcOaz1k0H/t52rfMhz8PYAjNvn7SwUXSJMRDeQavS6428g8IWtveqSUj4gnn6d4wGdTTNRpqnUm+m9SJARft2IjidMpvvBtUUzZe6srQs4JPZ9XzjfH+X/kMWgQT3X2pZVDrZZC9Od7P+sqyXPKoFNuZRPrWP15XogncIKTjt5MJLQUV42MGcaGlQ5w1PAvLNGGyNeMFG8wCfhuc/vLzodD+DP3bgIi7tjx8d5zhP3jMPAsUPxAxcJpZkBtuMBuKDNQO50dYHD2wwyOhx9HMcqHWCssMWN4qUzYKOth1KNlg0zlA/qzry1csYQqILH1F1b9O5QypPa2OA5gGmJNhar8svffekU9CXsqgtHDphJgEwsqrP1qSZzQ6wq1s5McDp6pPKijdPGbBrK4q2pxbJaVHu0lRn58gStP6HGEY8BLV/kEpygG27T4Vq4dp4uWLZDKw2oxk8ezrOIgv/lq7yXkZmhZs1GzHd4XWVXJvZ5dTI3rT1zrXMOTpInw4RWXULnazZn3HofZYOm0mUJvsofwzjaG88A=",
t: "MTYyNDcyNDI5Mi4wMTcwMDA=",
m: "SuNqM4NyxmnA1WU+nYefP0zkF5LxO+2HK+JlYjzu4dw=",
i1: "Z/V7+yIdblkqF9PRfarDwA==",
i2: "iMe97FeUtyqejNZ6Ziyc8w==",
zh: "/vdKLh0CrKHrnBUka1HcvI1mkhoFozUewI640Q15E4c=",
uh: "wSvBDgWWw4CCletn46YSZpWn4A/qjMkCb4uV9eAjmfA=",
hh: "56bTGUAA35o0NPPIwaihW3gLWiRsmO2PeArMwpTuU9E=",
}
};
}());
//]]>
</script>
<style type="text/css">
#cf-wrapper #spinner {width:69px; margin: auto;}
#cf-wrapper #cf-please-wait{text-align:center}
.attribution {margin-top: 32px;}
.bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
#cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
#cf-hcaptcha-container { text-align:center;}
#cf-hcaptcha-container iframe { display: inline-block;}
#keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
#cf-wrapper #cf-bubbles { width:69px; }
#-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
#cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
#cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
#cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
</style>
</head>
<body>
<div id="cf-wrapper">
<div class="cf-alert cf-alert-error cf-cookie-error" data-translate="enable_cookies" id="cookie-alert">Please enable cookies.</div>
<div class="cf-error-details-wrapper" id="cf-error-details">
<div class="cf-wrapper cf-header cf-error-overview">
<h1 data-translate="challenge_headline">One more step</h1>
<h2 class="cf-subheadline"><span data-translate="complete_sec_check">Please complete the security check to access</span> www.metro.ca</h2>
</div>
<div class="cf-section cf-highlight cf-captcha-container">
<div class="cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<div class="cf-highlight-inverse cf-form-stacked">
<form action="/en/online-grocery/search?filter=chicken&__cf_chl_captcha_tk__=0641319015a45358b1db60468c92bf88af4a70ea-1624724292-0-ATxUvClOko_GDrF_ejLwzZX-kuPRpoh1BFTlbPpgnM7UZS0tt0LcTa6u0ksaDrdsCuFkwxbyL7QYwbUeX6srjGPdlhXjLsQNqAH5sr4WHG8JX55aU2kRJzjzY9HulNoXyr6MuhmU1HzLv1ZvLss4X5hP-lABtnHTc5waDyQNzn3zxVHYetOu-uA7COqv76by9yx8dhQAWX0pT8cgjYQ2QwRLhrAw49GqhCux2EluSfziYo-Zqncf4uDyMe0Pb7Hb1csz2l9E_L26erOLQTrM_U2c1sYY0T-4ofJdQNEVLFA7e1FkGspeuGaFRRNmcXhCNPB7YKEiHlkROpAr2nxQeepJuefHBMdzbixJRXE5glhNCX9XXJ5nbpo8OzLY7pnMrJgaW6_YucjLh0fJs4c0bfBHAHZLWQeGxvcG7_AeM3zY6MIXngvnXg64GyrpxmYfADy_znyKmVlTCvVwdc8VEBZo27I4iGoqhJWaG0E1Q0Dw9a6dTU7bOWCSpoaxSNUmNkuwL5VsBAk3paSDIwYaewFLHijU-PUdeGw9hcLFsNbD95qUGlVEHZsdUMg176NYJ1VyZho1MMbNj8bVVC2kDKyZOu1IqcMe0TTqVwV5p9j_zZU6ODLXhn_d2VFULBMQTs9eIZUIz3j6uMZdEYV2o53P421SCx-MPPD5rALfYHdTRmSBDCLeW7gUG5-UvnWh87p87HJH__7plEmoJhFkW8crBpUeKBhwt7JQR_huvqOW" class="challenge-form interactive-form" enctype="application/x-www-form-urlencoded" id="challenge-form" method="POST">
<div id="cf-please-wait">
<div id="spinner">
<div id="cf-bubbles">
<div class="bubbles"></div>
<div class="bubbles"></div>
<div class="bubbles"></div>
</div>
</div>
<p data-translate="please_wait" id="cf-spinner-please-wait">Please stand by, while we are checking your browser...</p>
<p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting...</p>
</div>
<input name="r" type="hidden" value="86274ebf891ca5903cedef6f5476291f7a3f2375-1624724292-0-AYyBXOiOkLO6sEJbZAhxnborgsqm+9Myz1E+TgVNFE0OKQcJs14P/RNNa9jSf5uTx9Eo4AxksOkzMWys+5Roo/xz2LZWFQybup/QSTYAEX6Oz5WVB05OtClBu7NY+EMGVabeM1OM2Q3cc1qgrnOH4h4UWw/tFTEYmY0tOXDYpe93zmxREYOxBU/vxCsLtda3YAAodT9qhQyO7oiTEgWMNC595Rjao12av3f+TtLrX41QyH/qiSfKJYRQf616Yvk7IEzTwc/n8ZvMc8wnGm5j+9lM0bzc6kRGoCfHVj1r0eJJxEV9aF15A+pKYuIzupkw/QOT8rUZE6UtL3yGB9UYVwqmcvtvIIO4ILPVnQV8fXxnvnXpvCVXKr0PgxPF4p8Drl1Vb95PdVn7ZvQ0jr6xGiqhbPFu2//9mwUnSjBQt8SXfei/Zq/Z0uL0TD/513/bBF1Jp/QojGEJjVGfs1Wo4L+usEpn9O0Z6gWaZXPfQgqTwiO9uboO+Z9V8pdeK3egZMneaXfCjhwgrNzmTilR90jQlGbsMOXhUokOQaxqhJ/khmBgnu5UfJ3OFxG5e2zQylxXkK88T38DE7DysMBuXE3wv10Pf4Dl6lEPMYbXqUB1Vp/hT+ShzNvaG1wpRQD6XA1WIzKNVINAg9QIffi30ojuKxldogRE0rpTAzZfgzRN8kiXFsxwQfMfTYMvdtoJsEbBP6CrvsNNNOmzN9exuM5WSbj+UXSa4/ZSlkHp0SVEJZOccYYYT6C9kAV0srfDmysEkDpfYQcap8AhFh7Ub8pYA9CedTD31+ghxXqlBphJj6zAQQfJyawwyFv4dwctjYJxduR3p6yG/7fyhTh8/B7U47sR30cP2mRA2sUMRLAYrLp2bd8yiz+jxsuoD6JxikOSYLTOl9e862isXFOg4RSspNB1RqCtb/154pnoP3bRghEl6vTSpj6dSH8GUxBjSQPWxbZuSwSMGTHHPxevAZFSpct5SNv05aU6rFvwPcna2h6UwjqcZOenMDr53xh2NVjHVWIcsDMXDReVncZyb28PIBqmBOdx4Fui/4KNXJuM1kuk5SMnN9zc1H9ZKgJnpXKIuvGFqS+Ifb56RCH/XWaoVPOG6tMwbitv8I0BkDvCSWBFXIgHw2tDuS3i/CxrELTCK6QURZZdFgZQKITtC/FvsxnDvPmnaON0dxzhJufdiBGiRCJAsLNJUkE1RfJCg8pApT+REE6IvrKb1r/1DIjETBWN0ntGE/J8fzZXOXaJDmmX2ZxWfBQCJ66RisEmJTzwRU9ModQMXfUeXxYx30IZN4H1BML6G/qzRiwN4mgO1aTeG0ic1Pv8NGdLlWP66gxvlxVTdNTuo1GTR8zyBs0AIwP3ohZrUH2KBH/r/NCVmKbxW7jswpl2kK9dzcPQi48TgeOyV8080BDzWkDOZj1agmycaGFobNAMdFhZhCSfYg+6+Y6rHba2CXKi2IioAGLh9/iMOvTGlMRZsw/dSd//ihW+otU033+sCxNjv/xK7RyAicZMk1MVDCDYbaEwzy6MAluXTpSSto5MHUDBWb+qwDlQYVqbkU5TO5ivbbBWpq1+8YFeq5zcrBfU9r+8ttj3qR8MpLcIAF18q9Ll1rE02opU/J6cCMNoFRmBecQZLmcSFoDWmS5n0nca51/KYQdJJDEpq63RKfc7KrizwX0lHfM+vwW4P3zYlGjXRdCjrNIf9Oae8nZpcB6itAjhzeu4n+gx24EHQmdNeg8AJ3B519bjqCA+aYooSfUzgUrNF+YDQBbI7Nq/sErOM6RanUuFaoMS3jnNCS4tP3TWdjJraHEY53wpBg+oqXpsJzdhfesM/KjNpxBxX9OT9v4vXyi8xzDPJB0EiZ8I6OihO7odnVW+gUdFLr7aMCtPsx5LbTFwLvE8ESTtdCfXWKSGB0GmQdJe5KmGsGQ1pxQEiVW0KUw+PCzvudBYngQF4N+UQcthlmt2pRt73ULhy2abmRa+JHOLWdvOgOQASxb8DW1k/htFRdj6FFmLygYJx+NBMD0kcQO32768SuU1S18wh2Vi3b3LXZrjpc/tPfbvADi2BVyMiEfs3cKtLwZjK2mrEONy5xq1BAL0UzLJCZCpVDc8IoxIQ3LpTxOAoQ9sw92LQdfvq/CyhMF8sAhMxQamvsWklrv5seJlNWvoNlvgfeaNxI/ugceoW9IwiZCb26d5ySpiySIgANeZwV//k5eGECYr8gLB37o+dGblgHjr+onK4UG2nHLAkIbhXBI1ZAlfE4f6YyruB2Z/35lxayZkRE/YYXJrYtpYJRU/ssl7S0VGY8SPh7aRdx8N9sw+F3XKQ63Y2pxO1KAm/Xf1CElhz86alEXlAdA24LZRz8cVcuHvk9mKM2j/YmUlYX+1uF2Zul+101PVpuvCypZtAa0nhlGTiB+st00ohFe6HmhK6d2T4UWISX6JiubywIJ0oLEF4hzecd1hB0/2Vdpl5Z9y/jhuOxPWceYGhriP3JYP9cS+MFbC36wOkF7hYpsdg9NEgFIDLFxzSYeEFkPeIuE13M1hwZHjjW8Zf6REdiPnQrDZHAKRDldWwzwBrs36guuJ4AiNju+Mx8Lr8wB6Krcd1+HriLm4uUFVM2DLeuusRkrSojUWkdWc2dpBrkLZ0tQw7wa6ZXVRt1nsWr5/ApEuzcC1+BaCGNdl1UzNd3NnGlPDYtYPFNPsuyUJIWjUTcB0rk/CfFP6JLoROVSP2l3WFVbktqw3m+mcwa6bw7Aew7YU4N/O2yJ8ab8a13/tV01Dfi61AIdKB2APWbVNRGxinXj++7fTKmqLB4B7usJC9EYqYbqq7ntAnjV9b1jI3iut/E6qDPZ58j9021JY4k2dfKY3Ry6GbIPAhvd/aKcN5Y6x79KItMsijXvAhBSILkbOwGQXccjo7lEIeh8Z1M+e3X0j2B811qcNjCvJDeMYb57+7jVkCzuL3ICADL1IjHGftYzjBPhPwl2UiZ4qD7tqc7Q2/Ol7BgYsIuddbNV72tof1/akffCEltbezCynu7P0hoDFCjDJAPmv4hGLFZZrLCU69jxLGYKU/ol6l8EEQA=="/>
<input name="cf_captcha_kind" type="hidden" value="h"/>
<input name="vc" type="hidden" value="cb7d9f733e82b2a322f24468dd51d0a0"/>
<noscript class="cf-captcha-info" id="cf-captcha-bookmark">
<h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
</noscript>
<div class="cookie-warning" data-translate="turn_on_cookies" id="no-cookie-warning" style="display:none">
<p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
</div>
<script type="text/javascript">
//<![CDATA[
var a = function() {try{return !!window.addEventListener} catch(e) {return !1} },
b = function(b, c) {a() ? document.addEventListener("DOMContentLoaded", b, c) : document.attachEvent("onreadystatechange", b)};
b(function(){
var cookiesEnabled=(navigator.cookieEnabled)? true : false;
if(!cookiesEnabled){
var q = document.getElementById('no-cookie-warning');q.style.display = 'block';
}
});
//]]>
</script>
<div id="trk_captcha_js" style="background-image:url('/cdn-cgi/images/trace/captcha/nojs/h/transparent.gif?ray=6657c0090c70ecee')"></div>
</form>
<script type="text/javascript">
//<![CDATA[
(function(){
var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
var trkjs = isIE ? new Image() : document.createElement('img');
trkjs.setAttribute("src", "/cdn-cgi/images/trace/captcha/js/transparent.gif?ray=6657c0090c70ecee");
trkjs.id = "trk_captcha_js";
trkjs.setAttribute("alt", "");
document.body.appendChild(trkjs);
var cpo=document.createElement('script');
cpo.type='text/javascript';
cpo.src="/cdn-cgi/challenge-platform/h/g/orchestrate/captcha/v1?ray=6657c0090c70ecee";
document.getElementsByTagName('head')[0].appendChild(cpo);
}());
//]]>
</script>
</div>
</div>
<div class="cf-column">
<div class="cf-screenshot-container">
<span class="cf-no-screenshot"></span>
</div>
</div>
</div>
</div>
</div>
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="why_captcha_headline">Why do I have to complete a CAPTCHA?</h2>
<p data-translate="why_captcha_detail">Completing the CAPTCHA proves you are a human and gives you temporary access to the web property.</p>
</div>
<div class="cf-column">
<h2 data-translate="resolve_captcha_headline">What can I do to prevent this in the future?</h2>
<p data-translate="resolve_captcha_antivirus">If you are on a personal connection, like at home, you can run an anti-virus scan on your device to make sure it is not infected with malware.</p>
<p data-translate="resolve_captcha_network">If you are at an office or shared network, you can ask the network administrator to run a scan across the network looking for misconfigured or infected devices.</p>
<p data-translate="resolve_captcha_privacy_pass"> Another way to prevent getting this page in the future is to use Privacy Pass. You may need to download version 2.0 now from the Chrome Web Store.</p>
</div>
</div>
</div>
<div class="cf-error-footer cf-wrapper w-240 lg:w-full py-10 sm:py-4 sm:px-8 mx-auto text-center sm:text-left border-solid border-0 border-t border-gray-300">
<p class="text-13">
<span class="cf-footer-item sm:block sm:mb-1">Cloudflare Ray ID: <strong class="font-semibold">6657c0090c70ecee</strong></span>
<span class="cf-footer-separator sm:hidden">•</span>
<span class="cf-footer-item sm:block sm:mb-1"><span>Your IP</span>: 2607:fa49:3801:a800:6901:b6b5:6c3a:ec5</span>
<span class="cf-footer-separator sm:hidden">•</span>
<span class="cf-footer-item sm:block sm:mb-1"><span>Performance & security by</span> Cloudflare</span>
</p>
</div><!-- /.error-footer -->
</div>
</div>
<script type="text/javascript">
window._cf_translation = {};
</script>
</body>
</html>
My guess is that I'm doing something wrong when extracting the cookies as I am able to access pretty much any part of Metro's website using requests, but I'm pretty new to this so I'm not entirely sure. Any help would be much appreciated!
The website uses Cloudflare services so that it will prevent the request without browser interaction. When you send a request without browser interaction (JavaScript), it will activate a captcha to check whether you are a bot or not. You can use selenium to scrape the information from the website.
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
link = 'https://www.metro.ca/en'
chrome_driver = 'C:/Users/XXXX/Documents/chrome_driver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver)
driver.implicitly_wait(10)
driver.get(link)
cookie = [f"{c['name']}={c['value']};" for c in driver.get_cookies()]
cookie = ' '.join([elem for elem in cookie])
search = driver.find_element_by_css_selector('#header--search--input')
search.send_keys("chicken")
submitButton = driver.find_element_by_css_selector("#header--search--button")
submitButton.click()
driver.implicitly_wait(10)
content = BeautifulSoup(driver.page_source, 'html.parser')
print(content)
Using requests
from selenium import webdriver
from bs4 import BeautifulSoup
import requests
link = 'https://www.metro.ca/en'
chrome_driver = 'C:/Users/XXXX/Documents/chrome_driver/chromedriver.exe'
driver = webdriver.Chrome(executable_path=chrome_driver)
driver.implicitly_wait(10)
driver.get(link)
cookie = [f"{c['name']}={c['value']};" for c in driver.get_cookies()]
cookie = ' '.join([elem for elem in cookie])
def using_request():
header = {
'Host': 'www.metro.ca',
'Connection': 'close',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Client-Version': 'web version 2.0',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36',
'Origin': 'https://www.metro.ca/en',
'Referer': 'https://www.metro.ca/en',
'Accept-Encoding': 'gzip, deflate',
'Cookie': f"{cookie}"
}
search_item = "chicken"
base_url = f"https://www.metro.ca/en/search?filter={search_item}&freeText=true"
page = requests.get(base_url, headers=header)
content = BeautifulSoup(page.text, 'html.parser')
print(content)
using_request()
I'm trying to navigate a website with Selenium
I searched Google and said that adding user-agent would solve it, but it didn't solve it.
http://coupang.com/
from selenium.webdriver.chrome.options import Options
from selenium import webdriver
import time
options = Options()
options = webdriver.ChromeOptions()
# options.add_argument('headless')
options.add_argument('window-size=1920x1080')
options.add_argument('lang=ko_KR')
options.add_argument("--disable-gpu")
options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5")
options.add_argument("accept=text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8")
options.add_argument("accept-charset=cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3")
options.add_argument("accept-encoding=gzip,deflate,sdch")
options.add_argument("accept-language=tr,tr-TR,en-US,en;q=0.8")
driver = webdriver.Chrome('d:/temp/chromedriver.exe',options=options)
TEST_URL = 'https://login.coupang.com/login/login.pang?rtnUrl=https%3A%2F%2Fwww.coupang.com%2Fnp%2Fpost%2Flogin%3Fr%3Dhttps%253A%252F%252Fwww.coupang.com%252F'
driver.get(TEST_URL)
time.sleep(5)
driver.implicitly_wait(3)
elem_login = driver.find_element_by_id("login-email-input")
elem_login.clear()
elem_login.send_keys("id")
time.sleep(3)
elem_login = driver.find_element_by_id("login-password-input")
elem_login.clear()
elem_login.send_keys("pw")
time.sleep(3)
xpath = "/html/body/div[1]/div/div/form/div[5]/button"
driver.find_element_by_xpath(xpath).click()
driver.implicitly_wait(3)
print(driver.page_source)
Can you try and add headers like so and tell me if it works.
headers = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
"accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"accept-charset": "cp1254,ISO-8859-9,utf-8;q=0.7,*;q=0.3",
"accept-encoding": "gzip,deflate,sdch",
"accept-language": "tr,tr-TR,en-US,en;q=0.8",
Not that crystal clear in which circumstances you are facing Access Denied. However I was able to access the webpage http://coupang.com/ as follows:
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-logging"])
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
options.add_argument('lang=ko_KR')
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get('https://www.coupang.com/')
print(driver.page_source)
Console Output:
<!--[if lte IE 9]>
<div id="browserSupportWrap">
<div class="bs-wrap">
<p class="bs-message">고객님의 브라우저에서는 쿠팡이 정상 동작하지 않습니다.<br />
인터넷 익스플로러 업데이트, 크롬 또는 파이어폭스 브라우저를 설치하세요.</p>
<ul class="bs-browser-download">
<li class="ie">인터넷 익스플로러<br /> <em>업데이트하기</em></li>
<li class="chrome">크롬<br /> <em>설치하기</em></li>
<li class="firefox">파이어폭스<br /> <em> 설치하기</em></li>
</ul>
</div>
</div>
<![endif]-->
<div id="container" class="renewal home srp-sync srp-sync-brand">
.
</script>
<!-- Facebook Pixel Code -->
<script>
!function(f,b,e,v,n,t,s){if(f.fbq)return;n=f.fbq=function(){n.callMethod?n.callMethod.apply(n,arguments):n.queue.push(arguments)};if(!f._fbq)f._fbq=n;
n.push=n;n.loaded=!0;n.version='2.0';n.queue=[];t=b.createElement(e);t.async=!0;
t.src=v;s=b.getElementsByTagName(e)[0];s.parentNode.insertBefore(t,s)}(window,
document,'script','https://connect.facebook.net/en_US/fbevents.js');
fbq('init', '652323801535981');
fbq('track', 'PageView');
</script>
<noscript><img height="1" width="1" style="display:none" src="https://www.facebook.com/tr?id=652323801535981&ev=PageView&noscript=1"/></noscript>
<!-- End Facebook Pixel Code -->
<script type="text/javascript" src="//asset2.coupangcdn.com/customjs/criteo/5.6.1/ld.min.js" async="true"></script>
<noscript><img src="https://www.coupang.com/akam/11/pixel_3401c526?a=dD1kMDI3YTFiY2NmYTZiMDg3ZDE3ZWRkNzc3MDI5ZDhhNzNiYzM4ZDkxJmpzPW9mZg==" style="visibility: hidden; position: absolute; left: -999px; top: -999px;" /></noscript>
<iframe height="0" width="0" title="Criteo DIS iframe" style="display: none;"></iframe></body></html>
Browser Snapshot:
I am working on an Intranet with nested frames, and am unable to access a child frame.
The HTML source:
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>VIS</title>
<link rel="shortcut icon" href="https://bbbbb/ma1/imagenes/iconos/favicon.ico">
</head>
<frameset rows="51,*" frameborder="no" scrolling="no" border="0">
<frame id="cabecera" name="cabecera" src="./blablabla.html" scrolling="no" border="3">
<frameset id="frame2" name="frame2" cols="180,*,0" frameborder="no" border="1">
<frame id="menu" name="menu" src="./blablabla_files/Menu.html" marginwidth="5" scrolling="auto" frameborder="3">
Buscar
<frame id="contenido" name="contenido" src="./blablabla_files/saved_resource.html" marginwidth="5" marginheight="5">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=windows-1252">
<title>BUSCAr</title>
</head>
<frameset name="principal" rows="220,*" frameborder="NO">
<frame name="Formulario" src="./BusquedaSimple.html" scrolling="AUTO" noresize="noresize">
<input id="year" name="year" size="4" maxlength="4" value="" onchange="javascript:Orden();" onfocus="this.value='2018';this.select();" type="text">
<frame name="Busqueda" src="./saved_resource(2).html" scrolling="AUTO">
</frameset>
<noframes>
<body>
<p>soporte a tramas.</p>
</body>
</noframes>
</html>
<frame name="frameblank" marginwidth="0" scrolling="no" src="./blablabla_files/saved_resource(1).html">
</frameset>
<noframes>
<P>Para ver esta página.</P>
</noframes>
</frameset>
</html>
I locate the button "Buscar" inside of frame "menu" with:
driver.switch_to_default_content()
driver.switch_to_frame(driver.find_element_by_css_selector("html frameset frameset#frame2 frame#menu"))
btn_buscar = driver.find_element_by_css_selector("#div_menu > table:nth-child(10) > tbody > tr > td:nth-child(2) > span > a")
btn_buscar.click()
I've tried this code to locate the input id="year" inside frame="Formulario":
driver.switch_to_default_content()
try: driver.switch_to_frame(driver.switch_to_frame(driver.find_element_by_css_selector("html frameset frameset#frame2 frame#contenido frameset#principal frame#Formulario")))
print("Ok cabecera -> contenido")
except:
print("cabecera not found")
or
driver.switch_to_frame(driver.switch_to_xpath("//*[#id='year"]"))
but they don't work.
Can you help me?
Thanks!
To be able to handle required iframe you need to switch subsequently to all
ancestor frames:
driver.switch_to.frame("cabecera")
driver.switch_to.frame("menu")
btn_buscar = driver.find_element_by_link_text("Buscar")
btn_buscar.click()
Also note that Webdriver instance has no such method as switch_to_xpath() and switch_to_frame(), switch_to_default_content() methods are deprecated so you'd better use switch_to.frame(), switch_to.default_content()
Assuming your program have the focus on Top Level Browsing Context, to locate and the button with text as Buscar you need to switch() through all the parent frames along with WebDriverWait in association with proper expected_conditions and you can use the following code block :
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.ID,"cabecera"))
WebDriverWait(driver, 10).until(EC.frame_to_be_available_and_switch_to_it(By.ID,"menu"))
WebDriverWait(driver, 10).until(EC.element_to_be_clickable((By.LINK_TEXT, "Buscar"))).click()
I have been using the selenium webdriver with python in an attempt to try and login to this website Login Page Here
To do this I did the following in python:
from selenium import webdriver
import bs4 as bs
driver = webdriver.Chrome()
driver.get('https://app.chatra.io/')
I then go on to make an attempt at parsing using Beautiful Soup:
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify)
The main issue is that the page never fully loads. When I load the page in a browser on my own, all is fine. However when the selenium webdriver tries to load it, it just seemingly stops halfway.
Any idea why? Any ideas on how to fix it or where to look to learn?
First of all, the issue is also reproducible for me in the latest Chrome (with chromedriver 2.34 - also currently latest) - not yet sure what's happening at the moment. Workaround: Firefox worked for me perfectly.
And, I would add an extra step in between driver.get() and HTML parsing - an explicit wait to let the page properly load until the desired condition would be true:
import bs4 as bs
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()
driver.get('https://app.chatra.io/')
wait = WebDriverWait(driver, 10)
wait.until(EC.visibility_of_element_located((By.ID, "signin-email")))
html = driver.execute_script('return document.documentElement.outerHTML')
soup = bs.BeautifulSoup(html, 'html.parser')
print(soup.prettify())
Note that you also needed to call prettify() - it's a method.
There are several aspects to the issue you are facing as below :
As you are trying to take help of BeautifulSoup so if you try to use urlopen from urllib.request the error says it all :
urllib.error.HTTPError: HTTP Error 403: Forbidden
Which means urllib.request is getting detected and HTTP Error 403: Forbidden is raised. Hence using webdriver from selenium makes sense.
Next, when you take help of ChromeDriver and Chrome initially the Website opens and renders. But soon ChromeDriver being a WebDriver is detected and ChromeDriver is unable to parse the <head> & <body> tags. You see the minimal header as :
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" class="supports cssfilters flexwrap chrome webkit win hover web"></html>
Finally, when you take help of GeckoDriver and Firefox Quantum the Website opens and renders properly as follows :
Code Block :
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup)
Console Output :
<html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>
Adding prettify to the soup extraction :
Code Block :
from selenium import webdriver
from bs4 import BeautifulSoup as soup
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
html = driver.execute_script('return document.documentElement.outerHTML')
pagesoup = soup(html, "html.parser")
print(pagesoup.prettify)
Console Output :
<bound method Tag.prettify of <html class="supports cssfilters flexwrap firefox gecko win hover web"><head>
<link class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51" rel="stylesheet" type="text/css"/>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover" name="viewport"/>
.
.
.
<em>··· Chatra</em>
.
.
.
</div></body></html>>
Even you can use Selenium's page_source method as follows :
Code Block :
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://app.chatra.io/')
print(driver.page_source)
Console Output :
<html class="supports cssfilters flexwrap firefox gecko win hover web">
<head>
<link rel="stylesheet" type="text/css" class="" href="https://app.chatra.io/b281cc6b75916e26b334b5a05913e3eb18fd3a4d.css?meteor_css_resource=true&_g_app_v_=51">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1, minimum-scale=1, user-scalable=no, viewport-fit=cover">
<!-- platform specific stuff -->
<meta name="msapplication-tap-highlight" content="no">
<meta name="apple-mobile-web-app-capable" content="yes">
<!-- favicon -->
<link rel="shortcut icon" href="/static/favicon.ico">
<!-- win8 tile -->
<meta name="msapplication-TileImage" content="/static/win-tile.png">
<meta name="msapplication-TileColor" content="#ffffff">
<meta name="application-name" content="Chatra">
<!-- apple touch icon -->
<!--<link rel="apple-touch-icon" sizes="256x256" href="/static/?????.png">-->
<title>··· Chatra</title>
<style>
body {
background: #f6f5f7
}
</style>
<style type="text/css"></style>
</head>
<body>
<script async="" src="https://www.google-analytics.com/analytics.js"></script>
<script type="text/javascript" src="/meteor_runtime_config.js"></script>
<script type="text/javascript" src="https://app.chatra.io/9153feecdc706adbf2c71253473a6aa62c803e45.js?meteor_js_resource=true&_g_app_v_=51"></script>
<div class="body body-layout">
<div class="body-layout__main main-layout">
<aside class="main-layout__left-sidebar">
<div class="left-sidebar-layout">
</div>
</aside>
<div class="main-layout__content">
<div class="content-layout">
<main class="content-layout__main is-no-fades js-popover-boundry js-main">
<div class="center loading loading--light">
<div class="content-padding nothing">
<em>··· Chatra</em>
</div>
</div>
</main>
</div>
</div>
</div>
</div>
</body>
</html>