python requests problem: cloudflare error message "enable cookies" - python

I was planning on creating a basic web scraper for the site Sneakersnstuff.com however my efforts were stopped early due to an error. When requesting to the url https://www.sneakersnstuff.com/, rather than displaying the html of the website, or even the entrance captcha, I am redirected to a cloudflare page with the error message "enable cookies". Both my code and the response are shown below
import requests
import cfscrape
session = requests.session()
response = session.get('https://www.sneakersnstuff.com/')
print(response.headers)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
<!--<![endif]-->
<head>
<title>Access denied | www.sneakersnstuff.com used Cloudflare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css"
media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">
body {
margin: 0;
padding: 0
}
</style>
<!--[if gte IE 10]><!-->
<script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script>
<!--<![endif]-->
<!--[if gte IE 10]><!-->
<script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script>
<!--<![endif]-->
</head>
<body>
<div id="cf-wrapper">
<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please
enable cookies.</div>
<div id="cf-error-details" class="cf-error-details-wrapper">
<div class="cf-wrapper cf-header cf-error-overview">
<h1>
<span class="cf-error-type" data-translate="error">Error</span>
<span class="cf-error-code">1020</span>
<small class="heading-ray-id">Ray ID: 578133293d83e0d6 • 2020-03-22 16:13:25 UTC</small>
</h1>
<h2 class="cf-subheadline">Access denied</h2>
</div><!-- /.header -->
<section></section><!-- spacer -->
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="what_happened">What happened?</h2>
<p>This website is using a security service to protect itself from online attacks.</p>
</div>
</div>
</div><!-- /.section -->
<div class="cf-error-footer cf-wrapper">
<p>
<span class="cf-footer-item">Cloudflare Ray ID: <strong>578133293d83e0d6</strong></span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span>Your IP</span>: 96.241.108.243</span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span>Performance & security by</span> <a
href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link"
target="_blank">Cloudflare</a></span>
</p>
</div><!-- /.error-footer -->
</div><!-- /#cf-error-details -->
</div><!-- /#cf-wrapper -->
<script type="text/javascript">
window._cf_translation = {};
</script>
</body>
</html>
I have attempted using a library reccomend by many called cfscrape to no avail.

Adding Browser/User-Agent Filtering to cloudscraper did the trick for me.
import cloudscraper
from bs4 import BeautifulSoup
# Adding Browser / User-Agent Filtering should help ie.
# will give you only desktop firefox User-Agents on Windows
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False})
html = scraper.get("https://www.sneakersnstuff.com/").content
soup = BeautifulSoup(html, 'html.parser')
print(soup)

import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.sneakersnstuff.com/").content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Output:
cloudscraper.exceptions.CloudflareReCaptchaProvider: Cloudflare reCaptcha detected, unfortunately you haven't loaded an anti reCaptcha provider correctly via the 'recaptcha' parameter.
Next Step ?
3rd Party reCaptcha Solvers
Description
cloudscraper currently supports the following 3rd party reCaptcha solvers, should you require them.
anticaptcha
deathbycaptcha
2captcha
9kw
return_response

Related

Flask python - POST not working 400 bad request

i cannot access the text box data in the back end. i wanna use the textbox value in the backend but it says 400 badrequest. Please help me with this. i cannot see where my code went wrong.i cannot access the text box data in the back end. i wanna use the textbox value in the backend but it says 400 badrequest. Please help me with this. i cannot see where my code went wrong.
python
from flask import Flask, render_template, url_for, request,redirect
from flask_sqlalchemy import SQLAlchemy
from datetime import datetime
import tweepy
app = Flask(__name__)
app.config['SQLALCHEMY_DATABASE_URI'] = 'sqlite:///test.db'
db= SQLAlchemy(app)
#app.route('/')
def index():
return render_template('home.html')
#app.route('/getlivedata', methods=['POST','GET'])
def stream():
if request.method =="POST":
rows = request.form['numtweets']
else:
return render_template('home.html')
if __name__ == "__main__":
app.run(debug=True)
html
<!DOCTYPE html>
<html lang="en" >
<head>
<meta charset="UTF-8">
<title>CodePen - Navigation PageDesign/Lesson</title>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<meta name="description" content="sara mazal lessons">
<meta name="keywords" content="HTML, CSS, JavaScript, mazal, icons">
<meta name="author" content="Sara Mazal">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="preconnect" href="https://fonts.gstatic.com" crossorigin>
<link href="https://fonts.googleapis.com/css2?family=Josefin+Sans:wght#200;300;400;500&family=Raleway:wght#100;200;300;400;500&family=Roboto:wght#300;400;700&display=swap" rel="stylesheet">
<link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.8.2/css/all.css" integrity="sha384-oS3vJWv+0UjzBfQzYUhtDYW+Pj2yciDJxpsK1OYPAYjqT085Qq/1cq5FLXAZQ7Ay" crossorigin="anonymous" />
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet">
<script src="https://cdnjs.cloudflare.com/ajax/libs/particlesjs/2.2.3/particles.min.js"></script><link rel="stylesheet" href="https://cdnjs.cloudflare.com/ajax/libs/normalize/5.0.0/normalize.min.css">
<link rel="stylesheet" href="{{ url_for('static',filename='css/style.css') }}">
</head>
<body>
<!-- partial:index.partial.html -->
<section class="nav">
<h1>LIVE TWITTER DATA ANALYSIS</h1>
<h3 class="span loader">
<span class="m">S</spam><span class="m">E</spam><span class="m">N</spam><span class="m">T</spam><span class="m">I</spam><span class="m">M</spam><span class="m">E</spam><span class="m">N</spam><span class="m">T</spam><span class="m">A</spam><span class="m">L</spam><span class="m"> </span><span class="m">A</spam><span class="m">N</spam><span class="m">D</spam><span class="m"> </span><span class="m">C</spam><span class="m">A</spam><span class="m">T</spam><span class="m">E</spam><span class="m">G</spam><span class="m">O</spam><span class="m">R</spam><span class="m">I</spam><span class="m">C</spam><span class="m">A</spam><span class="m">L</spam> </h3>
<div class="nav-container"><a class="nav-tab" href="#tab-pwa">PWA</a><a class="nav-tab" href="#tab-graphql">GraphQL</a><a class="nav-tab" href="#tab-next">NEXT</a><a class="nav-tab" href="#tab-typescript">TYPESCRIPT</a><a class="nav-tab" href="#tab-deno">DENO</a><span class="nav-tab-slider"></span></div>
</section>
<form action="#" method="post">
<main class="main">
<section class="slider" id="tab-pwa">
<h1>PWA</h1>
<input type="text" name="numtweets">
<h3>GetLiveData</h3>
<h2>the best of both worlds...</h2>
</section>
<section class="slider" id="tab-graphql">
<h1>GraphQL</h1>
<h2>a query language for APIs</h2>
</section>
<section class="slider" id="tab-next">
<h1>NEXT</h1>
<h2>framework for Production</h2>
</section>
<section class="slider" id="tab-typescript">
<h1>TYPESCRIPT</h1>
<h2>giving you better tooling at any scale</h2>
</section>
<section class="slider" id="tab-deno">
<h1>DENO</h1>
<h2>a modern runtime</h2>
</section>
</main>
</form>
<canvas class="background"></canvas>
<!-- partial -->
<script src='https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.0/jquery.min.js'></script><script src="{{ url_for('static',filename='js/script.js') }}"></script>
</body>
</html>
Change your form action to "/getlivedata" so it would post the data there. Also, it seems that you are using the as the way to go to the url. Use a submitnbutton instead.

Download PDF with chrome plugin in python selenium

I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link

how to copy all the code of a URL with python

I want to copy all the code of an URL (http://modelseed.org/biochem/reactions/rxn00001) using Python 3.6, but I can only copy part of the code, and I don't know why.
So far, I tried with "requests" module
import requests
page = requests.get("http://modelseed.org/biochem/reactions/rxn00001")
print(page.content)
and "urllib"
import urllib.request
site = urllib.request.urlopen("http://modelseed.org/biochem/reactions/rxn00001")
print(site.read())
The part of the code with information of the "Reaction Details", like "Name", "ID" and "Abbreviation" are missing, but they are visible if I inspect the code on the developer bar of Chrome.
The code I'm able to download using the two codes above is:
<!DOCTYPE html>
<html lang="en" ng-app="ModelSEED">
<head>
<base href="/"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport">
<meta content="The ModelSEED is a resource for the reconstruction, exploration, comparison, and analysis of metabolic models." name="description"/>
<link href="/img/ModelSEED-favicon.png?v=2.0" rel="shortcut icon"/>
<meta content="nconrad" name="author"/>
<title>
ModelSEED
</title>
<link href="components/angular-material/angular-material.css" rel="stylesheet"/>
<link href="components/bootstrap/dist/css/bootstrap.min.css" rel="stylesheet"/>
<!-- to be removed -->
<link href="components/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
<link href="icomoon/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="http://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="build/style.css" rel="stylesheet"/>
<!--<script src="https://cdn.socket.io/socket.io-1.3.7.js"></script>-->
<script src="build/site.js">
</script>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</meta>
</head>
<body>
<div style="height: 100%;" ui-view="">
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-67412611-1', 'auto');
ga('send', 'pageview');
</script>
</body>
</html>
Anyone has any hint why the code between < div style="height: 100%;" ui-view="" > and (just after < body > and before < script >) is not downloaded?
Thank you.
It's being inserted by a javascript script, therefore, either requests nor urllib would find it, you would need to use a browser for this, you should try with selenium or PhantomJS
something like:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
driver.page_source
Try getting this url instead: https://www.patricbrc.org/api/model_reaction/?http_accept=application/json&eq(id,rxn00001)

Error as I'm switching to peewee for flask app. 'peewee.IntegerField object' has no attribute 'flags'

I began switching from using standard basic SQL in my flask app to using peewee and am getting a weird bug I cant seem to find any info about. My endpoints are working fine but when I tried going to the landing page I get "jinja2.exceptions.UndefinedError: 'peewee.IntegerField object' has no attribute 'flags'"
This seems like some weird interaction with wtforms and peewee but I cant seem to find similar issues. Thanks in advance.
Note everything is in one file
My Models:
class pipelineForm(FlaskForm):
pipeline = IntegerField('Pipeline ID')
class Process(Model):
pipeline_id = IntegerField()
process_name= CharField(null = True)
log= CharField(null = True)
exit_code= IntegerField()
started= CharField(null = True)
finsihed= CharField(null = True)
class Meta:
database=db
End Point for Landing Page:
#app.route('/bloodhound', methods=['GET','POST'])
def index():
form =pipelineForm()
print(form.errors)
if form.validate_on_submit():
print(str(form.pipeline.data))
return redirect(url_for('.display', pipelineId=form.pipeline.data))#'<h1>' + str(form.pipeline.data) + '</h1>'
return render_template('index.html',form=form)
Landing Page:
{% extends "bootstrap/base.html" %} {%import "bootstrap/wtf.html" as wtf%} {% block content %}
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->
<meta name="description" content="">
<meta name="author" content="">
<link rel="icon" href="../../favicon.ico">
<title>Narrow Jumbotron Template for Bootstrap</title>
<!-- Bootstrap core CSS -->
<link href="../../dist/css/bootstrap.min.css" rel="stylesheet">
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<link href="../../assets/css/ie10-viewport-bug-workaround.css" rel="stylesheet">
<!-- Custom styles for this template -->
<link href="jumbotron-narrow.css" rel="stylesheet">
<!-- Just for debugging purposes. Don't actually copy these 2 lines! -->
<!--[if lt IE 9]><script src="../../assets/js/ie8-responsive-file-warning.js"></script><![endif]-->
<script src="../../assets/js/ie-emulation-modes-warning.js"></script>
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.3/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<nav class="navbar navbar-inverse">
<div class="container-fluid">
<div class="navbar-header">
<a class="navbar-brand" href="/bloodhound">Bloodhound</a>
</div>
<ul class="nav navbar-nav">
<li class="active">Home</li>
<li>Performance</li>
</ul>
</div>
</nav>
<body>
<div class="container">
<div class="jumbotron">
<h1>Welcome to Bloodhound!</h1>
<p class="lead">Enter a pipeline Id to get diagnostic information.</p>
<form class="input" method="POST" action="/bloodhound">
<div class="input-group" style="width: 300px;">
{{form.hidden_tag()}} {{wtf.form_field(form.pipeline)}}
<span class="input-group-btn" style="vertical-align: bottom;">
<button class="btn btn-default" type="submit" >Go!</button>
</span>
</div>
</form>
</div>
<footer class="footer">
<p>© 2017 MITRE.</p>
</footer>
</div>
<!-- /container -->
<!-- IE10 viewport hack for Surface/desktop Windows 8 bug -->
<script src="../../assets/js/ie10-viewport-bug-workaround.js"></script>
</body>
</html>
{% endblock %}
Full Stack Trace
Traceback (most recent call last):
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1997, in __call__
    return self.wsgi_app(environ, start_response)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1985, in wsgi_app
    response = self.handle_exception(e)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1540, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1982, in wsgi_app
    response = self.full_dispatch_request()
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1614, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1517, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/_compat.py", line 33, in reraise
    raise value
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1612, in full_dispatch_request
    rv = self.dispatch_request()
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/app.py", line 1598, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/home/patrick/process_tracker/api/tracker_api.py", line 201, in index
    return render_template('index.html',form=form)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/templating.py", line 134, in render_template
    context, ctx.app)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask/templating.py", line 116, in _render
    rv = template.render(context)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/jinja2/environment.py", line 1008, in render
    return self.environment.handle_exception(exc_info, True)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/jinja2/environment.py", line 780, in handle_exception
    reraise(exc_type, exc_value, tb)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/jinja2/_compat.py", line 37, in reraise
    raise value.with_traceback(tb)
 File "/home/patrick/process_tracker/api/templates/index.html", line 1, in top-level template code
    {% extends "bootstrap/base.html" %} {%import "bootstrap/wtf.html" as wtf%} {% block content %}
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask_bootstrap/templates/bootstrap/base.html", line 1, in top-level template
 code
    {% block doc -%}
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask_bootstrap/templates/bootstrap/base.html", line 4, in block "doc"
    {%- block html %}
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask_bootstrap/templates/bootstrap/base.html", line 20, in block "html"
    {% block body -%}
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask_bootstrap/templates/bootstrap/base.html", line 23, in block "body"
    {% block content -%}
  File "/home/patrick/process_tracker/api/templates/index.html", line 58, in block "content"
    <!-- {{form.hidden_tag()}} {{wtf.form_field(form.pipeline)}} -->
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/jinja2/runtime.py", line 553, in _invoke
    rv = self._func(*arguments)
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/flask_bootstrap/templates/bootstrap/wtf.html", line 36, in template
    {% if field.flags.required and not required in kwargs %}
  File "/home/patrick/enviroments/venv/lib/python3.5/site-packages/jinja2/environment.py", line 430, in getattr
    return getattr(obj, attribute)
According to your error message this field
class pipelineForm(FlaskForm):
pipeline = IntegerField('Pipeline ID')
is of type pewee.IntegerField and you wan it to be the IntegerField type of WTForms. If you have both classes in the same file you need to:
import pewee
import wtforms.fields
class pipelineForm(FlaskForm):
pipeline = fields.IntegerField('Pipeline ID')
class Process(Model):
pipeline_id = pewee.IntegerField() # and so on

Python scraping of dynamic content (visual different from html source code)

I'm a big fan of stackoverflow and typically find solutions to my problems through this website. However, the following problem has bothered me for so long that it forced me to create an account here and ask directly:
I'm trying to scape this link: https://permid.org/1-21475776041 What i want is the row "TRCS Asset Class" and "Currency".
For starters, I'm using this code:
from bs4 import BeautifulSoup
import urllib2
url = 'https://permid.org/1-21475776041'
req = urllib2.urlopen(url)
raw = req.read()
soup = BeautifulSoup(raw)
print soup.prettify()
The html code returned (see below) is different from what you can see in your browser upon clicking the link:
<!DOCTYPE html>
<!--[if lt IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" ng-app="tmsMdaasApp">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="max-age=0,no-cache" http-equiv="Cache-Control"/>
<base href="/"/>
<title ng-bind="PageTitle">
Thomson Reuters | PermID
</title>
<meta content="" name="description"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#ff8000" name="theme-color"/>
<!-- Place favicon.ico and apple-touch-icon.png in the root directory -->
<link href="app/vendor.daf96efe.css" rel="stylesheet"/>
<link href="app/app.1405210f.css" rel="stylesheet"/>
<link href="favicon.ico" rel="icon"/>
<!-- Typekit -->
<script src="//use.typekit.net/gnw2rmh.js">
</script>
<script>
try{Typekit.load({async:true});}catch(e){}
</script>
<!-- // Typekit -->
<!-- Google Tag Manager Data Layer -->
<!--<script>
analyticsEvent = function() {};
analyticsSocial = function() {};
analyticsForm = function() {};
dataLayer = [];
</script>-->
<!-- // Google Tag Manager Data Layer -->
</head>
<body class="theme-grey" id="top" ng-esc="">
<!--[if lt IE 7]>
<p class="browserupgrade">You are using an <strong>outdated</strong> browser. Please upgrade your browser to improve your experience.</p>
<![endif]-->
<!-- Add your site or application content here -->
<navbar class="tms-navbar">
</navbar>
<div id="body" role="main" ui-view="">
</div>
<div id="footer-wrapper" ng-show="!params.elementsToHide">
<footer id="main-footer">
</footer>
</div>
<!--[if lt IE 9]>
<script src="bower_components/es5-shim/es5-shim.js"></script>
<script src="bower_components/json3/lib/json3.min.js"></script>
<![endif]-->
<script src="app/vendor.8cc12370.js">
</script>
<script src="app/app.6e5f6ce8.js">
</script>
</body>
</html>
Does anyone know what I'm missing here and how I could get it to work?
Thanks, Teemu Risikko - a comment (albeit not the solution) of the website you linked got me on the right path.
In case someone else is bumping into the same problem, here is my solution: I'm getting the data via requests and not via traditional "scraping" (e.g. BeautifulSoup or lxml).
Navigate to the website using Google Chrome.
Right-click on the website and select "Inspect".
On the top navigation bar select "Network".
Limit network monitor to "XHR".
One of the entries (market with an arrow) shows the link that can be used with the requests library.
import requests
url = 'https://permid.org/api/mdaas/getEntityById/21475776041'
headers = {'X-AG-Access-Token': YOUR_ACCESS_TOKEN}
r = requests.get(url, headers=headers)
r.json()
Which gets me this:
{u'Asset Class': [u'Units'],
u'Asset Class URL': [u'https://permid.org/1-302043'],
u'Currency': [u'CAD'],
u'Currency URL': [u'https://permid.org/1-500140'],
u'Exchange': [u'TOR'],
u'IsQuoteOf.mdaas': [{u'Is Quote Of': [u'Convertible Debentures Income Units'],
u'URL': [u'https://permid.org/1-21475768667'],
u'quoteOfInstrument': [u'21475768667'],
u'quoteOfInstrument URL': [u'https://permid.org/1-21475768667']}],
u'Mic': [u'XTSE'],
u'PERM ID': [u'21475776041'],
u'Quote Name': [u'CONVERTIBLE DEBENTURES INCOME UNT'],
u'Quote Type': [u'equity'],
u'RIC': [u'OCV_u.TO'],
u'Ticker': [u'OCV.UN'],
u'entityType': [u'Quote']}
Using the default user-agent with a lot of pages will give you a different looking page because it is using an outdated user-agent. This is what your output is telling you.
Reference on Changing user-agents
Thought this may be your problem, it does not exactly answer the question about getting dynamically applied changes on a webpage. To get the dynamically changed data you need to emulate the javascript requests that the page is making on load. If you make the requests that the javascript is making you will get the data that the javascript is getting.

Categories