experiencing difficulty parsing a web page - python

I had downloaded a web page and saved it in the html format. I wanted to parse and obtain the values of "fullname", "memberHeadline", "numberOfConnections", etc. I tried using BeautifulSoup in Python, but it is not working. I also tried
>>> import json
>>> encoded_data = json.loads(f)
Traceback (most recent call last):
File "<pyshell#14>", line 1, in <module>
encoded_data = json.loads(f)
File "C:\Python27\lib\json\__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "C:\Python27\lib\json\decoder.py", line 365, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "C:\Python27\lib\json\decoder.py", line 383, in raw_decode
raise ValueError("No JSON object could be decoded")
ValueError: No JSON object could be decoded
I am not clear on what the format of the file is. copied below is the content of the file.
<!DOCTYPE html>
<!--[if lt IE 7]>
<html lang="en" class="ie ie6 lte9 lte8 lte7 os-win">
<![endif]-->
<!--[if IE 7]>
<html lang="en" class="ie ie7 lte9 lte8 lte7 os-win">
<![endif]-->
<!--[if IE 8]>
<html lang="en" class="ie ie8 lte9 lte8 os-win">
<![endif]-->
<!--[if IE 9]>
<html lang="en" class="ie ie9 lte9 os-win">
<![endif]-->
<!--[if gt IE 9]>
<html lang="en" class="os-win">
<![endif]-->
<!--[if !IE]><!-->
<html lang="en" class="os-win">
<!--<![endif]-->
<head>
<meta name="lnkd-track-json-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=2jds9coeh4w78ed9wblscv68v-eo3jgzogk6v7maxgg86f4u27d&fc=2">
<meta name="lnkd-track-lib" content="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=eo3jgzogk6v7maxgg86f4u27d&fc=2">
<meta name="treeID" content="yGlqHfV7FxMQvJqjACsAAA==">
<meta name="appName" content="profile">
<meta name="lnkd-track-error" content="/lite/ua/error?csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1">
<script src="http://static.licdn.com:80/scds/common/u/lib/fizzy/fz-1.3.3-min.js" type="text/javascript"></script>
<script type="text/javascript">
fs.config({
"failureRedirect": "http://www.linkedin.com/nhome/",
"uniEscape": true,
"xhrHeaders": {
"X-FS-Origin-Request": "/profile/view?id=131506997&authType=NAME_SEARCH&authToken=9ikF&locale=en_US&srchid=123452511375704499972&srchindex=1&srchtotal=63&trk=vsrp_people_res_name&trkInfo=VSRPsearchId%3A123452511375704499972%2CVSRPtargetId%3A131506997%2CVSRPcmpt%3Aprimary",
"X-FS-Page-Id": "nprofile-view"
}
});
</script>
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=8swqmpmjehppqzovz8zzfvv9g-aef5jooigi7oiyblwlouo8z90-7tqheyb1qchwa8dejl8nvz7zd-10q339fub5b718xk0pv9lzhpl&fc=2"></script>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=9">
<meta name="pageImpressionID" content="3b47eb21-1db4-42c1-9d00-43b2918c4099">
<meta name="pageKey" content="nprofile_v2_view_fs">
<meta name="analyticsURL" content="/analytics/noauthtracker">
<link rel="openid.server" href="https://www.linkedin.com/uas/openid/authorize">
<link rel="apple-touch-icon-precomposed" href="/img/icon/apple-touch-icon.png">
<link rel="stylesheet" type="text/css" href="http://s.c.lnkd.licdn.com/scds/concat/common/css?h=3bifs78lai5i0ndyj1ew7316e-c8kkvmvykvq2ncgxoqb13d2by-cphg8n6ehozk6lgpbb36za2ap-2it1to3q1pt5evainys9ta07p-4uu2pkz5u0jch61r2nhpyyrn8-7poavrvxlvh0irzkbnoyoginp-4om4nn3a2z730xs82d78xj3be-3t9ar1pajet97hzt9uou74qbb-ct4kfyj4tquup0bvqhttvymms-58ujm6g9r0a3ok6mpq7cs25gn-9zbbsrdszts09by60it4vuo3q-8ti9u6z5f55pestwbmte40d9-5730os2tf3iaiql5c8fukzd2u-cxff0g818hf3ks7dzixd4lqcq-6ramlbadr9lh7v5r7vuc6t4ld-e5frmcn40t833k1adjvwkoyjq&fc=2">
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=dfoaudjrk6rbf82f45bz5crwi-62og8s54488owngg0s7escdit-c8ha6zrgpgcni7poa5ctye7il-3ufb745s29q1ovtbq6htt6rwh-51dv6schthjydhvcv6rxvospp-e9rsfv7b5gx0bk0tln31dx3sq-2r5gveucqe4lsolc3n0oljsn1-8v2hz0euzy8m1tk5d6tfrn6j-d4jr8g8vadmx9i5wz2z6ck3pb-1wfd86vm2f60y6uu5isrw94q-ddqoqtn6xcqi8i0y3tsrdl533-81wr8bey9cjjn6rhvbv530cap-6iw3fvg61uute6gxy89acxi5d-5sxnyeselbctwpry658s2lkew-4c6mz6u5rinti47gswwanj74j-5eec9p1vamr86uabn13sngx92-6oxtrh5eu6olunu39xzqgp10i-cdr0psywot2inbx54hmajga3p-anaxa6l712w7m4gp8089vyb5m-3h7320kwlnqtbngzm67z2annq-80vg9koywz84zoon9sjflbru0-8cwe3ciy81r59l0q3usztbt2r-1na957r12xyfe317uma5pn9mc-57sur4cj634ll9tk38imgvc6g-8v6o0480wy5u6j7f3sh92hzxo-9puf8y7tgjvse2oqtgkdb4wcj-c9pibx8dlmicbwjh48g12z6bl-12xp8e6pputw80p9fcpzyy9m0-3xjyji4eyuzpbppt3cssr1oko-34nej6plgotmo4hbnvjthteuu-4nw8tqsdbe61ig2l9faf3qdi9&fc=2"></script>
<script type="text/javascript">
LI.define('UrlPackage');
LI.UrlPackage.containerCore = ["http://s.c.lnkd.licdn.com/scds/concat/common/js?h=d7z5zqt26qe7ht91f8494hqx5&fc=2"]
[0];
</script>
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/common/u/js/scds-hashes.js"></script>
<script type="text/javascript">
LI.JSContentBasePath = "http://s.c.lnkd.licdn.com/scds/concat/common/js?v=build-2000_1_28557-prod&fc=2";
LI.CSSContentBasePath = "http://s.c.lnkd.licdn.com/scds/concat/common/css?v=build-2000_1_28557-prod&fc=2";
LI.injectRelayHtmlUrl = "http://s.c.lnkd.licdn.com/scds/common/u/lib/inject/0.4.2/relay.html";
LI.injectRelaySwfUrl = "http://s.c.lnkd.licdn.com/scds/common/u/lib/inject/0.4.2/relay.swf";
LI.comboBaseUrl = "http://s.c.lnkd.licdn.com/scds/concat/common/css?v=build-2000_1_28557-prod&fc=2";
LI.staticUrlHashEnabled = "true";
</script>
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=c19zsujfl1pg46iqy33ubhqc5-eu97293ov7e2qciy9zn7fyg55-4n3akxbgwkyp3he1eeb136xxq-aq8gt7g4x1o11fxmypuv7vfkb-8kh6sn7nciobs2crbqunav09q-6m96aslgoubdqpnadnimrxsuk&fc=2"></script>
<script type="text/javascript">
document.cookie = 'lang="v=2&lang=en-us"; domain=linkedin.com; version=0; path=/;';
</script>
<title>xyz abc | LinkedIn</title>
<link rel="stylesheet" type="text/css" href="http://s.c.lnkd.licdn.com/scds/concat/common/css?h=449dtfu96optpu75y189filyn-4lrmst05cep59hjxopm4xrj84-depvqaeschv5p2381431jub3f-ae142xp1b9qwrvanr8q32v931&fc=2">
<!--[if gte IE 9]>
<link rel="shortcut icon" type="image/ico" href="http://s.c.lnkd.licdn.com/scds/common/u/img/favicon_ie9.ico">
<![endif]-->
<!--[if (!IE)|(lt IE 9)]>
<link rel="shortcut icon" type="image/ico" href="http://s.c.lnkd.licdn.com/scds/common/u/img/favicon_v3.ico">
<![endif]-->
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=6rei9ktvfprzc38327x3gt0u3-c61ck8yq8xgf9ji3h55bmaux8-e7bh8bocljccs3kjl5f03uw8j&fc=2"></script>
<script type="text/javascript" src="http://s.c.lnkd.licdn.com/scds/concat/common/js?h=bqzk7a8ih1jk3hy30mnqy9jxb-39gmm0e77xjrikdpkxed7aywi-d228wbwzysn60azozcfg7gzoa&fc=2"></script>"ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/10d/1c1/3af8941.png"},{"link_biz":"/company/axis\u002dmutual\u002dfund?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"axis\u002dmutual\u002dfund","id":565614,"logo":"http://m.c.lnkd.licdn.com/media/p/2/000/036/30a/1bf5734.png","canonicalName":"Axis Mutual Fund","biz_follow":"/company/follow/submit?id=565614&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/2/000/036/30a/1bf5734.png"},{"link_biz":"/company/uti\u002dmf?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"uti\u002dmf","id":467803,"logo":"http://m.c.lnkd.licdn.com/media/p/3/000/02f/205/35eec6b.png","canonicalName":"UTI MF","biz_follow":"/company/follow/submit?id=467803&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/02f/205/35eec6b.png"},{"link_biz":"/company/investment\u002ddata\u002dservices?trk=prof\u002dfollowing\u002dcompany\u002dlogo","universalName":"investment\u002ddata\u002dservices","id":92139,"logo":"http://m.c.lnkd.licdn.com/media/p/3/000/0ad/3d0/32f42f2.png","canonicalName":"Investment Data Services","biz_follow":"/company/follow/submit?id=92139&csrfToken=ajax%3A1584468784299534813&goback=%2Enpv_131506997_*1_*1_NAME*4SEARCH_9ikF_*1_en*4US_*1_*1_*1_123452511375704499972_1_63_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1_*1","ind_lookup":"Financial Services","isShared":false,"logoId":"/p/3/000/0ad/3d0/32f42f2.png"}],"i18n_news":"News","lix_profile_showChannels":"control","i18n_unfollow":"Unfollow","isFollowing":true}},"BasicInfo":{"empty":{},"upsell":{"deferImg":true,"visible":true},"basic_info":{"showTopCardDetail":true,"visible":true,"phoneticname":"","i18n__Industry":"Industry","industry_pivot":"/search?search=&industry=43&sortCriteria=R&keepFacets=true&trk=prof\u002d0\u002dovw\u002dindustry","find_others_region":"Find other members in Mumbai Area, India","headline_highlight":"Executive at L&T Mutual Fund","i18n__find_others_in_industry":"Find other members in this industry","i18n_Edit":"Edit","location_highlight":"<strong class=\ "highlight\">Mumbai Area, India</strong>","deferImg":true,"industry_highlight":"Financial Services","industryID":43,"memberHeadline":"Executive at L&T Mutual Fund","i18n__Location":"Location","memberID":131506997,"location_pivot":"/search?search=&sortCriteria=R&keepFacets=true&facet_G=in%3A7150&trk=prof\u002d0\u002dovw\u002dlocation","fmt_location":"Mumbai Area, India","fullname":"xyz abc"}}`

This is an HTML file. You can tell because it starts with <!DOCTYPE html> (in fact, that means it's HTML5).
The reason you can't load it with json is because it's not json. The exception you get is expected, and your code should likely handle it, because in general passing a file of the wrong type is the sort of thing that happens.
You might like to use lxml, possibly with the beautifulsoup backend, to parse this.

The header is the doctype for HTML5 (this feature is similar of XHTML). Different of XHML, you don't really need to close some tags but in this case it seems more like a "hidden iframe streaming" technique, where the page is getting loading forever.
About the parsing, I would recommend the answers in about "parsing malformed html" ->
How to parse malformed HTML in python, using standard libraries
Regards
Luan

Related

Python/Pyscript Erro with folium

i'm trying to make a simple code with pyScript and the folium library, but I'am continually getting this error
[pyscript/base] PythonError: Traceback (most recent call last):
File "/lib/python3.10/asyncio/futures.py", line 201, in result
raise self._exception
File "/lib/python3.10/asyncio/tasks.py", line 232, in __step
result = coro.send(None)
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 506, in eval_code_async
await CodeRunner(
File "/lib/python3.10/site-packages/_pyodide/_base.py", line 357, in run_async
coroutine = eval(self.code, globals, locals)
File "<exec>", line 3, in <module>
TypeError: 'module' object is not callable
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script defer src="https://pyscript.net/latest/pyscript.js"></script>
<link rel="stylesheet" href="https://pyscript.net/latest/pyscript.css" />
<py-env>
- folium
</py-env>
<title>pyscipt test</title>
</head>
<body>
<div id="map" style="width: 100%; height: 100%"></div>
<py-script output="map">
import folium as fpl
m = fpl.map(location=[-6.2238, 106.8193], zoom_start=10)
print(m)
</py-script>
</body>
</html>
Your code needs some update and corrections. Try this modified code.
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<script defer src="https://pyscript.net/latest/pyscript.js"></script>
<link rel="stylesheet" href="https://pyscript.net/latest/pyscript.css" />
<py-config>
packages = [
"folium",
]
</py-config>
<title>pyscript test: Success!</title>
</head>
<body>
<div id="map" style="width: 100%; height: 100%"></div>
<py-script output="map">
import folium as fpl
m = fpl.Map(location=[-6.2238, 106.8193], zoom_start=10)
fpl.LayerControl().add_to(m)
m
</py-script>
</body>
</html>
Some warning from pyscript site:-
Please be advised that PyScript is very alpha and under heavy
development. There are many known issues, from usability to loading
times, and you should expect things to change often. We encourage
people to play and explore with PyScript, but at this time we do not
recommend using it for production.

Download PDF with chrome plugin in python selenium

I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link

jquery is not laoding in python flask app

Hello i dont know why the jquery is not loading in my flask app ? even though i have downloaded the jquery and add in js folder
<!DOCTYPE html>
<html lang="eng">
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<link href="{{url_for('static', filename = 'css/bootstrap.min.css')}}" rel="stylesheet">
<link rel="shortcut icon" href="{{url_for('static', filename = 'myicon.ico' )}}">
<script type=text/javascript src="{{url_for('static', filename='js/jquery.min.js') }}"></script>
<script type="text/javascript" src="{{url_for('static', filename = 'js/bootstrap.min.js')}}"></script>
type=text/javascript should be type="text/javascript" in line referencing jQuery.

how to copy all the code of a URL with python

I want to copy all the code of an URL (http://modelseed.org/biochem/reactions/rxn00001) using Python 3.6, but I can only copy part of the code, and I don't know why.
So far, I tried with "requests" module
import requests
page = requests.get("http://modelseed.org/biochem/reactions/rxn00001")
print(page.content)
and "urllib"
import urllib.request
site = urllib.request.urlopen("http://modelseed.org/biochem/reactions/rxn00001")
print(site.read())
The part of the code with information of the "Reaction Details", like "Name", "ID" and "Abbreviation" are missing, but they are visible if I inspect the code on the developer bar of Chrome.
The code I'm able to download using the two codes above is:
<!DOCTYPE html>
<html lang="en" ng-app="ModelSEED">
<head>
<base href="/"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport">
<meta content="The ModelSEED is a resource for the reconstruction, exploration, comparison, and analysis of metabolic models." name="description"/>
<link href="/img/ModelSEED-favicon.png?v=2.0" rel="shortcut icon"/>
<meta content="nconrad" name="author"/>
<title>
ModelSEED
</title>
<link href="components/angular-material/angular-material.css" rel="stylesheet"/>
<link href="components/bootstrap/dist/css/bootstrap.min.css" rel="stylesheet"/>
<!-- to be removed -->
<link href="components/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
<link href="icomoon/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="http://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="build/style.css" rel="stylesheet"/>
<!--<script src="https://cdn.socket.io/socket.io-1.3.7.js"></script>-->
<script src="build/site.js">
</script>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</meta>
</head>
<body>
<div style="height: 100%;" ui-view="">
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-67412611-1', 'auto');
ga('send', 'pageview');
</script>
</body>
</html>
Anyone has any hint why the code between < div style="height: 100%;" ui-view="" > and (just after < body > and before < script >) is not downloaded?
Thank you.
It's being inserted by a javascript script, therefore, either requests nor urllib would find it, you would need to use a browser for this, you should try with selenium or PhantomJS
something like:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
driver.page_source
Try getting this url instead: https://www.patricbrc.org/api/model_reaction/?http_accept=application/json&eq(id,rxn00001)

Python scraping of dynamic content (visual different from html source code)

I'm a big fan of stackoverflow and typically find solutions to my problems through this website. However, the following problem has bothered me for so long that it forced me to create an account here and ask directly:
I'm trying to scape this link: https://permid.org/1-21475776041 What i want is the row "TRCS Asset Class" and "Currency".
For starters, I'm using this code:
from bs4 import BeautifulSoup
import urllib2
url = 'https://permid.org/1-21475776041'
req = urllib2.urlopen(url)
raw = req.read()
soup = BeautifulSoup(raw)
print soup.prettify()
The html code returned (see below) is different from what you can see in your browser upon clicking the link:
<!DOCTYPE html>
<!--[if lt IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" ng-app="tmsMdaasApp">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="max-age=0,no-cache" http-equiv="Cache-Control"/>
<base href="/"/>
<title ng-bind="PageTitle">
Thomson Reuters | PermID
</title>
<meta content="" name="description"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#ff8000" name="theme-color"/>
<!-- Place favicon.ico and apple-touch-icon.png in the root directory -->
<link href="app/vendor.daf96efe.css" rel="stylesheet"/>
<link href="app/app.1405210f.css" rel="stylesheet"/>
<link href="favicon.ico" rel="icon"/>
<!-- Typekit -->
<script src="//use.typekit.net/gnw2rmh.js">
</script>
<script>
try{Typekit.load({async:true});}catch(e){}
</script>
<!-- // Typekit -->
<!-- Google Tag Manager Data Layer -->
<!--<script>
analyticsEvent = function() {};
analyticsSocial = function() {};
analyticsForm = function() {};
dataLayer = [];
</script>-->
<!-- // Google Tag Manager Data Layer -->
</head>
<body class="theme-grey" id="top" ng-esc="">
<!--[if lt IE 7]>
<p class="browserupgrade">You are using an <strong>outdated</strong> browser. Please upgrade your browser to improve your experience.</p>
<![endif]-->
<!-- Add your site or application content here -->
<navbar class="tms-navbar">
</navbar>
<div id="body" role="main" ui-view="">
</div>
<div id="footer-wrapper" ng-show="!params.elementsToHide">
<footer id="main-footer">
</footer>
</div>
<!--[if lt IE 9]>
<script src="bower_components/es5-shim/es5-shim.js"></script>
<script src="bower_components/json3/lib/json3.min.js"></script>
<![endif]-->
<script src="app/vendor.8cc12370.js">
</script>
<script src="app/app.6e5f6ce8.js">
</script>
</body>
</html>
Does anyone know what I'm missing here and how I could get it to work?
Thanks, Teemu Risikko - a comment (albeit not the solution) of the website you linked got me on the right path.
In case someone else is bumping into the same problem, here is my solution: I'm getting the data via requests and not via traditional "scraping" (e.g. BeautifulSoup or lxml).
Navigate to the website using Google Chrome.
Right-click on the website and select "Inspect".
On the top navigation bar select "Network".
Limit network monitor to "XHR".
One of the entries (market with an arrow) shows the link that can be used with the requests library.
import requests
url = 'https://permid.org/api/mdaas/getEntityById/21475776041'
headers = {'X-AG-Access-Token': YOUR_ACCESS_TOKEN}
r = requests.get(url, headers=headers)
r.json()
Which gets me this:
{u'Asset Class': [u'Units'],
u'Asset Class URL': [u'https://permid.org/1-302043'],
u'Currency': [u'CAD'],
u'Currency URL': [u'https://permid.org/1-500140'],
u'Exchange': [u'TOR'],
u'IsQuoteOf.mdaas': [{u'Is Quote Of': [u'Convertible Debentures Income Units'],
u'URL': [u'https://permid.org/1-21475768667'],
u'quoteOfInstrument': [u'21475768667'],
u'quoteOfInstrument URL': [u'https://permid.org/1-21475768667']}],
u'Mic': [u'XTSE'],
u'PERM ID': [u'21475776041'],
u'Quote Name': [u'CONVERTIBLE DEBENTURES INCOME UNT'],
u'Quote Type': [u'equity'],
u'RIC': [u'OCV_u.TO'],
u'Ticker': [u'OCV.UN'],
u'entityType': [u'Quote']}
Using the default user-agent with a lot of pages will give you a different looking page because it is using an outdated user-agent. This is what your output is telling you.
Reference on Changing user-agents
Thought this may be your problem, it does not exactly answer the question about getting dynamically applied changes on a webpage. To get the dynamically changed data you need to emulate the javascript requests that the page is making on load. If you make the requests that the javascript is making you will get the data that the javascript is getting.

Categories