Data/Contents are missing when using Python Selenium

Data/Contents are missing when using Python Selenium - python

I am testing how to use Selenium in python, and successfully open a page via this below code in Ubuntu 16.04:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
firefox_options = Options()
firefox_options.binary_location = '/usr/bin/firefox'
driver= webdriver.Firefox(executable_path='/home/myname/geckodriver',firefox_options=firefox_options)
driver.get('https://www.toutiao.com')
However, some data/contents are missing, comparing to open this page('https://www.toutiao.com') manually.
My Firefox version is '72.0.2' and geckodriver version is'0.26.0'. Could anybody help me on this issue please? Thanks in Advance!

I took your code, simplified the script and while execution I have encountered the similar issue i.e. the data/contents are missing comparing to open this page as follows:
Code Block:
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'C:\Utility\BrowserDrivers\geckodriver.exe')
driver.get('https://www.toutiao.com')
print(driver.page_source)
Console Output:
<html><head><style class="vjs-styles-defaults">
.video-js {
width: 300px;
height: 150px;
}
.vjs-fluid {
padding-top: 56.25%
}
</style><meta charset="utf-8"><title>????</title><meta http-equiv="x-dns-prefetch-control" content="on"><meta name="renderer" content="webkit"><link rel="dns-prefetch" href="//s3.pstatp.com/"><link rel="dns-prefetch" href="//s3a.pstatp.com/"><link rel="dns-prefetch" href="//s3b.pstatp.com"><link rel="dns-prefetch" href="//p1.pstatp.com/"><link rel="dns-prefetch" href="//p3.pstatp.com/"><meta http-equiv="Content-Security-Policy" content="upgrade-insecure-requests"><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1,minimum-scale=1,user-scalable=no,minimal-ui"><meta name="360-site-verification" content="b96e1758dfc9156a410a4fb9520c5956"><meta name="360_ssp_verify" content="2ae4ad39552c45425bddb738efda3dbb"><meta name="google-site-verification" content="3PYTTW0s7IAfkReV8wAECfjIdKY-bQeSkVTyJNZpBKE"><meta name="shenma-site-verification" content="34c05607e2a9430ad4249ed48faaf7cb_1432711730"><meta name="baidu_union_verify" content="b88dd3920f970845bad8ad9f90d687f7"><meta name="domain_verify" content="pmrgi33nmfuw4ir2ej2g65lunfqw6ltdn5wselbcm52wszbchirdqyztge3tenrsgq3dknjume2tayrvmqytemlfmiydimddgu4gcnzcfqrhi2lnmvjwc5tfei5dcnbwhazdcobuhe2dqobrpu"><meta name="keywords" content="????,??,???,????,??????"><meta name="description" content="«????»(www.toutiao.com)????????????????,?????????????????,?????????????,??????????????????????"><link rel="alternate" media="only screen and (max-width: 640px)" href="//m.toutiao.com/"><link rel="shortcut icon" href="//s3a.pstatp.com/toutiao/resource/ntoutiao_web/static/image/favicon_5995b44.ico" type="image/x-icon"><link rel="stylesheet" href="//s3.pstatp.com/toutiao/player/dist/pc_vue2.css" media="screen" title="no title"><!--[if lt IE 9]>
<p>?????????,??????</p>
.
.
.
<script>var imgUrl = '/c/9ubkblw9out4h9t6ya05r7h0uu7q2u341jhsdh7l4r4yphpuxlqgdm/';</script><script>tac='i+2gv2ch1tigds!i$1dmgs"yZl!%s"l"u&kLs#l l#vr*charCodeAtx0[!cb^i$1em7b*0d#>>>s j\uffeel s#0,<8~z|\x7f#QGNCJF[\\^D\\KFYSk~^WSZhg,(lfi~ah`{md"inb|1d<,%Dscafgd"in,8[xtm}nLzNEGQMKAdGG^NTY\x1ckgd"inb<b|1d<g,&TboLr{m,(\x02)!jx-2n&vr$testxg,%#tug{mn ,%vrfkbm[!cb|'</script><script type="text/javascript" crossorigin="anonymous" src="//s3b.pstatp.com/toutiao/static/js/vendor.63b66d4280309ac2fb48.js"></script><script type="text/javascript" crossorigin="anonymous" src="//s3a.pstatp.com/toutiao/static/js/page/index_node/index.e6afc60a3a3f653cfdba.js"></script><script type="text/javascript" crossorigin="anonymous" src="//s3b.pstatp.com/toutiao/static/js/ttstatistics.a083f6cd9b1a9a970725.js"></script><script src="//s3.pstatp.com/inapp/lib/raven.js" crossorigin="anonymous"></script><script>;(function(window) {
// sentry
window.Raven && Raven.config('//key#m.toutiao.com/log/sentry/v2/96', {
whitelistUrls: [/pstatp\.com/],
shouldSendCallback: function(data) {
var ua = navigator && navigator.userAgent;
var isDeviceOK = !/Mobile|Linux/i.test(navigator.userAgent);
return isDeviceOK;
},
tags: {
bid: 'toutiao_pc',
pid: 'index_new'
},
autoBreadcrumbs: {
'xhr': false,
'console': true,
'dom': true,
'location': true
}
}).install();
})(window);</script><script>document.getElementsByTagName('body')[0].addEventListener('click', function(e) {
var target = e.target,
ga_event,
ga_category,
ga_label,
ga_value;
while(target && target.nodeName.toUpperCase() !== 'BODY') {
ga_event = target.getAttribute('ga_event');
ga_category = target.getAttribute('ga_category') || '/';
ga_label = target.getAttribute('ga_label') || '';
ga_value = target.getAttribute('ga_value') || 1;
ga_event && window.ttAnalysis && ttAnalysis.send('event', { ev: ga_event });
target = target.parentNode;
}
});</script><script src="https://xxbg.snssdk.com/websdk/v1/getInfo?q=YOsueEs6CjZquUQrQwttBa2p27c%2FmJBGcEmZKypwf%2Fh%2B%2FFzCVrIwzk9L3bo%2FZb2O8gVTNaA4L2Bk10qWfZ2s94e6qe8KRXlOEjnI%2FrONB4jQynV3bfJ9exD2E4QPsgydRGjRLlDXE9uYD7HU3IZ%2FOU2MJG2vMgfNU55%2FmsOAlVSrPQH2wo4Eor0lgghKHjRi28vVvBdKY7JT4gG7S7ThRFD2YBIc%2Fs4JYViQu1Ll1Bg5Xn5bKuD6jZRz3AzfFqzSOWguO6vUbzL0wBc4mpa22mdpmAXIvUNWtjg5MUfXh9rfWI0ti7saL%2B0r4%2BaBdN5y4lrmxAcQZq2oeAKl4WjOeJsN%2BePpYmisoxTzdBZ6TL8IGE0E7ZUUlFlPGyUWhU3E4IRbtbCCd0QdVaJajiSOIhg9cImqTZYI56kIao1yVnV%2Bxu4%2BhaC1kHu5xsk49%2BX%2FNdwGcel%2BlOUzagkE5s8X6jEswA7jzW%2ByD6%2FusfkNyyx8WOWCJmZlTGQ4SNQr%2FQHvmK2QscQ7KnTvKVqjedUd7IFcvyTyYz3iFFrmRkOMRN9042sLiQwerXsn0f%2Fc%2Bh46PNdeU1S6BsFKq%2BZhMDxw1vI2Y1C%2Fa0RBdZC%2BGZq%2BkbNaoVotfvslg05ahevHTainlZR9DHEiWawFBJbTwjMeYrmo4NZiL5eNBUvslFn%2BDPHk%2F6Oj0Nbb89Rx8Ihi2pRH04voRog9848H2o2LR9gx0N0i0o6%3D&callback=_8712_1581940674310"></script></body></html>
Analysis
While inspecting the DOM Tree of the webpage you will find that some of the <script> tag refers to JavaScripts having keyword dist. As an example:
<link rel="stylesheet" href="//s3.pstatp.com/toutiao/player/dist/pc_vue2.css" media="screen" title="no title">
<script src="//unpkg.pstatp.com/byted/sec_sdk_build/1.1.12/dist/captcha.js"></script>
//s3a.pstatp.com/toutiao/picc_mig/dist/img.min.js
Which is a clear indication that the website is protected by Bot Management service provider Distil Networks and the navigation by ChromeDriver gets detected and subsequently blocked.
Distil
As per the article There Really Is Something About Distil.it...:
Distil protects sites against automatic content scraping bots by observing site behavior and identifying patterns peculiar to scrapers. When Distil identifies a malicious bot on one site, it creates a blacklisted behavioral profile that is deployed to all its customers. Something like a bot firewall, Distil detects patterns and reacts.
Further,
"One pattern with **Selenium** was automating the theft of Web content", Distil CEO Rami Essaid said in an interview last week. "Even though they can create new bots, we figured out a way to identify Selenium the a tool they're using, so we're blocking Selenium no matter how many times they iterate on that bot. We're doing that now with Python and a lot of different technologies. Once we see a pattern emerge from one type of bot, then we work to reverse engineer the technology they use and identify it as malicious".
Reference
You can find a couple of detailed discussion in:
Is there a way to use Selenium WebDriver without informing the document that it is controlled by WebDriver?
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection
Akamai Bot Manager detects WebDriver driven Chrome Browsing Context
Is there a version of selenium webdriver that is not detectable?

Related

Web Scraping Blocked by Robots Meta Directives

I am working on a web scraper to access scheduling data from a website. Our company has full access to this website and data via login credentials. With dynamic site navigation required, I am using Selenium for automated data scraping, Python, and BeautifulSoup to work with the HTML structure. With all variables defined, I have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml.html as lh
opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt, executable_path=<path to chromedriver.exe>)
driver = webdriver.Chrome(<path to chromedriver.exe>)
driver.get(<website login page URL>?username=' + username + '&password=' + password)
driver.get(<url of website page with data>?start_date=' + start_date + '&end_date=' + end_date +'&type=Excel')
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
The result of the print(soup) is as follows:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body> ... irrelevant ... </body></html>
Before any questions, I do not have much knowledge regarding robot or HTTP requests. My questions are:
When I run a headless driver as above, the scrape is blocked by robots. When I run a regular, non-headless driver where an automated browser opens, the scrape is successful. Why is this the case?
What is the best method to get around this? The scrape is legal and non-exploitive as we practically have full access to the data we are scraping (we are a registered client). Will using the requests library solve this problem? Are there other methods of running headless web drivers that won't get blocked? Is there some parameter I can change that prevents the block?
How do I see the robots.txt file of a website?

you can use the following code to hide the webdriver
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
also, add this to your chromedriver options
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('useAutomationExtension', False)

Selenium firewall issue "The requested URL was rejected.[...]" [duplicate]

I did several hours of research and asked a bunch of people on fiverr who all couldn't solve a a specific problem I have.
I installed Selenium and tried to access a Website. Unfortunately the site won't allow a specific request and doesn't load the site at all. However, if I try to access the website with my "normal" Chrome Browser, it works fine.
I tried several things such as:
Different IP's
Deleting Cookies
Incognito Mode
Adding different UserAgents
Hiding features which might reveal that a Webdriver is being used
Nothing helped.
Here is a Screenshot of the Error I'm receiving:
And here is the very simple script I'm using:
# coding: utf8
from selenium import webdriver
url = 'https://registrierung.gmx.net/'
# Open ChromeDriver
driver = webdriver.Chrome();
# Open URL
driver.get(url)
If anyone has a solution for that I would highly appreciate it.
I'm also willing to give a big tip if someone could help me out here.
Thanks a lot!
Stay healthy everyone.

I took your code modified with a couple of arguments and executed the test. Here are the observations:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://registrierung.gmx.net/")
print(driver.page_source)
Console Output:
<html style="" class=" adownload applicationcache blobconstructor blob-constructor borderimage borderradius boxshadow boxsizing canvas canvastext checked classlist contenteditable no-contentsecuritypolicy no-contextmenu cors cssanimations csscalc csscolumns cssfilters cssgradients cssmask csspointerevents cssreflections cssremunit cssresize csstransforms3d csstransforms csstransitions cssvhunit cssvmaxunit cssvminunit cssvwunit dataset details deviceorientation displaytable display-table draganddrop fileinput filereader filesystem flexbox fullscreen geolocation getusermedia hashchange history hsla indexeddb inlinesvg json lastchild localstorage no-mathml mediaqueries meter multiplebgs notification objectfit object-fit opacity pagevisibility performance postmessage progressbar no-regions requestanimationframe raf rgba ruby scriptasync scriptdefer sharedworkers siblinggeneral smil no-strictmode no-stylescoped supports svg svgfilters textshadow no-time no-touchevents typedarrays userselect webaudio webgl websockets websqldatabase webworkers datalistelem video svgasimg datauri no-csshyphens"><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta http-equiv="CacheControl" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo=">
<script type="text/javascript">
(function(){
window["bobcmn"] = "10111111111010200000005200000005200000006200000001249d50ae8200000096200000000200000002300000000300000000300000006/TSPD/300000008TSPD_10130000000cTSPD_101_DID300000005https3000000b0082f871fb6ab200097a0a5b9e04f342a8fdfa6e9e63434256f3f63e9b3885e118fdacf66cc0a382208ea9dc3b70a28002d902f95eb5ac2e5d23ffe409bb24b4c57f9cb8e1a5db4bcad517230d966c75d327f561cc49e16f4300000002TS200000000200000000";
.
.
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=11"></script>
<script type="text/javascript">
(function(){
window["blobfp"] = "01010101b00400000100e803000000000d4200623938653464333234383463633839323030356632343563393735363433343663666464633135393536643461353031366131633362353762643466626238663337210068747470733a2f2f72652e73656375726974792e66356161732e636f6d2f72652f0700545350445f3734";window["slobfp"] = "08c3194e510b10009a08af8b7ee6860a22b5726420e697e4";
})();
</script>
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=12"></script>
<noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 11993951574422772310.</noscript>
</head><body>
<style>canvas {display:none;}</style><canvas width="800" height="600"></canvas></body></html>
Browser Snapshot:
Conclusion
From the Page Source it's quite clear that Selenium driven ChromeDriver initiated google-chrome Browsing Context gets detected and the navigation is blocked.
I could have dug deeper and provide some more insights but suprisingly now even manually I am unable to access the webpage. Possibly my IP is black-listed now. Once my IP gets whitelisted I will provide more details.
References
You can find a couple of relevant detailed discussions in:
Can a website detect when you are using selenium with chromedriver?
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

Selenium and PhantomJS : webpage thinks Javascript is disabled

I am running a small script that is mostly checking small things on a website. Today I've come across a really interesting situation I've never seen before, which is the webpage i'm going to thinks Javascript is disabled. This is only happening in PhantomJS, but works fine in Chromedriver. I've even tried changing the driver's headers to ones similiar to Chrome, but still no luck. Is there anyway to get this page to work in PhantomJS without having to use ChromeDriver and PyVirtualDisplay? i'm running the code on Ubuntu Server and would rather not use the extra system resources of having to use them. I've also tried running driver.save_screenshot(), but it's returning a blank image since there is no content of the page being displayed.
simple code to reproduce the problem:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [
'--ignore-ssl-errors=true',
'--ssl-protocol=any'
]
capabilities = dict(DesiredCapabilities.PHANTOMJS)
capabilities['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/60.0.3112.113 Safari/537.36)')
driver = webdriver.PhantomJS(desired_capabilities=capabilities, service_args=service_args)
driver.get(EDIT: URL REMOVED)
print driver.page_source
html response:
<!DOCTYPE html><html class=" no-blobworkers adownload applicationcache no-audiodata no-webaudio no-audio no-lowbattery no-batteryapi no-battery-api blobconstructor blob-constructor canvas todataurljpeg todataurlpng no-todataurlwebp canvastext contenteditable no-contentsecuritypolicy no-contextmenu cookies cors cssanimations backgroundcliptext bgpositionshorthand bgpositionxy bgrepeatround bgrepeatspace backgroundsize bgsizecover borderimage borderradius boxshadow boxsizing csscalc checked csscolumns cubicbezierrange displayrunin display-runin displaytable display-table ellipsis cssfilters flexbox flexboxlegacy no-flexboxtweener fontface generatedcontent cssgradients hsla lastchild cssmask mediaqueries multiplebgs no-objectfit no-object-fit opacity no-overflowscrolling csspointerevents csspositionsticky no-csspseudoanimations csstransitions no-csspseudotransitions cssreflections regions cssremunit cssresize rgba cssscrollbar shapes siblinggeneral subpixelfont no-supports textshadow csstransforms csstransforms3d userselect cssvhunit cssvmaxunit cssvminunit cssvwunit no-wrapflow no-customprotocolhandler no-dart dataview classlist no-createelementattrs no-createelement-attrs dataset no-microdata draganddrop datalistelem details outputelem progressbar meter ruby no-time no-texttrackapi no-track no-emoji no-strictmode no-contains no-devicemotion no-deviceorientation filereader no-filesystem fileinput formattribute no-localizednumber placeholder no-speechinput no-formvalidation fullscreen gamepads no-geolocation hashchange history no-ie8compat sandbox seamless srcdoc indexeddb json olreversed no-mathml no-lowbandwidth eventsource xhr2 xhrresponsetypearraybuffer xhrresponsetypeblob xhrresponsetypedocument no-xhrresponsetypejson xhrresponsetypetext xhrresponsetype notification pagevisibility performance no-pointerevents no-pointerlock postmessage no-quotamanagement requestanimationframe raf scriptasync scriptdefer localstorage sessionstorage websqldatabase no-stylescoped svgclippaths svgfilters inlinesvg smil svg touchevents typedarrays unicode no-userdata no-vibrate no-video no-webintents no-webgl no-getusermedia no-peerconnection websocketsbinary websockets no-framed sharedworkers webworkers no-dataworkers no-exiforientation no-apng no-webplossless no-webp svgasimg datauri" style=""><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta http-equiv="CacheControl" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo=">
<script>
(function(){
var securemsg;
var dosl7_common;
window["bobcmn"] = "111111101010102000000052000000052000000002324c533c200000096300000000300000000300000006/TSPD/300000008TSPD_101300000005https200000000200000000";
window.cML=!!window.cML;try{(function(){try{var jj,Jj,Lj=1,sj=1,Sj=1,ij=1;for(var Ij=0;Ij<Jj;++Ij)Lj+=2,sj+=2,Sj+=2,ij+=3;jj=Lj+sj+Sj+ij;window.JO===jj&&(window.JO=++jj)}catch(OJ){window.JO=jj}var zJ=!0;function ZJ(J){J&&(zJ=!1,document.cookie="brav=ad");return zJ}function iJ(){}ZJ(window[iJ.name]===iJ);ZJ("function"!==typeof ie9rgb4);ZJ(/\x3c/.test(function(){return"\x3c"})&!/x3d/.test(function(){return"'x3'+'d';"}));
var IJ=window.attachEvent||/mobi/i.test(window["\x6e\x61vi\x67a\x74\x6f\x72"]["\x75\x73e\x72A\x67\x65\x6et"]),ol=+new Date+6E5,_l,Il,jL=setTimeout,JL=IJ?3E4:6E3;function LL(){if(!document.querySelector)return!0;var J=+new Date,O=J>ol;if(O)return ZJ(!1);O=Il&&_l+JL<J;O=ZJ(O);_l=J;Il||(Il=!0,jL(function(){Il=!1},1));return O}LL();var OL=[17795081,27611931586,1558153217];
function zL(J){J="string"===typeof J?J:J.toString(36);var O=window[J];if(!O.toString)return;var Z=""+O;window[J]=function(J,Z){Il=!1;return O(J,Z)};window[J].toString=function(){return Z}}for(var sL=0;sL<OL.length;++sL)zL(OL[sL]);ZJ(!1!==window.cML);
(function(){var J=-1,J={_:++J,oZ:"false"[J],J:++J,Il:"false"[J],oj:++J,L0:"[object Object]"[J],IL:(J[J]+"")[J],JL:++J,iL:"true"[J],Jj:++J,Zj:++J,OZ:"[object Object]"[J],S:++J,ij:++J,oLj:++J,LLj:++J};try{J.il=(J.il=J+"")[J.Zj]+(J.Ol=J.il[J.J])+(J.LZ=(J.ol+"")[J.J])+(!J+"")[J.JL]+(J.zl=J.il[J.S])+(J.ol="true"[J.J])+(J.Jo="true"[J.oj])+J.il[J.Zj]+J.zl+J.Ol+J.ol,J.LZ=J.ol+"true"[J.JL]+J.zl+J.Jo+J.ol+J.LZ,J.ol=J._[J.il][J.il],J.ol(J.ol(J.LZ+'"\\'+J.J+J.Zj+J.J+J.oZ+"\\"+J.Jj+J._+"("+J.zl+"\\"+J.J+J.ij+
J.J+"\\"+J.J+J.S+J._+J.iL+J.Ol+J.oZ+"\\"+J.Jj+J._+"\\"+J.J+J.S+J.ij+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+J.IL+J.Ol+"\\"+J.J+J.S+J.ij+"['\\"+J.J+J.S+J._+J.Il+"\\"+J.J+J.ij+J.J+"false"[J.oj]+J.Ol+J.Il+J.IL+"']\\"+J.Jj+J._+"===\\"+J.Jj+J._+"'\\"+J.J+J.S+J.JL+J.zl+"\\"+J.J+J.S+J.oj+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+"\\"+J.J+J.Jj+J.ij+"')\\"+J.Jj+J._+"{\\"+J.J+J.oj+"\\"+J.J+J.J+"\\"+J.J+J.S+J.S+J.Il+"\\"+J.J+J.S+J.oj+"\\"+J.Jj+J._+J.iL+J.IL+"\\"+J.J+J.S+J.S+J.OZ+"\\"+J.J+J.ij+J.J+J.Jo+"\\"+J.J+J.Zj+J.oj+
"\\"+J.J+J.Zj+J.JL+"\\"+J.J+J.S+J._+"\\"+J.Jj+J._+"=\\"+J.Jj+J._+"\\"+J.J+J.S+J.ij+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+J.IL+J.Ol+"\\"+J.J+J.S+J.ij+"['\\"+J.J+J.S+J._+J.Il+"\\"+J.J+J.ij+J.J+"false"[J.oj]+J.Ol+J.Il+J.IL+"'].\\"+J.J+J.S+J.oj+J.iL+"\\"+J.J+J.S+J._+"false"[J.oj]+J.Il+J.OZ+J.iL+"(/.{"+J.J+","+J.Jj+"}/\\"+J.J+J.Jj+J.ij+",\\"+J.Jj+J._+J.oZ+J.Jo+"\\"+J.J+J.Zj+J.S+J.OZ+J.zl+"\\"+J.J+J.Zj+J.J+J.Ol+"\\"+J.J+J.Zj+J.S+"\\"+J.Jj+J._+"(\\"+J.J+J.ij+J._+")\\"+J.Jj+J._+"{\\"+J.J+J.oj+"\\"+J.J+J.J+
"\\"+J.J+J.J+"\\"+J.J+J.J+"\\"+J.J+J.S+J.oj+J.iL+J.zl+J.Jo+"\\"+J.J+J.S+J.oj+"\\"+J.J+J.Zj+J.S+"\\"+J.Jj+J._+"(\\"+J.J+J.ij+J._+"\\"+J.Jj+J._+"+\\"+J.Jj+J._+"\\"+J.J+J.ij+J._+").\\"+J.J+J.S+J.JL+J.Jo+J.L0+"\\"+J.J+J.S+J.JL+J.zl+"\\"+J.J+J.S+J.oj+"("+J.oj+",\\"+J.Jj+J._+J.Jj+")\\"+J.J+J.oj+"\\"+J.J+J.J+"\\"+J.J+J.J+"});\\"+J.J+J.oj+"}\\"+J.J+J.oj+'"')())()}catch(O){J%=5}})();var SL=82;window.SZ={IZ:"0895a966bc0180002d019416d74a2e28d1f538ef3103146592d8c25a73dda892c7c585714f95500ba8b6beac1b79be4a3d61e8b7a80de2ffe8aa17af5acaa722530af851815bcaab86168951dee7b2ac8413c027a687d99e48318f014124304bb906e86573dd8e328c3b24cadaf832eea48f8634b3c6e9a0f49eee5235a376e326e984f99d888c10"};function l(J){return 812>J}
function L(J){var O=arguments.length,Z=[];for(var S=1;S<O;++S)Z.push(arguments[S]-J);return String.fromCharCode.apply(String,Z)}function z(J,O){J+=O;return J.toString(36)}(function(J){J||setTimeout(function(){if(!LL())return;var J=setTimeout(function(){},250);for(var Z=0;Z<=J;++Z)clearTimeout(Z);LL()},500)})(zJ);})();}catch(x){document.cookie='brav=oex'+x;}finally{ie9rgb4=void(0);};function ie9rgb4(a,b){return a>>b>>0};
})();
</script>
<script type="text/javascript" src="/TSPD/08e841a5c5ab20002a3554b194594e5f3375d2f994ac4de334932487e4817509e84bbe3658582b13?type=9"></script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
EDIT: Yes, I understand that PhantomJS is old, but it's what we have to use. My question is about getting something to work in PhantomJS, not about what alternatives are available. All of our servers run Ubuntu Server, including all server images, so we have to use headless browsing. Virtual Displays, such as PyVirtualDisplay and any other Xvfb routing method are too heavy on system resources. All of our codebase uses PhantomJS so as of right now I have to use it. As well, we use proxies with username and password authentication which Chrome has not supported, so the Headless Chrome option is out.
As well, i just tested this code with Headless Chrome and it also is not working.
python code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='/path/to/chromedriver', chrome_options=chrome_options)
driver.get("https://www.EDIT-REMOVED.com")
print driver.page_source
driver.quit()
html response:
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

Can not find out the source of data I need when crawling website

I am writing a web crawler with python. I come across a problem when I am trying to find out the source of the data I need.
The site I am crawling is: https://www.whoscored.com/Regions/252/Tournaments/2/England-Premier-League, and the data I want is as below:
I can find these data by browsering the page source after the page has been tatolly loaded by firefox:
DataStore.prime('standings', { stageId:15151, idx:0, field: 'overall'}, [[15151,32,'Manchester United',1,5,4,1,0,16,2,14,13,1,3,3,0,0,10,0,10,9,7,2,1,1,0,6,2,4,4,[[0,1190179,4,0,2,252,'England',2,'Premier League','2017/2018',32,29,'Manchester United','West Ham','Manchester United','West Ham',4,0,'w'] ......
I thought these data should be requested though ajax, but I detected no such request by using the web console.
Then, I simulated the browser behaviour (set header and cookies) requiring the html page:
<html>
<head>
<META NAME="robots" CONTENT="noindex,nofollow">
<script src="/_Incapsula_Resource?SWJIYLWA=2977d8d74f63d7f8fedbea018b7a1d05">
</script>
<script>
(function() {
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B7661722073746174757......";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
</script></head>
<body>
<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>
</body></html>
I created an .html file with the content above, and open it with firefox, but it seems that the script did not executed. Now, I don`t know how to do, I need some help, thanks!

Looking for solution for an automatically scrolling information board

Right now, my company has about 40 information boards scattered throughout the buildings with information relevant to each area. Each unit is a small linux based device that is programmed to launch an RDP session, log in with a user name and password, and pull up the appropriate powerpoint and start playing. The boards would go down every 4 hours for about 5 minutes, copy over a new version of a presentation (if applicable) and restart.
We now have "demands" for live data. Unfortunately I believe powerpoint will no longer be an option as we used the Powerpoint viewer, which does not support plugins. I wanted to use Google Slides, but also have the restriction that we cannot have a public facing service like Google Drive, so there goes that idea.
I was thinking of some kind of way to launch a web browser and have it rotate through a list of specified webpages (perhaps stored in a txt or csv file). I found a way to launch Firefox and have it autologin to OBIEE via python:
#source: http://obitool.blogspot.com/2012/12/automatic-login-script-for-obiee-11g_12.html
import unittest
from selenium import webdriver
from selenium.webdriver.firefox.firefox_profile import FirefoxProfile
# Hardcoding this information is usually not a good Idea!
user = '' # Put your user name here
password = '' # Put your password here
serverName = '' # Put Host name here
class OBIEE11G(unittest.TestCase):
def setUp(self):
# Create a new profile
self.fp = webdriver.FirefoxProfile()
self.fp.set_preference("browser.download.folderList",2)
self.fp.set_preference("browser.download.manager.showWhenStarting",False)
# Associate the profile with the Firefox selenium session
self.driver = webdriver.Firefox(firefox_profile=self.fp)
self.driver.implicitly_wait(2)
# Build the Analytics url and save it for the future
self.base_url = "http://" + serverName + ":9704/analytics"
def login(self):
# Retreive the driver variables created in setup
driver = self.driver
# Goto the loging page
driver.get(self.base_url + "/")
# The 11G login Page has following elements on it
driver.find_element_by_id("sawlogonuser").clear()
driver.find_element_by_id("sawlogonuser").send_keys(user)
driver.find_element_by_id("sawlogonpwd").clear()
driver.find_element_by_id("sawlogonpwd").send_keys(password)
driver.find_element_by_id("idlogon").click()
def test_OBIEE11G(self):
self.login()
#
if __name__ == "__main__":
unittest.main()
If I can use this, I would just need a way to rotate to a new webpage every 30 seconds. Any ideas / recommendations?

You could put a simple javascript snippet on each page that waits a specified time then redirects to the new page. This has the advantage of simple implementation, however it may be annoying to maintain this over many html files.
The other option is to write your copy in a markdown file, then have a single html page that rotates through a list of files and renders and displays the markdown. You would then update the data by rewriting the markdown files. It wouldn't be exactly live, but if 30 second resolution is ok you can get away with it. Something like this for the client code:
HTML
<!DOCTYPE html>
<html lang="en">
<head>
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Message Board</title>
<link rel="stylesheet" type="text/css" href="/css/style.css">
</head>
<body>
<div id="content" class="container"></div>
<!-- jQuery -->
<script src="//code.jquery.com/jquery-2.1.4.min.js" defer></script>
<!-- markdown compiler https://github.com/chjj/marked/ -->
<script src="/js/marked.min.js" defer></script>
<script src="/js/main.js" defer></script>
</body>
</html>
And the javascript
// main.js
// the markdown files
var sites = ["http://mysites.com/site1.md", "http://mysites.com/site2.md", "http://mysites.com/site3.md", "http://mysites.com/site4.md"]
function render(sites) {
window.scrollTo(0,0);
//get first element and rotate list of sites
var file = sites.shift();
sites.push(file);
$.ajax({
url:file,
success: function(data) { $("#content").html(marked(data)); },
cache: false
});
setTimeout(render(sites), 30000);
}
// start the loop
render(sites);
You can then use any method you would like to write out the markdown files.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Data/Contents are missing when using Python Selenium - python

Related

Web Scraping Blocked by Robots Meta Directives

Selenium firewall issue "The requested URL was rejected.[...]" [duplicate]

Selenium and PhantomJS : webpage thinks Javascript is disabled

Can not find out the source of data I need when crawling website

Looking for solution for an automatically scrolling information board

Categories

Resources