Web Scraping Blocked by Robots Meta Directives

Web Scraping Blocked by Robots Meta Directives - python

I am working on a web scraper to access scheduling data from a website. Our company has full access to this website and data via login credentials. With dynamic site navigation required, I am using Selenium for automated data scraping, Python, and BeautifulSoup to work with the HTML structure. With all variables defined, I have the following code:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import lxml.html as lh
opt = Options()
opt.headless = True
driver = webdriver.Chrome(options=opt, executable_path=<path to chromedriver.exe>)
driver = webdriver.Chrome(<path to chromedriver.exe>)
driver.get(<website login page URL>?username=' + username + '&password=' + password)
driver.get(<url of website page with data>?start_date=' + start_date + '&end_date=' + end_date +'&type=Excel')
soup = BeautifulSoup(driver.page_source, 'lxml')
print(soup)
The result of the print(soup) is as follows:
<html style="height:100%">
<head>
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="telephone=no" name="format-detection"/>
<meta content="initial-scale=1.0" name="viewport"/>
<meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
</head>
<body> ... irrelevant ... </body></html>
Before any questions, I do not have much knowledge regarding robot or HTTP requests. My questions are:
When I run a headless driver as above, the scrape is blocked by robots. When I run a regular, non-headless driver where an automated browser opens, the scrape is successful. Why is this the case?
What is the best method to get around this? The scrape is legal and non-exploitive as we practically have full access to the data we are scraping (we are a registered client). Will using the requests library solve this problem? Are there other methods of running headless web drivers that won't get blocked? Is there some parameter I can change that prevents the block?
How do I see the robots.txt file of a website?

you can use the following code to hide the webdriver
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
also, add this to your chromedriver options
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_experimental_option('useAutomationExtension', False)

Related

Splinter Iframe form filling

i am using splinter to do some automatic testing, i am stuck in filling a file input in an iframe using splinter with python.
This is that actual html form of the iframe
<iframe><html lang="en-US"><head>
<title>Website</title>
<meta charset="utf-8"></head><body><form> <input type="file" name="artwork" id="file-upload" accept=".png,.jpg"></form></body></html></iframe>
This is the actual python code i did
from splinter import Browser
browser = Browser()
browser.driver.maximize_window()
browser.driver.implicitly_wait(10)
browser.visit('https://website.com')
with browser.get_iframe(1) as iframe:
iframe.attach_file('artwork', 'C:\\Users\\design\\upload.png')
After execution i have this error
splinter.exceptions.ElementDoesNotExist: no elements could be found with name "artwork"
Can you help me i really don't know why it's not working

Selenium firewall issue "The requested URL was rejected.[...]" [duplicate]

I did several hours of research and asked a bunch of people on fiverr who all couldn't solve a a specific problem I have.
I installed Selenium and tried to access a Website. Unfortunately the site won't allow a specific request and doesn't load the site at all. However, if I try to access the website with my "normal" Chrome Browser, it works fine.
I tried several things such as:
Different IP's
Deleting Cookies
Incognito Mode
Adding different UserAgents
Hiding features which might reveal that a Webdriver is being used
Nothing helped.
Here is a Screenshot of the Error I'm receiving:
And here is the very simple script I'm using:
# coding: utf8
from selenium import webdriver
url = 'https://registrierung.gmx.net/'
# Open ChromeDriver
driver = webdriver.Chrome();
# Open URL
driver.get(url)
If anyone has a solution for that I would highly appreciate it.
I'm also willing to give a big tip if someone could help me out here.
Thanks a lot!
Stay healthy everyone.

I took your code modified with a couple of arguments and executed the test. Here are the observations:
Code Block:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://registrierung.gmx.net/")
print(driver.page_source)
Console Output:
<html style="" class=" adownload applicationcache blobconstructor blob-constructor borderimage borderradius boxshadow boxsizing canvas canvastext checked classlist contenteditable no-contentsecuritypolicy no-contextmenu cors cssanimations csscalc csscolumns cssfilters cssgradients cssmask csspointerevents cssreflections cssremunit cssresize csstransforms3d csstransforms csstransitions cssvhunit cssvmaxunit cssvminunit cssvwunit dataset details deviceorientation displaytable display-table draganddrop fileinput filereader filesystem flexbox fullscreen geolocation getusermedia hashchange history hsla indexeddb inlinesvg json lastchild localstorage no-mathml mediaqueries meter multiplebgs notification objectfit object-fit opacity pagevisibility performance postmessage progressbar no-regions requestanimationframe raf rgba ruby scriptasync scriptdefer sharedworkers siblinggeneral smil no-strictmode no-stylescoped supports svg svgfilters textshadow no-time no-touchevents typedarrays userselect webaudio webgl websockets websqldatabase webworkers datalistelem video svgasimg datauri no-csshyphens"><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta http-equiv="CacheControl" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo=">
<script type="text/javascript">
(function(){
window["bobcmn"] = "10111111111010200000005200000005200000006200000001249d50ae8200000096200000000200000002300000000300000000300000006/TSPD/300000008TSPD_10130000000cTSPD_101_DID300000005https3000000b0082f871fb6ab200097a0a5b9e04f342a8fdfa6e9e63434256f3f63e9b3885e118fdacf66cc0a382208ea9dc3b70a28002d902f95eb5ac2e5d23ffe409bb24b4c57f9cb8e1a5db4bcad517230d966c75d327f561cc49e16f4300000002TS200000000200000000";
.
.
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=11"></script>
<script type="text/javascript">
(function(){
window["blobfp"] = "01010101b00400000100e803000000000d4200623938653464333234383463633839323030356632343563393735363433343663666464633135393536643461353031366131633362353762643466626238663337210068747470733a2f2f72652e73656375726974792e66356161732e636f6d2f72652f0700545350445f3734";window["slobfp"] = "08c3194e510b10009a08af8b7ee6860a22b5726420e697e4";
})();
</script>
<script type="text/javascript" src="/TSPD/082f871fb6ab20009afc88ee053e87fea57bf47d9659e73d0ea3c46c77969984660358739f3d19d0?type=12"></script>
<noscript>Please enable JavaScript to view the page content.<br/>Your support ID is: 11993951574422772310.</noscript>
</head><body>
<style>canvas {display:none;}</style><canvas width="800" height="600"></canvas></body></html>
Browser Snapshot:
Conclusion
From the Page Source it's quite clear that Selenium driven ChromeDriver initiated google-chrome Browsing Context gets detected and the navigation is blocked.
I could have dug deeper and provide some more insights but suprisingly now even manually I am unable to access the webpage. Possibly my IP is black-listed now. Once my IP gets whitelisted I will provide more details.
References
You can find a couple of relevant detailed discussions in:
Can a website detect when you are using selenium with chromedriver?
Selenium webdriver: Modifying navigator.webdriver flag to prevent selenium detection

How to Extract the data from html

I was trying to use beautifulsoup4 with python to scrape a certain website. However, when I tried to see contents from the URL, it only gives me a header part and doesn't give me a body part that I want to use.
URL = "url"
URL_page = requests.get(URL)
print(URL_page.text)
this gives me
<!DOCTYPE html>
<html>
<head>
"Contents of Header"
</head>
<body>
<div id='root'></div>
</body>
</html>
there should be contents inside the body tag but it shows nothing.
the original html of this web page is looks like
<html xmlns:wb="http://open.weibo.com/wb" style>
▶<head...</head> ← ONLY GIVES ME THIS
▶<body data-loaded="true">...</body> ← I NEED THIS PART
</html>

It's hard to provide a working answer without a working URL, but your question does provide some clues.
For one, you say you receive this in the response from a GET:
<body>
But then you see this in a web browser:
<body data-loaded="true">
This suggests that the page has JavaScript code running that continues loading and constructing the page after the initial page has been loaded.
There's no way using requests or bs4 or something of the sort to get around that. You could check what request follows the initial page load that has the actual content (it may be another piece of html, some json, etc.) and use that request to get the content instead. If you want to try that, try opening the developer tools in a good browser and look at the network tab while the page is loading, you'll see all the requests and one of them may contain the content you're after.
But if you need the html after rendering, as rendered by the script, you can try using a JavaScript capable browser from Python, like Chrome driven through the Selenium Chrome webdriver:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get("https://your.url/here")
elem = driver.find_element_by_tag_name('body')
print(elem.text)
Note that you'll need to install Selenium and need to get a copy of the appropriate driver, like chromedriver.exe. Add it to your virtual environment:
install selenium pip install selenium
install the appropriate browser driver, for example ChromeDriver, from here: https://sites.google.com/a/chromium.org/chromedriver/home(drop the executable in your script folder)

No idea what exactly you are after or want as an output. But you can access the json response from ajax:
import pandas as pd
import requests
url='https://www.pixiv.net/ajax/user/14792128/profile/all?lang=en'
jsonData = requests.get(url).json()
data = jsonData['body']['mangaSeries']
df = pd.DataFrame(data)

I think, you should use 'user-agent'.you can try it:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT x.y; Win64; x64; rv:10.0) Gecko/20100101 Firefox/10.0 '}
url = "https://www.pixiv.net/en/users/14792128"
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'lxml')
print(soup.prettify())

Selenium and PhantomJS : webpage thinks Javascript is disabled

I am running a small script that is mostly checking small things on a website. Today I've come across a really interesting situation I've never seen before, which is the webpage i'm going to thinks Javascript is disabled. This is only happening in PhantomJS, but works fine in Chromedriver. I've even tried changing the driver's headers to ones similiar to Chrome, but still no luck. Is there anyway to get this page to work in PhantomJS without having to use ChromeDriver and PyVirtualDisplay? i'm running the code on Ubuntu Server and would rather not use the extra system resources of having to use them. I've also tried running driver.save_screenshot(), but it's returning a blank image since there is no content of the page being displayed.
simple code to reproduce the problem:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
service_args = [
'--ignore-ssl-errors=true',
'--ssl-protocol=any'
]
capabilities = dict(DesiredCapabilities.PHANTOMJS)
capabilities['phantomjs.page.settings.userAgent'] = ('Mozilla/5.0 (X11; Linux x86_64) '
'AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/60.0.3112.113 Safari/537.36)')
driver = webdriver.PhantomJS(desired_capabilities=capabilities, service_args=service_args)
driver.get(EDIT: URL REMOVED)
print driver.page_source
html response:
<!DOCTYPE html><html class=" no-blobworkers adownload applicationcache no-audiodata no-webaudio no-audio no-lowbattery no-batteryapi no-battery-api blobconstructor blob-constructor canvas todataurljpeg todataurlpng no-todataurlwebp canvastext contenteditable no-contentsecuritypolicy no-contextmenu cookies cors cssanimations backgroundcliptext bgpositionshorthand bgpositionxy bgrepeatround bgrepeatspace backgroundsize bgsizecover borderimage borderradius boxshadow boxsizing csscalc checked csscolumns cubicbezierrange displayrunin display-runin displaytable display-table ellipsis cssfilters flexbox flexboxlegacy no-flexboxtweener fontface generatedcontent cssgradients hsla lastchild cssmask mediaqueries multiplebgs no-objectfit no-object-fit opacity no-overflowscrolling csspointerevents csspositionsticky no-csspseudoanimations csstransitions no-csspseudotransitions cssreflections regions cssremunit cssresize rgba cssscrollbar shapes siblinggeneral subpixelfont no-supports textshadow csstransforms csstransforms3d userselect cssvhunit cssvmaxunit cssvminunit cssvwunit no-wrapflow no-customprotocolhandler no-dart dataview classlist no-createelementattrs no-createelement-attrs dataset no-microdata draganddrop datalistelem details outputelem progressbar meter ruby no-time no-texttrackapi no-track no-emoji no-strictmode no-contains no-devicemotion no-deviceorientation filereader no-filesystem fileinput formattribute no-localizednumber placeholder no-speechinput no-formvalidation fullscreen gamepads no-geolocation hashchange history no-ie8compat sandbox seamless srcdoc indexeddb json olreversed no-mathml no-lowbandwidth eventsource xhr2 xhrresponsetypearraybuffer xhrresponsetypeblob xhrresponsetypedocument no-xhrresponsetypejson xhrresponsetypetext xhrresponsetype notification pagevisibility performance no-pointerevents no-pointerlock postmessage no-quotamanagement requestanimationframe raf scriptasync scriptdefer localstorage sessionstorage websqldatabase no-stylescoped svgclippaths svgfilters inlinesvg smil svg touchevents typedarrays unicode no-userdata no-vibrate no-video no-webintents no-webgl no-getusermedia no-peerconnection websocketsbinary websockets no-framed sharedworkers webworkers no-dataworkers no-exiforientation no-apng no-webplossless no-webp svgasimg datauri" style=""><head>
<meta http-equiv="Pragma" content="no-cache">
<meta http-equiv="Expires" content="-1">
<meta http-equiv="CacheControl" content="no-cache">
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<link rel="shortcut icon" href="data:;base64,iVBORw0KGgo=">
<script>
(function(){
var securemsg;
var dosl7_common;
window["bobcmn"] = "111111101010102000000052000000052000000002324c533c200000096300000000300000000300000006/TSPD/300000008TSPD_101300000005https200000000200000000";
window.cML=!!window.cML;try{(function(){try{var jj,Jj,Lj=1,sj=1,Sj=1,ij=1;for(var Ij=0;Ij<Jj;++Ij)Lj+=2,sj+=2,Sj+=2,ij+=3;jj=Lj+sj+Sj+ij;window.JO===jj&&(window.JO=++jj)}catch(OJ){window.JO=jj}var zJ=!0;function ZJ(J){J&&(zJ=!1,document.cookie="brav=ad");return zJ}function iJ(){}ZJ(window[iJ.name]===iJ);ZJ("function"!==typeof ie9rgb4);ZJ(/\x3c/.test(function(){return"\x3c"})&!/x3d/.test(function(){return"'x3'+'d';"}));
var IJ=window.attachEvent||/mobi/i.test(window["\x6e\x61vi\x67a\x74\x6f\x72"]["\x75\x73e\x72A\x67\x65\x6et"]),ol=+new Date+6E5,_l,Il,jL=setTimeout,JL=IJ?3E4:6E3;function LL(){if(!document.querySelector)return!0;var J=+new Date,O=J>ol;if(O)return ZJ(!1);O=Il&&_l+JL<J;O=ZJ(O);_l=J;Il||(Il=!0,jL(function(){Il=!1},1));return O}LL();var OL=[17795081,27611931586,1558153217];
function zL(J){J="string"===typeof J?J:J.toString(36);var O=window[J];if(!O.toString)return;var Z=""+O;window[J]=function(J,Z){Il=!1;return O(J,Z)};window[J].toString=function(){return Z}}for(var sL=0;sL<OL.length;++sL)zL(OL[sL]);ZJ(!1!==window.cML);
(function(){var J=-1,J={_:++J,oZ:"false"[J],J:++J,Il:"false"[J],oj:++J,L0:"[object Object]"[J],IL:(J[J]+"")[J],JL:++J,iL:"true"[J],Jj:++J,Zj:++J,OZ:"[object Object]"[J],S:++J,ij:++J,oLj:++J,LLj:++J};try{J.il=(J.il=J+"")[J.Zj]+(J.Ol=J.il[J.J])+(J.LZ=(J.ol+"")[J.J])+(!J+"")[J.JL]+(J.zl=J.il[J.S])+(J.ol="true"[J.J])+(J.Jo="true"[J.oj])+J.il[J.Zj]+J.zl+J.Ol+J.ol,J.LZ=J.ol+"true"[J.JL]+J.zl+J.Jo+J.ol+J.LZ,J.ol=J._[J.il][J.il],J.ol(J.ol(J.LZ+'"\\'+J.J+J.Zj+J.J+J.oZ+"\\"+J.Jj+J._+"("+J.zl+"\\"+J.J+J.ij+
J.J+"\\"+J.J+J.S+J._+J.iL+J.Ol+J.oZ+"\\"+J.Jj+J._+"\\"+J.J+J.S+J.ij+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+J.IL+J.Ol+"\\"+J.J+J.S+J.ij+"['\\"+J.J+J.S+J._+J.Il+"\\"+J.J+J.ij+J.J+"false"[J.oj]+J.Ol+J.Il+J.IL+"']\\"+J.Jj+J._+"===\\"+J.Jj+J._+"'\\"+J.J+J.S+J.JL+J.zl+"\\"+J.J+J.S+J.oj+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+"\\"+J.J+J.Jj+J.ij+"')\\"+J.Jj+J._+"{\\"+J.J+J.oj+"\\"+J.J+J.J+"\\"+J.J+J.S+J.S+J.Il+"\\"+J.J+J.S+J.oj+"\\"+J.Jj+J._+J.iL+J.IL+"\\"+J.J+J.S+J.S+J.OZ+"\\"+J.J+J.ij+J.J+J.Jo+"\\"+J.J+J.Zj+J.oj+
"\\"+J.J+J.Zj+J.JL+"\\"+J.J+J.S+J._+"\\"+J.Jj+J._+"=\\"+J.Jj+J._+"\\"+J.J+J.S+J.ij+"\\"+J.J+J.Zj+J.J+"\\"+J.J+J.Zj+J.S+J.IL+J.Ol+"\\"+J.J+J.S+J.ij+"['\\"+J.J+J.S+J._+J.Il+"\\"+J.J+J.ij+J.J+"false"[J.oj]+J.Ol+J.Il+J.IL+"'].\\"+J.J+J.S+J.oj+J.iL+"\\"+J.J+J.S+J._+"false"[J.oj]+J.Il+J.OZ+J.iL+"(/.{"+J.J+","+J.Jj+"}/\\"+J.J+J.Jj+J.ij+",\\"+J.Jj+J._+J.oZ+J.Jo+"\\"+J.J+J.Zj+J.S+J.OZ+J.zl+"\\"+J.J+J.Zj+J.J+J.Ol+"\\"+J.J+J.Zj+J.S+"\\"+J.Jj+J._+"(\\"+J.J+J.ij+J._+")\\"+J.Jj+J._+"{\\"+J.J+J.oj+"\\"+J.J+J.J+
"\\"+J.J+J.J+"\\"+J.J+J.J+"\\"+J.J+J.S+J.oj+J.iL+J.zl+J.Jo+"\\"+J.J+J.S+J.oj+"\\"+J.J+J.Zj+J.S+"\\"+J.Jj+J._+"(\\"+J.J+J.ij+J._+"\\"+J.Jj+J._+"+\\"+J.Jj+J._+"\\"+J.J+J.ij+J._+").\\"+J.J+J.S+J.JL+J.Jo+J.L0+"\\"+J.J+J.S+J.JL+J.zl+"\\"+J.J+J.S+J.oj+"("+J.oj+",\\"+J.Jj+J._+J.Jj+")\\"+J.J+J.oj+"\\"+J.J+J.J+"\\"+J.J+J.J+"});\\"+J.J+J.oj+"}\\"+J.J+J.oj+'"')())()}catch(O){J%=5}})();var SL=82;window.SZ={IZ:"0895a966bc0180002d019416d74a2e28d1f538ef3103146592d8c25a73dda892c7c585714f95500ba8b6beac1b79be4a3d61e8b7a80de2ffe8aa17af5acaa722530af851815bcaab86168951dee7b2ac8413c027a687d99e48318f014124304bb906e86573dd8e328c3b24cadaf832eea48f8634b3c6e9a0f49eee5235a376e326e984f99d888c10"};function l(J){return 812>J}
function L(J){var O=arguments.length,Z=[];for(var S=1;S<O;++S)Z.push(arguments[S]-J);return String.fromCharCode.apply(String,Z)}function z(J,O){J+=O;return J.toString(36)}(function(J){J||setTimeout(function(){if(!LL())return;var J=setTimeout(function(){},250);for(var Z=0;Z<=J;++Z)clearTimeout(Z);LL()},500)})(zJ);})();}catch(x){document.cookie='brav=oex'+x;}finally{ie9rgb4=void(0);};function ie9rgb4(a,b){return a>>b>>0};
})();
</script>
<script type="text/javascript" src="/TSPD/08e841a5c5ab20002a3554b194594e5f3375d2f994ac4de334932487e4817509e84bbe3658582b13?type=9"></script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
EDIT: Yes, I understand that PhantomJS is old, but it's what we have to use. My question is about getting something to work in PhantomJS, not about what alternatives are available. All of our servers run Ubuntu Server, including all server images, so we have to use headless browsing. Virtual Displays, such as PyVirtualDisplay and any other Xvfb routing method are too heavy on system resources. All of our codebase uses PhantomJS so as of right now I have to use it. As well, we use proxies with username and password authentication which Chrome has not supported, so the Headless Chrome option is out.
As well, i just tested this code with Headless Chrome and it also is not working.
python code:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--headless")
driver = webdriver.Chrome(executable_path='/path/to/chromedriver', chrome_options=chrome_options)
driver.get("https://www.EDIT-REMOVED.com")
print driver.page_source
driver.quit()
html response:
<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body></body></html>

Why am I unable to download a zip file with selenium?

I am trying to use selenium in order to try to download a testfile from a html webpage. Here is the complete html page I have been using as test object:
<!DOCTYPE html>
<html>
<head>
<title>Testpage</title>
</head>
<body>
<div id="page">
Example Download link:
Download this testzip
</div>
</body>
</html>
which I put in the current working directory, along with some example zip file renamed to testzip.zip.
My selenium code looks as follows:
profile = webdriver.FirefoxProfile()
profile.set_preference("browser.download.dir", "/tmp")
profile.set_preference("browser.download.folderList", 2)
profile.set_preference("browser.download.manager.showWhenStarting", False )
profile.set_preference("pdfjs.disabled", True ) profile.set_preference("browser.helperApps.neverAsk.saveToDisk","application/zip")
profile.set_preference("plugin.disable_full_page_plugin_for_types", "application/zip")
browser = webdriver.Firefox(profile)
browser.get('file:///home/path/to/html/testweb.html')
browser.find_element_by_xpath('//a[contains(text(), "Download this testzip")]').click()
However, if I run the test (with nosetest for example), a browser is being opened, but after that nothing happens. No error message and no download, it just seems to 'hang'.
Any idea on how to fix this?

You are not setting up a real web server. You just have a html page but not a server to serve static files. You need to at least setup a server first.
But if your question is just related to download files, you can just use some international web site to test. It will work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Web Scraping Blocked by Robots Meta Directives - python

Related

Splinter Iframe form filling

Selenium firewall issue "The requested URL was rejected.[...]" [duplicate]

How to Extract the data from html

Selenium and PhantomJS : webpage thinks Javascript is disabled

Why am I unable to download a zip file with selenium?

Categories

Resources