If you go to the site, you'd notice that there is an age confirmation window which I want to bypass through scrapy but I messed up with that and I had to move on to selenium webdriver and now I'm using
driver.find_element_by_xpath('xpath').click()
to bypass that age confirmation window. Honestly I don't want to go with selenium webdriver because of its time consumption. Is there any way to bypass that window?
I searched a lot in stackoverflow and google
but didn't get any answer which may resolves my problem. If you've any link or idea of resolving it by Scrapy, that'd be appreciated. A single helpful comment will be up-voted!
To expand on Chillie's answer.
The age verification is irrelavant here. The data you are looking for is loaded via AJAX request:
See related question: Can scrapy be used to scrape dynamic content from websites that are using AJAX? to understand how they work.
You need to figure out how https://ns5bwtai8m-dsn.algolia.net/1/indexes/*/queries?x-algolia-agent=Algolia%20for%20vanilla%20JavaScript%203.19.1&x-algolia-application-id=NS5BWTAI8M&x-algolia-api-key=e676b05f3844d3adf54a29732af6e43c url works and how can you retrieve in it scrapy.
But the age verification "window" is just a div that gets hidden when you press the button, not a real separate window:
<div class="age-check-modal" id="age-check-modal">
You can use the browser's Network tab in developer tools to see that no new info is uploaded or sent when you press the button. So everything is already loaded when you request a page. The "popup" is not even a popup, just an element whose display is changed to none when you click the button.
So Scrapy doesn't really care what's meant to be displayed as long as all html is loaded. If the elements are loaded, they are accessible. Or have you seen some information being unavailable without pressing the button?
You should inspect the html code more to see what each website does, this might make your scraping tasks easier.
Edit: After inspecting the original html you can see the following:
<div class="products-list">
<div class="products-container-block">
<div class="products-container">
<div id="hits" class='row'>
</div>
</div>
</div>
</div>
You can also see a lot of JS script tags.
The browser element inspector shows us the following:
The ::before part gives away that this was manipulated by JS, as you cannot do this with simple CSS. See Granitosaurus' answer for details on this.
What this means is that you need to somehow execute the arbitrary JS code on those pages. So you either need a solution with Scrapy, or just use Selenium, as many do, and as you already have.
Related
I'm using Python with Requests and BeautifulSoup to parse the pages, and everything worked well until on one of the pages buttons which have a PostBack function instead of a link appeared.
Buttons have this structure:
<a onclick="PostBack('FollowLink','2');return false;" href="#">Continue</a>
I have no idea on how to navigate to the next page since main link remains unchanged.
You have two options. One is to manually inspect the Javascript and see what the PostBack function does, then simulate it. The other is to change to something like Selenium, where you run an instance of Chrome which interprets the Javascript for you. The first option would be less work.
I'm a retired programmer but new to scrapy. Actually, this is my first python project so I could be doing anything wrong.
I brought up scrapy under anaconda and started a shell with :
scrapy shell "https://sailing-channels.com/by-subscribers"
Looks like everything is working fine and I can get some querys to work.
Here is my problem:
when I enter :
response.css('body').extract()
I get:['<body><noscript>If you\'re seeing this message, that means <strong>JavaScript has been disabled on your browser</strong>, please <strong>enable JS</strong> to make this app work.</noscript><div id="app"></div><script src="//apis.google.com/js/platform.js" async></script><script>!function(e,a,n,t,g,c,i){e.GoogleAnalyticsObject="ga",e.ga=e.ga||function(){(e.ga.q=e.ga.q||[]).push(arguments)},e.ga.l=1*new Date,c=a.createElement(n),i=a.getElementsByTagName(n)[0],c.async=1,c.src="//www.google-analytics.com/analytics.js",i.parentNode.insertBefore(c,i)}(window,document,"script"),ga("create","UA-15981085-17","auto"),ga("require","linkid"),ga("set","anonymizeIp",!0),ga("send","pageview")</script><script type="application/ld+json">{\n\t\t\t"#context": "http://schema.org",\n\t\t\t"#type": "Organization",\n\t\t\t"name": "Sailing Channels"\n\t\t\t"url": "https://www.sailing-channels.com",\n\t\t\t"logo": "https://sailing-channels.com/img/banner.png",\n\t\t\t"sameAs" : [\n\t\t\t\t"https://www.facebook.com/sailingchannels",\n\t\t\t\t"https://twitter.com/sailchannels"\n\t\t\t]\n\t }</script><script type="text/javascript" src="https://cdn.sailing-channels.com/1.15.9/main.1dad65fcb7a507930e1f.js"></script></body>']
My problem is I expect a lot more. When I do an inspect on chrome I see a lot more /div sections inside <div id="app"></div>
Could someone shine some light on what I'm doing wrong? I want to scrape the channel name, subscriber count, and views
Thanks
Understandable. It is because of they rendering the data through another script during loading of the page.
In normal scrapy setting, dynamic page loading content doesn't appear. For scraping that data you can use selenium.
selenium-with-scrapy-for-dynamic-page
For an alternative way, you can use splash for handling javascript enabled content.handling-javascript-in-scrapy-with-splash
i unfortunately am not able to post code to reproduce this problem, since it involves signing into a site that is not a public site. but my question is more general than code problems. essentially, driver.page_source does not match what shows up in the browser it is driving. this is not an issue with elements not loading fully because i am testing this while executing code line by line in a python terminal. i am looking at the page source in the browser after right clicking and going to "view page source", and but if i print driver.page_source or attempt to find_element_by_[...], it shows slightly different code with entire elements missing. here is the html in question:
<nav role="navigation" class="utility-nav__wrapper--right">
<input id="hdn_partyId" value="1965629" type="hidden">
<input id="hdn_firstName" value="CHARLES" type="hidden">
<input id="hdn_sessionId" value="uHxQhlARvzA7N16uh+KJAdNFIcY6D8f9ornqoPQ" type="hidden">
<input id="hdn_cmsAlertRequest" type="hidden" value="Biennial Plus">
<ul class="h-list h-list--middle">
[...]
</ul>
i need all 4 of the input elements, however, hdn_partyId and hdn_sessionId elements do not appear in selenium's .page_source and if i try to get them with .find_element_by_[...] i get a NoSuchElementException
i even ran a check on finding all input elements and listing them, and these 2 do not show up.
does anyone have any idea why selenium would not provide the same content as directly looking at the browser it is driving?
EDIT: to clarify... i am driving Chrome with Chromedriver through Selenium. this is not an issue with the page not fully loading. as i mentioned, i am running this manually line by line through a python terminal and not executing a script. so the browser pops up, loads the page, logs in, and then i manually check the browser's page source and see the element, then i print driver.page_source and it's not there, and if i run session_id = driver.find_element_by_id('hdn_sessionId') i get a NoSuchElementException. there are also no frames at all in the page, nor any additional windows.
A coworker of mine has figured out the issue and a workaround. Essentially, after the page is done loading, it runs a javascript command that cleans up the DOM. What the "view page source" in the browser shows is not what the current state is. So running print driver.page_source or using any form of driver.find_element_by_[...] is pulling from the newest and freshest page data, while the browser's "view page source" only shows what was provided when the page first loaded. If you start 'inspecting' the page in Chrome, you will see the HTML is different than what the browser says the "page source" is. After reverse engineering the Javascript, we are able to run partyid = driver.execute_script('return accountdata.$partyId.val();') and get what was originally assigned. I hope this is enough info to help other people who may run into this issue in the future.
try like this you will get source code keyword "view-source:" which can be different according to your browser this is for the chrome
driver.get("view-source:"+url)
sourcecode=driver.find_element_by_tag_name('body').text
If you locate the 'body' of the page then use get_attribute('innerHTML') you can access everything from the page.
Quite often when using selenium, waiting does the trick without needing a lot of extra code (i.e. giving a few seconds for the full DOM to load). So in the example below, the HTML that was gathered reflected what one would see when one 'inspects' as opposed to using 'view source', which displayed pre-JS DOM
from time import sleep
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(ChromeDriverManager().install())
driver.get(url)
sleep(10)
HTML = driver.page_source
Question: yikyak.com returns some sort of "browser not supported" landing page when I try to view source code in chrome (even for the page I'm logged in on) or when I write it out to the Python terminal. Why is this and what can I do to get around it?
Edit for clarification: I'm using the chrome webdriver. I can navigate around the yik yak website by clicking on it just fine. But whenever I try to see what html is on the page, I get an html page for a "browser not reported" page.
Background: I'm trying to access yikyak.com with selenium for python to download yaks and do fun things with them. I know fairly little about web programming.
Thanks!
Secondary, less important question: If you're already here, are there particularly great free resources for a super-quick intro to the certification knowledge I need to store logins and stuff like that to use my logged in account? That would be awesome.
I figured it out. I was being dumb. I saved off the html as a file and opened that file with chrome and it displayed the normal page. I just didn't see the fact that it was a normal page looking at it directly. Thanks all 15 people for your time.
So I'm performing a scrape of omegle trying to scrape the users online.
This is the HTML code:
<div id="onlinecount">
<strong>
30,000+
</strong>
</div>
Now I would presume that using LXML it would be //div[#id="onlinecount"] to scrape any text within the , I want to get the numbers from the tags, but when I try to scrape this, I just end up with an empty list
Here's my relevant code:
print "\n Grabbing users online now from",self.website
site = requests.get(self.website)
tree = html.fromstring(site.text)
users = tree.xpath('//div[#id="onlinecount"]')
Note that the self.website variable is just http://www.omegle.com
Any ideas what I'm doing wrong? Note I can scrape other parts just not the number of online users.
I ended up using a different set of code which I learned from a friend.
Here's my full code for anyone interested.
http://pastebin.com/u1kTLZtJ
When you send a GET request to "http://www.omegle.com" using requests python module,what I observed is that there is no "onlinecount" in site.text. The reason is that part gets rendered by a javascript. You should use a library that is able to execute the javascript and give you the final html source that is rendered in a browser. One such third party library is Selenium http://selenium-python.readthedocs.org/. The only downside is that it opens a real web browser.
Below is a working code using selenium and an attached screenshot:
from selenium import webdriver
browser = webdriver.Firefox()
browser.get("http://www.omegle.com")
element = browser.find_element_by_id("onlinecount")
onlinecount = element.find_element_by_tag_name("strong")
You can also use GET method on this http://front1.omegle.com/status
that will return the count of online users and other details in JSON form
I have done a bit of looking at this and that particular part of the page is not XML but Javascript.
Here is the source (this is what the requests library is returning in your program)
<div id="onlinecount"></div>
<script>
if (IS_MOBILE) {
$('sharebuttons').dispose();
$('onlinecount').dispose();
}
</script>
</div>
As you can see, in lxml's eyes there is nothing but a script in the onlinecount div.
I agree with Praveen.
If you want to avoid launching a visible browser, you could use PhantomJS which also has a selenium driver :
http://phantomjs.org/
PhantomJS is a headless WebKit scriptable with a JavaScript API
Instead of selenium scripts, you could also write PhantomJS js scripts (but I assume you prefer to stay in Python env ;))