Is there a way to reliably scrawl (python 2.7) a website and determine whether it is responsive and thus will adapt to mobile devices or not? I want to build a crawler who screens multiple websites and stores the ones which are non mobile friendly.
I've heard methods like
-Finding websites without a "viewport" in their DOM HTML code
-Looking through CSS media queries
Do you know a way to accurately detect non responsive websites? Any suggestion would be very much appreciated.
If a web page doesn't have in its head a viewport meta tag then it's unlikely it will be responsive, ie you would look for
<head>
...
<meta name="viewport" content= SOMETHING>
...
</head>
Two cases you might want to consider that would fail the above test.
If a site uses device detection to route users to say a mobile or desktop site based on their device, then although the mobile or desktop sites may not be responsive, this wouldn't be obvious to an end user and their experience could be good.
Another case I can think of is a page that uses javascript to detect viewport size and uses the result to modify the page or the page's stylesheet.
On balance I think searching for a viewport meta tag would give good results, though don't expect it to be perfect.
I hope this helps!
Related
I have deployed my web app (which right now is just some templates with js and css) to aws elastic beanstalk and the website is up and running.
I used a predefined theme for the layout of the website that includes responsive design.
On a dekstop computer the website is looking fine, even if I manually resize the window. In every resolution it looks great! Every icon and image gets loaded perfectly.
Now if I want to look at the website on my smartphone (mobile device) it starts "srewing" up. It adds a white bar below the start page and on different browsers it just mixes some divs completely up, rendering them on the top of the page instead of the bottom.
Some icons also are not correctly displayed.
I tried different browsers for the desktop pc, everything fine. On the smartphone I have problems on all browsers.
I checked the website (www.meyn-computer-lotse.de) on the http://www.mobilephoneemulator.com/ and there the website is also displayed perfectly.
I just cannot find my mistake...
Now I kinda have two questions:
1. Is there an easy way on a mobile device to lookup the css rules for the website, cause on the desktop pc I can easily lookup the applied rules.
2. Do I have to specify something in django or aws elastic-beanstalk for a mobile device?
Every help is appreciated.
Instead of trying to look up the CSS rules applied while using your phone, you can press F12 in your browser. If you use the Chrome browser you can toggle the device toolbar and choose between different devices you want to view your website on.
To answer your second question, to make sure the pages are responsive you can use something like Bootstrap "https://getbootstrap.com". Bootstrap takes care of the Javascript and CSS to make sure everything is displayed correctly, though sometimes you might have to choose a different layout because bootstrap has some limitations.
You can make a distinction between desktop or mobile phone users to serve them a different template which is optimized for the device they use. Here you have a few different options:
Use django-mobile,
django-mobile provides a simple way to detect mobile browsers and
gives you tools at your hand to render some different templates to
deliver a mobile version of your site to the user.
Use MobileESP,
MobileESP has a simple API for detecting mobile devices. The API
returns the simple Boolean result of TRUE if the device is the
specified type, or FALSE if it isn't. For example, want to know if
your visitor is on an iPhone, Android or Windows Phone device? Simply
use the method DetectTierIphone().
I hope you find this information useful.
So, maybe I'm being paranoid.
I'm scraping my Facebook timeline for a hobby project using PhantomJS. Basically, I wrote a program that finds all of my ads by querying the page for the text Sponsored with XPATH inside of phantom's page.evaluate block. The text was being displayed as innerHTML of html a elements.
Things were working great for a few days and it was finding tons of ads.
Then it stopped returning any results.
When I logged into Facebook manually to inspect the elements again, I found that the word Sponsored was now appearing on the page in an ::after pseudoclass element with the css property content: sponsored. This means that an XPATH query for the text no longer yields any results. No joke, Facebook seemed to have changed the way they rendered this word after being scraped for a couple days.
Paranoid. I told you.
So, I offer this question to the community of Javascript, Web-Scraping, and PhantomJS developers out there. What the heck is going on. Can Facebook know what my PhantomJS program is doing inside of the page.evaluate block?
If so, how? Would my phantom commands appear in a key logger program embedded in the page, for instance?
What are some of your theories?
It is perfectly possible to detect PhantomJS even if the useragent is spoofed.
There are plenty of litte ways in which it differs from other browsers, among others:
Wrong order of headers
Lack of media plugins and latest JS capabilities
PhantomJS-specific methods, like window.callPhantom
PhantomJS name in the stack trace
and many others.
Please refer to this excellent article and presentation linked there for details: https://blog.shapesecurity.com/2015/01/22/detecting-phantomjs-based-visitors/
Maybe puppeteer would be a better fit for your needs as it is based on a real cutting-edge Chromium browser.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!
Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.
I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.
Preferably using python (currently using urllib and BeautifulSoup) given a URL.
For example I'm trying to scrape the main photo on this page: http://www.marcjacobs.com/marc-jacobs/womens/bags-and-accessories/c3122001/the-single#?p=1&s=12
In Firefox under Tools > Page Info > Media lists all the visible images, including the link to the image I want to grab ( http://imagesec.mj.ctscdn.com/image/336/504/6ace6aac-c049-4d7e-9465-c19b5cd8e4ac.jpg )
Two interrelated problems:
If I do a view source the image path retrieved from the
Firefox tool is not found in the html document... Is there any way I can retrieve this
path without going through Firefox Page Info? Perhaps through either
Python and/or Javascript/JQuery?
I'm trying to get the photo of the product in "Orange", and notice the page always loads the black color by default
A working example is probably Google 'Shopping', if you type the name of this product and select color, the image shows up in the correct color (from the exact same page) in the search results.
Basically, I want to be able to scrape color and style/variation specific images from mostly shopping sites.
Selecting the right color seems more complicated, in that case I'll settle for just the main product image in black for now..
So far I've tried selecting images based on img height tags, also trying to read dimensions when there are no height/width tags... but occurred to me there has to be a better way.
This can be a bit complex but most of the solutions that work in this particular situation, are pretty much the same.
First, let me tell you why using Beautiful Soup or xlml won't work. You need to retrieve some information which is available only after you click on the orange bag thumbnail, right?. That is loaded using Javascript, so that orange bag image won't be available to Beautiful Soup and friends (because they don't parse Javascript nor elements that are absent in the parsed tree).
So that is a death end.
However, there are other screen scraping tools like Selenium or PhantomJS. I have tested both and work great. They basically integrate a browser so they obviously are capable of managing javascript. I don't know if you need to scrape this automatically from your server or you want to start the scraping process at will. With Selenium (after you tell it what page you want to open, what element you want to click, etc), you will see your browser doing all that stuff by itself. There are other options available such as using a headless browser. In my opinion, it's very powerful but it can get quite complex to get it working.
A far more easier solution is using PhantomJs. It's similar to Selenium although, as its name indicates, you give the instructions via Javascript (which can be a bit more comfortable since you're already dealing with web elements). I recommend you to use CasperJS: It eases the process of defining a full navigation scenario and provides useful high-level functions, methods & syntactic sugar for doing common tasks...
Let me give some feel for what it looks like:
casperFunction = function(){
var casper = require('casper').create({
verbose: true
});
casper.start('yourwebpage'); // loading the webpage
casper.then(function(){ // after loading...
casper.evaluate(function(){ // get me some element's value
document.getElementById('yourelement').value
});
});
casper.then(function(){ // after that, click on this other element
this.click('#id_of_other_element');
})
casper.wait(7000); // wait for some processing... this can be quite
// useful if you need to wait a few seconds in
// order to retrieve your orange bag later
casper.run(); // actually runs the whole thing
There you have most of the things you need to accomplish your task.
By the way, let me remind you that usually it's needed to ask for permission to retrieve that kind of stuff.
Hope that helps.