Scraping a web page as you manually navigate - python

Is there a way, using some library or method, to scrape a webpage in real time as a user navigates it manually? Most scrapers I know of such as python mechanize create a browser object that emulates a browser - of course this is not what I am looking for since if I have a browser open, it will be different than the one mechanize creates.
If there is no solution, my problem is I want to scrape elements from a HTML5 game to make an intelligent agent of sorts. I won't go into more detail, but I suspect if others are trying to do the same in the future (or any real time scraping with a real user), a solution to this could be useful for them as well.
Thanks in advance!

Depending on what your use-case is, you could set up a SOCKS proxy or some other form of proxy and configure it to log all traffic, then instruct your browser to use it. You'd then scrape that log somehow.
Similarly, if you have control over your router, you could configure capture and logging there, e.g. using tcpdump. This wouldn't decrypt encrypted traffic, of course.
If you are working with just one browser, there may be a way to instruct it to do something at each action via a custom browser plugin, but I'd have to guess you'd be running into security model issues a lot.
The problem with a HTML5 game is that typically most of its "navigation" is done using a lot of Javascript. The Javascript is typically doing a lot -- manipulating the DOM, triggering requests for new content to fit into the DOM, etc...
Because of this you might be better off looking into OS-level or browser-level scripting services that can "drive" keyboard and mouse events, take screenshots, or possibly even take a snapshot of the current page DOM and query it.
You might investigate browser automation and testing frameworks like Selenium for this.

I am not sure if this would work in your situation but it is possible to create a simple web browser using PyQt which will work with HTML5 and from this it might be possible to capture what is going on when a live user plays the game.
I have used PyQt for a simple browser window (for a completely different application) and it seems to handle simple, sample HTML5 games. How one would delve into the details of what is going on the game is a question for PyQt experts, not me.

Related

Python-3.x-Selenium: Changing the used driver while staying logged in on a website

I'm currently testing a website with python-selenium and it works pretty well so far. I'm using webdriver.Firefox() because it makes the devolepment process much easier if you can see what the testing program actually does. However, the tests are very slow. At one point, the program has to click on 30 items to add them to a list, which takes roughly 40 seconds because the browser is responding so awfully slowly. So after googling how to make selenium faster I've thought about using a headless browser instead, for example webdriver.PhantomJS().
However, the problem is, that the website requires a login including a captcha at the beginning. Right now I enter the captcha manually in the Firefox-Browser. When switching to a headless browser, I cannot do this anymore.
So my idea was to open the website in Firefox, login and solve the captcha manually. Then I somehow continue the session in headless PhatomJS which allows me to run the code quickly. So basically it is about changing the used driver mid-code.
I know that a driver is completely clean when created. So if I create a new driver after logging in in Firefox, I'd be logged out in the other driver. So I guess I'd have to transfer some session-information between the two drivers.
Could this somehow work? If yes, how can I do it? To be honest I do not know a lot about the actual functionality of webhooks, cookies and storing the"logged-in" information in general. So how would you guys handle this problem?
Looking forward to hearing your answers,
Tobias
Note: I already asked a similar question, which got marked as a duplicate of this one. However, the other question discusses how to reconnect to the browser after quitting the script. This is not what I am intending to do. I want to change the used driver mid-script while staying logged in on the website. So I deleted my old question and created this new, more fitting one. I hope it is okay like that.
The real solution to this is to have your development team add a test mode (not available on Production) where the Captcha solution is either provided somewhere in the page code, or the Captcha is bypassed.
Your proposed solution does not sound like it would work, and having a manual step defeats the purpose of automation. Automation that requires manual steps to be taken will be abandoned.
The website "recognizes" the user via Cookies - a special HTTP Header which is being sent with each request so the website knows that the user is authenticated, has these or that permissions, etc.
Fortunately Selenium provides functions allowing cookies manipulation so all you need to do is to store cookies from the Firefox using WebDriver.get_cookies() method and once done add them to PhantomJS via WebDriver.add_cookie() method.
firefoxCookies = firefoxDriver.get_cookies()
for cookie in firefoxCookies:
phantomJSDriver.add_cookie(cookie)

Selenium: What functions would fire request?

I am new to Selenium and web applications. Please bear with me for a second if my question seems way too obvious. Here is my story.
I have written a scraper in Python that uses Selenium2.0 Webdriver to crawl AJAX web pages. One of the biggest challenge (and ethics) is that I do not want to burn down the website's server. Therefore I need a way to monitor the number of requests my webdriver is firing on each page parsed.
I have done some google-searches. It seems like only selenium-RC provides such a functionality. However, I do not want to rewrite my code just for this reason. As a compromise, I decided to limit the rate of method calls that potentially lead to the headless browser firing requests to the server.
In the script, I have the following kind of method calls:
driver.find_element_by_XXXX()
driver.execute_script()
webElement.get_attribute()
webElement.text
I use the second function to scroll to the bottom of the window and get the AJAX content, like the following:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Based on my intuition, only the second function will trigger request firing, since others seem like parsing existing html content.
Is my intuition wrong?
Many many thanks
Perhaps I should elaborate more. I am automating a process of crawling on a website in Python. There is a subtantial amount of work done, and the script is running without large bugs.
My colleagues, however, reminded me that if in the process of crawling a page I made too many requests for the AJAX list within a short time, I may get banned by the server. This is why I started looking for a way to monitor the number of requests I am firing from my headless PhantomJS browswer in script.
Since I cannot find a way to monitor the number of requests in script, I made the compromise I mentioned above.
Therefore I need a way to monitor the number of requests my webdriver
is firing on each page parsed
As far as I know, the number of requests is depending on the webpage's design, i.e. the resources used by the webpage and the requests made by Javascript/AJAX. Webdriver will open a browser and load the webpage just like a normal user.
In Chrome, you can check the requests and responses using Developer Tools panel. You can refer to this post. The current UI design of Developer Tools is different but the basic functions are still the same. Alternatively, you can also use the Firebug plugin in Firefox.
Updated:
Another method to check the requests and responses is by using Wireshark. Please refer to these Wireshark filters.

is there a way to capture network calls a site makes using python?

I've looked and urllib(2), mechanize, and Beautiful Soup in hopes to find something that captures network calls such as pixel/beacon fires from a page. Unfortunately i'm not very familiar with any of them, and also not very clear on how to go about my search.
I'd like to use python to run through a series of web urls, and capture each ones networks call aka pixel fires. Would anyone know of a means or library i can start from inorder to accomplish this??
looked into webscrappying, but i don't want the html, instead i beleive i'm looking for the GET request the site makes.
If I understand what you want, you want to log what requests a browser makes when displaying a page, in respect of many pages.
Your options are to script a browser using python (See: http://wiki.python.org/moin/WebBrowserProgramming), or script the browser using javascript, and output your results in some way (I suggest JSON, over a request or to a file), and analyse them in python.
You'll probably find it easier to do the scripting in javascript, honestly.
Another possibility if you have access to the Firefox web browser is to install Firebug, a powerful debugging tool that gives you the option to display all network traffic from a web page in the browser console. In order to transfer the output from the console to a file you will need to install the ConsoleExport plugin for Firebug.
You will now be able to capture all the traffic from a web page to a file which you can then parse with Python.

Django: Detect client browser support of HTML5

Is there any module out there that could be used by my Django site to tell whether the client browser supports HTML5 and what features are supported?
Sadly no. This is something that you'll need JavaScript client to do. Especially something like http://modernizr.com/
One way to do it would be to run modernizr and send results to back end.
If you would be really optimistic, you could build a list of User-Agents and decide upon that. But good luck with keeping which things works in which version of Chrome and Firefox.

How to scrape the images from this javascript website?

link text
This is a link from a digital book library.There are forward and backward buttons to see next and previous page.I want to download these pictures automatically. I have once used urllib in python but the website baned it soon. I just want to download this book for study purpose so can anyone recommend me some programming tools such as web-spiders which can simulate the process of turning pages and get the pictures automatically. Thanks!
That site uses Javascript, so you can't easily scrape it with Python. Two suggestions:
Work out what requests are being made when clicking the next button. You can do this with a tool like firebug. You might then find you can scrape it without processing any JS.
Use a tool such as Selenium which allows for browser scripting which lets you "execute" the JS.
As for the site blocking you, there are two ways to reduce the chance of being blocked:
Change your user-agent to that of a common browser, e.g. Firefox.
Add random delays between accessing the next image, so that you appear more human-like.
wget is an excellent web spider
http://linux.die.net/man/1/wget
You need a real browser to work with this (kind of) site. Selenium is one option, but it is more geared towards web testing. For web scraping iMacros is really nice. I had a quick test and it works well with iMacros for Firefox/IE.
Chris

Categories