Python Web-based Bot - python

I am trying to write a Python-based Web Bot that can read and interpret an HTML page, then execute an onClick function and receive the resulting new HTML page. I can already read the HTML page and I can determine the functions to be called by the onClick command, but I have no idea how to execute those functions or how to receive the resulting HTML code.
Any ideas?

The only tool in Python for Javascript, that I am aware of is python-spidermonkey. I have never used it though.
With Jython you could (ab-)use HttpUnit.
Edit: forgot that you can use Scrapy. It supports Javascript through Spidermonkey, and you can even use Firefox for crawling the web.
Edit 2: Recently, I find myself using browser automation more and more for such tasks thanks to some excellent libraries. QtWebKit offers full access to a WebKit browser, which can be used in Python thanks to language bindings (PySide or PyQt). There seem to be similar libraries and bindings for Gtk+ which I haven't tried. Selenium WebDriver API also works great and has an active community.

Well obviously python won't interpret the JS for you (though there may be modules out there that can). I suppose you need to convert the JS instructions to equivalent transformations in Python.
I suppose ElementTree or BeautifulSoup would be good starting points to interpret the HTML structure.

To execute JavaScript, you need to do much of what a full web browser does, except for the rendering. In particular, you need a JavaScript interpreter, in addition to the Python interpreter.
One starting point might be python-spidermonkey. Depending on the specific JavaScript, you might have to provide a good DOM API to the spidermonkey, in addition to providing an XmlHttpRequest implementation.

You can try to leverage V8,
V8 is Google's open source, high performance JavaScript engine. It is written in C++ and is used in Google Chrome, Google's open source browser.
Calling it from Python may not be straightforward, without a framework to provide the DOM.
Pyjamas has an experimental project, Pyjamas Desktop, providing V8 integration for Javascript execution.
Pyv8 is an experimental python v8 bindings and a python-javascript compiler.

For the browser part of this you might want to look into Mechanize, which basically is a webbrowser implemented as a Python library. http://pypi.python.org/pypi/mechanize/0.1.11
But as mentioned, the text n onClick is Javascript, and you'll need spidermonkey for that.
If you can make a generic support for spidermonkey in mechanize, I'm sure many people would be extremely happy. ;)
Mechanize may be overkill, maybe you just want to find specific parts of the HTML, and then lxml and BeautifulSoup both work well.

Why don't you just sniff what gets sent after the onclick event and replicate that with your bot?

For web automation , you can look into "webbot" library. It makes autmation damn simple and pain free.
webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.
Here's a snippet of webbot
from webbot import Browser
web = Browser()
web.go_to('google.com')
web.click('Sign in')
web.type('mymail#gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^
Docs are at : https://webbot.readthedocs.io

Related

While webscraping a website in python I don't get the expected response but just a script tag containing a few line of codes

I'm trying to scrape data from a site in python, the payload is right and everything works but when I get the response of the site which would normally be the source code of the html page I instead, get just a script tag with some error written in it. See the response I get enclosed :
b'<script language="JavaScript">\nerr = "";\nlargeur = 1024;\nif (screen.width>largeur) { document.location.href="accueil.php?" +err;\t}\nelse { document.location.href="m.accueil.php?largeur=" +screen.width +\'&\' +err;\t}\n</script>'
Information :
after looking at the site it seems that it uses google analytics, I don't really know about what it is but maybe because of the preview things, it can't load the page since i'm not accessing it by a navigator.
What tool are you using to webscrape? Tools like beautiful soup parse pre-loaded HTML content. If a website uses client-side rendering and JavaScript to load content, often times HTML parsers will not function.
You can instead use an automated browser that interacts with a website just as a regular user would. These automated browsers can operate with or without a GUI. Automated browsers when run without a GUI (also known as a headless browser) take up less time and resources than running them with a GUI. Here's a fairly exhaustive list of headless browsers you can use. Note that not all are compatible with Python.
As Buran mentioned in the comments Selenium is an option. Selenium is very well documented and has a large community following so it's easy to find helpful articles or tutorials. It's a multi-driver so it can run different types of browsers (firefox, chrome, etc.), both headless and with a GUI.

Run a web search through python?

I'm trying to run web searches using a python script. I know how to make it work for most sites, such as using the requests library to get "url+query arguments".
I'm trying to run searches on wappalyzer.com. But when you run a search its url doesn't change. I also tried inspecting the html to try and figure out where the search is taking place, so that I could use beautiful soup to change the html and run it but to no avail. I'm really new to web scraping so would love the help.
The URL does not change because the search works with javascript and asynchronous requests. The easiest way to automate such task is to execute the javascript and interact with programatically (often easier than retro engineering the requests the client does, except if a public API is available).
You could use selenium with python, which is pretty easy to use, or any automation framework that executes Javascript by running a web driver (gecko, chrone, phantomjs).
With selenium, you will be able to program your scraper pretty easily, by selecting the field of search (using css selectors or xpath for example), inputing a value and validating the search. You will then be able to dump the whole page or specific parts you need.

Can anyone clarify some options for Python Web automation

I'm trying to make a simple script in python that will scan a tweet for a link and then visit that link.
I'm having trouble determining which direction to go from here. From what I've researched it seems that I can Use Selenium or Mechanize? Which can be used for browser automation. Would using these be considered web scraping?
Or
I can learn one of the twitter apis , the Requests library, and pyjamas(converts python code to javascript) so I can make a simple script and load it into google chrome's/firefox extensions.
Which would be the better option to take?
There are many different ways to go when doing web automation. Since you're doing stuff with Twitter, you could try the Twitter API. If you're doing any other task, there are more options.
Selenium is very useful when you need to click buttons or enter values in forms. The only drawback is that it opens a separate browser window.
Mechanize, unlike Selenium, does not open a browser window and is also good for manipulating buttons and forms. It might need a few more lines to get the job done.
Urllib/Urllib2 is what I use. Some people find it a bit hard at first, but once you know what you're doing, it is very quick and gets the job done. Plus you can do things with cookies and proxies. It is a built-in library, so there is no need to download anything.
Requests is just as good as urllib, but I don't have a lot of experience with it. You can do things like add headers. It's a very good library.
Once you get the page you want, I recommend you use BeautifulSoup to parse out the data you want.
I hope this leads you in the right direction for web automation.
I am not expect in web scraping. But I had some experience with both Mechanize and Selenium. I think in your case, either Mechanize or Selenium will suit your needs well, but also spend some time look into these Python libraries Beautiful Soup, urllib and urlib2.
From my humble opinion, I will recommend you use Mechanize over Selenium in your case. Because, Selenium is not as light weighted compare to Mechanize. Selenium is used for emulating a real web browser, so you can actually perform 'click action'.
There are some draw back from Mechanize. You will find Mechanize give you a hard time when you try to click a type button input. Also Mechanize doesn't understand java-scripts, so many times I have to mimic what java-scripts are doing in my own python code.
Last advise, if you somehow decided to pick Selenium over Mechanize in future. Use a headless browser like PhantomJS, rather than Chrome or Firefox to reduce Selenium's computation time. Hope this helps and good luck.
For
Web automation : "webbot"
Web scraping : "scrapy"
webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.
Here's a snippet of webbot
from webbot import Browser
web = Browser()
web.go_to('google.com')
web.click('Sign in')
web.type('mymail#gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^
For web scraping Scrapy seems to be the best framework.
It is very well documented and easy to use.

Python and webbrowser form fill

Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html

Dynamically Alter HTML Source

I am curious if there might be a way to dynamically alter source from a web page automatically.
For instance, I know the firebug plugin for Firefox allows the capability to modify the source and see the reaction in real-time. So, say I want to login to a particular form. Could I alter this dynamic source with a login name and password and enter the website in question via some automated script? If not, are there any potential alternatives to this approach that may fair better?
Thanks.
If you want something which can automate IE browser, what I can recommend to you is: Watir and WatiN. Watir is developed in ruby while WatiN is developed in c#. They are both quite powerful, more than enough to meet your requirements.
If you have to use Python script. then I would recommend C# + WatiN + IronPython. You can write python scripts to call WatiN's dll. Please note that, IronPython is not the same as Python, it is based on microsoft's .net framework. Currently I don't know any pure python product which can do the same as WatiN and Watir.
If you want to login to a website automatically you don't need to edit the source, you need to interact with the webserver. Try curl and use it to submit login details and fetch the resulting web page.
for firefox automation, I recommend chickenfoot to you. It can meet your needs:
alter this dynamic source with a login
name and password and enter the
website in question via some automated
script
But chechenfoot only supports up to firefox 3. If you want to support the newest version of firefox, you might have to get the source code and compile it yourself.
If you can use Javascript, try:
document.write("HTML CODE HERE");
But if you need in Python, I think you can use REPLACE in the HTML Source Code.

Categories