I'm trying to make a simple script in python that will scan a tweet for a link and then visit that link.
I'm having trouble determining which direction to go from here. From what I've researched it seems that I can Use Selenium or Mechanize? Which can be used for browser automation. Would using these be considered web scraping?
Or
I can learn one of the twitter apis , the Requests library, and pyjamas(converts python code to javascript) so I can make a simple script and load it into google chrome's/firefox extensions.
Which would be the better option to take?
There are many different ways to go when doing web automation. Since you're doing stuff with Twitter, you could try the Twitter API. If you're doing any other task, there are more options.
Selenium is very useful when you need to click buttons or enter values in forms. The only drawback is that it opens a separate browser window.
Mechanize, unlike Selenium, does not open a browser window and is also good for manipulating buttons and forms. It might need a few more lines to get the job done.
Urllib/Urllib2 is what I use. Some people find it a bit hard at first, but once you know what you're doing, it is very quick and gets the job done. Plus you can do things with cookies and proxies. It is a built-in library, so there is no need to download anything.
Requests is just as good as urllib, but I don't have a lot of experience with it. You can do things like add headers. It's a very good library.
Once you get the page you want, I recommend you use BeautifulSoup to parse out the data you want.
I hope this leads you in the right direction for web automation.
I am not expect in web scraping. But I had some experience with both Mechanize and Selenium. I think in your case, either Mechanize or Selenium will suit your needs well, but also spend some time look into these Python libraries Beautiful Soup, urllib and urlib2.
From my humble opinion, I will recommend you use Mechanize over Selenium in your case. Because, Selenium is not as light weighted compare to Mechanize. Selenium is used for emulating a real web browser, so you can actually perform 'click action'.
There are some draw back from Mechanize. You will find Mechanize give you a hard time when you try to click a type button input. Also Mechanize doesn't understand java-scripts, so many times I have to mimic what java-scripts are doing in my own python code.
Last advise, if you somehow decided to pick Selenium over Mechanize in future. Use a headless browser like PhantomJS, rather than Chrome or Firefox to reduce Selenium's computation time. Hope this helps and good luck.
For
Web automation : "webbot"
Web scraping : "scrapy"
webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.
Here's a snippet of webbot
from webbot import Browser
web = Browser()
web.go_to('google.com')
web.click('Sign in')
web.type('mymail#gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^
For web scraping Scrapy seems to be the best framework.
It is very well documented and easy to use.
Related
I am new to Selenium and web applications. Please bear with me for a second if my question seems way too obvious. Here is my story.
I have written a scraper in Python that uses Selenium2.0 Webdriver to crawl AJAX web pages. One of the biggest challenge (and ethics) is that I do not want to burn down the website's server. Therefore I need a way to monitor the number of requests my webdriver is firing on each page parsed.
I have done some google-searches. It seems like only selenium-RC provides such a functionality. However, I do not want to rewrite my code just for this reason. As a compromise, I decided to limit the rate of method calls that potentially lead to the headless browser firing requests to the server.
In the script, I have the following kind of method calls:
driver.find_element_by_XXXX()
driver.execute_script()
webElement.get_attribute()
webElement.text
I use the second function to scroll to the bottom of the window and get the AJAX content, like the following:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
Based on my intuition, only the second function will trigger request firing, since others seem like parsing existing html content.
Is my intuition wrong?
Many many thanks
Perhaps I should elaborate more. I am automating a process of crawling on a website in Python. There is a subtantial amount of work done, and the script is running without large bugs.
My colleagues, however, reminded me that if in the process of crawling a page I made too many requests for the AJAX list within a short time, I may get banned by the server. This is why I started looking for a way to monitor the number of requests I am firing from my headless PhantomJS browswer in script.
Since I cannot find a way to monitor the number of requests in script, I made the compromise I mentioned above.
Therefore I need a way to monitor the number of requests my webdriver
is firing on each page parsed
As far as I know, the number of requests is depending on the webpage's design, i.e. the resources used by the webpage and the requests made by Javascript/AJAX. Webdriver will open a browser and load the webpage just like a normal user.
In Chrome, you can check the requests and responses using Developer Tools panel. You can refer to this post. The current UI design of Developer Tools is different but the basic functions are still the same. Alternatively, you can also use the Firebug plugin in Firefox.
Updated:
Another method to check the requests and responses is by using Wireshark. Please refer to these Wireshark filters.
I am familiar with BeautifulSoup and urllib2 to scrape data from a webpage. However, what if a parameter needs to be entered into the page before the result that I want to scrape is returned?
I'm trying to obtain the geographic distance between two addresses using this website: http://www.freemaptools.com/how-far-is-it-between.htm
I want to be able to go to the page, enter two addresses, click "Show", and then extract the "Distance as the Crow Flies" and "Distance by Land Transport" values and save them to a dictionary.
Is there any way to input data into a webpage using Python?
Take a look at tools like mechanize or scrape:
http://pypi.python.org/pypi/mechanize
http://stockrt.github.com/p/emulating-a-browser-in-python-with-mechanize/
http://www.ibm.com/developerworks/linux/library/l-python-mechanize-beautiful-soup/
http://zesty.ca/scrape/
Packt Publishing has an article on that matter, too:
http://www.packtpub.com/article/web-scraping-with-python
Yes! Try mechanize for this kind of Web screen-scraping task.
I think you can also use PySide/PyQt, because they have a browser core of qtwebkit, you can control the browser to open pages, simulate human actions(fill, click...), then scrape data from pages. FMiner is work on this way, it's a web scraping software I developed with PySide.
Or you can try phantomjs, it's an easy library to control browser, but not it's javascript not python lanuage.
In addition with the answers already given, you could simply do a request on that page. Using your browser you could always inspect the Network (under Tools/Web Developer tools) behaviors and actions when you interact with the page. E.g. http://www.freemaptools.com/ajax/getaandb.php?a=Florida_Usa&b=New%20York_Usa&c=6052 -> request query for getting the results page you are expecting. Request that page and scrape the field you wanted to. IMHO, page requests are way faster than screen scraping (case-to-case basis).
But of course, you could always do screen scraping/browser simulation also (Mechanize, Splinter) and use headless browsers (PhantomJS, etc.) or the browser driver of the browser you want to use.
The query may have been resolved.
You can use Selenium WebDriver for this purpose. A web page can be interacted using programming language. All the operations can be performed as if a human user is accessing the web page.
Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html
link text
This is a link from a digital book library.There are forward and backward buttons to see next and previous page.I want to download these pictures automatically. I have once used urllib in python but the website baned it soon. I just want to download this book for study purpose so can anyone recommend me some programming tools such as web-spiders which can simulate the process of turning pages and get the pictures automatically. Thanks!
That site uses Javascript, so you can't easily scrape it with Python. Two suggestions:
Work out what requests are being made when clicking the next button. You can do this with a tool like firebug. You might then find you can scrape it without processing any JS.
Use a tool such as Selenium which allows for browser scripting which lets you "execute" the JS.
As for the site blocking you, there are two ways to reduce the chance of being blocked:
Change your user-agent to that of a common browser, e.g. Firefox.
Add random delays between accessing the next image, so that you appear more human-like.
wget is an excellent web spider
http://linux.die.net/man/1/wget
You need a real browser to work with this (kind of) site. Selenium is one option, but it is more geared towards web testing. For web scraping iMacros is really nice. I had a quick test and it works well with iMacros for Firefox/IE.
Chris
I am trying to write a Python-based Web Bot that can read and interpret an HTML page, then execute an onClick function and receive the resulting new HTML page. I can already read the HTML page and I can determine the functions to be called by the onClick command, but I have no idea how to execute those functions or how to receive the resulting HTML code.
Any ideas?
The only tool in Python for Javascript, that I am aware of is python-spidermonkey. I have never used it though.
With Jython you could (ab-)use HttpUnit.
Edit: forgot that you can use Scrapy. It supports Javascript through Spidermonkey, and you can even use Firefox for crawling the web.
Edit 2: Recently, I find myself using browser automation more and more for such tasks thanks to some excellent libraries. QtWebKit offers full access to a WebKit browser, which can be used in Python thanks to language bindings (PySide or PyQt). There seem to be similar libraries and bindings for Gtk+ which I haven't tried. Selenium WebDriver API also works great and has an active community.
Well obviously python won't interpret the JS for you (though there may be modules out there that can). I suppose you need to convert the JS instructions to equivalent transformations in Python.
I suppose ElementTree or BeautifulSoup would be good starting points to interpret the HTML structure.
To execute JavaScript, you need to do much of what a full web browser does, except for the rendering. In particular, you need a JavaScript interpreter, in addition to the Python interpreter.
One starting point might be python-spidermonkey. Depending on the specific JavaScript, you might have to provide a good DOM API to the spidermonkey, in addition to providing an XmlHttpRequest implementation.
You can try to leverage V8,
V8 is Google's open source, high performance JavaScript engine. It is written in C++ and is used in Google Chrome, Google's open source browser.
Calling it from Python may not be straightforward, without a framework to provide the DOM.
Pyjamas has an experimental project, Pyjamas Desktop, providing V8 integration for Javascript execution.
Pyv8 is an experimental python v8 bindings and a python-javascript compiler.
For the browser part of this you might want to look into Mechanize, which basically is a webbrowser implemented as a Python library. http://pypi.python.org/pypi/mechanize/0.1.11
But as mentioned, the text n onClick is Javascript, and you'll need spidermonkey for that.
If you can make a generic support for spidermonkey in mechanize, I'm sure many people would be extremely happy. ;)
Mechanize may be overkill, maybe you just want to find specific parts of the HTML, and then lxml and BeautifulSoup both work well.
Why don't you just sniff what gets sent after the onclick event and replicate that with your bot?
For web automation , you can look into "webbot" library. It makes autmation damn simple and pain free.
webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.
Here's a snippet of webbot
from webbot import Browser
web = Browser()
web.go_to('google.com')
web.click('Sign in')
web.type('mymail#gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^
Docs are at : https://webbot.readthedocs.io