Hello how can i make changes in my web browser with python? Like filling forms and pressing Submit?
What lib's should i use? And maybe someone of you have some examples?
Using urllib does not make any changes in opened browser for me
Urllib is not intended to do anyting with your browser, but rather to get contents from urls.
To fill in forms and this kind of things, have a look into mechanize, to scrap the webpages, consider using pyquery.
Selenium is great for this. It's a browser automation tool that you can use to launch a browser (any major browser or a 'headless' one), navigate to a url, and interact with the page.
It's used primarily for testing web code against multiple browsers, but is also very useful for 'scraping' pages and automating mundane tasks.
Here are the python docs: http://selenium-python.readthedocs.org/en/latest/index.html
Related
I want to build a python script to submit some form on internet website. Such as a form to publish automaticaly some item on site like ebay.
Is it possible to do it with BeautifulSoup or this is only to parse some website?
Is it possible to do it with selenium but quickly without open really the browser?
Are there any other ways to do it?
Look at the requests library.. Also, check out the chrome debugger toolbar to see the requests fly by. There is also a utility called postman, where you can "design", queries, then generate code in many different flavors (including pythons requests library).
BeautifulSoup is for parsing HTML.
You can use selenium with PhantomJS to do this without the browser opening. You have to use the Keys portion of selenium to send data to the form to be submitted. It is also worth noting that this method will not work if there are captcha's on the form.
The mechanize library can fill and submit forms.
I'm trying to make a simple script in python that will scan a tweet for a link and then visit that link.
I'm having trouble determining which direction to go from here. From what I've researched it seems that I can Use Selenium or Mechanize? Which can be used for browser automation. Would using these be considered web scraping?
Or
I can learn one of the twitter apis , the Requests library, and pyjamas(converts python code to javascript) so I can make a simple script and load it into google chrome's/firefox extensions.
Which would be the better option to take?
There are many different ways to go when doing web automation. Since you're doing stuff with Twitter, you could try the Twitter API. If you're doing any other task, there are more options.
Selenium is very useful when you need to click buttons or enter values in forms. The only drawback is that it opens a separate browser window.
Mechanize, unlike Selenium, does not open a browser window and is also good for manipulating buttons and forms. It might need a few more lines to get the job done.
Urllib/Urllib2 is what I use. Some people find it a bit hard at first, but once you know what you're doing, it is very quick and gets the job done. Plus you can do things with cookies and proxies. It is a built-in library, so there is no need to download anything.
Requests is just as good as urllib, but I don't have a lot of experience with it. You can do things like add headers. It's a very good library.
Once you get the page you want, I recommend you use BeautifulSoup to parse out the data you want.
I hope this leads you in the right direction for web automation.
I am not expect in web scraping. But I had some experience with both Mechanize and Selenium. I think in your case, either Mechanize or Selenium will suit your needs well, but also spend some time look into these Python libraries Beautiful Soup, urllib and urlib2.
From my humble opinion, I will recommend you use Mechanize over Selenium in your case. Because, Selenium is not as light weighted compare to Mechanize. Selenium is used for emulating a real web browser, so you can actually perform 'click action'.
There are some draw back from Mechanize. You will find Mechanize give you a hard time when you try to click a type button input. Also Mechanize doesn't understand java-scripts, so many times I have to mimic what java-scripts are doing in my own python code.
Last advise, if you somehow decided to pick Selenium over Mechanize in future. Use a headless browser like PhantomJS, rather than Chrome or Firefox to reduce Selenium's computation time. Hope this helps and good luck.
For
Web automation : "webbot"
Web scraping : "scrapy"
webbot works even for webpages with dynamically changing id and classnames and has more methods and features than selenium and mechanize.
Here's a snippet of webbot
from webbot import Browser
web = Browser()
web.go_to('google.com')
web.click('Sign in')
web.type('mymail#gmail.com' , into='Email')
web.click('NEXT' , tag='span')
web.type('mypassword' , into='Password' , id='passwordFieldId') # specific selection
web.click('NEXT' , tag='span') # you are logged in ^_^
For web scraping Scrapy seems to be the best framework.
It is very well documented and easy to use.
I have a task where in I need to submit a form to website but they dont provide any API. I am currently using webdriver and faced many problems because of asynchronous nature between my code and browser. I am looking for a light weight reliable library/tool using with I can do all the tasks a user do with a browser.
Casperjs is one of the option which can do my job but I am more familiar with python and scrapy has larger developer community compare to casperjs.
Navigation utility without browser, light weight and fail-proof is one of the related question.
in short answer is No.
scrapy can't render java script but browser can.
you can use Selenium.
if you are sure to use scrapy and there is javascript that you need to run you can use
scrapy with selenium
scrapy with gtk/webkit/jswebkit
scrapy with webdrivers
If you like CasperJS but want to stick with Python, you should have a look at Ghost.py
I'm trying to write a server process that will allow you to enter a URL, then every 30 min ping that URL and capture it as an image. Is this possible with a combination of something like CURL, urllib2 and PIL?
Curl, urllib2, etc., grab the HTML code for a web page. But a page doesn't look like anything on its own. Instead, a browser uses that code and renders a web page according to its own internal rules of how that code should be used. And, of course, each browser renders the page slightly differently.
In other words, you can't take a snapshot of a page without having a web browser to generate the page to take the snapshot of.
If you're feeling very ambitious, you can create your own custom, scriptable page renderer by using the rendering engine from the browser of your choice -- they all make the rendering engine a separate component that you can work with separately. IE's is called "Trident", Firefox's is called "Gecko", Chrome's is "WebKit", etc.
Otherwise you'll want to just do some sort of UI scripting, like you might do with iOpus or Selenium. Selenium is scriptable with python, so that's one for you right there.
EDIT
Here you go. That looks pretty simple.
The ImageGrab can be used to take a screenshot on windows. However, you can't do this purely using CURL, urllib2 and PIL, because you will have to render the web site. The easiest would probably be to open the website in a browser and grab a screenshot.
link text
This is a link from a digital book library.There are forward and backward buttons to see next and previous page.I want to download these pictures automatically. I have once used urllib in python but the website baned it soon. I just want to download this book for study purpose so can anyone recommend me some programming tools such as web-spiders which can simulate the process of turning pages and get the pictures automatically. Thanks!
That site uses Javascript, so you can't easily scrape it with Python. Two suggestions:
Work out what requests are being made when clicking the next button. You can do this with a tool like firebug. You might then find you can scrape it without processing any JS.
Use a tool such as Selenium which allows for browser scripting which lets you "execute" the JS.
As for the site blocking you, there are two ways to reduce the chance of being blocked:
Change your user-agent to that of a common browser, e.g. Firefox.
Add random delays between accessing the next image, so that you appear more human-like.
wget is an excellent web spider
http://linux.die.net/man/1/wget
You need a real browser to work with this (kind of) site. Selenium is one option, but it is more geared towards web testing. For web scraping iMacros is really nice. I had a quick test and it works well with iMacros for Firefox/IE.
Chris