How to automate web scraper selector rules? - python

I am planning to automate web scraper selector rules.Currently we are using selenium to scrape web pages. And its universal problem that websites tend to change their DOM structure. Is there a way to automate the process if the particular rule breaks then it should restructure the rules based on new DOM structure. I know its extremely difficult to do it when the Dom structure completely changes. But is there way to identify and fix leaves level changes?
As far as I know there are no python framework to give a direct solution to it.is there any python library which I should take to help me in this?

Related

Web scrapping an array with multiple pages

I am trying to scrap this table from Etherscan.io : https://etherscan.io/tokens?ps=100&p=1
Since I am not that familiar with XPath, I have chosen to use the Chrome extension Webscrap.
The problem being the table is contained in several pages.
Is there a way of scraping all the pages at once without mapping each page one by one?
I could also try to it with python directly, I know there are some pretty good libraries out there.
Is that do-able? Would it take me a long time to learn (I know very little about HTML and XPath)?
If so what would be easiest and quickest librairy to learn?

Does selenium or other web scraper tools are mandatory for scraping data from chrome to python script

So I wanted to scrape an website data. I have used selenium in my python script to scrape the data. But i have noticed that in Network section of Google Chrome Inspect, the chrome can record the XmlHttpRequest to find out the json/xml file of websites. So i was wondering that can i directly use this data in my python script as selenium is quite heavy weight and needs more bandwidth. Does selenium or other web scraper tools should be used as a medium to communicate with browser? If not, please give out some information about scraping data to be used for my python file only by using chrome itself.
Definitely! Check out the requests module.
From there you can access the page source, and using data from it you can access the different aspects separately. Here are the things to consider though:
Pros:
Faster, less to download. For things like AJAX requests, is extremely more efficient.
Does not require graphic UI like selenium
More precise; Get exactly what you need
The ability to set Headers/Cookies/etc before making requests
Images may be downloaded separately, with no obligation to download any of them.
Allows as many sessions as you want to be opened in parallel, each
can have different options(proxies, no cookies, consistent cookies,
custom headers, block redirects, etc) without affecting the other.
Cons:
Much harder to get into as opposed to Selenium, requires
minimal knowledge of HTML's GET and POST , and a library
like re or BeautifulSoup to extract data.
For pages with javascript-generated data, depending how the
javascript is implemented(or obfuscated), while always possible,
could be extremely difficult to extract wanted data.
Conclusion:
I suggest you definitely learn requests, and use it for most cases; However if the javascript gets too complicated, then switch to selenium for an easier solution. Look for some tutorials online, and then check the official page for an overview of what you've learned.

Browsing/parsing html pages in python

I'm trying to put together a little collection of plugins that I need in order to interact with html pages. What I need ranges from simple browsing and interacting with buttons or links of a web page (as is "write some text in this textbox and press this button") to parsing a html page and sending custom get/post messages to the server.
I am using Python 3 and up to now I have Requests for simple webpage loading, custom get and post messages,
BeautifulSoup for parsing the HTML tree and I'm thinking of trying out Mechanize for simple web page interactions.
Are there any other libraries out there that are similar to the 3 I am using so far? Is there some sort of gathering place where all Python libraries hang out? Because I sometimes find if difficult to find what I am looking for.
The set of tools/libraries for web-scraping really depends on the multiple factors: purpose, complexity of the page(s) you want to crawl, speed, limitations etc.
Here's a list of tools that are popular in a web-scraping world in Python nowadays:
selenium
Scrapy
splinter
ghost.py
requests (and grequests)
mechanize
There are also HTML parsers out there, these are the most popular:
BeautifuSoup
lxml
Scrapy is probably the best thing that happened to be created for web-scraping in Python. It's really a web-scraping framework that makes it easy and straightforward, Scrapy provides everything you can imagine for a web-crawling.
Note: if there is a lot AJAX and js stuff involved in loading, forming the page you would need a real browser to deal with it. This is where selenium helps - it utilizes a real browser allowing you to interact with it by the help of a WebDriver.
Also see:
Web scraping with Python
Headless Selenium Testing with Python and PhantomJS
HTML Scraping
Python web scraping resource
Parsing HTML using Python
Hope that helps.

How to search for some specific links(which may be present in a pdf file) in a website and crawl those links for other information?

I have a task to complete. I need to make a web crawler kind of application. What i need to do is to pass a url to my application. This url is website of a government agency. This url also having some links to other individual agencies which are approved by this government agency. I need to go to those links and get some information from that site about that agency. I hope i make myself clear.Now i have to make this application generic. It means i can't hard code it for just one website(government agency). I need to make it like any url given to it , it should check it and then get all the links and proceed. Now in some website these links present in pdfs and in some they are present on a page.
I have to use python for this. I don't know how to approach this. I spend time on this using BeautifulSoup but that require lots of parsing. Other options are scrapy or twill. Honestly i am new to python. I dont know which one is better for this task. So any one can help me in selecting the right tool and right approach to solve this problem. Thanks in advance
There is plenty of information out there about building web scrapers with Python. Python is a great tool for the job.
There are also tons of posts about web scrapers on this website if you search for them.

Best way to automate creation routine using Python

Usually, everyday I have to create emails, save them into a file, then use them to register accounts, etc. It's really boring and takes a good amount of time. I don't want to waste it.
I want to automate this process. As I understand, it's called a "bot". What it should do is go through few websites, click some buttons, scrape needed information, store collected information and fill some forms. Is it possible to do so with Python? If yes, what's the most compact way to do this?
Python's selenium bindings are a great way to automate browser sessions, scrape page data, fill out forms, click buttons, etc. Selenium allows the page JS to run, then parses the DOM for you and makes it available as Python objects through a well-documented API:
http://selenium-python.readthedocs.org/en/latest/
Much much easier than trying to parse it yourself.
The Scrapy python module should meet your needs:
http://scrapy.org/

Categories