Browse links recursively using selenium - python

I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a

You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.

Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.

Related

How to scrape a website and all its directories from the one link?

Sorry if this is not a valid question, i personally feel it kind of boarders on the edge.
Assuming the website involved has given full permission
How could I download the ENTIRE contents (html) of that website using a python data scraper. By entire contents I refer to not only the current page you are on, but any other directory that branches off of that main website. Eg.
Using the link:
https://www.dogs.com
could I pull info from:
https://www.dogs.com/about-us
and any other directory attached to the "https://www.dogs.com/"
(I have no idea is dogs.com is a real website or not, just an example)
I have already made a scraper that will pull info from a certain link (nothing further than that), but I want to further improve it so I dont have to have heaps of links. I understand I can use an API but if this is possible I would rather this. Cheers!
while there is scrapy to do it professionally, you can use requests to get the url data, and bs4 to parse the html and look into it. it's also easier to do for a beginner i guess.
anyhow you go, you need to have a starting point, then you just follow the link's in the page, and then link's within those pages.
you might need to check if the url is linking to another website or is still in the targeted website. find the pages one by one and scrape them.

How to read a HTML page that takes some time to load? [duplicate]

I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)

Download files after opening a webpage using Python

I have opened a webpage('http://example.com/protected_page.php') using Python's requests Library.
from requests import session
payload = {
'action': 'login',
'username': USERNAME,
'password': PASSWORD
}
with session() as c:
c.post('http://example.com/login.php', data=payload)
response = c.get('http://example.com/protected_page.php')
Now there are around 15 links on that page to download files.
I wish to download files from only 2 links(say, linkA and linkB).
How can I specify this in my code, so that the 2 files get downloaded when I run my code.
Can you please give more information about these links ?
Are these linkA and linkB always the same links ?
If yes then you can use :
r = requests.get(linkA, stream=True)
If the url links are not the same all the time , then maybe you can find another way, using the order of the link maybe, for instance if the linkA and linkB is always the first and the second link on the page etc.
Another way is to use any unique class name or id from the page. But it would be better if you could provide us more informations.
In fact what you are referring is more precisely called as web scraping , in which one can scrape some specific contents from the given web site:
Web scraping is a computer software technique of extracting
information from websites. This technique mostly focuses on the
transformation of unstructured data (HTML format) on the web into
structured data (database or spreadsheet).
without knowing the HTML semantics it is not possible to give you a snap of code , for what you are looking for. But here i can advice you some of the way using which you can do web scrape from your site.
1. Non-programming way:
For those of you, who need a non-programming way to extract
information out of web pages, you can also look at import.io . It
provides a GUI driven interface to perform all basic web scraping
operations.
2. Programmers way:
You may find many libraries to perform one function using python. Hence, it is necessary to find the best to use library. I prefer BeautifulSoup , since it is easy and intuitive to work on. Precisely, you use two Python modules for scraping data:
Urllib2: It is a Python module which can be used for fetching URLs. It defines functions and classes to help with URL actions (basic
and digest authentication, redirections, cookies, etc). For more
detail refer to the documentation page.
BeautifulSoup: It is an incredible tool for pulling out information
from a webpage. You can use it to extract tables, lists, paragraph and
you can also put filters to extract information from web pages. the latest available version is BeautifulSoup 4. You can look
at the installation instruction in its documentation page.
BeautifulSoup does not fetch the web page for us. That’s why, need to use urllib2 in combination with the BeautifulSoup library.
Python has several other options for HTML scraping in addition to BeatifulSoup. Here are some others:
mechanize
scrapemark
scrapy

Webscraping Financial Data from Morningstar

I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM&region=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.

How to crawl specific ASP.NET pages using Python?

I want to crawl an ASP.NET website but the urls are all the same how can I crawl specific pages using python?
here is the website I want to crawl:
http://www.fveconstruction.ch/index.htm
(I am using beautifulsoup, urllib and python 3)
What information should I get to distinguish a page from the other?
If the target website is just a single page application, it can't be crawled. As a workaround you can see the requests (GET, POST etc) that actually go when you manually navigate through the website and ask the crawler to use those. Or, teach your crawler to execute javascript at least what's on the target website.
It's the website who need to change to be easily crawlable, they need to provide a reasonable non-AJAX version of every page that needs to be indexed, or links to a page that needs to be indexed. Or use something like what pushState does in angularJs.

Categories