I'm trying to put together a little collection of plugins that I need in order to interact with html pages. What I need ranges from simple browsing and interacting with buttons or links of a web page (as is "write some text in this textbox and press this button") to parsing a html page and sending custom get/post messages to the server.
I am using Python 3 and up to now I have Requests for simple webpage loading, custom get and post messages,
BeautifulSoup for parsing the HTML tree and I'm thinking of trying out Mechanize for simple web page interactions.
Are there any other libraries out there that are similar to the 3 I am using so far? Is there some sort of gathering place where all Python libraries hang out? Because I sometimes find if difficult to find what I am looking for.
The set of tools/libraries for web-scraping really depends on the multiple factors: purpose, complexity of the page(s) you want to crawl, speed, limitations etc.
Here's a list of tools that are popular in a web-scraping world in Python nowadays:
selenium
Scrapy
splinter
ghost.py
requests (and grequests)
mechanize
There are also HTML parsers out there, these are the most popular:
BeautifuSoup
lxml
Scrapy is probably the best thing that happened to be created for web-scraping in Python. It's really a web-scraping framework that makes it easy and straightforward, Scrapy provides everything you can imagine for a web-crawling.
Note: if there is a lot AJAX and js stuff involved in loading, forming the page you would need a real browser to deal with it. This is where selenium helps - it utilizes a real browser allowing you to interact with it by the help of a WebDriver.
Also see:
Web scraping with Python
Headless Selenium Testing with Python and PhantomJS
HTML Scraping
Python web scraping resource
Parsing HTML using Python
Hope that helps.
Related
I'm trying to scrape data from a site in python, the payload is right and everything works but when I get the response of the site which would normally be the source code of the html page I instead, get just a script tag with some error written in it. See the response I get enclosed :
b'<script language="JavaScript">\nerr = "";\nlargeur = 1024;\nif (screen.width>largeur) { document.location.href="accueil.php?" +err;\t}\nelse { document.location.href="m.accueil.php?largeur=" +screen.width +\'&\' +err;\t}\n</script>'
Information :
after looking at the site it seems that it uses google analytics, I don't really know about what it is but maybe because of the preview things, it can't load the page since i'm not accessing it by a navigator.
What tool are you using to webscrape? Tools like beautiful soup parse pre-loaded HTML content. If a website uses client-side rendering and JavaScript to load content, often times HTML parsers will not function.
You can instead use an automated browser that interacts with a website just as a regular user would. These automated browsers can operate with or without a GUI. Automated browsers when run without a GUI (also known as a headless browser) take up less time and resources than running them with a GUI. Here's a fairly exhaustive list of headless browsers you can use. Note that not all are compatible with Python.
As Buran mentioned in the comments Selenium is an option. Selenium is very well documented and has a large community following so it's easy to find helpful articles or tutorials. It's a multi-driver so it can run different types of browsers (firefox, chrome, etc.), both headless and with a GUI.
I'm trying to run web searches using a python script. I know how to make it work for most sites, such as using the requests library to get "url+query arguments".
I'm trying to run searches on wappalyzer.com. But when you run a search its url doesn't change. I also tried inspecting the html to try and figure out where the search is taking place, so that I could use beautiful soup to change the html and run it but to no avail. I'm really new to web scraping so would love the help.
The URL does not change because the search works with javascript and asynchronous requests. The easiest way to automate such task is to execute the javascript and interact with programatically (often easier than retro engineering the requests the client does, except if a public API is available).
You could use selenium with python, which is pretty easy to use, or any automation framework that executes Javascript by running a web driver (gecko, chrone, phantomjs).
With selenium, you will be able to program your scraper pretty easily, by selecting the field of search (using css selectors or xpath for example), inputing a value and validating the search. You will then be able to dump the whole page or specific parts you need.
I am trying to scrape data from the morningstar website below:
http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US
I am currently trying to do just IBM but hope to eventually be able to type in the code of another company and do this same with that one. My code so far is below:
import requests, os, bs4, string
url = 'http://financials.morningstar.com/ratios/r.html?t=IBM®ion=USA&culture=en_US';
fin_tbl = ()
page = requests.get(url)
c = page.content
soup = bs4.BeautifulSoup(c, "html.parser")
summary = soup.find("div", {"class":"r_bodywrap"})
tables = summary.find_all('table')
print(tables[0])
The problem I am experiencing at the moment is unlike a simpler webpage I have scraped the program can't seem to locate any tables even though I can see them in the HTML for the page.
In researching this problem the closest stackoverflow question is below:
Python webscraping - NoneObeject Failure - broken HTML?
In that one they explained that Morningstar's tables are dynamically loaded and used some json code I am unfamiliar with and somehow generated a different weblink which managed to scrape the data but I don't understand where it came from?
It's a real problem scraping some modern web pages, particularly on pages generated by single-page applications (where the content is maintained by AJAX calls and DOM modification rather than delivered as ready-to-go HTML in a single server response).
The best way I have found to access such content is to use the Selenium web testing environment to have a browser load the page under the control of my program, then extract the page contents from Selenium for scraping. There are other environments that will execute the scripts and modify the DOM appropriately, but I haven't used any of them.
It's not as difficult as it sounds, but it will take you a little jiggering around to get there.
Web scraping can be greatly simplified when the site offers an API, be it officially supported or just an unofficial hack. Even the hack is better than trying to fiddle with the HTML which can change every day.
So a search for morningstar api might be fruitful. And, in fact, some friendly Gister has already worked this out for you.
Would the search be without result, a usually fruitful approach is to investigate what ajax calls the page is doing to retrieve data and then issue them directly. This can be achieved via the browser debuggers, tab "network" or so where each request can be investigated in detail in a very friendly UI.
I've found scraping dynamic sites to be a lot easier with JavaScript than with Python + Selenium. There is a great module for nodejs/phantomjs: ScraperJS. It is very easy to use: it injects jQuery into the scraped page and you can extract data with jQuery selectors.
I'd like to scrape content from a site which apparently uses a javascript to generate the tables (the site is oddsportal.com).
I see that Scrapy can't load dynamic content, i read selenium could handle it but i'm planning to use a web server.
Is there a way i can parse this site or get the dynamic request and parse it using scrapy?
For example i'd like to import the full table from this page with the headers, match name and odds
http://www.oddsportal.com/matches/handball/
From what I understand, you have a constraint that you don't have a real display. You can still go with selenium - there is a headless PhantomJS browser that can be automated, there is an option to work in a virtual display, and you can use a remote selenium server or docker-selenium.
There are multiple examples on how to combine selenium and scrapy, for instance:
selenium with scrapy for dynamic page
Scrapy with Selenium crawling but not scraping
And, also check if scrapy-splash middleware would be enough for your use case.
For sites with dynamic content through AJAX and Javascript, I have used PhantomJS. It doesn't require open a browser because it's in itself a fully scriptable web browser. PhantomJS is fast and includes native support for various web standards as DOM handling, CSS selector, JSON and Canvas.
If you aren't a JavaScript Ninja, You should look CasperJS, it is written over PhantomJS. It eases the process of defining a full navigation scenario and provides useful high-level functions.
Here an example about how CasperJS works:
CasperJs and Jquery with chained Selects
I'd like to know if is it possible to browse all links in a site (including the parent links and sublinks) using python selenium (example: yahoo.com),
fetch all links in the homepage,
open each one of them
open all the links in the sublinks to three four levels.
I'm using selenium on python.
Thanks
Ala'a
You want "web-scraping" software like Scrapy and possibly Beautifulsoup4 - the first is used to build a program called a "spider" which "crawls" through web pages, extracting structured data from them, and following certain (or all) links in them. BS4 is also for extracting data from web pages, and combined with libraries like requests can be used to build your own spider, though at this point something like Scrapy is probably more relevant to what you need.
There are numerous tutorials and examples out there to help you - just start with the google search I linked above.
Sure it is possible, but you have to instruct selenium to enter these links one by one as you are working within one browser.
In case, the pages are not having the links rendered by JavaScript in the browser, it would be much more efficient to fetch these pages by direct http request and process it this way. In this case I would recommend using requests. However, even with requests it is up to your code to locate all urls in the page and follow up with fetching those pages.
There might be also other Python packages, which are specialized on this kind of task, but here I cannot serve with real experience.