I'm trying to run web searches using a python script. I know how to make it work for most sites, such as using the requests library to get "url+query arguments".
I'm trying to run searches on wappalyzer.com. But when you run a search its url doesn't change. I also tried inspecting the html to try and figure out where the search is taking place, so that I could use beautiful soup to change the html and run it but to no avail. I'm really new to web scraping so would love the help.
The URL does not change because the search works with javascript and asynchronous requests. The easiest way to automate such task is to execute the javascript and interact with programatically (often easier than retro engineering the requests the client does, except if a public API is available).
You could use selenium with python, which is pretty easy to use, or any automation framework that executes Javascript by running a web driver (gecko, chrone, phantomjs).
With selenium, you will be able to program your scraper pretty easily, by selecting the field of search (using css selectors or xpath for example), inputing a value and validating the search. You will then be able to dump the whole page or specific parts you need.
Related
I'm trying to scrape data from a site in python, the payload is right and everything works but when I get the response of the site which would normally be the source code of the html page I instead, get just a script tag with some error written in it. See the response I get enclosed :
b'<script language="JavaScript">\nerr = "";\nlargeur = 1024;\nif (screen.width>largeur) { document.location.href="accueil.php?" +err;\t}\nelse { document.location.href="m.accueil.php?largeur=" +screen.width +\'&\' +err;\t}\n</script>'
Information :
after looking at the site it seems that it uses google analytics, I don't really know about what it is but maybe because of the preview things, it can't load the page since i'm not accessing it by a navigator.
What tool are you using to webscrape? Tools like beautiful soup parse pre-loaded HTML content. If a website uses client-side rendering and JavaScript to load content, often times HTML parsers will not function.
You can instead use an automated browser that interacts with a website just as a regular user would. These automated browsers can operate with or without a GUI. Automated browsers when run without a GUI (also known as a headless browser) take up less time and resources than running them with a GUI. Here's a fairly exhaustive list of headless browsers you can use. Note that not all are compatible with Python.
As Buran mentioned in the comments Selenium is an option. Selenium is very well documented and has a large community following so it's easy to find helpful articles or tutorials. It's a multi-driver so it can run different types of browsers (firefox, chrome, etc.), both headless and with a GUI.
I'm practicing in parsing web pages with python. So what I do is
ans = requests.get(link)
Then I use re to extract some information from html, that is stored in
ans.content
What I faced is that some sites use scripts, that are automatically executed in a browser, but not when I try to download a page using requests. For example, instead of getting a page with information I get something like
scripts_to_get_info.run()
in html code
Browser is installed on my computer, so as a program that I wrote, this means that, theoretically, I should have a way to run this script and to get information while running python code to parse then.
Is it possible? Any suggestion?
(idea, that this is doable, came from the fact, that when I tried to inspect page in google, I saw real html file without any trashy scripts)
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
I'm trying to put together a little collection of plugins that I need in order to interact with html pages. What I need ranges from simple browsing and interacting with buttons or links of a web page (as is "write some text in this textbox and press this button") to parsing a html page and sending custom get/post messages to the server.
I am using Python 3 and up to now I have Requests for simple webpage loading, custom get and post messages,
BeautifulSoup for parsing the HTML tree and I'm thinking of trying out Mechanize for simple web page interactions.
Are there any other libraries out there that are similar to the 3 I am using so far? Is there some sort of gathering place where all Python libraries hang out? Because I sometimes find if difficult to find what I am looking for.
The set of tools/libraries for web-scraping really depends on the multiple factors: purpose, complexity of the page(s) you want to crawl, speed, limitations etc.
Here's a list of tools that are popular in a web-scraping world in Python nowadays:
selenium
Scrapy
splinter
ghost.py
requests (and grequests)
mechanize
There are also HTML parsers out there, these are the most popular:
BeautifuSoup
lxml
Scrapy is probably the best thing that happened to be created for web-scraping in Python. It's really a web-scraping framework that makes it easy and straightforward, Scrapy provides everything you can imagine for a web-crawling.
Note: if there is a lot AJAX and js stuff involved in loading, forming the page you would need a real browser to deal with it. This is where selenium helps - it utilizes a real browser allowing you to interact with it by the help of a WebDriver.
Also see:
Web scraping with Python
Headless Selenium Testing with Python and PhantomJS
HTML Scraping
Python web scraping resource
Parsing HTML using Python
Hope that helps.
I've looked and urllib(2), mechanize, and Beautiful Soup in hopes to find something that captures network calls such as pixel/beacon fires from a page. Unfortunately i'm not very familiar with any of them, and also not very clear on how to go about my search.
I'd like to use python to run through a series of web urls, and capture each ones networks call aka pixel fires. Would anyone know of a means or library i can start from inorder to accomplish this??
looked into webscrappying, but i don't want the html, instead i beleive i'm looking for the GET request the site makes.
If I understand what you want, you want to log what requests a browser makes when displaying a page, in respect of many pages.
Your options are to script a browser using python (See: http://wiki.python.org/moin/WebBrowserProgramming), or script the browser using javascript, and output your results in some way (I suggest JSON, over a request or to a file), and analyse them in python.
You'll probably find it easier to do the scripting in javascript, honestly.
Another possibility if you have access to the Firefox web browser is to install Firebug, a powerful debugging tool that gives you the option to display all network traffic from a web page in the browser console. In order to transfer the output from the console to a file you will need to install the ConsoleExport plugin for Firebug.
You will now be able to capture all the traffic from a web page to a file which you can then parse with Python.