I want to get all the Advisory ID and CVE ID from this page
https://psirt.global.sonicwall.com/vuln-list
My earlier approach was to extract links and IDs from source code (I have followed this approach with other vendors such as Google chrome update and Mozilla update). But here I cannot see any data in the source code. When I am in inspect mode though, I can see the data. However, when I view the source code, I cannot find it.
I tried logging the traffic and then searching for the piece of data it seems like it's requesting https://psirtapi.global.sonicwall.com/api/v1/vulnsummary/?srch=&vulnerable_products=&ord=-advisory_id for the data, you're looking for and then returns it in the response. You can then parse it.
Related
My question is more in the "concept" side, as I don't have any code to show yet. I've basically got access to an API Explorer for a website, but the information retrieved when I put a specific url in the API Explorer is not the same as the html information I'd get if I opened a webpage with the same url and "inspected" the elements. I'm honestly lost on how to retrieve the data I need, as they are only present in the API Explorer but can't be accessible via web scraping.
Here is an example to show you what I mean:
API Explorer link: https://platform.worldcat.org/api-explorer/apis/worldcatidentities/identity/Read,
and the specific url to request is: http://www.worldcat.org/identities/lccn-n80126307/
If I put the url (http://www.worldcat.org/identities/lccn-n80126307/) myself and "inspect element", this piece of information:
does not have all the same data as:
For example, the language count, audLevel, oclcnum and many others are not existent in the html version but are in the API Explorer and with other authors, the genres count is only existent in the API Explorer.
I realize that one is in xml and the other in html so is that why the data is not the same in both versions? And whatever is the reason, what can I do to retrieve the data present only in the API Explorer? (such as genres count, audLevel, oclcnum, etc.)
Any insight would be really helpful.
It's not unusual for sites not showing all the data, that's in the underlying json/xml. Those sorts of things often holds interesting content that aren't displayed anywhere onsite.
In this case the server gives you, what you ask for. If you're going for the data using Python, all you really have to do is specify in your header what you're after. If you don't do that on this site, you get the html-stuff.
If you do like this, you'll get the xml data, you're interested in:
import requests
import xml.dom.minidom
url = 'https://www.worldcat.org/identities/lccn-n80126307/'
r = requests.get(url, headers={'Accept': 'application/json'})
# a couple of lines for printing the xml pretty
xml = xml.dom.minidom.parseString(r.text)
pretty_xml_as_string = xml.toprettyxml()
print(pretty_xml_as_string)
Then all you have to do is extract the content, you're after. That can be done in many ways. Let me know if this helps you.
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
In my line of work, I often need to look at campaign disclosure reports for my state from ethics.ga.gov. However, the state system is one of the shittiest webapps I've ever dealt with.
It only provides contribution data per report. There are six reports per election cycle. And to add insult to injury, the system is slow. Not only are you having to download a shit ton of files, you have to wait a good minute for the damn thing to generate.
This is like an obvious opportunity to automate the process. What I had planned on doing is writing a program where I can input a URL of the page that links to all disclosure reports, and it will download all the contribution reports.
For a given candidate, I would input a link to this page - http://media.ethics.ga.gov/Search/Campaign/Campaign_Name.aspx?NameID=5753&FilerID=C2009000086&Type=candidate (the view report links are in the dropdown list titled "campaign contribution reports"). I then plan on following each of those links to the report page, following that link to the contributions page, and downloading the csv file. Once I have the csv file, (I think) the project comes under the scope of my coding ability.
The problem I am stuck on right now is that I can't figure out how to follow the view report links. The system is written in ASP. The links call a javascript postback function with a call of the sort "View Report". ctl02 is the identifier of the control. It appears that the information to map that control identifier to the url I need (in this case http://media.ethics.ga.gov/search/Campaign/Campaign_ReportOptions.aspx?NameID=5753&FilerID=C2009000086&CDRID=85776) is embedded in an encrypted __VIEWSTATE field.
I installed the Firebug debugger to try and get data that way. While I am very new to Firebug, all I could find is that in the net tab it shows a GET request to the URL that I need.
Obviously, somehow my browser is getting the next page, which means it should be automatable, but I am now at a loss. I've been working this up in python because I'm really starting to like it, but everything's negotiable. I am doing this on a mac (with full gnu environment), and would prefer to keep working in the environment I am familiar with, but I do have a windows xp vm with visual c++ '10 if I have to go that route.
What do y'all think?
Turns out the data wasn't in the encrypted __VIEWSTATE at all. There was a POST operation that Firebug was clearing on a redirect (despite having it set not to clear things.) I ran it with the Chrome dev console, and I was able to capture the POST data and replicate the POST operation in my application. That got me the URL I was looking for.
Thanks to everyone that looked at this!
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
I'm a little new to web crawlers and such, though I've been programming for a year already. So please bear with me as I try to explain my problem here.
I'm parsing info from Yahoo! News, and I've managed to get most of what I want, but there's a little portion that has stumped me.
For example: http://news.yahoo.com/record-nm-blaze-test-forest-management-225730172.html
I want to get the numbers beside the thumbs up and thumbs down icons in the comments. When I use "Inspect Element" in my Chrome browser, I can clearly see the things that I have to look for - namely, an em tag under the div class 'ugccmt-rate'. However, I'm not able to find this in my python program. In trying to track down the root of the problem, I clicked to view source of the page, and it seems that this tag is not there. Do you guys know how I should approach this problem? Does this have something to do with the javascript on the page that displays the info only after it runs? I'd appreciate some pointers in the right direction.
Thanks.
The page is being generated via JavaScript.
Check if there is a mobile version of the website first. If not, check for any APIs or RSS/Atom feeds. If there's nothing else, you'll either have to manually figure out what the JavaScript is loading and from where, or use Selenium to automate a browser that renders the JavaScript for you for parsing.
Using the Web Console in Firefox you can pretty easily see what requests the page is actually making as it runs its scripts, and figure out what URI returns the data you want. Then you can request that URI directly in your Python script and tease the data out of it. It is probably in a format that Python already has a library to parse, such as JSON.
Yahoo! may have some stuff on their server side to try to prevent you from accessing these data files in a script, such as checking the browser (user-agent header), cookies, or referrer. These can all be faked with enough perseverance, but you should take their existence as a sign that you should tread lightly. (They may also limit the number of requests you can make in a given time period, which is impossible to get around.)