I am trying to access our site's Web usage statistics through Google Analytics API. I downloaded the Python code from here
http://code.google.com/p/gdata-python-client/
Under samples/analytics folder, there is data_feed_demo.py. I ran it, however this code seems to want an table ID, but in the docs it is not clear where this comes from. On the Web, some suggest to use profile id, others say to look at some URL from the GA admin pages. I tried various sections of such URLs from the GA tool, but the code was not able to get data. Any ideas?
There was an answer here
https://groups.google.com/forum/?fromgroups#!topic/google-analytics-data-export-api/SdprtYcBLP4
When I logged in GA, the URL in the main page is something like
https://www.google.com/analytics/web/?et=#dashboard...a[xxx]w[xxx]p[xxx]/
I took [xxx] out of p[xxx] and gave the sample script as ga:[xxx]. This worked. Funny thing is I remember taking out p values out of the URL before, but I guess I was not on the main page. Anyhow. This is the answer.
https://developers.google.com/analytics/solutions/articles/hello-analytics-api
This is the best way to get started with GA api.
Related
So I am building a Django web app, and I want to allow users to search for their desired crypto currency and then return the price. I plan on getting the price from coinbase or some other site that already presents this information. How would I go about this. I figure I would have to wrote the script to get the price under views.py. What would be the best approach? Can I add a web scrapping script that already does this to django? Or would I have to connect say coinbases api to my Django project. If so how do I do this?
If you're looking at using an API from a service to get these prices then Request is something you can look at.
If you're looking at scrapping the data from a page, then you'll probably want to look at BeautifulSoup, or scrapy or one step further selenium
As for where you call it, that's on you. if it's data that you're always going to need, then you could look at runnning your script as a task or worker so you're always getting an up-to-date price. Otherwise you could trigger the script and wait for the response to come back. Lot's of draw backs to both of these, and I'm guessing if the site doesn't provide an API for getting the info you need through a managed endpoint they will probably block your requests if you make too many of them.
but that's a starter for 10
I want to get all the Advisory ID and CVE ID from this page
https://psirt.global.sonicwall.com/vuln-list
My earlier approach was to extract links and IDs from source code (I have followed this approach with other vendors such as Google chrome update and Mozilla update). But here I cannot see any data in the source code. When I am in inspect mode though, I can see the data. However, when I view the source code, I cannot find it.
I tried logging the traffic and then searching for the piece of data it seems like it's requesting https://psirtapi.global.sonicwall.com/api/v1/vulnsummary/?srch=&vulnerable_products=&ord=-advisory_id for the data, you're looking for and then returns it in the response. You can then parse it.
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
I've looked at a lot of questions and libs and didn't found exactly what I wanted. Here's the thing, I'm developing an application in python for a user to get all sorts of things from social networks accounts. I'm having trouble with facebook. I would like, if possible, a step-by-step tutorial on the code and libs to use to get a user's information, from posts to photos information (with the user's login information, and how to do it, because I've had a lot of problem with authentication).
Thank you
I strongly encourage you to use Facebook's own APIs.
First of all, check out documentation on Facebook's Graph API https://developers.facebook.com/docs/reference/api/. If you are not familiar with JSON, DO read a tutorial on it (for instance http://secretgeek.net/json_3mins.asp).
Once you grasp the concepts, start using this API. For Python, there are at several alternatives:
facebook/python-sdk https://github.com/facebook/python-sdk
pyFaceGraph https://github.com/iplatform/pyFaceGraph/
It is also semitrivial to write a simple HTTP client that uses the graph API
I would suggest you to check out the Python libraries, try out the examples in their documentation and see if they are working and do the stuff you need.
Only as a last resort, would I write a scraper and try to extract data with screenscraping (it is much more painful and breaks more easily).
I have not used this with Facebook, but in the past when I had to scrape a site that required login I used Mechanize to handle the login and scraping and Beautiful Soup to parse the resulting HTML.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.