When I upload PDF to Google Docs (using Python's gdata library), I get link to the document:
>>> e.GetAlternateLink().href
Out[14]: 'http://docs.google.com/a/my.dom.ain/fileview id=<veery-long-doc-id>&hl=en'
Unfortunately using that link in IFRAME is not working for me because PDF viewer is redirecting to itself, breaking out of IFRAME.
Looking for the solution, I've found this: http://googlesystem.blogspot.com/2009/09/embeddable-google-document-viewer.html - which looks very nice, but I can't find a way to use it with document uploaded to Google Docs. Does somebody know how to do it/if it's at all possible?
Just for the record - I haven't found any way to force "internal" google google pdf viewer to not go out of the iframe. And as I mentioned in the question, I found this nice standalone viewer: https://googlesystem.blogspot.com/2009/09/embeddable-google-document-viewer.html, that can be used like this:
<iframe src="https://docs.google.com/gview?url=http://infolab.stanford.edu/pub/papers/google.pdf&embedded=true" style="width:600px; height:500px;" frameborder="0"></iframe>
-- but in order to use it you have to publish your PDF to the outside world. This wouldn't be a bad solution, because published document has unique id that is probably harder to guess than a password to google docs account. Unfortunately, even with hottest Google Docs API version 3 API, there seems to be no way of publishing PDF programatically..
In the end, I went for a mix of: standalone PDF viewer from google and some other web service that allows to programatically upload and publish PDF. A bit half-baked solution, but it works well so far.
To embed pdf files present in your google docs into your website use the below code:
<iframe src="http://docs.google.com/gview?a=v&pid=explorer&chrome=false&api=true&embedded=true&srcid=<id of your pdf>&hl=en&embedded=true" style="width:600px; height:500px;" frameborder="0"></iframe>
Try this!
Same as other answers above...
<iframe src="https://docs.google.com/gview?url={magical url that works}"></iframe>
except the magical url that works is https://drive.google.com/uc?id=<docId>&embedded=true.
Google Drive/Docs provides a bunch of different urls:
https://drive.google.com/open?id=<docId> Share link.
https://docs.google.com/document/<docId>/edit Open in Google Drive.
https://docs.google.com/document/d/<docId>/view Same as 'edit' above. I think.
https://docs.google.com/document/d/<docId>/pub?embedded=true For embedding in iframe if you File -> Publish to the web...
https://drive.google.com/uc?export=download&id=<docId> Direct download link.
I stumbed across this solution after a bunch of trial and error with different links. Hope this helps!
The Google Docs embedding in iframes via the viewer is problematic in IE8 if not already cached, and is is just not equal to the much better Scribd's facility that allows you to simply make a simple html page with the document embeded via their supplied object code for the document. I then use it as the source file for my iframe. It shows the print (and also a full screen button), right in the embedded frame page. Much more friendly and reliable for the page's visitors.
The following worked for me:
<iframe src="https://drive.google.com/viewerng/viewer?url=url_of_pdf?pid=explorer&efh=false&a=v&chrome=false&embedded=true" embedded=true></iframe>
Spent an hour on this, below worked:
Example:
<iframe src={`https://docs.google.com/gview?url=${encodeURIComponent('http://infolab.stanford.edu/pub/papers/google.pdf')}&embedded=true`}></iframe>
Note that encodeURIComponent was needed.
Related
I am trying to scrape a web site using python and beautiful soup. I encountered that in some sites, the image links although seen on the browser is cannot be seen in the source code. However on using Chrome Inspect or Fiddler, we can see the the corresponding codes.
What I see in the source code is:
<div id="cntnt"></div>
But on Chrome Inspect, I can see a whole bunch of HTML\CSS code generated within this div class. Is there a way to load the generated content also within python? I am using the regular urllib in python and I am able to get the source but without the generated part.
I am not a web developer hence I am not able to express the behaviour in better terms. Please feel free to clarify if my question seems vague !
You need JavaScript Engine to parse and run JavaScript code inside the page.
There are a bunch of headless browsers that can help you
http://code.google.com/p/spynner/
http://phantomjs.org/
http://zombie.labnotes.org/
http://github.com/ryanpetrello/python-zombie
http://jeanphix.me/Ghost.py/
http://webscraping.com/blog/Scraping-JavaScript-webpages-with-webkit/
The Content of the website may be generated after load via javascript, In order to obtain the generated script via python refer to this answer
A regular scraper gets just the HTML document. To get any content generated by JavaScript logic, you rather need a Headless browser that would also generate the DOM, load and run the scripts like a regular browser would. The Wikipedia article and some other pages on the Net have lists of those and their capabilities.
Keep in mind when choosing that some previously major products of those are abandoned now.
TRY THIS FIRST!
Perhaps the data technically could be in the javascript itself and all this javascript engine business is needed. (Some GREAT links here!)
But from experience, my first guess is that the JS is pulling the data in via an ajax request. If you can get your program simulate that, you'll probably get everything you need handed right to you without any tedious parsing/executing/scraping involved!
It will take a little detective work though. I suggest turning on your network traffic logger (such as "Web Developer Toolbar" in Firefox) and then visiting the site. Focus your attention attention on any/all XmlHTTPRequests. The data you need should be found somewhere in one of these responses, probably in the middle of some JSON text.
Now, see if you can re-create that request and get the data directly. (NOTE: You may have to set the User-Agent of your request so the server thinks you're a "real" web browser.)
I am trying to include a youtube video on website that I'm developing using GAE and python.
I know I should use this<iframe width="420" height="345"
src="http://www.youtube.com/watch?v=MYSVMgRr6pw">
</iframe> in my HTML, but I am also guessing I have to make some changes in app.yaml file. I can't figure out how to amend my app.yaml correctly. Currently I can only see a square box and no video. Here is a link to a web page with a video http://www.firstpiproject.appspot.com/learninglinux
Thanks
I believe, per http://www.w3schools.com/html/html_youtube.asp, that the canonical form is something like, and I quote:
<iframe width="420" height="315"
src="http://www.youtube.com/embed/XGSy3_Czz8k">
</iframe>
Note the slightly different format for the src= URL, with .../embed/ -- your page has src="http://www.youtube.com/watch?v=hBvaB8aAp1I&feature=youtu.be", which is a somewhat-different format.
I don't think this has anything to do with App Engine, python, app.yaml, and the like -- it's all about what, exactly, you put in that src= parameter of the iframe you serve as part of your HTML page. Try the w3schools-recommended format with .../embed/... and let us know!
Say I look at the following Tumblr post: http://ronbarak.tumblr.com/post/40692813…
It (currently) has 292 notes.
I'd like to get all the above notes using a Python script (e.g., via urllib2, BeautifulSoup, simplejson, or tumblr Api).
Some extensive Googling did not produce any items relating to notes' extraction in Tumblr.
Can anyone point me in the right direction on which tool will enable me to do that?
Unfortunately looks like the Tumblr API has some limitations (lacks of meta information about Reblogs, notes limited by 50), so you can't get all the notes.
It is also forbidden to do page scraping according to the Terms of Service.
"You may not do any of the following while accessing or using the Services: (...) scrape the Services, and particularly scrape Content (as defined below) from the Services, without Tumblr's express prior written consent;"
Source:
https://groups.google.com/forum/?fromgroups=#!topic/tumblr-api/ktfMIdJCOmc
Without JS you get separate pages that only contain the notes. For the mentioned blog post the first page would be:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Following pages are linked at the bottom, e.g.:
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358403506
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358383221
http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy?from_c=1358377013
…
(See my answer on how to find the next URL in a’s onclick attribute.)
Now you could use various tools to download/parse the data.
The following wget command should download all notes pages for that post:
wget --recursive --domains=ronbarak.tumblr.com --include-directories=notes http://ronbarak.tumblr.com/notes/40692813320/4Y70Zzacy
Like Fabio implies, it is better to use the API.
If for whatever reasons you cannot, then the tools you will use will depend on what you want to do with the data in the posts.
for a data dump: urllib will return a string of the page you want
looking for a specific section in the html: lxml is pretty good
looking for something in unruly html: definitely beautifulsoup
looking for a specific item in a section: beautifulsoup, lxml, text parsing is what you need.
need to put the data in a database/file: use scrapy
Tumblr url scheme is simple: url/scheme/1, url/scheme/2, url/scheme/3, etc... until you get to the end of the posts and the servers just does not return any data anymore.
So if you are going to brute force your way to scraping, you can easily tell your script to dump all the data on your hard drive until, say the contents tag, is empty.
One last word of advice, please remember to put a small sleep(1000) in your script, because you could put some stress on Tumblr servers.
how to load all notes on tumblr? also covers the topic, but unor's response (above) does it very well.
I'm trying to scrap a page in youtube with python which has lot of ajax in it
I've to call the java script each time to get the info. But i'm not really sure how to go about it. I'm using the urllib2 module to open URLs. Any help would be appreciated.
Youtube (and everything else Google makes) have EXTENSIVE APIs already in place for giving you access to just about any and all data you could possibly want.
Take a look at The Youtube Data API for more information.
I use urllib to make the API requests and ElementTree to parse the returned XML.
Main problem is, you're violating the TOS (terms of service) for the youtube site. Youtube engineers and lawyers will do their professional best to track you down and make an example of you if you persist. If you're happy with that prospect, then, on you head be it -- technically, your best bet are python-spidermonkey and selenium. I wanted to put the technical hints on record in case anybody in the future has needs like the ones your question's title indicates, without the legal issues you clearly have if you continue in this particular endeavor.
Here is how I would do it: Install Firebug on Firefox, then turn the NET on in firebug and click on the desired link on YouTube. Now see what happens and what pages are requested. Find the one that are responsible for the AJAX part of page. Now you can use urllib or Mechanize to fetch the link. If you CAN pull the same content this way, then you have what you are looking for, then just parse the content. If you CAN'T pull the content this way, then that would suggest that the requested page might be looking at user login credentials, sessions info or other header fields such as HTTP_REFERER ... etc. Then you might want to look at something more extensive like the scrapy ... etc. I would suggest that you always follow the simple path first. Good luck and happy "responsibly" scraping! :)
As suggested, you should use the YouTube API to access the data made available legitimately.
Regarding the general question of scraping AJAX, you might want to consider the scrapy framework. It provides extensive support for crawling and scraping web sites and uses python-spidermonkey under the hood to access javascript links.
You could sniff the network traffic with something like Wireshark then replay the HTTP calls via a scraping framework that is robust enough to deal with AJAX, such as scraPY.
I want to be able to download a page and all of its associated resources (images, style sheets, script files, etc) using Python. I am (somewhat) familiar with urllib2 and know how to download individual urls, but before I go and start hacking at BeautifulSoup + urllib2 I wanted to be sure that there wasn't already a Python equivalent to "wget --page-requisites http://www.google.com".
Specifically I am interested in gathering statistical information about how long it takes to download an entire web page, including all resources.
Thanks
Mark
Websucker? See http://effbot.org/zone/websucker.htm
websucker.py doesn't import css links. HTTrack.com is not python, it's C/C++, but it's a good, maintained, utility for downloading a website for offline browsing.
http://www.mail-archive.com/python-bugs-list#python.org/msg13523.html
[issue1124] Webchecker not parsing css "#import url"
Guido> This is essentially unsupported and unmaintaned example code. Feel free
to submit a patch though!