My final goal is import some data from Google Site pages.
I'm trying to use gdata-python-client (v2.0.17) to download a specific Content Feed:
self.client = gdata.sites.client.SitesClient(source=SOURCE_APP_NAME)
self.client.client_login(USERNAME, PASSWORD, source=SOURCE_APP_NAME, service=self.client.auth_service)
self.client.site = SITE
self.client.domain = DOMAIN
uri = '%s?path=%s' % (self.client.MakeContentFeedUri(), '[PAGE PATH]')
feed = self.client.GetContentFeed(uri=uri)
entry = feed.entry[0]
...
Resulted entry.content has a page content in xhtml format. But this tree doesn't content any plan text data from a page. Only html page struct and links.
For example my test page has
<div>Some text</div>
ContentFeed entry has only div node with text=None.
I have debugged gdata-python-client request/response and checked resolved data from server in raw buffer - any plan text data in content. Hence it is a Google API bug.
May be there is some workaround? May be i can use some common request parameter? What's going wrong here?
This code works for me against a Google Apps domain and gdata 2.0.17:
import atom.data
import gdata.sites.client
import gdata.sites.data
client = gdata.sites.client.SitesClient(source='yourCo-yourAppName-v1', site='examplesite', domain='example.com')
client.ClientLogin('admin#example.com', 'examplepassword', client.source);
uri = '%s?path=%s' % (client.MakeContentFeedUri(), '/home')
feed = client.GetContentFeed(uri=uri)
entry = feed.entry[0]
print entry
Given, it's pretty much identical to yours, but it might help you prove or disprove something. Good luck!
Related
I'd like to download the entire revision history of a single article on Wikipedia, but am running into a roadblock.
It is very easy to download an entire Wikipedia article, or to grab pieces of its history using the Special:Export URL parameters:
curl -d "" 'https://en.wikipedia.org/w/index.php?title=Special:Export&pages=Stack_Overflow&limit=1000&offset=1' -o "StackOverflow.xml"
And of course I can download the entire site including all versions of every article from here, but that's many terabytes and way more data than I need.
Is there a pre-built method for doing this? (Seems like there must be.)
The example above only gets information about the revisions, not the actual contents themselves. Here's a short python script that downloads the full content and metadata history data of a page into individual json files:
import mwclient
import json
import time
site = mwclient.Site('en.wikipedia.org')
page = site.pages['Wikipedia']
for i, (info, content) in enumerate(zip(page.revisions(), page.revisions(prop='content'))):
info['timestamp'] = time.strftime("%Y-%m-%dT%H:%M:%S", info['timestamp'])
print(i, info['timestamp'])
open("%s.json" % info['timestamp'], "w").write(json.dumps(
{ 'info': info,
'content': content}, indent=4))
Wandering around aimlessly looking for clues to another question I have myself — my way of saying I know nothing substantial about this topic! — I just came upon this a moment after reading your question: http://mwclient.readthedocs.io/en/latest/reference/page.html. Have a look for the revisions method.
EDIT: I also see http://mwclient.readthedocs.io/en/latest/user/page-ops.html#listing-page-revisions.
Sample code using the mwclient module:
#!/usr/bin/env python3
import logging, mwclient, pickle, os
from mwclient import Site
from mwclient.page import Page
logging.root.setLevel(logging.DEBUG)
logging.debug('getting page...')
env_page = os.getenv("MEDIAWIKI_PAGE")
page_name = env_page is not None and env_page or 'Stack Overflow'
page_name = Page.normalize_title(env_page)
site = Site('en.wikipedia.org') # https by default. change w/`scheme=`
page = site.pages[page_name]
logging.debug('extracting revisions (may take a really long time, depending on the page)...')
revisions = []
for i, revision in enumerate(page.revisions()):
revisions.append(revision)
logging.debug('saving to file...')
with open('{}Revisions.mediawiki.pkl'.format(page_name), 'wb+') as f:
pickle.dump(revisions, f, protocol=0) # protocol allows backwards compatibility between machines
The following is the xml from remote URL
<SHOUTCASTSERVER>
<CURRENTLISTENERS>0</CURRENTLISTENERS>
<PEAKLISTENERS>0</PEAKLISTENERS>
<MAXLISTENERS>100</MAXLISTENERS>
<UNIQUELISTENERS>0</UNIQUELISTENERS>
<AVERAGETIME>0</AVERAGETIME>
<SERVERGENRE>variety</SERVERGENRE>
<SERVERGENRE2/>
<SERVERGENRE3/>
<SERVERGENRE4/>
<SERVERGENRE5/>
<SERVERURL>http://localhost/</SERVERURL>
<SERVERTITLE>Wicked Radio WIKD/WPOS</SERVERTITLE>
<SONGTITLE>Unknown - Haxor Radio Show 08</SONGTITLE>
<STREAMHITS>0</STREAMHITS>
<STREAMSTATUS>1</STREAMSTATUS>
<BACKUPSTATUS>0</BACKUPSTATUS>
<STREAMLISTED>0</STREAMLISTED>
<STREAMLISTEDERROR>200</STREAMLISTEDERROR>
<STREAMPATH>/stream</STREAMPATH>
<STREAMUPTIME>448632</STREAMUPTIME>
<BITRATE>128</BITRATE>
<CONTENT>audio/mpeg</CONTENT>
<VERSION>2.4.7.256 (posix(linux x64))</VERSION>
</SHOUTCASTSERVER>
All I am trying to do is store the contents of the element <SONGTITLE> store it so I can post to IRC using a bot that I have.
import urllib2
from lxml import etree
url = "http://142.4.217.133:9203/stats?sid=1&mode=viewxml&page=0"
fp = urllib2.urlopen(url)
doc = etree.parse(fp)
fp.close()
for record in doc.xpath('//SONGTITLE'):
for x in record.xpath("./subfield/text()"):
print "\t", x
That is what I have so far; not sure what I am doing wrong here. I am quite new to python but the IRC bot works and does some other utility type things I just want to add this as a feature to it.
You don't need to include ./subfield/:
for x in record.xpath("text()"):
Output:
Unknown - Haxor Radio Show 08
I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.
I keep my images in the DB as blobs:
class MyClass(db.Model):
icon=db.BlobProperty()
Now, I want to send the blob to my HTML like this :
lets say I have myClass as an instance of MyClass
result = """<div img_attr=%s> Bla Bla </div>""" % (myClass.icon)
Some how it doesn't work. Do you have any idea?
You cannot just dump raw image data into your html page. You need to do this in two pieces:
In your html, you need to refer to an image file:
result = "<div>"
"<img src='{0}' />"
"</div>"
.format(MYSITE + MYIMAGEDIR + myClass.name)
Your browser reads the html page, finds out you want to include an image, and goes looking for the image file to include - so it asks your site for something like http://www.myexample.com/images/icon01.jpg
Now, separately, you respond to this second request with the image content, as #anand has shown.
Your code suggests that you are working on Google application engine with Django.
You just need to query the image in your view and return it as http response.
image = myclassObject.icon
response = HttpResponse(image)
response['Content-Type'] = 'image/jpg'
return response
The value stored in the the data store, and returned by appengine with a db.BlobProperty is not the actual bytes of the blob, but rather a BlobKey that is used to reference it. There are two ways you can use that key. You can create a BlobReader to load the bytes of the blob from the BlobStore into your app, or you can craft a response with ServeHandler.send_blob to transfer those bytes to the client.
Doing the second one in Django is a bit of a headache, because ServeHandler doesn't really fit in well with the Django request processing stack. Here's a view that will do it for you without too much trouble:
def get_image_data(request, key, filename):
"serve original uploaded image"
# verify the users' ability to get the requested image
key = urllib.unquote(key)
img = _load_metadata(request, key)
blob = img.data;
blobkey = blob.key()
# and tell google to serve it
response = http.HttpResponse(
content='',
content_type=blob.content_type)
response['X-AppEngine-BlobKey'] = str(blobkey)
return response
I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}