How to clear cookies in WebKit? - python

i'm currently working with PyWebKitGtk in python (http://live.gnome.org/PyWebKitGtk). I would like to clear all cookies in my own little browser. I found interesting method webkit.HTTPResponse.clearCookies() but I have no idea how to lay my hands on instance of HTTPResponse object :/
I wouldn't like to use java script for that task.

If you look at the current state of the bindings on GitHub, you'll see PyWebKitGTK doesn't yet provide quite what you want- there's not mapping for the HTTPResponse type it looks like. Unfortunately, I think Javascript or a proxy are your only options right now.
EDIT:
...unless, of course, you want it real bad and stay up into the night learning ctypes. In which case, you can do magic. To clear all the browser's cookies, try this.
import gtk, webkit, ctypes
libwebkit = ctypes.CDLL('libwebkit-1.0.so')
libgobject = ctypes.CDLL('libgobject-2.0.so')
libsoup = ctypes.CDLL('libsoup-2.4.so')
v = webkit.WebView()
#do whatever it is you do with WebView...
....
#get the cookiejar from the default session
#(assumes one session and one cookiesjar)
generic_cookiejar_type = libgobject.g_type_from_name('SoupCookieJar')
cookiejar = libsoup.soup_session_get_feature(session, generic_cookiejar_type)
#build a callback to delete cookies
DEL_COOKIE_FUNC = ctypes.CFUNCTYPE(None, ctypes.c_void_p)
def del_cookie(cookie):
libsoup.soup_cookie_jar_delete_cookie(cookiejar, cookie)
#run the callback on all the cookies
cookie_list = libsoup.soup_cookie_jar_all_cookies(cookiejar)
libsoup.g_slist_foreach(cookie_list, DEL_COOKIE_FUNC(del_cookie), None)
EDIT:
Just started needing this myself, and while it's the right idea it needed work. Instead, try this- the function type and cookiejar access are fixed.
#add a new cookie jar
cookiejar = libsoup.soup_cookie_jar_new()
#uncomment the below line for a persistent jar instead
#cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)
#build a callback to delete cookies
DEL_COOKIE_FUNC = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.c_void_p, ctypes.c_void_p)
def del_cookie(cookie, userdata):
libsoup.soup_cookie_jar_delete_cookie(cookiejar, cookie)
return 0
#run the callback on all the cookies
cookie_list = libsoup.soup_cookie_jar_all_cookies(cookiejar)
libsoup.g_slist_foreach(cookie_list, DEL_COOKIE_FUNC(del_cookie), None)
Note that you should only do this before using the WebView, or maybe in WebKit callbacks, or you will have threading issues above and beyond those usually associated with GTK programming.

Related

python django soaplib response with classmodel issue

I run a soap server in django.
Is it possible to create a soap method that returns a soaplib classmodel instance without <{method name}Response><{method name}Result> tags?
For example, here is a part of my soap server code:
# -*- coding: cp1254 -*-
from soaplib.core.service import rpc, DefinitionBase, soap
from soaplib.core.model.primitive import String, Integer, Boolean
from soaplib.core.model.clazz import Array, ClassModel
from soaplib.core import Application
from soaplib.core.server.wsgi import Application as WSGIApplication
from soaplib.core.model.binary import Attachment
class documentResponse(ClassModel):
__namespace__ = ""
msg = String
hash = String
class MyService(DefinitionBase):
__service_interface__ = "MyService"
__port_types__ = ["MyServicePortType"]
#soap(String, Attachment, String ,_returns=documentResponse,_faults=(MyServiceFaultMessage,) , _port_type="MyServicePortType" )
def sendDocument(self, fileName, binaryData, hash ):
binaryData.file_name = fileName
binaryData.save_to_file()
resp = documentResponse()
resp.msg = "Saved"
resp.hash = hash
return resp
and it responses like that:
<senv:Body>
<tns:sendDocumentResponse>
<tns:sendDocumentResult>
<hash>14a95636ddcf022fa2593c69af1a02f6</hash>
<msg>Saved</msg>
</tns:sendDocumentResult>
</tns:sendDocumentResponse>
</senv:Body>
But i need a response like this:
<senv:Body>
<ns3:documentResponse>
<hash>A694EFB083E81568A66B96FC90EEBACE</hash>
<msg>Saved</msg>
</ns3:documentResponse>
</senv:Body>
What kind of configurations should i make in order to get that second response i mentioned above ?
Thanks in advance.
I haven't used Python's SoapLib yet, but had the same problem while using .NET soap libs. Just for reference, in .NET this is done using the following decorator:
[SoapDocumentMethod(ParameterStyle=SoapParameterStyle.Bare)]
I've looked in the soaplib source, but it seems it doesn't have a similar decorator. The closest thing I've found is the _style property. As seen from the code https://github.com/soaplib/soaplib/blob/master/src/soaplib/core/service.py#L124 - when using
#soap(..., _style='document')
it doesn't append the %sResult tag, but I haven't tested this. Just try it and see if this works in the way you want it.
If it doesn't work, but you still want to get this kind of response, look at Spyne:
http://spyne.io/docs/2.10/reference/decorator.html
It is a fork from soaplib(I think) and has the _soap_body_style='bare' decorator, which I believe is what you want.

what is the best way to make my folders invisible / restricted in twistd?

a fews days ago, i tried to learn the python twisted..
and this is how i make my webserver :
from twisted.application import internet, service
from twisted.web import static, server, script
from twisted.web.resource import Resource
import os
class NotFound(Resource):
isLeaf=True
def render(self, request):
return "Sorry... the page you're requesting is not found / forbidden"
class myStaticFile(static.File):
def directoryListing(self):
return self.childNotFound
#root=static.file(os.getcwd()+"/www")
root=myStaticFile(os.getcwd()+"/www")
root.indexNames=['index.py']
root.ignoreExt(".py")
root.processors = {'.py': script.ResourceScript}
root.childNotFound=NotFound()
application = service.Application('web')
sc = service.IServiceCollection(application)
i = internet.TCPServer(8080, server.Site(root))##UndefinedVariable
i.setServiceParent(sc)
in my code, i make an instance class for twisted.web.static.File and override the directoryListing.
so when user try to access my resource folder (http://localhost:8080/resource/ or http://localhost:8080/resource/css), it will return a notFound page.
but he can still open/read the http://localhost:8080/resource/css/style.css.
it works...
what i want to know is.. is this the correct way to do that???
is there another 'perfect' way ?
i was looking for a config that disable directoryListing like root.dirListing=False. but no luck...
Yes, that's a reasonable way to do it. You can also use twisted.web.resource.NoResource or twisted.web.resource.Forbidden instead of defining your own NotFound.

python webkit webview remember cookies?

I have written a short python script that opens Google music in web view window. however I can't seem to find anything about getting webkit to use cookies so that I don't have to login every time I start it up.
Here's what I have:
#!/usr/bin/env python
import gtk, webkit
import ctypes
libgobject = ctypes.CDLL('/usr/lib/i386-linux-gnu/libgobject-2.0.so.0')
libwebkit = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libsoup = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
proxy_uri = libsoup.soup_uri_new('http://tcdproxy.tcd.ie:8080') #proxy urli
session = libwebkit.webkit_get_default_session()
libgobject.g_object_set(session, "proxy-uri", proxy_uri, None)
w = gtk.Window()
w.connect("destroy",w.destroy)
w.set_size_request(1000,600)
w.connect('delete-event', lambda w, event: gtk.main_quit())
s = gtk.ScrolledWindow()
v = webkit.WebView()
s.add(v)
w.add(s)
w.show_all()
v.open('http://music.google.com')
gtk.main()
Any help on this would be greatly appreciated,
thanks,
Richard
Worked it out, but it required learning more ctypes than I wanted -_-. Try this- I required different library paths, etc than you, so I'll just paste what's relevant.
#remove all cookiejars
generic_cookiejar_type = libgobject.g_type_from_name('SoupCookieJar')
libsoup.soup_session_remove_feature_by_type(session, generic_cookiejar_type)
#and replace with a new persistent jar
cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)
The code's pretty self explanatory. There's also a SoupCookieJarSqlite that you might prefer, though I'm sure the text file would be easier for development.
EDIT: actually, the cookie jar removal doesn't seem to be doing anything, so the appropriate snippet is
#add a new persistent cookie jar
cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)
I know its old question and have been looking for the answer all over the place. Finally came up on my own after some trial and error. Hope this helps others.
This is basically same answer from Matt, just using GIR introspection and feels more pythonish.
from gi.repository import Soup
cookiejar = Soup.CookieJarText.new("<Your cookie path>", False)
cookiejar.set_accept_policy(Soup.CookieJarAcceptPolicy.ALWAYS)
session = WebKit.get_default_session()
session.add_feature(cookiejar)
In the latest version i.e. GTK WebKit2 4.0, this has to be done in the following way:
import gi
gi.require_version('Soup', '2.4')
gi.require_version('WebKit2', '4.0')
from gi.repository import Soup
from gi.repository import WebKit2
browser = WebKit2.WebView()
website_data_manager = browser.get_website_data_manager()
cookie_manager = website_data_manager.get_cookie_manager()
cookie_manager.set_persistent_storage('PATH_TO_YOUR/cookie.txt', WebKit2.CookiePersistentStorage.TEXT)
cookie_manager.set_accept_policy(Soup.CookieJarAcceptPolicy.ALWAYS)

Captchas in Scrapy

I'm working on a Scrapy app, where I'm trying to login to a site with a form that uses a captcha (It's not spam). I am using ImagesPipeline to download the captcha, and I am printing it to the screen for the user to solve. So far so good.
My question is how can I restart the spider, to submit the captcha/form information? Right now my spider requests the captcha page, then returns an Item containing the image_url of the captcha. This is then processed/downloaded by the ImagesPipeline, and displayed to the user. I'm unclear how I can resume the spider's progress, and pass the solved captcha and same session to the spider, as I believe the spider has to return the item (e.g. quit) before the ImagesPipeline goes to work.
I've looked through the docs and examples but I haven't found any ones that make it clear how to make this happen.
This is how you might get it to work inside the spider.
self.crawler.engine.pause()
process_my_captcha()
self.crawler.engine.unpause()
Once you get the request, pause the engine, display the image, read the info from the user& resume the crawl by submitting a POST request for login.
I'd be interested to know if the approach works for your case.
I would not create an Item and use the ImagePipeline.
import urllib
import os
import subprocess
...
def start_requests(self):
request = Request("http://webpagewithcaptchalogin.com/", callback=self.fill_login_form)
return [request]
def fill_login_form(self,response):
x = HtmlXPathSelector(response)
img_src = x.select("//img/#src").extract()
#delete the captcha file and use urllib to write it to disk
os.remove("c:\captcha.jpg")
urllib.urlretrieve(img_src[0], "c:\captcha.jpg")
# I use an program here to show the jpg (actually send it somewhere)
captcha = subprocess.check_output(r".\external_utility_solving_captcha.exe")
# OR just get the input from the user from stdin
captcha = raw_input("put captcha in manually>")
# this function performs the request and calls the process_home_page with
# the response (this way you can chain pages from start_requests() to parse()
return [FormRequest.from_response(response,formnumber=0,formdata={'user':'xxx','pass':'xxx','captcha':captcha},callback=self.process_home_page)]
def process_home_page(self, response):
# check if you logged in etc. etc.
...
What I do here is that I import urllib.urlretrieve(url) (to store the image), os.remove(file) (to delete the previous image), and subprocess.checoutput (to call an external command line utility to solve the captcha). The whole Scrapy infrastructure is not used in this "hack", because solving a captcha like this is always a hack.
That whole calling external subprocess thing could have been one nicer, but this works.
On some sites it's not possible to save the captcha image and you have to call the page in a browser and call a screen_capture utility and crop on an exact location to "cut out" the captcha. Now that is screenscraping.

Using Python/Selenium/Best Tool For The Job to get URI of image requests generated through JavaScript?

I have some JavaScript from a 3rd party vendor that is initiating an image request. I would like to figure out the URI of this image request.
I can load the page in my browser, and then monitor "Live HTTP Headers" or "Tamper Data" in order to figure out the image request URI, but I would prefer to create a command line process to do this.
My intuition is that it might be possible using python + qtwebkit, but perhaps there is a better way.
To clarify: I might have this (overly simplified code).
<script>
suffix = magicNumberFunctionIDontHaveAccessTo();
url = "http://foobar.com/function?parameter=" + suffix
img = document.createElement('img'); img.src=url; document.all.body.appendChild(img);
</script>
Then once the page is loaded, I can go figure out the url by sniffing the packets. But I can't just figure it out from the source, because I can't predict the outcome of magicNumberFunction...().
Any help would be muchly appreciated!
Thank you.
The simplest thing to do might be to use something like HtmlUnit and skip a real browser entirely. By using Rhino, it can evaluate JavaScript and likely be used to extract that URL out.
That said, if you can't get that working, try out Selenium RC and use the captureNetworkTraffic command (which requires the Selenium instant be started with an option of captureNetworkTraffic=true). This will launch Firefox with a proxy configured and then let you pull the request info back out as JSON/XML/plain text. Then you can parse that content and get what you want.
Try out the instant test tool that my company offers. If the data you're looking for is in our results (after you click View Details), you'll be able to get it from Selenium. I know, since I wrote the captureNetworkTraffic API for Selenium for my company, BrowserMob.
I would pick any one of the many http proxy servers written in Python -- probably one of the simplest ones at the very top of the list -- and tweak it to record all URLs requested (as well as proxy-serve them) e.g. appending them to a text file -- without loss of generality, call that text file 'XXX.txt'.
Now all you need is a script that: starts the proxy server in question; starts Firefox (or whatever) on your main desired URL with the proxy in question set as your proxy (see e.g. this SO question for how), though I'm sure other browsers would work just as well; waits a bit (e.g. until the proxy's XXX.txt file has not been altered for more than N seconds); reads XXX.txt to extract only the URLs you care about and record them wherever you wish; turns down the proxy and Firefox processes.
I think this will be much faster to put in place and make work correctly, for your specific requirements, than any more general solution based on qtwebkit, selenium, or other "automation kits".
Use Firebug Firefox plugin. It will show you all requests in real time and you can even debug the JS in your Browser or run it step-by-step.
Ultimately, I did it in python, using Selenium-RC. This solution requires the python files for selenium-rc, and you need to start the java server ("java -jar selenium-server.jar")
from selenium import selenium
import unittest
import lxml.html
class TestMyDomain(unittest.TestCase):
def setUp(self):
self.selenium = selenium("localhost", \
4444, "*firefox", "http://www.MyDomain.com")
self.selenium.start()
def test_mydomain(self):
htmldoc = open('site-list.html').read()
url_list = [link for (element, attribute,link,pos) in lxml.html.iterlinks(htmldoc)]
for url in url_list:
try:
sel = self.selenium
sel.open(url)
sel.select_window("null")
js_code = '''
myDomainWindow = this.browserbot.getUserWindow();
for(obj in myDomainWindow) {
/* This code grabs the OMNITURE tracking pixel img */
if ((obj.substring(0,4) == 's_i_') && (myDomainWindow[obj].src)) {
var ret = myDomainWindow[obj].src;
}
}
ret;
'''
omniture_url = sel.get_eval(js_code) #parse&process this however you want
except Exception, e:
print 'We ran into an error: %s' % (e,)
self.assertEqual("expectedValue", observedValue)
def tearDown(self):
self.selenium.stop()
if __name__ == "__main__":
unittest.main()
Why can't you just read suffix, or url for that matter? Is the image loaded in an iframe or in your page?
If it is loaded in your page, then this may be a dirty hack (substitute document.body for whatever element is considered):
var ac = document.body.appendChild;
var sources = [];
document.body.appendChild = function(child) {
if (/^img$/i.test(child.tagName)) {
sources.push(child.getAttribute('src'));
}
ac(child);
}

Categories