python webkit webview remember cookies? - python

I have written a short python script that opens Google music in web view window. however I can't seem to find anything about getting webkit to use cookies so that I don't have to login every time I start it up.
Here's what I have:
#!/usr/bin/env python
import gtk, webkit
import ctypes
libgobject = ctypes.CDLL('/usr/lib/i386-linux-gnu/libgobject-2.0.so.0')
libwebkit = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libsoup = ctypes.CDLL('/usr/lib/libsoup-2.4.so.1')
libwebkit = ctypes.CDLL('/usr/lib/libwebkitgtk-1.0.so.0')
proxy_uri = libsoup.soup_uri_new('http://tcdproxy.tcd.ie:8080') #proxy urli
session = libwebkit.webkit_get_default_session()
libgobject.g_object_set(session, "proxy-uri", proxy_uri, None)
w = gtk.Window()
w.connect("destroy",w.destroy)
w.set_size_request(1000,600)
w.connect('delete-event', lambda w, event: gtk.main_quit())
s = gtk.ScrolledWindow()
v = webkit.WebView()
s.add(v)
w.add(s)
w.show_all()
v.open('http://music.google.com')
gtk.main()
Any help on this would be greatly appreciated,
thanks,
Richard

Worked it out, but it required learning more ctypes than I wanted -_-. Try this- I required different library paths, etc than you, so I'll just paste what's relevant.
#remove all cookiejars
generic_cookiejar_type = libgobject.g_type_from_name('SoupCookieJar')
libsoup.soup_session_remove_feature_by_type(session, generic_cookiejar_type)
#and replace with a new persistent jar
cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)
The code's pretty self explanatory. There's also a SoupCookieJarSqlite that you might prefer, though I'm sure the text file would be easier for development.
EDIT: actually, the cookie jar removal doesn't seem to be doing anything, so the appropriate snippet is
#add a new persistent cookie jar
cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)

I know its old question and have been looking for the answer all over the place. Finally came up on my own after some trial and error. Hope this helps others.
This is basically same answer from Matt, just using GIR introspection and feels more pythonish.
from gi.repository import Soup
cookiejar = Soup.CookieJarText.new("<Your cookie path>", False)
cookiejar.set_accept_policy(Soup.CookieJarAcceptPolicy.ALWAYS)
session = WebKit.get_default_session()
session.add_feature(cookiejar)

In the latest version i.e. GTK WebKit2 4.0, this has to be done in the following way:
import gi
gi.require_version('Soup', '2.4')
gi.require_version('WebKit2', '4.0')
from gi.repository import Soup
from gi.repository import WebKit2
browser = WebKit2.WebView()
website_data_manager = browser.get_website_data_manager()
cookie_manager = website_data_manager.get_cookie_manager()
cookie_manager.set_persistent_storage('PATH_TO_YOUR/cookie.txt', WebKit2.CookiePersistentStorage.TEXT)
cookie_manager.set_accept_policy(Soup.CookieJarAcceptPolicy.ALWAYS)

Related

Python URLLib does not work with PyQt + Multiprocessing

A simple code as such:
import urllib2
import requests
from PyQt4 import QtCore
import multiprocessing
import time
data = (
['a', '2'],
)
def mp_worker((inputs, the_time)):
r = requests.get('http://www.gpsbasecamp.com/national-parks')
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
def mp_handler():
p = multiprocessing.Pool(2)
p.map(mp_worker, data)
if __name__ == '__main__':
mp_handler()
Basically, if i import PyQt4, and i have a urllib request (i believe this is used in almost all web extraction libraries such as BeautifulSoup, Requests or Pyquery. it crashes with a cryptic log on my MAC)
This is exactly True. It always fails on Mac, I have wasted rows of days just to fix this. And honestly there is no fix as of now. The best way is to use Thread instead of Process and it will work like a charm.
By the way -
r = requests.get('http://www.gpsbasecamp.com/national-parks')
and
request = urllib2.Request("http://www.gpsbasecamp.com/national-parks")
response = urllib2.urlopen(request)
do one and the same thing. Why are you doing it twice?
This may be due _scproxy.get_proxies() not being fork-safe on Mac.
This is raised here https://bugs.python.org/issue33725#msg329926
_scproxy has been known to be problematic for some time, see for instance Issue31818. That issue also gives a simple workaround: setting urllib's "no_proxy" environment variable to "*" will prevent the calls to the System Configuration framework.
This is something that urllib may be attempting to do causing failure when multiprocessing.
There is a workaround and that is to set the environmental variable no-proxy to *
Eg. export no_proxy=*

How can I connect to a website using twill inside of google app engine's python sandbox?

this allows me to connect to a website with python on my computer:
from twill.commands import go, show, showforms, formclear, fv, submit
from bs4 import BeautifulSoup as bs
go('http://www.pge.com')
showforms()
this gets me to hello world on google app engine, with the twill and beautiful soup imports working:
import webapp2
import sys
sys.path.insert(0, 'libs')
from twill.commands import go, show, showforms, formclear, fv, submit
from bs4 import BeautifulSoup as bs
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-Type'] = 'text/plain'
self.response.write('Hello, World! I love dog food.')
application = webapp2.WSGIApplication([
('/', MainPage),
], debug=True)
now after this i try to connect to a website using twill and fail:
where can i call go() to connect to a website?
if I add it before class MainPage(webapp2.RequestHandler): it hangs and I don't get to hello world.
If I add it inside the MainPage class on the first line as either getit = go('http://www.pge.com'), or just go('http://www.pge.com'), it also hangs and I don't get to hello world.
If I add it inside def: get(self):, I get:
Internal Server Error
The server has either erred or is incapable of performing the
requested operation.
and a bunch of stuff about twill and mechanize.py, followed by
File "..../twill/utils.py", line 275, in run_tidy
process = subprocess.Popen(_tidy_cmd, stdin=subprocess.PIPE,
AttributeError: 'module' object has no attribute 'Popen'
am i somehow missing some other dependencies, like mechanize.py? or is there something else i need to be doing?
This helped partially solve my issue, but other problems cropped up later, particularly in filling forms, submitting them, and migrating further in the website.
import webapp2
import sys
sys.path.insert(0, 'libs')
from twill.commands import go, show, showforms, formclear, fv, submit, config
from twill.browser import *
from bs4 import BeautifulSoup as bs
class MainPage(webapp2.RequestHandler):
config('use_tidy', '0')
def get(self):
go('http://www.pge.com')
self.response.headers['Content-Type'] = 'text/plain'
self.response.write('Hello, World! I love dog food.')
self.response.write(show())
tidy is disabled in this way (tidy does not play well with the google sandbox), and I am able to show() the webpage via self.response.write(show()).
As a side note: be careful about using TextEdit to edit .py files when working with google app engine. I was getting all sorts of weird non-ascii character errors. I added # -- coding: utf-8 -- to the first line of the python file, which sort-of helped, and switched to using pycharm's ide which really helped.
twill is giving me all sorts of issues inside google app engine's sandbox that i am not getting on my system. i can get finally the html for one webpage, but can't submit forms in the way i was doing so simply on my system. showforms() isn't even showing the forms, maybe because i disabled tidy and the html is not being parsed properly?
i think one way to move forward here is to get tidy to work inside twill. obviously roadblocks are being hit "under the hood", and they are hard for me to see.
it seems like twill is a high level abstraction and maybe is not a good idea right now for using on google app engine. next i will try to switch to mechanize.py, or look for another sandbox, maybe amazon?

Python script for "Google search by image"

I have checked Google Search API's and it seems that they have not released any API for searching "Images". So, I was wondering if there exists a python script/library through which I can automate the "search by image feature".
This was annoying enough to figure out that I thought I'd throw a comment on the first python-related stackoverflow result for "script google image search". The most annoying part of all this is setting up your proper application and custom search engine (CSE) in Google's web UI, but once you have your api key and CSE, define them in your environment and do something like:
#!/usr/bin/env python
# save top 10 google image search results to current directory
# https://developers.google.com/custom-search/json-api/v1/using_rest
import requests
import os
import sys
import re
import shutil
url = 'https://www.googleapis.com/customsearch/v1?key={}&cx={}&searchType=image&q={}'
apiKey = os.environ['GOOGLE_IMAGE_APIKEY']
cx = os.environ['GOOGLE_CSE_ID']
q = sys.argv[1]
i = 1
for result in requests.get(url.format(apiKey, cx, q)).json()['items']:
link = result['link']
image = requests.get(link, stream=True)
if image.status_code == 200:
m = re.search(r'[^\.]+$', link)
filename = './{}-{}.{}'.format(q, i, m.group())
with open(filename, 'wb') as f:
image.raw.decode_content = True
shutil.copyfileobj(image.raw, f)
i += 1
There is no API available but you are can parse the page and imitate the browser, but I don't know how much data you need to parse because google may limit or block access.
You can imitate the browser by simply using urllib and setting correct headers, but if you think parsing complex web-pages may be difficult from python, you can directly use a headless browser like phontomjs, inside a browser it is trivial to get correct elements using javascript/DOM
Note before trying all this check google's TOS
You can try this:
https://developers.google.com/image-search/v1/jsondevguide#json_snippets_python
It's deprecated, but seems to work.

How to clear cookies in WebKit?

i'm currently working with PyWebKitGtk in python (http://live.gnome.org/PyWebKitGtk). I would like to clear all cookies in my own little browser. I found interesting method webkit.HTTPResponse.clearCookies() but I have no idea how to lay my hands on instance of HTTPResponse object :/
I wouldn't like to use java script for that task.
If you look at the current state of the bindings on GitHub, you'll see PyWebKitGTK doesn't yet provide quite what you want- there's not mapping for the HTTPResponse type it looks like. Unfortunately, I think Javascript or a proxy are your only options right now.
EDIT:
...unless, of course, you want it real bad and stay up into the night learning ctypes. In which case, you can do magic. To clear all the browser's cookies, try this.
import gtk, webkit, ctypes
libwebkit = ctypes.CDLL('libwebkit-1.0.so')
libgobject = ctypes.CDLL('libgobject-2.0.so')
libsoup = ctypes.CDLL('libsoup-2.4.so')
v = webkit.WebView()
#do whatever it is you do with WebView...
....
#get the cookiejar from the default session
#(assumes one session and one cookiesjar)
generic_cookiejar_type = libgobject.g_type_from_name('SoupCookieJar')
cookiejar = libsoup.soup_session_get_feature(session, generic_cookiejar_type)
#build a callback to delete cookies
DEL_COOKIE_FUNC = ctypes.CFUNCTYPE(None, ctypes.c_void_p)
def del_cookie(cookie):
libsoup.soup_cookie_jar_delete_cookie(cookiejar, cookie)
#run the callback on all the cookies
cookie_list = libsoup.soup_cookie_jar_all_cookies(cookiejar)
libsoup.g_slist_foreach(cookie_list, DEL_COOKIE_FUNC(del_cookie), None)
EDIT:
Just started needing this myself, and while it's the right idea it needed work. Instead, try this- the function type and cookiejar access are fixed.
#add a new cookie jar
cookiejar = libsoup.soup_cookie_jar_new()
#uncomment the below line for a persistent jar instead
#cookiejar = libsoup.soup_cookie_jar_text_new('/path/to/your/cookies.txt',False)
libsoup.soup_session_add_feature(session, cookiejar)
#build a callback to delete cookies
DEL_COOKIE_FUNC = ctypes.CFUNCTYPE(ctypes.c_int, ctypes.c_void_p, ctypes.c_void_p)
def del_cookie(cookie, userdata):
libsoup.soup_cookie_jar_delete_cookie(cookiejar, cookie)
return 0
#run the callback on all the cookies
cookie_list = libsoup.soup_cookie_jar_all_cookies(cookiejar)
libsoup.g_slist_foreach(cookie_list, DEL_COOKIE_FUNC(del_cookie), None)
Note that you should only do this before using the WebView, or maybe in WebKit callbacks, or you will have threading issues above and beyond those usually associated with GTK programming.

Flash-Selenium and Python

I want to try Flash-Selenium with the python driver, however I have some concerns regarding the available python extension, it seems aged and there is no example on how to use it... Is there anybody who is using it? Any example on how to use it ?
Example taken from FlashSelenium page:
from com.thoughtworks.selenium.FlashSelenium import FlashSelenium
from com.thoughtworks.selenium.selenium import selenium
url = "http://flashselenium.t35.com/colors.html"
browserType = "*firefox"
selenium = selenium("localhost", 4444, browserType, url)
selenium.start()
selenium.open(url)
flashApp = FlashSelenium(selenium, "coloredSquare")
flashApp.percent_loaded()

Categories