Most straightforward way to cache geocoding data - python

I am using geopy to get lat/long coordinates for a list of addresses. All the documentation points to limiting server queries by caching (many questions here, in fact), but few actually give practical solutions.
What is the best way to accomplish this?
This is for a self-contained data processing job I'm working on ... no app platform involved. Just trying to cut down on server queries as I run through data that I will have seen before (very likely, in my case).
My code looks like this:
from geopy import geocoders
def geocode( address ):
# address ~= "175 5th Avenue NYC"
g = geocoders.GoogleV3()
cache = addressCached( address )
if ( cache != False ):
# We have seen this exact address before,
# return the saved location
return cache
# Otherwise, get a new location from geocoder
location = g.geocode( address )
saveToCache( address, location )
return location
def addressCached( address ):
# What does this look like?
def saveToCache( address, location ):
# What does this look like?

How exactly you want to implement your cache is really dependent on what platform your Python code will be running on.
You want a pretty persistent "cache" since addresses' locations are not going to change often:-), so a database (in a key-value mood) seems best.
So in many cases I'd pick sqlite3, an excellent, very lightweight SQL engine that's part of the Python standard library. Unless perhaps I preferred e.g a MySQL instance that I need to have running anyway, one advantage might be that this would allow multiple applications running on different nodes to share the "cache" -- other DBs, both SQL and non, would be good for the latter, depending on your constraints and preferences.
But if I was e.g running on Google App Engine, then I'd be using the datastore it includes, instead. Unless I had specific reasons to want to share the "cache" among multiple disparate applications, in which case I might consider alternatives such as google cloud sql and google storage, as well as another alternative yet consisting of a dedicated "cache server" GAE app of my own serving RESTful results (maybe w/endpoints?). The choice is, again!, very, very dependent on your constraints and preferences (latency, queries-per-seconds sizing, etc, etc).
So please clarify what platform you are in, and what other constraints and preferences you have for your databasey "cache", and then the very simple code to implement that can easily be shown. But showing half a dozen different possibilities before you clarify would not be very productive.
Added: since the comments suggest sqlite3 may be acceptable, and there are a few important details best shown in code (such as, how to serialize and deserialize an instance of geopy.location.Location into/from a sqlite3 blob -- similar issues may well arise with other underlying databases, and the solutions are similar), I decided a solution example may be best shown in code. So, as the "geo cache" is clearly best implemented as its own module, I wrote the following simple geocache.py...:
import geopy
import pickle
import sqlite3
class Cache(object):
def __init__(self, fn='cache.db'):
self.conn = conn = sqlite3.connect(fn)
cur = conn.cursor()
cur.execute('CREATE TABLE IF NOT EXISTS '
'Geo ( '
'address STRING PRIMARY KEY, '
'location BLOB '
')')
conn.commit()
def address_cached(self, address):
cur = self.conn.cursor()
cur.execute('SELECT location FROM Geo WHERE address=?', (address,))
res = cur.fetchone()
if res is None: return False
return pickle.loads(res[0])
def save_to_cache(self, address, location):
cur = self.conn.cursor()
cur.execute('INSERT INTO Geo(address, location) VALUES(?, ?)',
(address, sqlite3.Binary(pickle.dumps(location, -1))))
self.conn.commit()
if __name__ == '__main__':
# run a small test in this case
import pprint
cache = Cache('test.db')
address = '1 Murphy St, Sunnyvale, CA'
location = cache.address_cached(address)
if location:
print('was cached: {}\n{}'.format(location, pprint.pformat(location.raw)))
else:
print('was not cached, looking up and caching now')
g = geopy.geocoders.GoogleV3()
location = g.geocode(address)
print('found as: {}\n{}'.format(location, pprint.pformat(location.raw)))
cache.save_to_cache(address, location)
print('... and now cached.')
I hope the ideas illustrated here are clear enough -- there are alternatives on each design choice, but I've tried to keep things simple (in particular, I'm using a simple example-cum-mini-test when this module is run directly, in lieu of a proper suite of unit-tests...).
For the bit about serializing to/from blobs, I've chosen pickle with the "highest protocol" (-1) protocol -- cPickle of course would be just as good in Python 2 (and faster:-) but these days I try to write code that's equally good as Python 2 or 3, unless I have specific reasons to do otherwise:-). And of course I'm using a different filename test.db for the sqlite database used in the test, so you can wipe it out with no qualms to test some variation, while the default filename meant to be used in "production" code stays intact (it is quite a dubious design choice to use a filename that's relative -- meaning "in the current directory" -- but the appropriate way to decide where to place such a file is quite platform dependent, and I didn't want to get into such exoterica here:-).
If any other question is left, please ask (perhaps best on a separate new question since this answer has already grown so big!-).

How about creating a list or dict in which all geocoded addresses are stored? Then you could simply check.
if address in cached:
//skip

This cache will live from the moment the module is loaded and won't be saved after you finish using this module. You'll probably want to save it into a file with pickle or into a database and load it next time you load the module.
from geopy import geocoders
cache = {}
def geocode( address ):
# address ~= "175 5th Avenue NYC"
g = geocoders.GoogleV3()
cache = addressCached( address )
if ( cache != False ):
# We have seen this exact address before,
# return the saved location
return cache
# Otherwise, get a new location from geocoder
location = g.geocode( address )
saveToCache( address, location )
return location
def addressCached( address ):
global cache
if address in cache:
return cache[address]
return None
def saveToCache( address, location ):
global cache
cache[address] = location

Here is a simple implementation that uses the python shelve package for transparent and persistent caching:
import geopy
import shelve
import time
class CachedGeocoder:
def __init__(self, source = "Nominatim", geocache = "geocache.db"):
self.geocoder = getattr(geopy.geocoders, source)()
self.db = shelve.open(geocache, writeback = True)
self.ts = time.time()+1.1
def geocode(self, address):
if not address in self.db:
time.sleep(max(1 -(time.time() - self.ts), 0))
self.ts = time.time()
self.db[address] = self.geocoder.geocode(address)
return self.db[address]
geocoder = CachedGeocoder()
print geocoder.geocode("San Francisco, USA")
It stores a timestamp to ensure that requests are not issued more frequently than once per second (which is a requirement for Nominatim). One weakness is that doesn't deal with timed out responses from Nominatim.

The easiest way to cache geocoding requests is probably to use requests-cache:
import geocoder
import requests_cache
requests_cache.install_cache('geocoder_cache')
g = geocoder.osm('Mountain View, CA') # <-- This request will go to OpenStreetMap server.
print(g.latlng)
# [37.3893889, -122.0832101]
g = geocoder.osm('Mountain View, CA') # <-- This request should be cached, and return immediatly
print(g.latlng)
# [37.3893889, -122.0832101]
requests_cache.uninstall_cache()
Just for debugging purposes, you could check if requests are indeed cached:
import geocoder
import requests_cache
def debug(response):
print(type(response))
return True
requests_cache.install_cache('geocoder_cache2', filter_fn=debug)
g = geocoder.osm('Mountain View, CA')
# <class 'requests.models.Response'>
# <class 'requests.models.Response'>
g = geocoder.osm('Mountain View, CA')
# <class 'requests_cache.models.response.CachedResponse'>
requests_cache.uninstall_cache()

Related

Datastore delay on creating entities with put()

I am developing an application using with the Cloud Datastore Emulator (2.1.0) and the google-cloud-ndb Python library (1.6).
I find that there is an intermittent delay on entities being retrievable via a query.
For example, if I create an entity like this:
my_entity = MyEntity(foo='bar')
my_entity.put()
get_my_entity = MyEntity.query().filter(MyEntity.foo == 'bar').get()
print(get_my_entity.foo)
it will fail itermittently because the get() method returns None.
This only happens on about 1 in 10 calls.
To demonstrate, I've created this script (also available with ready to run docker-compose setup on GitHub):
import random
from google.cloud import ndb
from google.auth.credentials import AnonymousCredentials
client = ndb.Client(
credentials=AnonymousCredentials(),
project='local-dev',
)
class SampleModel(ndb.Model):
"""Sample model."""
some_val = ndb.StringProperty()
for x in range(1, 1000):
print(f'Attempt {x}')
with client.context():
random_text = str(random.randint(0, 9999999999))
new_model = SampleModel(some_val=random_text)
new_model.put()
retrieved_model = SampleModel.query().filter(
SampleModel.some_val == random_text
).get()
print(f'Model Text: {retrieved_model.some_val}')
What would be the correct way to avoid this intermittent failure? Is there a way to ensure the entity is always available after the put() call?
Update
I can confirm that this is only an issue with the datastore emulator. When testing on app engine and a Firestore in Datastore mode, entities are available immediately after calling put().
The issue turned out to be related to the emulator trying to replicate eventual consistency.
Unlike relational databases, Datastore does not gaurentee that the data will be available immediately after it's posted. This is because there are often replication and indexing delays.
For things like unit tests, this can be resolved by passing --consistency=1.0 to the datastore start command as documented here.

Using libsecret I can't get to the label of an unlocked item

I'm working on a little program that uses libsecret. This program should be able to create a Secret.Service...
from gi.repository import Secret
service = Secret.Service.get_sync(Secret.ServiceFlags.LOAD_COLLECTIONS)
... get a specific Collection from that Service...
# 2 is the index of a custom collection I created, not the default one.
collection = service.get_collections()[2]
... and then list all the Items inside that Collection, this by just printing their labels.
# I'm just printing a single label here for testing, I'd need all of course.
print(collection.get_items()[0].get_label())
An important detail is that the Collecction may initially be locked, and so I need to include code that checks for that possibility, and tries to unlock the Collection.
# The unlock method returns a tuple (number_of_objs_unlocked, [list_of_objs])
collection = service.unlock_sync([collection])[1][0]
This is important because the code I currently have can do all I need when the Collection is initially unlocked. However if the Collection is initially locked, even after I unlock it, I can't get the labels from the Items inside. What I can do is disconnect() the Service, recreate the Service again, get the now unlocked Collection, and this way I am able to read the label on each Item. Another interesting detail is that, after the labels are read once, I no longer required the Service reconnection to access them. This seems quite inelegant, so I started looking for a different solution.
I realized that the Collection inherited from Gio.DBusProxy and this class caches the data from the object it accesses. So I'm assuming that is the problem for me, I'm not updating the cache. This is strange though because the documentation states that Gio.DBusProxy should be able to detect changes on the original object, but that's not happening.
Now I don't know how to update the cache on that class. I've taken a look at some seahorse(another application that uses libsecret) vala code, which I wasn't able to completely decipher, I can't code vala, but that mentioned the Object.emit() method, I'm still not sure how I could use that method to achieve my goal. From the documentation(https://lazka.github.io/pgi-docs/Secret-1/#) I found another promising method, Object.notify(), which seems to be able to send notifications of changes that would enable cache updates, but I also haven't been able to properly use it yet.
I also posted on the gnome-keyring mailing list about this...
https://mail.gnome.org/archives/gnome-keyring-list/2015-November/msg00000.html
... with no answer so far, and found a bugzilla report on gnome.org that mentions this issue...
https://bugzilla.gnome.org/show_bug.cgi?id=747359
... with no solution so far(7 months) either.
So if someone could shine some light on this problem that would be great. Otherwise some inelegant code will unfortunately find it's way into my little program.
Edit-0:
Here is some code to replicate the issue in Python3.
This snippet creates a collection 'test_col', with one item 'test_item', and locks the collection. Note libsecret will prompt you for the password you want for this new collection:
#!/usr/bin/env python3
from gi import require_version
require_version('Secret', '1')
from gi.repository import Secret
# Create schema
args = ['com.idlecore.test.schema']
args += [Secret.SchemaFlags.NONE]
args += [{'service': Secret.SchemaAttributeType.STRING,
'username': Secret.SchemaAttributeType.STRING}]
schema = Secret.Schema.new(*args)
# Create 'test_col' collection
flags = Secret.CollectionCreateFlags.COLLECTION_CREATE_NONE
collection = Secret.Collection.create_sync(None, 'test_col', None, flags, None)
# Create item 'test_item' inside collection 'test_col'
attributes = {'service': 'stackoverflow', 'username': 'xor'}
password = 'password123'
value = Secret.Value(password, len(password), 'text/plain')
flags = Secret.ItemCreateFlags.NONE
Secret.Item.create_sync(collection, schema, attributes,
'test_item', value, flags, None)
# Lock collection
service = collection.get_service()
service.lock_sync([collection])
Then we need to restart the gnome-keyring-daemon, you can just logout and back in or use the command line:
gnome-keyrin-daemon --replace
This will setup your keyring so we can try opening a collection that is initially locked. We can do that with this code snippet. Note that you will be prompted once again for the password you set previously:
#!/usr/bin/env python3
from gi import require_version
require_version('Secret', '1')
from gi.repository import Secret
# Get the service
service = Secret.Service.get_sync(Secret.ServiceFlags.LOAD_COLLECTIONS)
# Find the correct collection
for c in service.get_collections():
if c.get_label() == 'test_col':
collection = c
break
# Unlock the collection and show the item label, note that it's empty.
collection = service.unlock_sync([collection])[1][0]
print('Item Label:', collection.get_items()[0].get_label())
# Do the same thing again, and it works.
# It's necessary to disconnect the service to clear the cache,
# Otherwise we keep getting the same empty label.
service.disconnect()
# Get the service
service = Secret.Service.get_sync(Secret.ServiceFlags.LOAD_COLLECTIONS)
# Find the correct collection
for c in service.get_collections():
if c.get_label() == 'test_col':
collection = c
break
# No need to unlock again, just show the item label
print('Item Label:', collection.get_items()[0].get_label())
This code attempts to read the item label twice. One the normal way, which fails, you should see an empty string, and then using a workaround, that disconnects the service and reconnects again.
I came across this question while trying to update a script I use to retrieve passwords from my desktop on my laptop, and vice versa.
The clue was in the documentation for Secret.ServiceFlags—there are two:
OPEN_SESSION = 2
establish a session for transfer of secrets while initializing the Secret.Service
LOAD_COLLECTIONS = 4
load collections while initializing the Secret.Service
I think for a Service that both loads collections and allows transfer of secrets (including item labels) from those collections, we need to use both flags.
The following code (similar to your mailing list post, but without a temporary collection set up for debugging) seems to work. It gives me the label of an item:
from gi.repository import Secret
service = Secret.Service.get_sync(Secret.ServiceFlags.OPEN_SESSION |
Secret.ServiceFlags.LOAD_COLLECTIONS)
collections = service.get_collections()
unlocked_collection = service.unlock_sync([collections[0]], None)[1][0]
unlocked_collection.get_items()[0].get_label()
I have been doing this
print(collection.get_locked())
if collection.get_locked():
service.unlock_sync(collection)
Don't know if it is going to work though because I have never hit a case where I have something that is locked. If you have a piece of sample code where I can create a locked instance of a collection then maybe I can help

latency with group in pymongo in tests

Good Day.
I have faced following issue using pymongo==2.1.1 in python2.7 with mongo 2.4.8
I have tried to find solution using google and stack overflow but failed.
What's the issue?
I have following function
from bson.code import Code
def read(groupped_by=None):
reducer = Code("""
function(obj, prev){
prev.count++;
}
""")
client = Connection('localhost', 27017)
db = client.urlstats_database
results = db.http_requests.group(key={k:1 for k in groupped_by},
condition={},
initial={"count": 0},
reduce=reducer)
groupped_by = list(groupped_by) + ['count']
result = [tuple(res[col] for col in groupped_by) for res in results]
return sorted(result)
Then I am trying to write test for this function
class UrlstatsViewsTestCase(TestCase):
test_data = {'data%s' % i : 'data%s' % i for i in range(6)}
def test_one_criterium(self):
client = Connection('localhost', 27017)
db = client.urlstats_database
for column in self.test_data:
db.http_requests.remove()
db.http_requests.insert(self.test_data)
response = read([column])
self.assertEqual(response, [(self.test_data[column], 1)])
this test sometimes fails as I understand because of latency. As I can see response has not cleaned data in it
If I add delay after remove test pass all the time.
Is there any proper way to test such functionality?
Thanks in Advance.
A few questions regarding your environment / code:
What version of pymongo are you using?
If you are using any of the newer versions that have MongoClient, is there any specific reason you are using Connection instead of MongoClient?
The reason I ask second question is because Connection provides fire-and-forget kind of functionality for the operations that you are doing while MongoClient works by default in safe mode and is also preferred approach of use since mongodb 2.2+.
The behviour that you see is very conclusive for Connection usage instead of MongoClient. While using Connection your remove is sent to server, and the moment it is sent from client side, your program execution moves to next step which is to add new entries. Based on latency / remove operation completion time, these are going to be conflicting as you have already noticed in your test case.
Can you change to use MongoClient and see if that helps you with your test code?
Additional Ref: pymongo: MongoClient or Connection
Thanks All.
There is no MongoClient class in version of pymongo I use. So I was forced to find out what exactly differs.
As soon as I upgrade to 2.2+ I will test whether everything is ok with MongoClient. But as for connection class one can use write concern to control this latency.
I older version One should create connection with corresponding arguments.
I have tried these twojournal=True, safe=True (journal write concern can't be used in non-safe mode)
j or journal: Block until write operations have been commited to the journal. Ignored if the server is running without journaling. Implies safe=True.
I think this make performance worse but for automatic tests this should be ok.

Python Linux route table lookup

I posted Python find first network hop about trying to find the first hop and the more I thought about it, the easier it seemed like it would be a process the routing table in python. I'm not a programmer, I don't know what I'm doing. :p
This is what I came up with, the first issue I noticed is the loopback interface doesn't show up in the /proc/net/route file- so evaluating 127.0.0.0/8 will give you the default route... for my application, that doesn't matter.
Anything else major I'm overlooking? Is parsing ip route get <ip> still a better idea?
import re
import struct
import socket
'''
Read all the routes into a list. Most specific first.
# eth0 000219AC 04001EAC 0003 0 0 0 00FFFFFF ...
'''
def _RtTable():
_rt = []
rt_m = re.compile('^[a-z0-9]*\W([0-9A-F]{8})\W([0-9A-F]{8})[\W0-9]*([0-9A-F]{8})')
rt = open('/proc/net/route', 'r')
for line in rt.read().split('\n'):
if rt_m.match(line):
_rt.append(rt_m.findall(line)[0])
rt.close()
return _rt
'''
Create a temp ip (tip) that is the entered ip with the host
section striped off. Matching to routers in order,
the first match should be the most specific.
If we get 0.0.0.0 as the next hop, the network is likely(?)
directly attached- the entered IP is the next (only) hop
'''
def FindGw(ip):
int_ip = struct.unpack("I", socket.inet_aton(ip))[0]
for entry in _RtTable():
tip = int_ip & int(entry[2], 16)
if tip == int(entry[0], 16):
gw_s = socket.inet_ntoa(struct.pack("I", int(entry[1], 16)))
if gw_s == '0.0.0.0':
return ip
else:
return gw_s
if __name__ == '__main__':
import sys
print FindGw(sys.argv[1])
In the man page of proc file system it is given that.
/proc/net
various net pseudo-files, all of which give the status of some part of
the networking layer. These files contain ASCII structures and are,
there‐fore, readable with cat(1).
However, the standard netstat(8) suite provides much
cleaner access to these files.
Just rely on the tools designed for those purposes. Use netstat, traceroute or any other standard tool. Wrap those commands cleanly using subprocess module and get the information for what you are looking for.
With pyroute2.IPRoute, get the next hop on the way to some distant host, here — 8.8.8.8:
from pyroute2 import IPRoute
with IPRoute() as ipr:
print(ipr.route('get', dst='8.8.8.8'))

What Python way would you suggest to check whois database records?

I'm trying to get a webservice up and running that actually requires to check whois databases. What I'm doing right now is ugly and I'd like to avoid it as much as I can: I call gwhois command and parse its output. Ugly.
I did some search to try to find a pythonic way to do this task. Generally I got quite much nothing - this old discussion list link has a way to check if domain exist. Quite not what I was looking for... But still, it was best anwser Google gave me - everything else is just a bunch of unanwsered questions.
Any of you have succeeded to get some method up and running? I'd very much appreciate some tips, or should I just do it the opensource-way, sit down and code something by myself? :)
Found this question in the process of my own search for a python whois library.
Don't know that I agree with cdleary's answer that using a library that wraps
a command is always the best way to go - but I can see his reasons why he said this.
Pro: cmd-line whois handles all the hard work (socket calls, parsing, etc)
Con: not portable; module may not work depending on underlying whois command.
Slower, since running a command and most likely shell in addition to whois command.
Affected if not UNIX (Windows), different UNIX, older UNIX, or
older whois command
I am looking for a whois module that can handle whois IP lookups and I am not interested in coding my own whois client.
Here are the modules that I (lightly) tried out and more information about it:
UPDATE from 2022: I would search pypi for newer whois libraries. NOTE - some APIs are specific to online paid services
pywhoisapi:
Home: http://code.google.com/p/pywhoisapi/
Last Updated: 2011
Design: REST client accessing ARIN whois REST service
Pros: Able to handle IP address lookups
Cons: Able to pull information from whois servers of other RIRs?
BulkWhois
Home: http://pypi.python.org/pypi/BulkWhois/0.2.1
Last Updated: fork https://github.com/deontpearson/BulkWhois in 2020
Design: telnet client accessing whois telnet query interface from RIR(?)
Pros: Able to handle IP address lookups
Cons: Able to pull information from whois servers of other RIRs?
pywhois:
Home: http://code.google.com/p/pywhois/
Last Updated: 2010 (no forks found)
Design: REST client accessing RRID whois services
Pros: Accessses many RRIDs; has python 3.x branch
Cons: does not seem to handle IP address lookups
python-whois (wraps whois command):
Home: http://code.google.com/p/python-whois/
Last Updated: 2022-11 https://github.com/DannyCork/python-whois
Design: wraps "whois" command
Cons: does not seem to handle IP address lookups
whois:
Home: https://github.com/richardpenman/whois (https://pypi.org/project/python-whois/)
Last Updated: 2022-12
Design: port of whois.c from Apple
Cons: TODO
whoisclient - fork of python-whois
Home: http://gitorious.org/python-whois
Last Updated: (home website no longer valid)
Design: wraps "whois" command
Depends on: IPy.py
Cons: does not seem to handle IP address lookups
Update: I ended up using pywhoisapi for the reverse IP lookups that I was doing
Look at this:
http://code.google.com/p/pywhois/
pywhois - Python module for retrieving WHOIS information of domains
Goal:
- Create a simple importable Python module which will produce parsed WHOIS data for a given domain.
- Able to extract data for all the popular TLDs (com, org, net, ...)
- Query a WHOIS server directly instead of going through an intermediate web service like many others do.
- Works with Python 2.4+ and no external dependencies
Example:
>>> import pywhois
>>> w = pywhois.whois('google.com')
>>> w.expiration_date
['14-sep-2011']
>>> w.emails
['contact-admin#google.com',
'dns-admin#google.com',
'dns-admin#google.com',
'dns-admin#google.com']
>>> print w
...
There's nothing wrong with using a command line utility to do what you want. If you put a nice wrapper around the service, you can implement the internals however you want! For example:
class Whois(object):
_whois_by_query_cache = {}
def __init__(self, query):
"""Initializes the instance variables to defaults. See :meth:`lookup`
for details on how to submit the query."""
self.query = query
self.domain = None
# ... other fields.
def lookup(self):
"""Submits the `whois` query and stores results internally."""
# ... implementation
Now, whether or not you roll your own using urllib, wrap around a command line utility (like you're doing), or import a third party library and use that (like you're saying), this interface stays the same.
This approach is generally not considered ugly at all -- sometimes command utilities do what you want and you should be able to leverage them. If speed ends up being a bottleneck, your abstraction makes the process of switching to a native Python implementation transparent to your client code.
Practicality beats purity -- that's what's Pythonic. :)
Here is the whois client re-implemented in Python:
http://code.activestate.com/recipes/577364-whois-client/
I don't know if gwhois does something special with the server output; however, you can plainly connect to the whois server on port whois (43), send your query, read all the data in the reply and parse them. To make life a little easier, you could use the telnetlib.Telnet class (even if the whois protocol is much simpler than the telnet protocol) instead of plain sockets.
The tricky parts:
which whois server will you ask? RIPE, ARIN, APNIC, LACNIC, AFRINIC, JPNIC, VERIO etc LACNIC could be a useful fallback, since they tend to reply with useful data to requests outside of their domain.
what are the exact options and arguments for each whois server? some offer help, others don't. In general, plain domain names work without any special options.
Another way to do it is to use urllib2 module to parse some other page's whois service (many sites like that exist). But that seems like even more of a hack that what you do now, and would give you a dependency on whatever whois site you chose, which is bad.
I hate to say it, but unless you want to re-implement whois in your program (which would be re-inventing the wheel), running whois on the OS and parsing the output (ie what you are doing now) seems like the right way to do it.
Parsing another webpage woulnd't be as bad (assuming their html woulnd't be very bad), but it would actually tie me to them - if they're down, I'm down :)
Actually I found some old project on sourceforge: rwhois.py. What scares me a bit is that their last update is from 2003. But, it might seem as a good place to start reimplementation of what I do right now... Well, I felt obligued to post the link to this project anyway, just for further reference.
here is a ready-to-use solution that works for me; written for Python 3.1 (when backporting to Py2.x, take special care of the bytes / Unicode text distinctions). your single point of access is the method DRWHO.whois(), which expects a domain name to be passed in; it will then try to resolve the name using the provider configured as DRWHO.whois_providers[ '*' ] (a more complete solution could differentiate providers according to the top level domain). DRWHO.whois() will return a dictionary with a single entry text, which contains the response text sent back by the WHOIS server. Again, a more complete solution would then try and parse the text (which must be done separately for each provider, as there is no standard format) and return a more structured format (e.g., set a flag available which specifies whether or not the domain looks available). have fun!
##########################################################################
import asyncore as _sys_asyncore
from asyncore import loop as _sys_asyncore_loop
import socket as _sys_socket
##########################################################################
class _Whois_request( _sys_asyncore.dispatcher_with_send, object ):
# simple whois requester
# original code by Frederik Lundh
#-----------------------------------------------------------------------
whoisPort = 43
#-----------------------------------------------------------------------
def __init__(self, consumer, host, provider ):
_sys_asyncore.dispatcher_with_send.__init__(self)
self.consumer = consumer
self.query = host
self.create_socket( _sys_socket.AF_INET, _sys_socket.SOCK_STREAM )
self.connect( ( provider, self.whoisPort, ) )
#-----------------------------------------------------------------------
def handle_connect(self):
self.send( bytes( '%s\r\n' % ( self.query, ), 'utf-8' ) )
#-----------------------------------------------------------------------
def handle_expt(self):
self.close() # connection failed, shutdown
self.consumer.abort()
#-----------------------------------------------------------------------
def handle_read(self):
# get data from server
self.consumer.feed( self.recv( 2048 ) )
#-----------------------------------------------------------------------
def handle_close(self):
self.close()
self.consumer.close()
##########################################################################
class _Whois_consumer( object ):
# original code by Frederik Lundh
#-----------------------------------------------------------------------
def __init__( self, host, provider, result ):
self.texts_as_bytes = []
self.host = host
self.provider = provider
self.result = result
#-----------------------------------------------------------------------
def feed( self, text ):
self.texts_as_bytes.append( text.strip() )
#-----------------------------------------------------------------------
def abort(self):
del self.texts_as_bytes[:]
self.finalize()
#-----------------------------------------------------------------------
def close(self):
self.finalize()
#-----------------------------------------------------------------------
def finalize( self ):
# join bytestrings and decode them (witha a guessed encoding):
text_as_bytes = b'\n'.join( self.texts_as_bytes )
self.result[ 'text' ] = text_as_bytes.decode( 'utf-8' )
##########################################################################
class DRWHO:
#-----------------------------------------------------------------------
whois_providers = {
'~isa': 'DRWHO/whois-providers',
'*': 'whois.opensrs.net', }
#-----------------------------------------------------------------------
def whois( self, domain ):
R = {}
provider = self._get_whois_provider( '*' )
self._fetch_whois( provider, domain, R )
return R
#-----------------------------------------------------------------------
def _get_whois_provider( self, top_level_domain ):
providers = self.whois_providers
R = providers.get( top_level_domain, None )
if R is None:
R = providers[ '*' ]
return R
#-----------------------------------------------------------------------
def _fetch_whois( self, provider, domain, pod ):
#.....................................................................
consumer = _Whois_consumer( domain, provider, pod )
request = _Whois_request( consumer, domain, provider )
#.....................................................................
_sys_asyncore_loop() # loops until requests have been processed
#=========================================================================
DRWHO = DRWHO()
domain = 'example.com'
whois = DRWHO.whois( domain )
print( whois[ 'text' ] )
import socket
socket.gethostbyname_ex('url.com')
if it returns a gaierror you know know it's not registered with any DNS

Categories