How can I de-couple the two components of my python application - python

I'm trying to learn python development, and I have been reading on topics of architectural pattern and code design, as I want to stop hacking and really develop. I am implementing a webcrawler, and I know it has a problematic structure as you'll see, but I don't know how to fix it.
The crawlers will return a list of actions to input data in a mongoDB instance.
This is my general structure of my application:
Spiders
crawlers.py
connections.py
utils.py
__init__.py
crawlers.py implements a class of type Crawler, and each specific crawler inherits it. Each Crawler has an attribute table_name, and a method: crawl.
In connections.py, I implemented a pymongo driver to connect to the DB. It expects a crawler as a parameter to it's write method. Now here come's the trick part... the crawler2 depends on the results of crawler1, so I end up with something like this:
from pymongo import InsertOne
class crawler1(Crawler):
def __init__(self):
super().__init__('Crawler 1', 'table_A')
def crawl(self):
return list of InsertOne
class crawler2(Crawler):
def __init__(self):
super().__init__('Crawler 2', 'table_B')
def crawl(self, list_of_codes):
return list of InsertOne # After crawling the list of codes/links
and then, in my connections, I create a class that expects a crawler.
class MongoDriver:
def __init__.py
self.db = MongoClient(...)
def write(crawler, **kwargs):
self.db[crawler.table_name].bulk_write(crawler.crawl(**kwargs))
def get_list_of_codes():
query = {}
return [x['field'] for x in self.db.find(query)]
and so, here comes the (biggest) problem (because I think there are many other, some of which I can barely grasp, and others that I'm still totally blind to): the implementation of my connections needs context of the crawler!! For example:
mongo_driver = MongoDriver()
crawler1 = Crawler1()
crawler2 = Crawler2()
mongo_driver.write(crawler1)
mongo_driver.write(crawler2, list_of_codes=mongo_driver.get_list_of_codes())
How would one go about solving it? And what else is particularly worrysome in this construct? Thanks for the feedback!

Problem 1: MongoDriver knows too much about your crawlers. You should separate the driver from crawler1 and crawler2. I'm not sure what your crawl function returns, but I assume it is a list of objects of type A.
You could use an object such as CrawlerService to manage the dependency between MongoDriver and Crawler. This will separate the driver's write responsibility from the crawler's responsibility to crawl. The service will also manage the order of operations which in some cases may be considered good enough.
class Repository:
def write(for_table: str, objects: 'List[A]'):
self.db[for_table].bulk_write(objects)
class CrawlerService:
def __init__(self, repository: Repository, crawlers: List[Crawler]):
...
def crawl(self):
crawler1, crawler2 = crawlers
result = [repository.write(x) for x in crawler1.crawl()]
... # work with crawler2 and result
Problem 2: Crawler1 and Crawler2 are pretty much the same; they only differ when we call the crawl function. Considering the principle DRY, you can separate the crawling algorithms into objects such as strategies and have one Crawler depend (with composition) on it.
class CrawlStrategy(ABC):
#abstractmethod
def crawl(self) -> List[A]:
pass
class CrawlStrategyA(CrawlStrategy):
def crawl(self) -> List[A]:
...
class CrawlStrategyB(CrawlStrategy):
def __init__(self, codes: List[int]):
self.__codes = codes
def crawl(self) -> List[A]:
...
class Crawler(ABC):
def __init__(self, name: str, strategy: 'CrawlStrategy'):
self.__name = name
self.__strategy = strategy
def crawl(self) -> List[int]:
return self.__strategy.crawl()
By doing this the structure of Crawler (e.g. table name etc.) exists in only one place, and you can extend it later on.
Problem 3: From here, you have multiple approaches to improve the overall design. You can remove the CrawlService by creating new strategies which depend on your database connection. To express that one strategy depends on another (e.g. crawler1 produces results for crawler2), you can compose two strategies together, such as:
class StrategyA(Strategy):
def __init__(self, other: Strategy, database: DB):
self.__other = other
self.__db = database
def crawl(self) -> 'List[A]':
result = self.__other.crawl()
self.__db.write(result)
xs = self.__db.find(...)
# do something with xs
...
Of course, this is a simplified example, but will remove the need for a single mediator between your database connection and your crawlers, and will provide more flexibility. Also, the overall design is easier to test because all you have to do is unit test the strategy objects (and you can mock the database connection with ease to do DI).
From this point, the next steps for improving the overall design depend a lot on the implementation's complexity and on how much flexibility you need in general.
PS: You can also try other alternatives apart from Strategy Pattern, maybe depending on how many crawlers you have and their general structure you will have to use the Decorator pattern.

Related

Implementing luigi dynamic graph configuration

I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit.
Basically what I was looking for was a way to be able to persist a custom built pipeline and thus have its results repeatable and easier to deploy, after reading most of the online tutorials I tried to implement my serialization using the existing luigi.cfg configuration and command line mechanisms and it might have sufficed for the tasks' parameters but it provided no way of serializing the DAG connectivity of my pipeline, so I decided to have a WrapperTask which received a json config file that would then create all the task instances and connect all the input output channels of the luigi tasks (do all the plumbing).
I hereby enclose a small test program for your scrutiny:
import random
import luigi
import time
import os
class TaskNode(luigi.Task):
i = luigi.IntParameter() # node ID
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.required = []
def set_required(self, required=None):
self.required = required # set the dependencies
return self
def requires(self):
return self.required
def output(self):
return luigi.LocalTarget('{0}{1}.txt'.format(self.__class__.__name__, self.i))
def run(self):
with self.output().open('w') as outfile:
outfile.write('inside {0}{1}\n'.format(self.__class__.__name__, self.i))
self.process()
def process(self):
raise NotImplementedError(self.__class__.__name__ + " must implement this method")
class FastNode(TaskNode):
def process(self):
time.sleep(1)
class SlowNode(TaskNode):
def process(self):
time.sleep(2)
# This WrapperTask builds all the nodes
class All(luigi.WrapperTask):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
num_nodes = 513
classes = TaskNode.__subclasses__()
self.nodes = []
for i in reversed(range(num_nodes)):
cls = random.choice(classes)
dependencies = random.sample(self.nodes, (num_nodes - i) // 35)
obj = cls(i=i)
if dependencies:
obj.set_required(required=dependencies)
else:
obj.set_required(required=None)
# delete existing output causing a build all
if obj.output().exists():
obj.output().remove()
self.nodes.append(obj)
def requires(self):
return self.nodes
if __name__ == '__main__':
luigi.run()
So, basically, as is stated in the question's title, this focuses on the dynamic dependencies and generates a 513 node dependency DAG with p=1/35 connectivity probability, it also defines the All (as in make all) class as a WrapperTask that requires all nodes to be built for it to be considered done (I have a version which only connects it to heads of connected DAG components but I didn't want to over complicate).
Is there a more standard (Luigic) way of implementing this? Especially note the not so pretty complication with the TaskNode init and set_required methods, I only did it this way because receiving parameters in the init method clashes somehow with the way luigi registers parameters. I also tried several other ways but this was basically the most decent one (that worked)
If there isn't a standard way I'd still love to hear any insights you have on the way I plan to go before I finish implementing the framework.
I answered a similar question yesterday with a demo. I based that almost entirely off of the example in the docs.. In the docs, assigning dynamic dependencies by yeilding tasks seems like the way they prefer.
luigi.Config and dynamic dependencies can probably give you a pipeline of almost infinite flexibility. They also describe a dummy Task that calls multiple dependency chains here, which could give you even more control.

Can an association class be implemented in Python?

I have just started learning software development and I am modelling my system in a UML Class diagram. I am unsure how I would implement this in code.
To keep things simple let’s assume the followimg example:
There is a Room and a Guest Class with association Room(0..)-Guest(0..) and an association class RoomBooking, which contains booking details. How would I model this in Python if my system wants to see all room bookings made by a particular guest?
Most Python applications developed from a UML design are backed by a relational database, usually via an ORM. In which case your design is pretty trivial: your RoomBooking is a table in the database, and the way you look up all RoomBooking objects for a given Guest is just an ORM query. Keeping it vague rather than using a particular ORM syntax, something like this:
bookings = RoomBooking.select(Guest=guest)
With an RDBMS but no ORM, it's not much different. Something like this:
sql = 'SELECT Room, Guest, Charge, Paid FROM RoomBooking WHERE Guest = ?'
cur = db.execute(sql, (guest.id))
bookings = [RoomBooking(*row) for row in cur]
And this points to what you'd do if you're not using a RDBMS: any relation that would be stored as a table with a foreign key is instead stored as some kind of dict in memory.
For example, you might have a dict mapping guests to sets of room bookings:
bookings = guest_booking[guest]
Or, alternatively, if you don't have a huge number of hotels, you might have this mapping implicit, with each hotel having a 1-to-1 mapping of guests to bookings:
bookings = [hotel.bookings[guest] for hotel in hotels]
Since you're starting off with UML, you're probably thinking in strict OO terms, so you'll want to encapsulate this dict in some class, behind some mutator and accessor methods, so you can ensure that you don't accidentally break any invariants.
There are a few obvious places to put it—a BookingManager object makes sense for the guest-to-set-of-bookings mapping, and the Hotel itself is such an obvious place for the per-hotel-guest-to-booking that I used it without thinking above.
But another place to put it, which is closer to the ORM design, is in a class attribute on the RoomBooking type, accessed by classmethods. This also allows you to extend things if you later need to, e.g., look things up by hotel—you'd then put two dicts as class attributes, and ensure that a single method always updates both of them, so you know they're always consistent.
So, let's look at that:
class RoomBooking
guest_mapping = collections.defaultdict(set)
hotel_mapping = collections.defaultdict(set)
def __init__(self, guest, room):
self.guest, self.room = guest, room
#classmethod
def find_by_guest(cls, guest):
return cls.guest_mapping[guest]
#classmethod
def find_by_hotel(cls, hotel):
return cls.hotel_mapping[hotel]
#classmethod
def add_booking(cls, guest, room):
booking = cls(guest, room)
cls.guest_mapping[guest].add(booking)
cls.hotel_mapping[room.hotel].add(booking)
Of course your Hotel instance probably needs to add the booking as well, so it can raise an exception if two different bookings cover the same room on overlapping dates, whether that happens in RoomBooking.add_booking, or in some higher-level function that calls both Hotel.add_booking and RoomBooking.add_booking.
And if this is multi-threaded (which seems like a good possibility, given that you're heading this far down the Java-inspired design path), you'll need a big lock, or a series of fine-grained locks, around the whole transaction.
For persistence, you probably want to store these mappings along with the public objects. But for a small enough data set, or for a server that rarely restarts, it might be simpler to just persist the public objects, and rebuild the mappings at load time by doing a bunch of add_booking calls as part of the load process.
If you want to make it even more ORM-style, you can have a single find method that takes keyword arguments and manually executes a "query plan" in a trivial way:
#classmethod
def find(cls, guest=None, hotel=None):
if guest is None and hotel is None:
return {booking for bookings in cls.guest_mapping.values()
for booking in bookings}
elif hotel is None:
return cls.guest_mapping[guest]
elif guest is None:
return cls.hotel_mapping[hotel]
else:
return {booking for booking in cls.guest_mapping[guest]
if booking.room.hotel == hotel}
But this is already pushing things to the point where you might want to go back and ask whether you were right to not use an ORM in the first place. If that sounds ridiculously heavy duty for your simple toy app, take a look at sqlite3 for the database (which comes with Python, and which takes less work to use than coming up with a way to pickle or json all your data for persistence) and SqlAlchemy for the ORM. There's not much of a learning curve, and not much runtime overhead or coding-time boilerplate.
Sure you can implement it in Python. But there is not a single way. Quite often you have a database layer where the association class is used with two foreign keys (in your case to the primaries of Room and Guest). So in order to search you would just code an according SQL to be sent. In case you want to cache this table you would code it like this (or similarly) with an associative array:
from collections import defaultdict
class Room():
def __init__(self, num):
self.room_number = num
def key(self):
return str(self.room_number)
class Guest():
def __init__(self, name):
self.name = name
def key(self):
return self.name
def nested_dict(n, type):
if n == 1:
return defaultdict(type)
else:
return defaultdict(lambda: nested_dict(n-1, type))
room_booking = nested_dict(2, str)
class Room_Booking():
def __init__(self, date):
self.date = date
room1 = Room(1)
guest1 = Guest("Joe")
room_booking[room1.key()][guest1.key()] = Room_Booking("some date")
print(room_booking[room1.key()][guest1.key()])

Calling mapPartitions on a class method in python

I'm fairly new to Python and to Spark but let me see if I can explain what I am trying to do.
I have a bunch of different types of pages that I want to process. I created a base class for all the common attributes of those pages and then have a page specific class inherit from the base class. The idea being that the spark runner will be able to do the exact thing for all pages by changing just the page type when called.
Runner
def CreatePage(pageType):
if pageType == "Foo":
return PageFoo(pageType)
elif pageType == "Bar":
return PageBar(pageType)
def Main(pageType):
page = CreatePage(pageType)
pageList_rdd = sc.parallelize(page.GetPageList())
return = pageList_rdd.mapPartitions(lambda pageNumber: CreatePage(pageType).ProcessPage(pageNumber))
print Main("Foo")
PageBaseClass.py
class PageBase(object):
def __init__(self, pageType):
self.pageType = None
self.dbConnection = None
def GetDBConnection(self):
if self.dbConnection == None:
# Set up a db connection so we can share this amongst all nodes.
self.dbConnection = DataUtils.MySQL.GetDBConnection()
return self.dbConnection
def ProcessPage():
raise NotImplementedError()
PageFoo.py
class PageFoo(PageBase, pageType):
def __init__(self, pageType):
self.pageType = pageType
self.dbConnetion = GetDBConnection()
def ProcessPage():
result = self.dbConnection.cursor("SELECT SOMETHING")
# other processing
There are a lot of other page specific functionality that I am omitting from brevity, but the idea is that I'd like to keep all the logic of how to process that page in the page class. And, be able to share resources like db connection and an s3 bucket.
I know that the way that I have it right now, it is creating a new Page object for every item in the rdd. Is there a way to do this so that it is only creating the one object? Is there a better pattern for this? Thanks!
A few suggestions:
Don't create connections directly. Use connection pool (since each executor uses separate process setting pool size to one is just fine) and make sure that connections are automatically closed on timeout) instead.
Use Borg pattern to store the pool and adjust your code to use it to retrieve connections.
You won't be able to share connections between nodes or even within a single node (see comment about separate processes). The best guarantee you can get is a single connection per partition (or a number of partitions with interpreter reuse).

How do I use TDD to create a database representation of existing objects?

I have used TDD to develop a set of classes in Python. These objects contain data fields, functions and links to each other. Everything functionally works like I want.
Eventually all of this should be stored in a database, to be used in a Django web application.
I have sketched some possible database schema's to hold the same information, but I feel this is a "sudden big leap", compared to the traditional TDD way of developing the rest of the application.
So, now I wonder, which tests should I write to force me to store these objects in a database in a step-by-step TDD way?
Making this question a bit more concrete, the classes are currently like this:
class Connector(object):
def __init__(self, title = None):
self.value = None
self.valid = False
self.title = title
...
class Element(object):
def __init__(self, title = None):
self.title = title
self.input_connectors = []
self.output_connectors = []
self.number_of_runs = 0
def run(self):
...
self.number_of_runs += 1
class Average(Element):
def __init__(self, title = None):
super(OpenCVMean, self).__init__(title = title)
self.src = Connector("source")
self.avg = Connector("average")
self.input_connectors.append(self.src)
self.output_connectors.append(self.avg)
def run(self):
super(Average, self).run()
self.avg.set_value(numpy.average(self.src.value))
I realize some of the data should be in the database, while processing functions should not. I think there should be a table which represents the details of the different "types / subclasses " of Element, while also one which stores actual instances. But, as I said, I don't see how to get there using TDD.
First, ask yourself if you will be testing your code or the Django ORM. Storing and reading will be fine most of the time.
The things you will need to test are validating the data and any properties that are not model fields. I think you should end up with a good schema through writing tests on the next layer up from the database.
Also, use South or some other means of migration to reduce the cost of schema change. That will give you some peace of mind.
If you have several levels of tests (eg. integration tests), it makes sense to check that the database configuration is intact. Most of the tests do not need to hit the database. You can do this by mocking the model or some database operations (at least save()). Having said that, you can check database writes and reads with this simple test:
def test_db_access(self):
input = Element(title = 'foo')
input.save()
output = Element.objects.get(title='foo')
self.assertEquals(input, output)
Mocking save:
def save(obj):
obj.id = 1

Python - passing object references?

I learning Python (coming from a dotnet background) and developing an app which interacts with a webservice.
The web service is flat, in that it has numerous calls some of which are related to sessions e.g. logging on etc, whereas other calls are related to retrieving/setting business data.
To accompany the webservice, there are a couple of python classes which wrap all the calls. I am looking to develop a client on top of that class but give the client more OO structure.
The design of my own app was to have a Session-type class which would be responsible for logging on/maintaining the connection etc , but itself would be injected into a Business-type class which is responsible for making all the business calls.
So the stack is something like
WebService (Soap)
WebServiceWrapper (Python)
Session (Python)
Business (Python)
Here's a sample of my code (I've renamed some methods to try and make stuff more explicit)
from webServiceWrapper import webServiceAPI
class Session():
def __init__(self, user, password):
self._api = webServiceAPI()
self.login = self._api.login(user, password)
def webServiceCalls(self):
return self._api()
class Business():
def __init__(self, service):
self._service=service
def getBusinessData(self):
return self._service.get_business_data()
and my unit test
class exchange(unittest.TestCase):
def setUp(self):
self.service = Session("username","password")
self._business = Business(self.service.webServiceCalls())
def testBusinessReturnsData(self):
self.assertFalse(self._business.getBusinessData()==None)
The unit test fails fails on
return self._api()
saying that the underlying class is not callable
TypeError: 'webServiceAPI' is not callable
My first q is, is that the python way? Is the OO thinking which underpins app development with static languages good for dynamic as well? (That's probably quite a big q!)
My second q is that, if this kind of architecture is ok, what am I doing wrong (I guess in terms of passing references to objects in this way)?
Many thx
S
If WebserviceAPI is an object just remove the parentheses like that:
return self._api
You already created an instance of the object in the constructor.
Maybe add the definition of WebserviceAPI to the question, I can only guess at the moment.
I don't see anything that is wrong or un-Pythonic here. Pythonistas often point out the differences to static languages like Java or C#, but many real-world application mostly use a simple static class design in Python, too.
I guess that webServiceAPI is not a class, thus it can't be called.
If you are using Python 2.x, always inherit from the object type, otherwise you'll get a “classic class” (a relict from ancient times that is kept for background compatibility).

Categories