Implementing luigi dynamic graph configuration - python

I am new to luigi, came across it while designing a pipeline for our ML efforts. Though it wasn't fitted to my particular use case it had so many extra features I decided to make it fit.
Basically what I was looking for was a way to be able to persist a custom built pipeline and thus have its results repeatable and easier to deploy, after reading most of the online tutorials I tried to implement my serialization using the existing luigi.cfg configuration and command line mechanisms and it might have sufficed for the tasks' parameters but it provided no way of serializing the DAG connectivity of my pipeline, so I decided to have a WrapperTask which received a json config file that would then create all the task instances and connect all the input output channels of the luigi tasks (do all the plumbing).
I hereby enclose a small test program for your scrutiny:
import random
import luigi
import time
import os
class TaskNode(luigi.Task):
i = luigi.IntParameter() # node ID
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.required = []
def set_required(self, required=None):
self.required = required # set the dependencies
return self
def requires(self):
return self.required
def output(self):
return luigi.LocalTarget('{0}{1}.txt'.format(self.__class__.__name__, self.i))
def run(self):
with self.output().open('w') as outfile:
outfile.write('inside {0}{1}\n'.format(self.__class__.__name__, self.i))
self.process()
def process(self):
raise NotImplementedError(self.__class__.__name__ + " must implement this method")
class FastNode(TaskNode):
def process(self):
time.sleep(1)
class SlowNode(TaskNode):
def process(self):
time.sleep(2)
# This WrapperTask builds all the nodes
class All(luigi.WrapperTask):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
num_nodes = 513
classes = TaskNode.__subclasses__()
self.nodes = []
for i in reversed(range(num_nodes)):
cls = random.choice(classes)
dependencies = random.sample(self.nodes, (num_nodes - i) // 35)
obj = cls(i=i)
if dependencies:
obj.set_required(required=dependencies)
else:
obj.set_required(required=None)
# delete existing output causing a build all
if obj.output().exists():
obj.output().remove()
self.nodes.append(obj)
def requires(self):
return self.nodes
if __name__ == '__main__':
luigi.run()
So, basically, as is stated in the question's title, this focuses on the dynamic dependencies and generates a 513 node dependency DAG with p=1/35 connectivity probability, it also defines the All (as in make all) class as a WrapperTask that requires all nodes to be built for it to be considered done (I have a version which only connects it to heads of connected DAG components but I didn't want to over complicate).
Is there a more standard (Luigic) way of implementing this? Especially note the not so pretty complication with the TaskNode init and set_required methods, I only did it this way because receiving parameters in the init method clashes somehow with the way luigi registers parameters. I also tried several other ways but this was basically the most decent one (that worked)
If there isn't a standard way I'd still love to hear any insights you have on the way I plan to go before I finish implementing the framework.

I answered a similar question yesterday with a demo. I based that almost entirely off of the example in the docs.. In the docs, assigning dynamic dependencies by yeilding tasks seems like the way they prefer.
luigi.Config and dynamic dependencies can probably give you a pipeline of almost infinite flexibility. They also describe a dummy Task that calls multiple dependency chains here, which could give you even more control.

Related

How can I de-couple the two components of my python application

I'm trying to learn python development, and I have been reading on topics of architectural pattern and code design, as I want to stop hacking and really develop. I am implementing a webcrawler, and I know it has a problematic structure as you'll see, but I don't know how to fix it.
The crawlers will return a list of actions to input data in a mongoDB instance.
This is my general structure of my application:
Spiders
crawlers.py
connections.py
utils.py
__init__.py
crawlers.py implements a class of type Crawler, and each specific crawler inherits it. Each Crawler has an attribute table_name, and a method: crawl.
In connections.py, I implemented a pymongo driver to connect to the DB. It expects a crawler as a parameter to it's write method. Now here come's the trick part... the crawler2 depends on the results of crawler1, so I end up with something like this:
from pymongo import InsertOne
class crawler1(Crawler):
def __init__(self):
super().__init__('Crawler 1', 'table_A')
def crawl(self):
return list of InsertOne
class crawler2(Crawler):
def __init__(self):
super().__init__('Crawler 2', 'table_B')
def crawl(self, list_of_codes):
return list of InsertOne # After crawling the list of codes/links
and then, in my connections, I create a class that expects a crawler.
class MongoDriver:
def __init__.py
self.db = MongoClient(...)
def write(crawler, **kwargs):
self.db[crawler.table_name].bulk_write(crawler.crawl(**kwargs))
def get_list_of_codes():
query = {}
return [x['field'] for x in self.db.find(query)]
and so, here comes the (biggest) problem (because I think there are many other, some of which I can barely grasp, and others that I'm still totally blind to): the implementation of my connections needs context of the crawler!! For example:
mongo_driver = MongoDriver()
crawler1 = Crawler1()
crawler2 = Crawler2()
mongo_driver.write(crawler1)
mongo_driver.write(crawler2, list_of_codes=mongo_driver.get_list_of_codes())
How would one go about solving it? And what else is particularly worrysome in this construct? Thanks for the feedback!
Problem 1: MongoDriver knows too much about your crawlers. You should separate the driver from crawler1 and crawler2. I'm not sure what your crawl function returns, but I assume it is a list of objects of type A.
You could use an object such as CrawlerService to manage the dependency between MongoDriver and Crawler. This will separate the driver's write responsibility from the crawler's responsibility to crawl. The service will also manage the order of operations which in some cases may be considered good enough.
class Repository:
def write(for_table: str, objects: 'List[A]'):
self.db[for_table].bulk_write(objects)
class CrawlerService:
def __init__(self, repository: Repository, crawlers: List[Crawler]):
...
def crawl(self):
crawler1, crawler2 = crawlers
result = [repository.write(x) for x in crawler1.crawl()]
... # work with crawler2 and result
Problem 2: Crawler1 and Crawler2 are pretty much the same; they only differ when we call the crawl function. Considering the principle DRY, you can separate the crawling algorithms into objects such as strategies and have one Crawler depend (with composition) on it.
class CrawlStrategy(ABC):
#abstractmethod
def crawl(self) -> List[A]:
pass
class CrawlStrategyA(CrawlStrategy):
def crawl(self) -> List[A]:
...
class CrawlStrategyB(CrawlStrategy):
def __init__(self, codes: List[int]):
self.__codes = codes
def crawl(self) -> List[A]:
...
class Crawler(ABC):
def __init__(self, name: str, strategy: 'CrawlStrategy'):
self.__name = name
self.__strategy = strategy
def crawl(self) -> List[int]:
return self.__strategy.crawl()
By doing this the structure of Crawler (e.g. table name etc.) exists in only one place, and you can extend it later on.
Problem 3: From here, you have multiple approaches to improve the overall design. You can remove the CrawlService by creating new strategies which depend on your database connection. To express that one strategy depends on another (e.g. crawler1 produces results for crawler2), you can compose two strategies together, such as:
class StrategyA(Strategy):
def __init__(self, other: Strategy, database: DB):
self.__other = other
self.__db = database
def crawl(self) -> 'List[A]':
result = self.__other.crawl()
self.__db.write(result)
xs = self.__db.find(...)
# do something with xs
...
Of course, this is a simplified example, but will remove the need for a single mediator between your database connection and your crawlers, and will provide more flexibility. Also, the overall design is easier to test because all you have to do is unit test the strategy objects (and you can mock the database connection with ease to do DI).
From this point, the next steps for improving the overall design depend a lot on the implementation's complexity and on how much flexibility you need in general.
PS: You can also try other alternatives apart from Strategy Pattern, maybe depending on how many crawlers you have and their general structure you will have to use the Decorator pattern.

Mocking elasticsearch-py calls

I'm writing a CLI to interact with elasticsearch using the elasticsearch-py library. I'm trying to mock elasticsearch-py functions in order to test my functions without calling my real cluster.
I read this question and this one but I still don't understand.
main.py
Escli inherits from cliff's App class
class Escli(App):
_es = elasticsearch5.Elasticsearch()
settings.py
from escli.main import Escli
class Settings:
def get(self, sections):
raise NotImplementedError()
class ClusterSettings(Settings):
def get(self, setting, persistency='transient'):
settings = Escli._es.cluster\
.get_settings(include_defaults=True, flat_settings=True)\
.get(persistency)\
.get(setting)
return settings
settings_test.py
import escli.settings
class TestClusterSettings(TestCase):
def setUp(self):
self.patcher = patch('elasticsearch5.Elasticsearch')
self.MockClass = self.patcher.start()
def test_get(self):
# Note this is an empty dict to show my point
# it will contain childs dict to allow my .get(persistency).get(setting)
self.MockClass.return_value.cluster.get_settings.return_value = {}
cluster_settings = escli.settings.ClusterSettings()
ret = cluster_settings.get('cluster.routing.allocation.node_concurrent_recoveries', persistency='transient')
# ret should contain a subset of my dict defined above
I want to have Escli._es.cluster.get_settings() to return what I want (a dict object) in order to not make the real HTTP call, but it keeps doing it.
What I know:
In order to mock an instance method I have to do something like
MagicMockObject.return_value.InstanceMethodName.return_value = ...
I cannot patch Escli._es.cluster.get_settings because Python tries to import Escli as module, which cannot work. So I'm patching the whole lib.
I desperately tried to put some return_value everywhere but I cannot understand why I can't mock that thing properly.
You should be mocking with respect to where you are testing. Based on the example provided, this means that the Escli class you are using in the settings.py module needs to be mocked with respect to settings.py. So, more practically, your patch call would look like this inside setUp instead:
self.patcher = patch('escli.settings.Escli')
With this, you are now mocking what you want in the right place based on how your tests are running.
Furthermore, to add more robustness to your testing, you might want to consider speccing for the Elasticsearch instance you are creating in order to validate that you are in fact calling valid methods that correlate to Elasticsearch. With that in mind, you can do something like this, instead:
self.patcher = patch('escli.settings.Escli', Mock(Elasticsearch))
To read a bit more about what exactly is meant by spec, check the patch section in the documentation.
As a final note, if you are interested in exploring the great world of pytest, there is a pytest-elasticsearch plugin created to assist with this.

Load local (unserializable) objects on workers

I'm trying to use Dataflow in conjunction with Tensorflow for predictions. Those predictions are happening on the workers and I'm currently loading the model through startup_bundle(). Like here:
class PredictDoFn(beam.DoFn):
def start_bundle(self):
self.model = load_model_from_file()
def process(self, element):
...
My current problem is that even if I process 1000 elements, the startup_bundle() function is called multiple times (at least 10) and not once per worked as I've hoped. This slows down the pipeline significantly because the model needs to be loaded many times and it takes every time 30 seconds.
Are there any ways to load the model on the workers on initialisation and not every time in the start_bundle()?
Thanks in advance!
Dimitri
The easiest thing would be for you to add an if self.model is None: self.model = load_model_from_file(), and this may not reduce the number of times your model is reloaded.
This is because DoFn instances are not currently reused across bundles. This means that your model will be forgotten after every work item is executed.
You could also create a global variable where you keep the model. This would reduce the amount of reloads, but it would be really unorthodox (though it may solve your use case).
A global variable approach should work something like this:
class MyModelDoFn(object):
def process(self, elem):
global my_model
if my_model is None:
my_model = load_model_from_file()
yield my_model.apply_to(elem)
An approach that relies on a thread-local variable would look like so. Consider that this will load the model once per thread, so the number of times your model is loaded depends on runner implementation (it will work in Dataflow):
class MyModelDoFn(object):
_thread_local = threading.local()
#property
def model(self):
model = getattr(MyModelDoFn._thread_local, 'model', None)
if not model:
MyModelDoFn._thread_local.model = load_model_from_file()
return MyModelDoFn._thread_local.model
def process(self, elem):
yield self.model.apply_to(elem)
I guess you can load the model from the start_bundle call as well.
Note: This approach is very unorthodox, and is not guaranteed to work in newer versions, nor all runners.

Calling mapPartitions on a class method in python

I'm fairly new to Python and to Spark but let me see if I can explain what I am trying to do.
I have a bunch of different types of pages that I want to process. I created a base class for all the common attributes of those pages and then have a page specific class inherit from the base class. The idea being that the spark runner will be able to do the exact thing for all pages by changing just the page type when called.
Runner
def CreatePage(pageType):
if pageType == "Foo":
return PageFoo(pageType)
elif pageType == "Bar":
return PageBar(pageType)
def Main(pageType):
page = CreatePage(pageType)
pageList_rdd = sc.parallelize(page.GetPageList())
return = pageList_rdd.mapPartitions(lambda pageNumber: CreatePage(pageType).ProcessPage(pageNumber))
print Main("Foo")
PageBaseClass.py
class PageBase(object):
def __init__(self, pageType):
self.pageType = None
self.dbConnection = None
def GetDBConnection(self):
if self.dbConnection == None:
# Set up a db connection so we can share this amongst all nodes.
self.dbConnection = DataUtils.MySQL.GetDBConnection()
return self.dbConnection
def ProcessPage():
raise NotImplementedError()
PageFoo.py
class PageFoo(PageBase, pageType):
def __init__(self, pageType):
self.pageType = pageType
self.dbConnetion = GetDBConnection()
def ProcessPage():
result = self.dbConnection.cursor("SELECT SOMETHING")
# other processing
There are a lot of other page specific functionality that I am omitting from brevity, but the idea is that I'd like to keep all the logic of how to process that page in the page class. And, be able to share resources like db connection and an s3 bucket.
I know that the way that I have it right now, it is creating a new Page object for every item in the rdd. Is there a way to do this so that it is only creating the one object? Is there a better pattern for this? Thanks!
A few suggestions:
Don't create connections directly. Use connection pool (since each executor uses separate process setting pool size to one is just fine) and make sure that connections are automatically closed on timeout) instead.
Use Borg pattern to store the pool and adjust your code to use it to retrieve connections.
You won't be able to share connections between nodes or even within a single node (see comment about separate processes). The best guarantee you can get is a single connection per partition (or a number of partitions with interpreter reuse).

Mapreduce on GAE Python - Cause ReducePipeline to issue callback on finalize?

I'd like to execute a custom callback function once a mapreduce job has finalized/completed.
The only useful references I found for this problem are a somewhat outdated Google site and a related, but again seemingly outdated Stackoverflow question.
Both those sources assume that I use control.start_map to kick off Mapreduce jobs, and rely on the fact that start_map takes a keyword argument mapreduce_parameters in which one can specify a done_callback argument to specify the url which should be called on completion. However, I'm using a different method (afaik the more recent, preferred one) in which a custom pipeline's run method yields a Mapreduce pipeline:
yield mapreduce_pipeline.MapreducePipeline(
"word_count",
"main.word_count_map",
"main.word_count_reduce",
"mapreduce.input_readers.BlobstoreZipInputReader",
"mapreduce.output_writers.BlobstoreOutputWriter",
mapper_params={
"blob_key": blobkey,
},
reducer_params={
"mime_type": "text/plain",
},
shards=16)
The signature for MapreducePipeline doesn't allow for a mapreduce_parameters argument. The only places where I can see references to callback cropping up in the source is in mapper_pipeline.MapperPipeline.run, but it seems to be used internally only.
So, is there a way to get that callback parameter in there?
If not, does someone have good ideas on where and how to extend the library to provide such a functionality?
I set up my Mapreduce pipeline paradigm to look a little like the following:
class MRRecalculateSupportsPipeline(base_handler.PipelineBase):
def run(self, user_key):
# ...
yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
If you would like to capture the completion of this pipeline you have two options.
A) Use pipeline.After to run a completion pipeline after the MR pipeline completes.
pipe_future = yield mapreduce_pipeline.MapreducePipeline('user_recalculate_supports',
'myapp.mapreduces.user_recalculate_supports_map',
'myapp.mapreduces.user_recalculate_supports_reduce',
'mapreduce.input_readers.DatastoreInputReader', output_writer_spec=None,
mapper_params={"""..."""})
with pipeline.After(pipe_future):
yield CalcCompletePipeline(...) # this could be a mapreduce pipeline, or any pipeline using the same base_handler.PipelineBase parent class.
B) Use the finalized method of the top-level pipeline to handle completion. Personally, I'd stick with option A, because you can trace the path in /_ah/*/status?root= view.
class EmailNewReleasePipeline(base_handler.PipelineBase):
"""Email followers about a new release"""
# TODO: product_key is the name of the parameter, but it's built for albums ...
def run(self, product_key, testing=False):
# Send those emails ...
yield mapreduce_pipeline.MapreducePipeline(...)
def finalized(self):
"""Save product as launched"""
...
product.launched = True
product.put()
Here are the docs on the finalization of a pipeline.
At least a not-so-much-investment workaround for this issue is to simply yield another Map/Mapreduce pipeline that does the desired postprocessing.
E.g.:
class MainPipeline(base_handler.PipelineBase):
def run(self):
mapper_params = { ... }
reducer_params = { ... }
yield mapreduce_pipeline.MapReducePipeline(
...,
mapper_params=mapper_params,
reducer_params=reducer_params)
yield PostprocessPipeline(reducer_params)
class PostprocessPipeline(base_handler.PipelineBase):
def run(self, reducer_params):
do_some_postprocessing(reducer_params)
That workaround doesn't have access to the Mapreduce state, which I suppose could somehow be retrieved from the pipeline ID, but it's not yet obvious to me how. So, you'll have to set another flag/memcache/ds entry to check if the pipeline was completed successfully or not (if that's relevant to the postprocessing).

Categories