Calling mapPartitions on a class method in python - python

I'm fairly new to Python and to Spark but let me see if I can explain what I am trying to do.
I have a bunch of different types of pages that I want to process. I created a base class for all the common attributes of those pages and then have a page specific class inherit from the base class. The idea being that the spark runner will be able to do the exact thing for all pages by changing just the page type when called.
Runner
def CreatePage(pageType):
if pageType == "Foo":
return PageFoo(pageType)
elif pageType == "Bar":
return PageBar(pageType)
def Main(pageType):
page = CreatePage(pageType)
pageList_rdd = sc.parallelize(page.GetPageList())
return = pageList_rdd.mapPartitions(lambda pageNumber: CreatePage(pageType).ProcessPage(pageNumber))
print Main("Foo")
PageBaseClass.py
class PageBase(object):
def __init__(self, pageType):
self.pageType = None
self.dbConnection = None
def GetDBConnection(self):
if self.dbConnection == None:
# Set up a db connection so we can share this amongst all nodes.
self.dbConnection = DataUtils.MySQL.GetDBConnection()
return self.dbConnection
def ProcessPage():
raise NotImplementedError()
PageFoo.py
class PageFoo(PageBase, pageType):
def __init__(self, pageType):
self.pageType = pageType
self.dbConnetion = GetDBConnection()
def ProcessPage():
result = self.dbConnection.cursor("SELECT SOMETHING")
# other processing
There are a lot of other page specific functionality that I am omitting from brevity, but the idea is that I'd like to keep all the logic of how to process that page in the page class. And, be able to share resources like db connection and an s3 bucket.
I know that the way that I have it right now, it is creating a new Page object for every item in the rdd. Is there a way to do this so that it is only creating the one object? Is there a better pattern for this? Thanks!

A few suggestions:
Don't create connections directly. Use connection pool (since each executor uses separate process setting pool size to one is just fine) and make sure that connections are automatically closed on timeout) instead.
Use Borg pattern to store the pool and adjust your code to use it to retrieve connections.
You won't be able to share connections between nodes or even within a single node (see comment about separate processes). The best guarantee you can get is a single connection per partition (or a number of partitions with interpreter reuse).

Related

Drawbacks of executing code in an SQLAlchemy managed session and if so why?

I have seen different "patterns" in handling this case so I am wondering if one has any drawbacks comapred to the other.
So lets assume that we wish to create a new object of class MyClass and add it to the database. We can do the following:
class MyClass:
pass
def builder_method_for_myclass():
# A lot of code here..
return MyClass()
my_object=builder_method_for_myclass()
with db.managed_session() as s:
s.add(my_object)
which seems that only keeps the session open for adding the new object but I have also seen cases where the entire builder method is called and executed within the managed session like so:
class MyClass:
pass
def builder_method_for_myclass():
# A lot of code here..
return MyClass()
with db.managed_session() as s:
my_object=builder_method_for_myclass()
are there any downsides in either of these methods and if yes what are they? Cant find something specific about this in the documentation.
When you build objects depending on objects fetched from a session you have to be in a session. So a factory function can only execute outside a session for the simplest cases. Usually you have to pass the session around or make it available on a thread local.
For example in this case to build a product I need to fetch the product category from the database into the session. So my product factory function depends on the session instance. The new product is created and added to the same session that the category is also in. An implicit commit should also occur when the session ends, ie the context manager completes.
def build_product(session, category_name):
category = session.query(ProductCategory).where(
ProductCategory.name == category_name).first()
return Product(category=category)
with db.managed_session() as s:
my_product = build_product(s, "clothing")
s.add(my_product)

How can I de-couple the two components of my python application

I'm trying to learn python development, and I have been reading on topics of architectural pattern and code design, as I want to stop hacking and really develop. I am implementing a webcrawler, and I know it has a problematic structure as you'll see, but I don't know how to fix it.
The crawlers will return a list of actions to input data in a mongoDB instance.
This is my general structure of my application:
Spiders
crawlers.py
connections.py
utils.py
__init__.py
crawlers.py implements a class of type Crawler, and each specific crawler inherits it. Each Crawler has an attribute table_name, and a method: crawl.
In connections.py, I implemented a pymongo driver to connect to the DB. It expects a crawler as a parameter to it's write method. Now here come's the trick part... the crawler2 depends on the results of crawler1, so I end up with something like this:
from pymongo import InsertOne
class crawler1(Crawler):
def __init__(self):
super().__init__('Crawler 1', 'table_A')
def crawl(self):
return list of InsertOne
class crawler2(Crawler):
def __init__(self):
super().__init__('Crawler 2', 'table_B')
def crawl(self, list_of_codes):
return list of InsertOne # After crawling the list of codes/links
and then, in my connections, I create a class that expects a crawler.
class MongoDriver:
def __init__.py
self.db = MongoClient(...)
def write(crawler, **kwargs):
self.db[crawler.table_name].bulk_write(crawler.crawl(**kwargs))
def get_list_of_codes():
query = {}
return [x['field'] for x in self.db.find(query)]
and so, here comes the (biggest) problem (because I think there are many other, some of which I can barely grasp, and others that I'm still totally blind to): the implementation of my connections needs context of the crawler!! For example:
mongo_driver = MongoDriver()
crawler1 = Crawler1()
crawler2 = Crawler2()
mongo_driver.write(crawler1)
mongo_driver.write(crawler2, list_of_codes=mongo_driver.get_list_of_codes())
How would one go about solving it? And what else is particularly worrysome in this construct? Thanks for the feedback!
Problem 1: MongoDriver knows too much about your crawlers. You should separate the driver from crawler1 and crawler2. I'm not sure what your crawl function returns, but I assume it is a list of objects of type A.
You could use an object such as CrawlerService to manage the dependency between MongoDriver and Crawler. This will separate the driver's write responsibility from the crawler's responsibility to crawl. The service will also manage the order of operations which in some cases may be considered good enough.
class Repository:
def write(for_table: str, objects: 'List[A]'):
self.db[for_table].bulk_write(objects)
class CrawlerService:
def __init__(self, repository: Repository, crawlers: List[Crawler]):
...
def crawl(self):
crawler1, crawler2 = crawlers
result = [repository.write(x) for x in crawler1.crawl()]
... # work with crawler2 and result
Problem 2: Crawler1 and Crawler2 are pretty much the same; they only differ when we call the crawl function. Considering the principle DRY, you can separate the crawling algorithms into objects such as strategies and have one Crawler depend (with composition) on it.
class CrawlStrategy(ABC):
#abstractmethod
def crawl(self) -> List[A]:
pass
class CrawlStrategyA(CrawlStrategy):
def crawl(self) -> List[A]:
...
class CrawlStrategyB(CrawlStrategy):
def __init__(self, codes: List[int]):
self.__codes = codes
def crawl(self) -> List[A]:
...
class Crawler(ABC):
def __init__(self, name: str, strategy: 'CrawlStrategy'):
self.__name = name
self.__strategy = strategy
def crawl(self) -> List[int]:
return self.__strategy.crawl()
By doing this the structure of Crawler (e.g. table name etc.) exists in only one place, and you can extend it later on.
Problem 3: From here, you have multiple approaches to improve the overall design. You can remove the CrawlService by creating new strategies which depend on your database connection. To express that one strategy depends on another (e.g. crawler1 produces results for crawler2), you can compose two strategies together, such as:
class StrategyA(Strategy):
def __init__(self, other: Strategy, database: DB):
self.__other = other
self.__db = database
def crawl(self) -> 'List[A]':
result = self.__other.crawl()
self.__db.write(result)
xs = self.__db.find(...)
# do something with xs
...
Of course, this is a simplified example, but will remove the need for a single mediator between your database connection and your crawlers, and will provide more flexibility. Also, the overall design is easier to test because all you have to do is unit test the strategy objects (and you can mock the database connection with ease to do DI).
From this point, the next steps for improving the overall design depend a lot on the implementation's complexity and on how much flexibility you need in general.
PS: You can also try other alternatives apart from Strategy Pattern, maybe depending on how many crawlers you have and their general structure you will have to use the Decorator pattern.

Initializing state on dask-distributed workers

I am trying to do something like
resource = MyResource()
def fn(x):
something = dosemthing(x, resource)
return something
client = Client()
results = client.map(fn, data)
The issue is that resource is not serializable and is expensive to construct.
Therefore I would like to construct it once on each worker and be available to be used by fn.
How do I do this?
Or is there some other way to make resource available on all workers?
You can always construct a lazy resource, something like
class GiveAResource():
resource = [None]
def get_resource(self):
if self.resource[0] is None:
self.resource[0] = MyResource()
return self.resource[0]
An instance of this will serialise between processes fine, so you can include it as an input to any function to be executed on workers, and then calling .get_resource() on it will get your local expensive resource (which will get remade on any worker which appears later on).
This class would be best defined in a module rather than dynamic code.
There is no locking here, so if several threads ask for the resource at the same time when it has not been needed so far, you will get redundant work.

How do I use TDD to create a database representation of existing objects?

I have used TDD to develop a set of classes in Python. These objects contain data fields, functions and links to each other. Everything functionally works like I want.
Eventually all of this should be stored in a database, to be used in a Django web application.
I have sketched some possible database schema's to hold the same information, but I feel this is a "sudden big leap", compared to the traditional TDD way of developing the rest of the application.
So, now I wonder, which tests should I write to force me to store these objects in a database in a step-by-step TDD way?
Making this question a bit more concrete, the classes are currently like this:
class Connector(object):
def __init__(self, title = None):
self.value = None
self.valid = False
self.title = title
...
class Element(object):
def __init__(self, title = None):
self.title = title
self.input_connectors = []
self.output_connectors = []
self.number_of_runs = 0
def run(self):
...
self.number_of_runs += 1
class Average(Element):
def __init__(self, title = None):
super(OpenCVMean, self).__init__(title = title)
self.src = Connector("source")
self.avg = Connector("average")
self.input_connectors.append(self.src)
self.output_connectors.append(self.avg)
def run(self):
super(Average, self).run()
self.avg.set_value(numpy.average(self.src.value))
I realize some of the data should be in the database, while processing functions should not. I think there should be a table which represents the details of the different "types / subclasses " of Element, while also one which stores actual instances. But, as I said, I don't see how to get there using TDD.
First, ask yourself if you will be testing your code or the Django ORM. Storing and reading will be fine most of the time.
The things you will need to test are validating the data and any properties that are not model fields. I think you should end up with a good schema through writing tests on the next layer up from the database.
Also, use South or some other means of migration to reduce the cost of schema change. That will give you some peace of mind.
If you have several levels of tests (eg. integration tests), it makes sense to check that the database configuration is intact. Most of the tests do not need to hit the database. You can do this by mocking the model or some database operations (at least save()). Having said that, you can check database writes and reads with this simple test:
def test_db_access(self):
input = Element(title = 'foo')
input.save()
output = Element.objects.get(title='foo')
self.assertEquals(input, output)
Mocking save:
def save(obj):
obj.id = 1

Initializing cherrypy.session early

I love CherryPy's API for sessions, except for one detail. Instead of saying cherrypy.session["spam"] I'd like to be able to just say session["spam"].
Unfortunately, I can't simply have a global from cherrypy import session in one of my modules, because the cherrypy.session object isn't created until the first time a page request is made. Is there some way to get CherryPy to initialize its session object immediately instead of on the first page request?
I have two ugly alternatives if the answer is no:
First, I can do something like this
def import_session():
global session
while not hasattr(cherrypy, "session"):
sleep(0.1)
session = cherrypy.session
Thread(target=import_session).start()
This feels like a big kludge, but I really hate writing cherrypy.session["spam"] every time, so to me it's worth it.
My second solution is to do something like
class SessionKludge:
def __getitem__(self, name):
return cherrypy.session[name]
def __setitem__(self, name, val):
cherrypy.session[name] = val
session = SessionKludge()
but this feels like an even bigger kludge and I'd need to do more work to implement the other dictionary functions such as .get
So I'd definitely prefer a simple way to initialize the object myself. Does anyone know how to do this?
For CherryPy 3.1, you would need to find the right subclass of Session, run its 'setup' classmethod, and then set cherrypy.session to a ThreadLocalProxy. That all happens in cherrypy.lib.sessions.init, in the following chunks:
# Find the storage class and call setup (first time only).
storage_class = storage_type.title() + 'Session'
storage_class = globals()[storage_class]
if not hasattr(cherrypy, "session"):
if hasattr(storage_class, "setup"):
storage_class.setup(**kwargs)
# Create cherrypy.session which will proxy to cherrypy.serving.session
if not hasattr(cherrypy, "session"):
cherrypy.session = cherrypy._ThreadLocalProxy('session')
Reducing (replace FileSession with the subclass you want):
FileSession.setup(**kwargs)
cherrypy.session = cherrypy._ThreadLocalProxy('session')
The "kwargs" consist of "timeout", "clean_freq", and any subclass-specific entries from tools.sessions.* config.

Categories