Passing class functions to PySpark RDD

Passing class functions to PySpark RDD - python

I have a class named some_class() in a Python file here:
/some-folder/app/bin/file.py
I am importing it to my code here:
/some-folder2/app/code/file2.py
By
import sys
sys.path.append('/some-folder/app/bin')
from file import some_class
clss = some_class()
I want to use this class's function named some_function in map of spark
sc.parallelize(some_data_iterator).map(lambda x: clss.some_function(x))
This is giving me an error :
No module named file
While class.some_function when I am calling it outside map function of pyspark i.e. normally but not in pySpark's RDD. I think this has something to do with pyspark. I have no idea where am I going wrong in this.
I tried broadcasting this class and still didn't work.

All Python dependencies have to be either present on the search path of the worker nodes or distributed manually using SparkContext.addPyFile method so something like this should do the trick:
sc.addPyFile("/some-folder/app/bin/file.py")
It will copy the file to all the workers and place in the working directory.
On a side note please don't use file as module name, even if it is only an example. Shadowing built-in functions in Python is not a very good idea.

Related

Does importing specific class from file instead of full file matters?

Most of the tutorials and books about Django or Flask import specific classes from files instead of importing the whole file.
For example, importing DataRequiered validator from wrtforms.validators is done via from wtforms import validators instead of importing it via import wtforms.validators as valids and then accessing DataRequiered with valids.DataRequiered.
My question is: Is there an reason for this ?
I thought to something like avoiding the loading a whole module for computation/memory optimization (is it really relevant?) ? Or is it simply to make the code more readable ?

My question is: Is there an reason for this ?
from module_or_package import something is the canonical pythonic idiom (when you only want to import something in your current namespace of course).
Also, import module_or_package.something only works if module_or_package is a package and something a submodule, it raises an ImportError(No module named something) if something is a function, class or whatever object defined in module_or_package, as can be seen in the stdlib with os.path (which is a submodule of the os.package) vs datetime.date (which is a class defined in the datetime module):
>>> import os.path as p
>>> p
<module 'posixpath' from '/home/bruno/.virtualenvs/blook/lib/python2.7/posixpath.pyc'>
vs
>>>import datetime.date as d
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named date
thought to something like avoiding the loading a whole module for computation/memory optimization (is it really relevant?)
Totally irrelevant - importing a given name from a module requires importing the whole module. Actually, this:
from module_or_package import something
is only syntactic sugar for
import module_or_package
something = module_or_package.something
del module_or_package
EDIT: You mention in a comment that
Right, but importing the whole module means loading it to the memory, which can be a reason for importing only a submodule/class
so it seems I failed to make the point clear: in Python, you can not "import only a submodule/class", period.
In Python, import, class and def are all executable statements (and actually just syntactic sugar for operation you can do 'manually' with functions and classes). Importing a module actually consists in executing all the code at the module's top-level (which will instanciate function and class objects) and create a module object (instance of module type) which attributes will be all names defined at the top-level via import, def and class statements or via explicit assignment. It's only when all this has been done that you can access any name defined in the module, and this is why, as I stated above,
from module import obj
is only syntactic sugar for
import module
obj = module.obj
del module
But (unless you do something stupid like defining a terabyte-huge dict or list in your module) this doesn't actually take that much time nor eat much ram, and a module is only effectively executed once per process the first time it's imported - then it's cached in sys.modules so subsequent imports only fetch it from cache.
Also, unless you actively prevents it, Python will cache the compiled version of the module (the .pyc files) and only recompile it if the .pyc is missing or older than the source .py file.
wrt/ packages and submodules, importing a submodule will also execute the package's __init__.py and build a module instance from it (IOW, at runtime, a package is also a module). Package initializer are canonically rather short, and actually quite often empty FWIW...

It depends, in the tutorial that was probably done for readability
Usually if you use most of the classes in a file, you import the file. If the files contains many classes but you only need a few, just import those.
It's both a matter of readability and optimization.

Using a module inside another module

I have a .py file containing some functions. One of the functions requires Python's csv module. Lets call it foo.
Here is the thing: if I enter the python shell, import the csv module, write the defitinion of foo and use it, everything runs fine.
The problem comes when I try to import foo from a custom module. If I enter the python shell, import the csv module, import the module where foo is located and try to use it, it will returns an error stating that 'csv' has not been defined (it behaves as if the csv module had not been imported).
I'm wondering if I'm missing some kind of scope behaviour related to imports.
How can I enable foo to use the csv module or any other module it requires?
Thank you in advance

By importing it in the file that defines the foo function.
The foo function doesn't know to look in the dictionary containing the globals you use in the REPL (where you have imported csv). It looks in the globals of it's module (there's other steps here of course), if it doesn't find it there you'll get a NameError.

Hacking python's import statement

I'm building a Python module for a fairly specific purpose. What I'd like to do with this is get more functionality behind importing things from it.
I'd like to have a setup by which saying from my_module import foo would run a function and pass the string "foo". This function would return the object that should be imported.
For example, maybe I want to make a cloud-based import system. I'd like to store community scripts in the cloud, and then download them when a user tries to import them.
Maybe I use the code from cloud import test_module. This would check a cache to decide whether test_module had been downloaded. If so, it would return that module. If not, it would download the module before importing it.
How can I accomplish something like this in Python, by which a dynamic range of submodules could be seamlessly imported from the cloud?

Full featured support for what you ask probably requires a bunch of complicated code using importlib and hooking into various parts of the import machinery. However, a more limited solution can be implemented with just a single custom class that pretends to be a module.
When you import a module, Python first checks in the sys.modules dictionary to see if the module is a key. If so, it returns the value associated with the key. It does this regardless of what the value is, so you can put any kind of object in sys.modules and Python will treat it like a module. A module's code can even replace its own entry in sys.modules, and the replacement will be used even the first time it is imported!
So, to implement your fancy module that downloads other modules on demand, replace the module itself with an instance of a custom class, and write that class a __getattr__ or __getattribute__ method that does the work you want.
Here's a trivial example module that returns a string for any attribute you look for in it. The string will always be the same as the requested attribute name. In your code, you'd want to do your fancy web-cache lookups and downloading, and then return the fetched module object instead of just returning a string.
class FakeModule(object):
def __getattribute__(self, name):
return name
import sys
sys.modules[__name__] = FakeModule()
On my system I've saved that as fakemodule.py. Now if I do from fakemodule import foo, I get foo with the value 'foo' in my local namespace.
Note that this only works for one level deep imports. If you do from fakemodule.subpackage import name it will not work because there's no fakemodule.subpackage entry in sys.modules.

Dynamic module imports from external function, (or - editing globals() outside of module), in Python

I have a project in which I want to repeatedly change code in a class and then run other modules to test the changes (verification..). Currently, after each edit I have to reload the code, the testing modules which run it, and then run the test. I want to reduce this cycle to one line, moreover, I will later want to test different classes, so I want to be able to receive the name of the tested class as a parameter - meaning I need dynamic imports.
I wrote a function for clean imports of any module, it seems to work:
def build_module_clean(module_string,attr_strings):
module = import_module(module_string)
module = reload(module)
for f in attr_strings:
globals()[f]=getattr(module,f)
Now, in the name of cleanliness, I want to keep this function in a wrapper module (which will contain the one-liner I want to rebuild and test all the code each time), and run it from the various modules, i.e. among the import statements of my ModelChecker module I would place the line
from wrapper import build_module_clean
build_module_clean('test_class_module',['test_class_name'])
however, when I do this, it seems the test class is added to the globals in the wrapper module, but not in the ModelChecker module (attempting to access globals()['test_class_name'] in ModelChecker gives a key error). I have tried passing globals or globals() as further parameters to build_module_clean, but globals is a function (so the test module is still loaded to the wrapper globals), and passing and then using globals() gives the error
TypeError: 'builtin_function_or_method' object does not support item assignment
So I need some way to edit one module's globals() from another module.
Alternatively, (ideally?) I would like to import the test_class module in the wrapper, in a manner that would make it visible to all the modules that use it (e.g. ModelChecker). How can I do that?

Your function should look like:
def build_module_clean(globals, module_string, attr_strings):
module = import_module(module_string)
module = reload(module)
globals[module_string] = module
for f in attr_strings:
globals[f] = getattr(module, f)
and call it like so:
build_module_clean(globals(), 'test_class_module', ['test_class_name'])
Explanation:
Calling globals() in the function call (build_module_clean(globals()...) grabs the module's __dict__ while still in the correct module and passes that to your function.
The function is then able to (re)assign the names to the newly-loaded module and it's current attributes.
Note that I also (re)assigned the newly-loaded module itself to the globals (you may not want that part).

How can I modify modules and packages while keeping the original intact?

I have a program written in my python using the PyPDF2 package to scrape a batch of pdf files. These PDF's aren't in the greatest shape so in order for my program to run, I need to modify the pdf.py file located within the package library as recommended by this website:
https://cheonhyangzhang.wordpress.com/2015/03/31/python-pdffilereader-pdfreaderror-eof-marker-not-found/
Is there a way I can implement this change to the file while keeping the original file intact? I've tried creating a child class of PdfFileReader class and modifying the 'read' method as prescribed by my link above, however, I've found that that leads to several import dependency issues that I'd like to avoid.
Is there an easier way to do this?

I would recommend to copy the pdf.py file into our script directory and rename it to mypdf.py. You can then modify the copy as you please without affecting the original. You can import the package using
import mypdf
I have done something similar for shutil.py as the default buffer size is too small in Windows for transferring large files.

You can add (or redefine) a method of class using setattr() like this (where the class has been defined inline rather than being imported only for purposes of illustration):
class Class(object):
pass
def func(self, some_other_argument):
return some_other_argument
setattr(Class, 'func', func)
if __name__ == '__main__':
c = Class()
print(c.func(42)) # -> 42

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Passing class functions to PySpark RDD - python

Related

Does importing specific class from file instead of full file matters?

Using a module inside another module

Hacking python's import statement

Dynamic module imports from external function, (or - editing globals() outside of module), in Python

How can I modify modules and packages while keeping the original intact?

Categories

Resources