Optimal way to add functionality to classes - PySpark

Optimal way to add functionality to classes - PySpark - python

A while a go I was looking for how to rename several columns at once for a PySpark DF and came across something like the following:
import pyspark
def rename_sdf(df, mapper={}, **kwargs_mapper):
# Do something
# return something
pyspark.sql.dataframe.DataFrame.rename = rename_sdf
I am interested in that last bit where a method is added to the pyspark.DataFrame class through an assignment statement.
The thing is, I am creating a Github repo to store all my functions and ETLs and I thought that if I could apply the logic showed above it would be super easy to just create an __init__.py module where I instantiate all my functionalities like:
from funcs import *
pyspark.sql.dataframe.DataFrame.func1 = func1
pyspark.sql.dataframe.DataFrame.func2 = func2
.
.
.
pyspark.sql.dataframe.DataFrame.funcN = funcN
I guess my question is:
Is this useful? Is it good for performance? Is it wrong? Is it un-Pythonic?

That can definitely have its uses in certain scenarios. I would recommend putting this code into a function so the user must explicitly call it.
import funcs
def wrap_pyspark_dataframe():
pyspark.sql.dataframe.DataFrame.func1 = funcs.func1
pyspark.sql.dataframe.DataFrame.func2 = funcs.func2
...

Related

Re-referencing a large number of functions in python

I have a file functional.py which defines a number of useful functions. For each function, I want to create an alias that when called will give a reference to a function. Something like this:
foo/functional.py
def fun1(a):
return a
def fun2(a):
return a+1
...
foo/__init__.py
from inspect import getmembers, isfunction
from . import functional
for (name, fun) in getmembers(functional, isfunction):
dun = lambda f=fun: f
globals()[name] = dun
>> bar.fun1()(1)
>> 1
>> bar.fun2()(1)
>> 2
I can get the functions from functional.py using inspect and dynamically define a new set of functions that are fit for my purpose.
But why? you might ask... I am using a configuration manager Hydra where one can instantiate objects by specifying the fully qualified name. I want to make use of the functions in functional.py in the config and have hydra pass a reference to the function when creating an object that uses the function (more details can be found in the Hydra documentation).
There are many functions and I don't want to write them all out ... people have pointed out in similar questions that modifying globals() for this purpose is bad practice. My use case is fairly constrained - documentation wise there is a one-one mapping (but obviously an IDE won't be able to resolve it).
Basically, I am wondering if there is a better way to do it!

Is your question related to this feature request and in particular to this comment?
FYI: In Hydra 1.1, instantiate fully supports positional arguments so I think you should be able to call functools.partial directly without redefining it.

Cannot import a function from a module

Basically I have 3 modules that all communicate with eachother and import eachother's functions. I'm trying to import a function from my shigui.py module that creates a gui for the program. Now I have a function that gets the values of user entries in the gui and I want to pass them to the other module. I'm trying to pass the function below:
def valueget():
keywords = kw.get()
delay = dlay.get()
category = catg.get()
All imports go fine, up until I try to import this function with
from shigui import valueget to another module that would use the values. In fact, I can't import any function to any module from this file. Also I should add that they are in the same directory. I'm appreciative of any help on this matter.

Well, I am not entirely sure of what imports what, but here is what I can tell you. Python can sometimes allow for circular dependencies. However, it depends on what the layout of your dependencies is. First and foremost, I would say see if there is any way you can avoid this happening (restructuring your code, etc.). If it is unavoidable then there is one thing you can try. When Python imports modules, it does so in order of code execution. This means that if you have a definition before an import, you can sometimes access the definition in the first module by importing that first module in the second module. Let me give an example. Consider you have two modules, A and B.
A:
def someFunc():
# use B's functionality from before B's import of A
pass
import B
B:
def otherFunc():
# use A's functionality from before A's import of B
pass
import A
In a situation like that, Python will allow this. However, everything after the imports is not always fair game so be careful. You can read up on Python's module system more if you want to know why this works.
Helpful, but not complete link: https://docs.python.org/3/tutorial/modules.html

Using class as functions and "global" variables container: bad design?

I spent last months rewriting from scratch a new version of my Python algorithm. One of my goals was to write a perfectly documented code, easy to read and understand for "anyone".
In the same project folder I put a lot of different modules and each module contain a class. I used classes as functions and related variables container, in that way a class contain all the functions with a specific task, for example wrinting on Excel files all the output results of the algorithm.
Here an example:
Algorithm.py
import os
import pandas as pd
import numpy as np
from Observer import Observer
def main(hdf_path):
for hdf_file in os.listdir(hdf_path):
filename = str(hdf_file.replace('.hdf', '.xlsx'))
Observer.create_workbook(filename)
dataframe = pd.read_hdf(hdf_file)
years_array = dataframe.index.levels[0].values
for year in years_array:
year_mean = np.mean(dataframe.loc[year].values)
Observer.mean_values = np.append(Observer.mean_values, dataframe_mean)
Observer.export_result()
if __name__ == "main":
hdf_path = 'bla/bla/bla/'
main(hdf_path)
Observer.py
import numpy as np
import openpyxl
class Observer:
workbook = None
workbookname = None
mean_values = np.array([])
def create_workbook(filename):
Observer.workbook = openpyxl.Workbook()
Observer.workbookname = filename
# do other things
def save_workbook():
Observer.workbook.save('results_path' + Observer.workbookname)
def export_results():
# print Observer.mean_values values in different workbook cells
# export result on a specific sheet
I hope that you can understand from this simple example how do I use class on my project. For every class I define a lot of variables (workbook for example) and I call them from other modules as if they were global variables. In that way I can easily access them from anywhere and I dont need to pass them to functions explicitly, cause I can simply write Classname.varname.
My question is: is it bad design? Will it create some problems or performance slowdown?
Thanks for your help.

My question is: is it bad design?
Yes.
I can simply write Classname.varname.
You are creating a very strong coupling between classes when you enforce calling Classname.varname. The class that access this variable is now strongly coupled with Classname. This prevent you from changing the behavior in OOP way by passing different parameters, and will complicate testing of the class - since you will be unable to mock Classname and use its mock instead of the "real" class.
This will result in code duplication when you try to run 2 pieces of very similar code in two parts of your app, which only vary in these parameters. You will end up creating two almost identical classes, one using Workbook and the other using Notepad classes.
And remember the vicious cycle:
Hard to test code -> Fear of refactor -> Sloppy code
^ |
| |
---------------------------------------
Using proper objects, with ability to mock them (and dependency injection) is going to guarantee your code is easily testable, and the rest will will follow.

executing python code from string loaded into a module

I found the following code snippet that I can't seem to make work for my scenario (or any scenario at all):
def load(code):
# Delete all local variables
globals()['code'] = code
del locals()['code']
# Run the code
exec(globals()['code'])
# Delete any global variables we've added
del globals()['load']
del globals()['code']
# Copy k so we can use it
if 'k' in locals():
globals()['k'] = locals()['k']
del locals()['k']
# Copy the rest of the variables
for k in locals().keys():
globals()[k] = locals()[k]
I created a file called "dynamic_module" and put this code in it, which I then used to try to execute the following code which is a placeholder for some dynamically created string I would like to execute.
import random
import datetime
class MyClass(object):
def main(self, a, b):
r = random.Random(datetime.datetime.now().microsecond)
a = r.randint(a, b)
return a
Then I tried executing the following:
import dynamic_module
dynamic_module.load(code_string)
return_value = dynamic_module.MyClass().main(1,100)
When this runs it should return a random number between 1 and 100. However, I can't seem to get the initial snippet I found to work for even the simplest of code strings. I think part of my confusion in doing this is that I may misunderstand how globals and locals work and therefore how to properly fix the problems I'm encountering. I need the code string to use its own imports and variables and not have access to the ones where it is being run from, which is the reason I am going through this somewhat over-complicated method.

You should not be using the code you found. It is has several big problems, not least that most of it doesn't actually do anything (locals() is a proxy, deleting from it has no effect on the actual locals, it puts any code you execute in the same shared globals, etc.)
Use the accepted answer in that post instead; recast as a function that becomes:
import sys, imp
def load_module_from_string(code, name='dynamic_module')
module = imp.new_module(name)
exec(code, mymodule.__dict__)
return module
then just use that:
dynamic_module = load_module_from_string(code_string)
return_value = dynamic_module.MyClass().main(1, 100)
The function produces a new, clean module object.

In general, this is not how you should dynamically import and use external modules. You should be using __import__ within your function to do this. Here's a simple example that worked for me:
plt = __import__('matplotlib.pyplot', fromlist = ['plt'])
plt.plot(np.arange(5), np.arange(5))
plt.show()
I imagine that for your specific application (loading from code string) it would be much easier to save the dynamically generated code string to a file (in a folder containing an __init__.py file) and then to call it using __import__. Then you could access all variables and functions of the code as parts of the imported module.
Unless I'm missing something?

Python namespace elements visibility (vs proper package structure)

Python3
Tried to found an answer but failed. First I'll present the snippet, then I'll explain why I wanted to do it this way and what I wanted to achieve. Maybe it'll look like this approach is "the bad one".
Hence this semi-double topic, cause first I'd like to know why this snippet isn't working and second - I'd like to know if this approach is right.
So:
class Namespace:
def some_function():
pass
class SomeClass:
fcnt = some_function
This won't work due to:
NameError: name 'some_function' is not defined
What I want to achieve is code and file structure readability.
Above example is a snippet which I use (not this one, but it looks like this) in Pyramid project.
My project tree looks like this:
my_project
├── models
│   ├── __init__.py
│   └── some_model.py
├── schemas
│   ├── __init__.py
│   ├── some_schema.py
│   └── some_other_schema.py
...
├── views
│   ├── __init__.py
│   └── some_view.py
└── __init__.py
What I wanted to achieve is clean schema/model/view importing.
In some_schema.py file resides class SomeSchema, in some_other_schema.py class SomeOtherSchema.
With above snippet I can make:
from my_project.schemas.some_schema import Schema
and use it like Schema.SomeSchema()
I've got a little bit lost with packages and imports. How one could make a clean structure (one schema per file) and still be able to use Schema namespace? (In C++ I'd just put each of those classes in Schema namespace, that's why I did this in snippet above. But! What works in C++ maybe shouldn't be used in python, right?).
Thanks for answer in advance.
EDIT:
Ok, I've done some testing (I thought that I've done it, but looks like not..).
using from my_project.schemas.some_schema import Schema with another from my_project.schemas.some_other_schema import Schema causes in the second import shadowing first one. So if after first import I'd be able to use x = Schema.SomeSchema() than after second import I'd be unable to do this, because class Schema gets overriden. Right, so as Erik said - classes aren't namespaces. GOT IT!
in my very first snippet yes, I should've used fnct = Namespace.some_function. What's wierd - it works. I have the same statement in my pyramid code, with one difference. some_function has #colander.deferred decorator. In fact it looks like this:
class Schema:
#colander.deferred
def deferred_some_function(node, kw):
something = kw.get("something", [])
return deform.widget.SelectWidget(values=something,
multiple=True)
class SomeSchema(colander.MappingSchema):
somethings = colander.SchemaNode(colander.Set(),
widget=Schema.deferred_some_function)
And I get NameError: name 'Schema' is not defined
Getting back to package format. With this:
### another/file.py
from foo.bar.schema import SomeSchema
# do something with SomeSchema:
smth = SomeSchema()
smth.fcnt()
I have to make one module foo/bar/schema.py in which I'd have to put all my SomeXSchema classes. An if I have lots of them, then there's the unreadabilty glitch which I wanted to get rid off by splitting SomeXSchema - one per file. Can I accomplish this somehow? I want to call this class for example: User. And here's the THING. Maybe I do it wrong? I'd like to have class named User in schema namespace and class named User in model namespace. Shouldn't I? Maybe I ought to use prefix? Like class SchemaUser and class ModelUser ? I wanted to avoid it by the use of modules/packages.
If I'd use : import foo.bar.schema then I'd have to use it like x = foo.bar.schema.User() right? There is no way to use it like x = schema.User() ? Sorry, I just got stuck, my brain got fixed. Maybe I need a little break to take a fresh look?
ANOTHER EDIT (FOR POINT 3 ONLY)
I did some more research. The answer here would be to make it like this:
## file: myproject/schemas/__init__.py
from .some_schema import SomeSchema
from .some_other_schema import SomeOtherSchema
then usage would be like this:
## some file using it
import myproject.schemas as schema
s1 = schema.SomeSchema()
s2 = schema.SomeOtherSchema()
Would it be lege artis?
If anyone thinks that topic should be changed - go ahead, give me something more meaningful, I'd appreciate it.

Your are swimming upstream by trying to do what you are trying to do.
Classes are meant for defining new data types not as a means to group related parts of code together. Modules are perfectly suited for that, and I presume you know that well because of the "(vs proper package structure)" part in the question title.
Modules can also be imported as objects, so to achieve what you want:
### foo/bar/schema.py
def some_function():
pass
class SomeSchema:
fcnt = some_function
### another/file.py
from foo.bar import schema
# do something with SomeSchema:
smth = schema.SomeSchema()
smth.fcnt()
...although it's also typical to import classes directly into the scope like this (i.e. being able to refer to SomeSchema after the import as opposed to schema.SomeSchema):
### another/file.py
from foo.bar.schema import SomeSchema
# do something with SomeSchema:
smth = SomeSchema()
smth.fcnt()
(Also note that module names should be lowercase as suggested by PEP8 and only class names should use PascalCase)
This, by the way, applies to programming in general, not just Python. There are a few languages such as Java and C# which require that functions be declared inside of classes as statics because they disallow writing of code outside of classes for some weird reason, but even these languages have modules/proper namespaces for structuring your code; i.e. classes are not normally put inside other classes (they sometimes are, but for wholly different reasons/goals than yours).
So basically "class" means a "type" or "a set of objects having similar behavior"; once you ignore that principle/definition, you're writing bad code by definition.
PS. if you are using Python 2.x, you should be inheriting your classes from object so as to get new-style classes.
PPS. in any case, even technically speaking, what you are trying to do won't work cleanly in Python:
class fake_namespace:
def some_function():
pass
class RealClass:
some_function # <-- that name is not even visibile here;
# you'd have to use fake_namespace.some_function instead
...and this is the reason for the exception I reported I was getting: NameError: name 'some_function' is not defined.
EDIT AS PER YOUR EDITS:
I'm not really sure why you're making it so complicated; also some of your statements are false:
If I'd use : import foo.bar.schema then I'd have to use it like x = foo.bar.schema.User right?
No. Please learn how Python modules work.
I'd like to have class named User in Schema namespace and class named User in Model namespace. Shouldn't I? Maybe I ought to use prefix? Like class SchemaUser and class ModelUser
please note that namespaces a.k.a. modules should be lowercase not PascalCase.
An if I have lots of them, then there's the unreadabilty glitch which I wanted to get rid off by splitting SomeXSchema - one per file. Can I accomplish this somehow?
Yes; you can put your classes in individual submodules, e.g. schema1/class1.py, schema/class2.py etc; then you can "collect" them into schema/__init__.py so that you could import them directly from schema:
# schema/__init__.py
from .class1 import Class1
from .class2 import Class2
__all__ = [Class1, Class2] # optional
General note: you can name your schema modules differently, e.g. schema1, schema2, etc; then you could just use them like this:
from somewhere import schema1
from somewhere_else import schema2
s1_user = schema1.User()
s2_user = schema2.User()
# etc
For more information on how Python modules work, refer to http://docs.python.org/2/tutorial/modules.html

Name and binding
You can read Python naming and binding and understand how Python namespace works.
A scope defines the visibility of a name within a block. If a local variable is defined in a block, its scope includes that block. If the definition occurs in a function block, the scope extends to any blocks contained within the defining one, unless a contained block introduces a different binding for the name. The scope of names defined in a class block is limited to the class block; it does not extend to the code blocks of methods this includes generator expressions since they are implemented using a function scope.
BTW, use globals() and locals() can help debug for variable binding.
The User Problem
You can try this instead:
from model import User as modelUser
from foo.bar.schema import User as schemaUser

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Optimal way to add functionality to classes - PySpark - python

That can definitely have its uses in certain scenarios. I would recommend putting this code into a function so the user must explicitly call it. import funcs def wrap_pyspark_dataframe(): pyspark.sql.dataframe.DataFrame.func1 = funcs.func1 pyspark.sql.dataframe.DataFrame.func2 = funcs.func2 ...

Related

Re-referencing a large number of functions in python

Cannot import a function from a module

Using class as functions and "global" variables container: bad design?

executing python code from string loaded into a module

Python namespace elements visibility (vs proper package structure)

Categories

Resources