Structuring ocean modelling code in python

Structuring ocean modelling code in python - python

I am starting to use python for numerical simulation, and in particular I am starting from this project to build mine, that will be more complicated than this since I will have to try a lot of different methods and configurations. I work full time on Fortran90 codes and Matlab codes, and those are the two languages I am "mother tongue". In those two languages one is free to structure the code as he wants, and I am trying to mimic this feature because in my field (computation oceanography) things gets rather complicated easily. See as an example the code I daily work with, NEMO (here the main page, here the source code). The source code (of NEMO) is conveniently divided in folders, each of which contains modules and methods for a specific task (e.g. the domain discretisation routines are in folder DOM, the vertical physics is in the folder ZDF, the lateral physics in LDF and so on), this because the processes (physical or purely mathematical) involved are completely different.
What I am trying to build is this
/shallow_water_model
-
create_conf.py (creates a new subdirectory in /cfgs with a given name, like "caspian_sea" or "mediterranean_sea" and copies the content of the folder /src inside this new subdirectory to create a new configuration)
/cfgs
-
/caspian_sea (example configuration)
/mediterranean_sea (example configuration)
/src
-
swm_main.py (initialize a dictionary and calls the functions)
swm_param.py (fills the dictionary)
/domain
-
swm_grid.py (creates a numerical grid)
/dynamics
-
swm_adv.py (create advection matrix)
swm_dif.py (create diffusion matrix)
/solver
-
swm_rk4.py (time stepping with Runge-Kutta4)
swm_pc.py (time stepping with predictor corrector)
/IO
-
swm_input.py (handles netCDF input)
sim_output.py (handles netCDF output)
The script create_conf.py contains the following structure, and it is supposed to take a string input from the terminal, create a folder with that name and copy all the files and subdirectories of /src folder inside, so one can put there all the input files of this configuration and eventually modify the source code to create an ad-hoc source code for the configuration. This duplication of the source code is common in the ocean modelling community because two different configuration (like the Mediterranean Sea and the Caspian Sea) may differ not only in the input files (like topography, coastlines etc etc) but also in the modelling itself, meaning that the modification you need to make to the source code for each configuration might be substantial. (Most ocean models allow you to put your own modified source files in specific folders and they are instructed to overwrite the specific files at compilation. My code is going to be simple enough to just duplicate the source code.)
import os, sys
import shutil
def create_conf(conf_name="new_config"):
cfg_dir = os.getcwd() + "/cfgs/"
# Check if configuration exists
try:
os.makedirs(cfg_dir + conf_name)
print("Configuration " + conf_name + " correctly created")
except FileExistsError:
# directory already exists
# Handles overwriting, duplicates or stop
# make a copy of "/src" into the new folder
return
# This is supposed to be used directly from the terminal
if __name__ == '__main__':
filename = sys.argv[1]
create_conf(filename)
The script swm_main.py can be thought as a list of calls to the necessary routines depending on the kind of process you want to take into account, just like
import numpy as np
from DOM.swm_domain import set_grid
from swm_param import set_param, set_timestep, set_viscosity
# initialize dictionary (i.e. structure) containing all the parameters of the run
global param
param = dict()
# define the parameters (i.e. call swm_param.py)
set_param(param)
# Create the grid
set_grid(param)
The two routines called just take a particular field of param and assign it a value, like
import numpy as np
import os
def set_param(param):
param['nx'] = 32 # number of grid points in x-direction
param['ny'] = 32 # number of grid points in y-direction
return param
Now, the main topic of discussion is how to achieve this kind of structure in python. I almost always find source codes that are either monolithic (all routines in the same file) or a sequence of files in the same folders. I want to have some better organisation, but the solution I found browsing fills every subfolder in /src with a folder __pycache__ and I need to put a __init__.py file in each folder. I don't know why but these two things make me think there is something sloppy in this approach. Moreover, I need to import modules (like numpy) in every file, and I was wondering whether this was efficient or not.
What do you think would be better to keep this structuring and keep it as simple as possible?
Thanks for your help

As I understand the actual question here is:
the solution I found browsing fills every subfolder in /src with a folder __pycache__ and I need to put a __init__.py file in each folder... this makes me think there is something sloppy in this approach.
There is nothing sloppy or unpythonic about making your code into packages. In order to be able to import from .py files in a directory, one of two conditions has to be satisfied:
the directory must be in your sys.path, or
the directory must be a package, and that package must be a sub-directory of some directory in your sys.path (or a sub-directory of a package which is a sub-directory of some directory in your sys.path)
The first solution is generally hacky in code, although often appropriate in tests, and involves modifying sys.path to add every dir you want. This is generally hacky because the whole point of putting your code inside a package is that the package structure encodes some natural division in the source: e.g. a package modeller is conceptually distinct from a package quickgui, and each could be used independently of each other in different programs.
The easiest[1] way to make a directory into a package is to place an __init__.py in it. The file should contain anything which belongs conceptually at the package level, i.e. not in modules. It may be appropriate to leave it empty, but it's often a good idea to import the public functions/classes/vars from your modules, so you can do from mypkg import thing rather than from mypkg.module import thing. Packages should be conceptually complete, which normally means you should be able (in theory) to use them from multiple places. Sometimes you don't want a separate package: you just want a naming convention, like gui_tools.py gui_constants.py, model_tools.py, model_constants.py, etc. The __pycache__ folder is simply python caching the bytecode to make future imports faster: you can move that or prevent it, but just add *__pycache__* to your .gitignore and forget about them.
Lastly, since you come from very different languages:
lots of python code written by scientists (rather than programmers) is quite unpythonic IMHO. Billion line long single python files is not good style[2]. Python prefers readability, always: call things derived_model not dm1. If you do that you may well find you don't need as many dirs as you thought.
importing the same module in every file is a trivial cost: python imports once: every other import is just another name bound in sys.modules. Always import explicitly.
in general stop worrying about performance in python. Write your code as clearly as possible, then profile it if you need to, and find what is slow. Python is so high level that micro-optimisations learned in compiled languages will probably backfire.
lastly, and this is mostly personal, don't give folders/modules names in CAPITALS. FORTRAN might encourage that, and it was written on machines which often didn't have case sensitivity for filenames, but we no longer have those constraints. In python we reserve capitals for constants, so I find it plain weird when I have to modify or execute something in capitals. Likewise 'DOM' made me think of the document object model which is probably not what you mean here.
References
[1] Python does have implicit namespace packages but you are still better off with explicit packages to signal your intention to make a package (and to avoid various importing problems).
[2] See pep8 for some more conventions on how you structure things. I would also recommend looking at some decent general-purpose libraries to see how they do things: they tend to be written by mainstream programmers who focus on writing clean, maintainable code, rather than by scientists who focus on solving highly specific (and frequently very complicated) problems.

Related

Can a Python module be multiple files?

For years, I've known that the very definition of a Python module is as a separate file. In fact, even the official documentation states that "a module is a file containing Python definitions and statements". Yet, this online tutorial from people who seem pretty knowledgeable states that "a module usually corresponds to a single file". Where does the "usually" come from? Can a Python module consist of multiple files?

Not really.
Don't read too much into the phrasing of one short throwaway sentence, in a much larger blog post that concerns packaging and packages, both of which are by nature multi-file.
Imports do not make modules multifile
By the logic that modules are multifile because of imports... almost any python module is multifile. Unless the imports are from the subtree, which has no real discernible difference to code using the module. That notion of subtree imports, btw, is relevant... to Python packages.
__module__, the attribute found on classes and functions, also maps to one file, as determined by import path.
The usefulness of expanding the definition of modules that way seems… limited, and risks confusion. Let imports be imports ans modules be modules (i.e. files).
But that's like, my personal opinion.
Let's go all language lawyer on it
And refer to the Python tutorial. I figure they will be talking about modules at some point and will be much more careful in their wording than a blog post which was primarily concerned about another subject.
6. Modules
To support this, Python has a way to put definitions in a file and use them in a script or in an interactive instance of the interpreter. Such a file is called a module; definitions from a module can be imported into other modules or into the main module (the collection of variables that you have access to in a script executed at the top level and in calculator mode).
A module is a file containing Python definitions and statements. The file name is the module name with the suffix .py appended. Within a module, the module’s name (as a string) is available as the value of the global variable name.
p.s. OK, what about calling it a file, instead of a module, then?
That supposes that you store Python code in a file system. But you could have an exotic environment that stores it in a database instead (or embeds it in a larger C/Rust executable?). So, module, seems better understood as a "contiguous chunk of Python code". Usually that's a file, but having a separate term allows for flexibility, without changing anything to the core concepts.

Yup, a python module can include more than one file. Basically what you would do is get a file for the main code of the module you are writing, and in that main file include some other tools you can use.
For example, you can have the file my_splitter_module.py, in which you have... say a function that gets a list of integers and split it in half creating two lists. Now say you wanna multiply all the numbers that are in the first half between each other ([1, 2, 3] -> 1 * 2 * 3), but with the other half sum them ([1, 2, 3] -> 1 + 2 + 3). Now say you don't want to make the code messy and so you decide to make another two functions, one that gets a list and multiply its items, and another that sum them.
Of course, you could make the two functions in the same my_splitter_module.py file, but in other situations when you have big files with big classes etc, you would like to make a file like multiply_list.py and sum_list.py, and then importing them to the my_splitter_module.py
At the end, you would import my_splitter_module.py to your main.py file, and while doing this, you would also be importing multiply_list.py and sum_list.py files.

Yes, sure.
If you create a folder named mylib in your PATH or in the same directory as your script, it allows you to use import mylib.
Make sure to put __init__.py in the folder and in that imoprt everything from other files because the variables, functions, etc. import just from the __init__.py.
For example:
project -+- lib -+- __init__.py
| +- bar.py
| +- bar2.py
|
+- foo.py
__init__.py :
from bar import test, random
from bar2 import sample
foo.py :
import lib
print(test)
sample()
Hope it helps.

Import from parent directory for a test sub-directory without using packaging, Python 2.7

TL;DR
For a fixed and unchangeable non-package directory structure like this:
some_dir/
mod.py
test/
test_mod.py
example_data.txt
what is a non-package way to enable test_mod.py to import from mod.py?
I am restricted to using Python 2.7 in this case.
I want to write a few tests for the functions in mod.py. I want to create a new directory test that sits alongside mod.py and inside it there is one test file test/test_mod.py which should import the functions from mod.py and test them.
Because of the well-known limitations of relative imports, which rely on package-based naming, you can't do this in the straightforward way. Yet, all advice on the topic suggests to build the script as a package and then use relative imports, which is impossible for my use case.
In my case, it is not allowable for mod.py to be built as a package and I cannot require users of mod.py to install it. They are instead free to merely check the file out from version control and just begin using it however they wish, and I am not able to change that circumstance.
Given this, what is a way to provide a simple, straightforward test directory?
Note: not just a test file that sits alongside mod.py, but an actual test directory since there will be other assets like test data that come with it, and the sub-directory organization is critical.
I apologize if this is a duplicate, but out of the dozen or so permutations of this question I've seen in my research before posting, I haven't seen a single one that addresses how to do this. They all say to use packaging, which is not a permissible option for my case.

Based on #mgilson 's comment, I added a file import_helper.py to the test directory.
some_dir/
mod.py
test/
test_mod.py
import_helper.py
example_data.txt
Here is the content of import_helper.py:
import sys as _sys
import os.path as _ospath
import inspect as _inspect
from contextlib import contextmanager as _contextmanager
#_contextmanager
def enable_parent_import():
path_appended = False
try:
current_file = _inspect.getfile(_inspect.currentframe())
current_directory = _ospath.dirname(_ospath.abspath(current_file))
parent_directory = _ospath.dirname(current_directory)
_sys.path.insert(0, parent_directory)
path_appended = True
yield
finally:
if path_appended:
_sys.path.pop(0)
and then in the import section of test_mod.py, prior to an attempt to import mod.py, I have added:
import unittest
from import_helper import enable_parent_import
with enable_parent_import():
from mod import some_mod_function_to_test
It is unfortunate to need to manually mangle PYTHONPATH, but writing it as a context manager helps a little, and restores sys.path back to its original state prior to the parent directory modification.
In order for this solution to scale across multiple instances of this problem (say tomorrow I am asked to write a widget.py module for some unrelated tasks and it also cannot be distributed as a package), I have to replicate my helper function and ensure a copy of it is distributed with any tests, or I have to write that small utility as a package, ensure it gets globally installed across my user base, and then maintain it going forward.
When you manage a lot of Python code internal to a company, often one dysfunctional code distribution mode that occurs is that "installing" some Python code becomes equivalent to checking out the new version from version control.
Since the code is often extremely localized and specific to a small set of tasks for a small subset of a larger team, maintaining the overhead for sharing code via packaging (even if it is a better idea in general) will simply never happen.
As a result, I feel the use case I describe above is extremely common for real-world Python, and it would be nice if some import tools added this functionality for modifying PYTHONPATH, which some sensible default choices (like adding parent directory) being very easy.
That way you could rely on this at least being part of the standard library, and not needing to roll your own code and ensure it's either shipped with your tests or installed across your user base.

Python 3 NOT updating variables in a nested package (”recursive” relative imports used)

I’m refactoring a large procedural program (implemented as many files in one folder) and using packages to group the files into an object oriented structure. This application uses tKinter (probable red herring) and is being developed using PyDev on Eclipse Kepler (on & for Win7).
It does use some classes but converting the package structure (see below) into classes is NOT the preferred solution (unless it’s the only reasonable way to get what I want).
At the bottom of a 4 level package nest I’ve defined a constant (“conA”), a function (“funcB”), and a variable (“varC”). Going from the bottom up, the __init__.py file of every level (inside the (nested) folder that implements the package) contains:
from .levelbelowModuleName import conA
from .levelbelowModuleName import funcB
from .levelbelowModuleName import varC
so that the “recursive” imports make the “level4” entities visible by their “level4” names at all levels.
In the higher level packages “read only” references to all entities do not cause warning/error messages. No problem so far.
But when trying to update “varC “ inside a higher level package I get two warnings: 1) an “unused import: varC” warning at the top of the file, and, 2) inside the updating function (which has a “global varC” statement) an “unused variable: varC” warning at the “varC =” line. OK so far since the two “varC”s aren’t being associated (but see nasty side issue below).
I thought (inferred from Chapter 23/24 of Lutz’s “Learning Python” book ) that imported names - at all levels - would refer to the same object (address) - and therefore that updating a variable that resides in a “nephew” (brothers child) package would work. Elevating “varC” to the nearest common ancestor package (immediate parent of the problem package, grandparent of the “nephew”) seems to be a valid workaround (it eliminates the warnings) – but doing this destroys the purpose of the object-oriented package structure.
Converting to absolute imports did not help. Renaming/aliasing “varC” (by using “as” on the imports) did not help.
BTW, The update line used in the higher level module is “varC = X.getTable()”; this returns a matrix (list of lists) of tKinter IntVars from a custom class.
The nasty side issue: a “read only” reference to “varC” anywhere in the problem file, e.g. “print(varC)” whether at the top of the file or inside the function, will eliminate both warnings, hiding the problem.
Is there a way of having my cake and eating it too? Is there a “non-class” way of having “varC” reside at level4 and still be updateable by the higher level packages? Or is using a common ancestor the only simple-to-understand ”non-class” approach that will work?
P.S. Of the related questions that were suggested when entering this question, none seem to apply. A similar, but simpler (not nested) question is:
How to change a module variable from another module?
Added 2015-03-27:
Here are two Eclipse screenshots of the actual files. (For those not familiar with Eclipse, __init__.py appears as a tab that has the package name.) The first shot shows the recursive imports (from bottom to top). The second shows the function, the "unused import" (yellow triangle) warning for "SiteSpecificReports". The line highlighted in blue is where the "unused variable" warning was (it has mysteriously disappeared).
Ignore everything to do with LongitudeReports, it's essentially a clone of MessageCountReports. For clarity, I've hidden all code that is not relevant to the problem (e.g. tKinter calls). Ignore the red "X" on the file names; all are tKinter "init time" type mismatches that disappear when the code is run (as per comment above "SiteSpecificReports" in __init__.py for MessageCountReports).
In the module hierarchy the problem file is highlighted in grey. "A_Mainline.py" is the execute point, everything below that is the code being refactored (some has already been moved into the packages above that file). Finally, excluding CZQX, all subpackages beneath "SiteSpecific" are placeholders and only contain an empty __init__.py file.
Updated 2105-10-23
The point behind all of this was to keep file sizes to a reasonable level by splitting each module into several source files.
The accepted answer (and it's included link) provided the clue I needed. My problem was that when I refactored a file/module into several subfiles (each containing the variable definitions and the functions that modified them) I was thinking of each subfile as being a class-like "black box" object instead of being, somewhat more correctly, a simple "insert this file into the higher level file" command (like an editor "paste" command).
My thinking was that the recursive imports would, in effect, recursively promote the addresses of the lower level variables into the higher level "init.py" namespaces, thus making those variables visible to all other subfiles of the module (who would only reference those variables) - and allowing me to have my cake (localized definitions) and eat it too (variables available at the topmost level). This approach has worked for me in other compiled languages, notably ADA 83.
When creating subfiles it seems that for a variable to be visible to other subfiles you need to define it in the topmost file instead of the bottommost one - which defeats the "object-ish" approach I was trying to use. Kind of a bummer that this approach doesn't work since the location of the variables makes it awkward to reuse a subfile in other modules. Converting each file to a class should accomplish what I was after - but that seems really pointless when all you need is the "insert this code block here" effect.
Anyway, the important thing is that everything's working now.

You're missing the distinction between names and values in python. Names live in namespaces and point to values. When you say:
from .blah import Foo
you're creating a new name in your current namespace, that points to the same value that Foo points to in the blah namespace. If after that you say:
Foo = 1
That changes your local namespace's Foo to point to a 1, but it does nothing to blah.Foo -- you would have to explicitly say blah.Foo = 1 to change the name Foo that lives in blah.
This blog post is a good read to clarify.

Where to Store Borrowed Python Code?

Recently, I have been working on a Python project with usual directory structure, and have received help from someone else who has given me a code snippet (a single function definition, about 30 lines long) which I would like to import into my code. What is the most proper directory/location in a Python project to store borrowed code of this size? Is it best to store the snippet into an entirely different module and import it from there?

I generally find it easiest to put such code in a separate file, because for clarity you don't want more than one different copyright/licensing term to apply within a single file. So in Python this does indeed mean a separate module. Then the file can contain whatever attribution and other legal boilerplate you need.
As long as your file headers don't accidentally claim copyright on something to which you do not own the copyright, I don't think it's actually a legal problem to mix externally-licensed or public domain code into files you mostly own. I may be wrong, though, which is why I normally avoid giving myself reason to think about it. A comment saying "this is external code from the following source with the following license:" may well be clearer than dividing code into different files that naturally wouldn't be. So I do occasionally do that.
I don't see any definite need for a separate directory (or package) per separate external source. If that's already part of your project structure (that is, it already uses external libraries by incorporating their source) then I suppose you might as well continue the trend.

I usually place scripts I copy off the internet in a folder/package called borrowed so I know all of the code here is stuff that I didn't write myself.
That is, if it's something more substantial than a one or two-liner demonstrating how something works.

Populating Factory using Metaclasses in Python

Obviously, registering classes in Python is a major use-case for metaclasses. In this case, I've got a serialization module that currently uses dynamic imports to create classes and I'd prefer to replace that with a factory pattern.
So basically, it does this:
data = #(Generic class based on serial data)
moduleName = data.getModule()
className = data.getClass()
aModule = __import__(moduleName)
aClass = getattr(aModule, className)
But I want it to do this:
data = #(Generic class based on serial data)
classKey = data.getFactoryKey()
aClass = factory.getClass(classKey)
However, there's a hitch: If I make the factory rely on metaclasses, the Factory only learns about the existence of classes after their modules are imported (e.g., they're registered at module import time). So to populate the factory, I'd have to either:
manually import all related modules (which would really defeat the purpose of having metaclasses automatically register things...) or
automatically import everything in the whole project (which strikes me as incredibly clunky and ham-fisted).
Out of these options, just registering the classes directly into a factory seems like the best option. Has anyone found a better solution that I'm just not seeing? One option might be to automatically generate the imports required in the factory module by traversing the project files, but unless you do that with a commit-hook, you run the risk of your factory getting out of date.
Update:
I have posted a self-answer, to close this off. If anyone knows a good way to traverse all Python modules across nested subpackages in a way that will never hit a cycle, I will gladly accept that answer rather than this one. The main problem I see happening is:
\A.py (import Sub.S2)
\Sub\S1.py (import A)
\Sub\S2.py
\Sub\S3.py (import Sub.S2)
When you try to import S3, it first needs to import Main (otherwise it won't know what a Sub is). At that point, it tries to import A. While there, the __init__.py is called, and tries to register A. At this point, A tries to import S1. Since the __init__.py in Sub is hit, it tries to import S1, S2, and S3. However, S1 wants to import A (which does not yet exist, as it is in the process of being imported)! So that import fails. You can switch how the traversal occurs (i.e., depth first rather than breadth first), but you hit the same issues. Any insight on a good traversal approach for this would be very helpful. A two-stage approach can probably solve it (i.e., traverse to get all module references, then import as a flat batch). However, I am not quite sure of the best way to handle the final stage (i.e., to know when you are done traversing and then import everything). My big restriction is that I do not want to have a super-package to deal with (i.e., an extra directory under Sub and A). If I had that, it could kick off traversal, but everything would need to import relative to that for no good reason (i.e., all imports longer by an extra directory). Thusfar, adding a special function call to sitecustomize.py seems like my only option (I set the root directory for the package development in that file anyway).

The solution I found to this was to do all imports on the package based off of a particular base directory and have special __init__.py functions for all of the ones that might have modules with classes that I'd want to have registered. So basically, if you import any module, it first has to import the base directory and proceeds to walk every package (i.e., folder) with a similar __init__.py file.
The downside of this approach is that the same modules are sometimes imported multiple times, which is annoying if anyone leaves code with side effects in a module import. However, that's bad either way. Unfortunately, some major packages (cough, cough: Flask) have serious complaints with IDLE if you do this (IDLE just restarts, rather than doing anything). The other downside is that because modules import each other, sometimes it attempts to import a module that it is already in the process of importing (an easily caught error, but one I'm still trying to stamp out). It's not ideal, but it does get the job done. Additional details on the more specific issue are attached, and if anyone can offer a better answer, I will gladly accept it.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.