I am building a program to run several different analyses on a dataset. The different kinds of analysis are each represented by a different kind of analysis tool object (e.g. "AnalysisType1" and "AnalysisType2"). The analysis tools share many of the same parameters. The program is operated from a GUI, in which all the parameters are set by the user. What I'm trying to figure out, is what is the most elegant/best way to share the parameters between all the components of the program. Options I can think of include:
Keep all the parameters in the GUI, and pass to each analysis tool when it is executed.
Keep parameters in each of the tools, and update the parameters in all the tools every time they are changed in the GUI. Then they are ready to go whenever an analysis is executed.
Create a ParameterSet object that holds all the parameters for all the components. Give a reference to this ParameterSet object to every component that needs it, and update its parameters whenever they are changed in the GUI.
I've already tried #1, followed by #2, and as the complexity is growing, I'm considering moving to #3. Are there any reasons not to take this approach?
How about creating a parent class to all Analysis that will have common attributes (maybe static) and methods?
This way when you implement a new AnalysisType you inherit all the parameters and you can change them in a single place.
Related
I'm working with a large existing Python code base, which has an internal graph model, with nodes and edges being regular Python classes. I'd like to optimize the memory footprint by converting these to slotted classes -- currently, the memory usage is creating severe issues.
I think using slots may help, as there are a few dozens of classes, but hundreds of thousands of instances of these classes which create the graph model.
To that end, I have a couple of questions:
How to get the overall application memory usage? I'm using psutil.Process().memory_info.rss - is that the preferred option?
How to know which specific classes to focus on for adding slots? Ideally a tool/report which can show number of instances x memory per instance for all user defined classes? I have been trying out Pympler, but that requires adding tracking code for all classes individually.
In both of the above, I'd like to know if there are better approaches that I may have missed.
I have an object that I used blenders "pixelate" (advanced) object function, this created what looked like a bunch (1000's) of duplications of a single cube.
having exported and then re-imported this resulted in a single object consisting of some 18,000 cubes.
This has had different materials added to many of the cubes.
The aim is to split the object into "layers" of all the cubes that are at the same height, while retaining their materials
I have tried a number of things like boolean operations, but that's been prohibitively slow and hasn't always kept the materials
In addition there are some 70+ layers, so manually creating the layers might be somewhat tedious....
ideally I'd like to write some kind of script that would filter out each layer at a time and export them (with materials) so they can be rendered as 2d images...
The python documentation for blender initially seems to be somewhat opaque probably due the the very large size of the API (where do you start!)
can anyone help with at least some of the steps I might need to write this script as I'm having problems gaining any kind of traction.
In the end I used a partially automated and partially manual method to do what I wanted thanks to #keltar I found the info window and the commands I needed picking one up from the python commands in the menu popup
>>> def dolayer(name):
... bpy.ops.mesh.select_linked(delimit={'SEAM'})
... bpy.ops.mesh.separate(type='SELECTED')
... bpy.data.objects[name].hide = True
...
>>> dolayer('Cube.018')
>>> dolayer('Cube.019')
I selected the next layer with box select being sure to turn off limit selection to visible! then I simply provide the dolayer function with what will me the new name for the split object (this lets you hide it) the up cursor key is your friend here!
the minimal amount of automation made it practical to separate out 72 layers into separate objects, this allows me to hide different layers and show only the ones I want for each step of the build....
I completely missed the info window, which should make scripting very much more accessible !
We're trying to assess the feasibility of this idea:
We have a pretty deep stack of HasTraits objects in a modeling program. For example, if we are modeling two materials, we could access various attributes on these with:
Layer.Material1.Shell.index_of_refraction
Layer.Material5.Medium.index_of_refraction
We've used this code for simulations, where we merely increment the values of a trait. For example, we could run a simulation were the index_of_refraction of one of these materials varies from 1.3 to 1.6 over 10 iterations. It actually is working quite nicely.
The problem is in selecting the desired traits for the simulation. Users aren't going to know all of these trait variable names, so we wanted to present a heirarchal/tree view of the entire trait structure of the program. For the above two traits, it might look like:
Layer
- Material1
- Shell
- index_of_refraction
- Material2
- Medium
- index_of_refraction
Etc...
I know that traitsui supports TreeEditors, but are there any examples of building a TreeEditor based on the inspection of a HasTraits stack like this? What is the most straightforward way to get the Stack of traits from an object? Essentially, is this idea feasible or should I go back to the drawing board?
Thanks
The ValueEditor does this. You can take a look at how it configures the TreeEditor to do this here:
https://github.com/enthought/traitsui/blob/master/traitsui/value_tree.py
Here is an image from Robert's solution.
Followup Discussion
Robert, imagine I had a custom TreeEditor. It doesn't seem to let me use it directly:
Item('myitem', editor=TreeEditor())
I get:
traits.trait_errors.TraitError: The 'adapter' trait of an ITreeNodeAdapterBridge instance must be an implementor of, or can be adapted to implement, ITreeNode or None, but a value of [<pame.gensim.LayerSimulation object at 0x7fb623bf0830>] <class 'traits.trait_handlers.TraitListObject'> was specified.
I've tried this with _ValueTree, ValueTree, value_tree_editor, value_tree_editor_with_root, _ValueEditor and ValueEditor.
The only one that works is ValueEditor, therefore, even though I can understand how to subclass TraitsNode, it doesn't seem like it's going to work unless I hook everything up through an EditorFactory. IE the behavior we want to customize is all the way down in TreeEditor, and that's buried under _ValueEditor, ValueEditor, EditorFactory etc...
Does this make any sense?
I have a program completed that does the following:
1)Reads formatted data (a sequence of numbers and associated labels) from serial port in real time.
2)Does minor manipulations to data.
3)plots data in real time in a gui I wrote using pyqt.
4)Updates data stats in the gui.
5)Allows post analysis of the data after collection is stopped.
There are two dialogs (separate classes) that are called from within the main window in order to select certain preferences in plotting and statistics.
My question is the following: Right now my data is read in and declared as several global variables that are appended to as data comes in 20x per second or so - a 2d list of values for the numerical values and 1d lists for the various associated text values. Would it be better to create a class in which to store data and its various attributes, and then to use instances of this data class to make everything else happen - like the plotting of the data and the statistics associated with it?
I have a hunch that the answer is yes, but I need a bit of guidance on how to make this happen if it is the best way forward. For instance, would every single datum be a new instance of the data class? Would I then pass them one by one or as a list of instances to the other classes and to methods? How should the passing most elegantly be done?
If I'm not being specific enough, please let me know what other information would help me get a good answer.
A reasonably good rule of thumb is that if what you are doing needs more than 20 lines of code it is worth considering using an object oriented design rather than global variables, and if you get to 100 lines you should already be using classes. The purists will probably say never use globals but IMHO for a simple linear script it is probably overkill.
Be warned that you will probably get a lot of answers expressing horror that you are not already.
There are some really good, (and some of them free), books that introduce you to object oriented programming in python a quick google should provide the help you need.
Added Comments to the answer to preserve them:
So at 741 lines, I'll take that as a yes to OOP:) So specifically on the data class. Is it correct to create a new instance of the data class 20x per second as data strings come in, or is it more appropriate to append to some data list of an existing instance of the class? Or is there no clear preference either way? – TimoB
I would append/extend your existing instance. – seth
I think I see the light now. I can instantiate the data class when the "start data" button is pressed, and append to that instance in the subsequent thread that does the serial reading. THANKS! – TimoB
I have a scientific data management problem which seems general, but I can't find an existing solution or even a description of it, which I have long puzzled over. I am about to embark on a major rewrite (python) but I thought I'd cast about one last time for existing solutions, so I can scrap my own and get back to the biology, or at least learn some appropriate language for better googling.
The problem:
I have expensive (hours to days to calculate) and big (GB's) data attributes that are typically built as transformations of one or more other data attributes. I need to keep track of exactly how this data is built so I can reuse it as input for another transformation if it fits the problem (built with right specification values) or construct new data as needed. Although it shouldn't matter, I typically I start with 'value-added' somewhat heterogeneous molecular biology info, for example, genomes with genes and proteins annotated by other processes by other researchers. I need to combine and compare these data to make my own inferences. A number of intermediate steps are often required, and these can be expensive. In addition, the end results can become the input for additional transformations. All of these transformations can be done in multiple ways: restricting with different initial data (eg using different organisms), by using different parameter values in the same inferences, or by using different inference models, etc. The analyses change frequently and build on others in unplanned ways. I need to know what data I have (what parameters or specifications fully define it), both so I can reuse it if appropriate, as well as for general scientific integrity.
My efforts in general:
I design my python classes with the problem of description in mind. All data attributes built by a class object are described by a single set of parameter values. I call these defining parameters or specifications the 'def_specs', and these def_specs with their values the 'shape' of the data atts. The entire global parameter state for the process might be quite large (eg a hundred parameters), but the data atts provided by any one class require only a small number of these, at least directly. The goal is to check whether previously built data atts are appropriate by testing if their shape is a subset of the global parameter state.
Within a class it is easy to find the needed def_specs that define the shape by examining the code. The rub arises when a module needs a data att from another module. These data atts will have their own shape, perhaps passed as args by the calling object, but more often filtered from the global parameter state. The calling class should be augmented with the shape of its dependencies in order to maintain a complete description of its data atts.
In theory this could be done manually by examining the dependency graph, but this graph can get deep, and there are many modules, which I am constantly changing and adding, and ... I'm too lazy and careless to do it by hand.
So, the program dynamically discovers the complete shape of the data atts by tracking calls to other classes attributes and pushing their shape back up to the caller(s) through a managed stack of __get__ calls. As I rewrite I find that I need to strictly control attribute access to my builder classes to prevent arbitrary info from influencing the data atts. Fortunately python is making this easy with descriptors.
I store the shape of the data atts in a db so that I can query whether appropriate data (i.e. its shape is a subset of the current parameter state) already exists. In my rewrite I am moving from mysql via the great SQLAlchemy to an object db (ZODB or couchdb?) as the table for each class has to be altered when additional def_specs are discovered, which is a pain, and because some of the def_specs are python lists or dicts, which are a pain to translate to sql.
I don't think this data management can be separated from my data transformation code because of the need for strict attribute control, though I am trying to do so as much as possible. I can use existing classes by wrapping them with a class that provides their def_specs as class attributes, and db management via descriptors, but these classes are terminal in that no further discovery of additional dependency shape can take place.
If the data management cannot easily be separated from the data construction, I guess it is unlikely that there is an out of the box solution but a thousand specific ones. Perhaps there is an applicable pattern? I'd appreciate any hints at how to go about looking or better describing the problem. To me it seems a general issue, though managing deeply layered data is perhaps at odds with the prevailing winds of the web.
I don't have specific python-related suggestions for you, but here are a few thoughts:
You're encountering a common challenge in bioinformatics. The data is large, heterogeneous, and comes in constantly changing formats as new technologies are introduced. My advice is to not overthink your pipelines, as they're likely to be changing tomorrow. Choose a few well defined file formats, and massage incoming data into those formats as often as possible. In my experience, it's also usually best to have loosely coupled tools that do one thing well, so that you can chain them together for different analyses quickly.
You might also consider taking a version of this question over to the bioinformatics stack exchange at http://biostar.stackexchange.com/
ZODB has not been designed to handle massive data, it is just for web-based applications and in any case it is a flat-file based database.
I recommend you to try PyTables, a python library to handle HDF5 files, which is a format used in astronomy and physics to store results from big calculations and simulations. It can be used as an hierarchical-like database and has also an efficient way to pickle python objects. By the way, the author of pytables explained that ZOdb was too slow for what he needed to do, and I can confirm you that. If you are interested in HDF5, there is also another library, h5py.
As a tool for managing the versioning of the different calculations you have, you can have a try at sumatra, which is something like an extension to git/trac but designed for simulations.
You should ask this question on biostar, you will find better answers there.