python `asyncio` event loops: how to integrate other, foreign loops? - python

At the moment I'm struggling a bit with Python asyncio, and with event loops in general. It's probably a rather uncommon experiment, though: I'm trying if I could implement my own event loop (i.e. subclassing asyncio.AbstractEventLoop or similar) which allows me to 'plug' other main loops inside it (e.g. the native main loop of GTK/GLib or another UI toolkit).
I could then instantiate and run my loop, and use it as usual, e.g. using async/await syntax. Beyond that, I can add and remove "other main loops" to it, and it would process this ones as well. For example, I could add the GLib loop to it, so I can use async functions in my GTK project. Maybe even other ones alongside.
So, this surely needs some glue code for each kind of "other loop", which implements an interface that I have to define, and takes care of processing that particular loop when I add it to "my loop". I want this interface to be versatile, i.e. it should be possible to not only use GLib's loop, but all kinds of other ones.
I'm not sure what that interface should indeed be, and how it interacts with my loop implementation. Is there a common pattern or idea how to integrate main loops, which works for GLib and a lot of other ones?
It should also be resource efficient. Otherwise I could just have a while True loop inside run_forever which constantly checks for tasks to execute (and executes them), and constantly calls a particular method of my "other loop" interface, say ForeignLoop.process(self), which could then e.g. call gtk_loop.get_context().iteration(False) for GTK. This would keep one CPU core constantly busy.
So my questions are: Are there better ways to implement my loop idea? Do you think it is possible (without an insane bunch of code, which maybe is even hard to maintain)?
There are already some projects out there which have at least GLib loop integration: There is https://github.com/beeware/gbulb and https://github.com/python-trio/trio, but it will take me ages to understand what they do, since they do a lot of other things as well. There is also https://github.com/jhenstridge/asyncio-glib. That's much more compact and looks interesting. Unfortunately, I don't understand this one as well so far. It does some things that I cannot find much documentation about. What is its basic mechanism? It looks like it works with UNIX' select (as the whole default event loop impl does), but how is that wired to GLib's main loop? And, is that a common approach or a very GTK specific trick?
It turned out to be important to note: My question is not if the idea itself is useful or not. Unless there is a very significant reason to consider it as not useful, at least. :)

Related

TDD in Python - should we test helper functions?

A bit of a theoretical question that comes up with Python, since we can access almost anything we want even if it is underscored to sign as something "private".
def main_function():
_helper_function_()
...
_other_helper_function()
Doing it with TDD, you follow the Red-Green-Refactor cycle. A test looks like this now:
def test_main_function_for_something_only_helper_function_does():
# tedious setup
...
main_function()
assert something
The problem is that my main_function had so much setup steps that I've decided to test the helper functions for those specific cases:
from main_package import _helper_function
def test_helper_function_works_for_this_specific_input():
# no tedious setup
...
_helper_function_(some_input)
assert helper function does exactly something I expect
But this seems to be a bad practice. Should I even "know" about any inner/helper functions?
I refactored the main function to be more readable by moving out parts into these helper functions. So I've rewritten tests to actually test these smaller parts and created another test that the main function indeed calls them. This also seems counter-productive.
On the other hand I dislike the idea of a lot of lingering inner/helper functions with no dedicated unit tests to them, only happy path-like ones for the main function. I guess if I covered the original function before the refactoring, my old tests would be just as good enough.
Also if the main function breaks this would mean many additional tests for the helper ones are breaking too.
What is the better practice to follow?
The problem is that my main_function had so much setup steps that I've decided to test the helper functions for those specific cases
Excellent, that's exactly what's supposed to happen (the tests "driving" you to decompose the whole into smaller pieces that are easier to test).
Should I even "know" about any inner/helper functions?
Tradeoffs.
Yes, part of the point of modules is that they afford information hiding, allowing you to later change how the code does something without impacting clients, including test clients.
But also there are benefits to testing the internal modules directly; test design becomes simpler, with less coupling to irrelevant details. Fewer tests are coupled to each decision, which means that the blast radius is smaller when you need to change one of them.
My usual thinking goes like this: I should know that there are testable inner modules, and I can know that an outer module behaves like it is coupled to an inner module, but I shouldn't necessarily know that the outer module is coupled to the inner module.
assert X.result(A,B) == Y.sort([C,D,E])
If you squint at this, you'll see that it implies that X.result and Y.sort have some common requirement today, but it doesn't necessarily promise that X.result calls Y.sort.
So I've rewritten tests to actually test these smaller parts and created another test that the main function indeed calls them. This also seems counter-productive.
A works, and B works, and C works, and now here you are writing a test for f(A,B,C).... yeah, things go sideways.
The desired outcome of TDD is "Clean code that works" (Jeffries); and the truth of things is that you can get clean code that works without writing every test in the world.
Tests are most important in code where faults are most probable - straight line code where we are just wiring things together doesn't benefit nearly as much from the red-green-refactor cycle as code that has a lot of conditionals and branching.
There are two ways of constructing a software design: One way is to make it so simple that there are obviously no deficiencies
For sections of code that are "so simple that there are obviously no deficiencies", a suite of automated programmer tests is not a great investment. Get two people to perform a manual review, and sign off on it.
Too many private/helper functions are often a sign of missing abstraction.
May be you should consider applying the 'Extract class' refactoring. This refactoring will solve your confusion, as the private members will end up becoming public members of the extracted class.
Please not, I am not suggesting here to create a class for every private member but rather to play with the model a bit to find a better design.

Most Performant Way To Do Imports

From a performance point of view (time or memory) is it better to do:
import pandas as pd
or
from pandas import DataFrame, TimeSeries
Does the best thing to depend on how many classes I'm importing from the package?
Similarly, I've seen people do things like:
def foo(bar):
from numpy import array
Why would I ever want to do an import inside a function or method definition? Wouldn't this mean that import is being performed every time that the function is called? Or is this just to avoid namespace collisions?
This is micro-optimising, and you should not worry about this.
Modules are loaded once per Python process. All code that then imports only need to bind a name to the module or objects defined in the module. That binding is extremely cheap.
Moreover, the top-level code in your module only runs once too, so the binding takes place just once. An import in a function does the binding each time the function is run, but again, this is so cheap as to be negligible.
Importing in a function makes a difference for two reasons: it won't put that name in the global namespace for the module (so no namespace pollution), and because the name is now local, using that name is slightly faster than using a global.
If you want to improve performance, focus on code that is being repeated many, many times. Importing is not it.
Answering the more general question of when to import, imports are dependancies. It is code that may-or-may-not exist, that is required for the functioning of the program. It is therefore, a very good idea to import that code as soon as possible to prevent dumb errors from cropping up in the middle of execution.
This is particularly true as pypy becomes more popular, when the import might exist but isn't usable via pypy. Far better to fail early, than potentially hours into the execution of the code.
As for "import pandas as pd" vs "from pandas import DataFrame, TimeSeries", this question has multiple concerns (as all questions do), with some far more important than others. There's the question of namespace, there's the question of readability, and there's the question of performance. Performance, as Martjin states, should contribute to about 0.0001% of the decision. Readability should contribute about 90%. Namespace only 10%, as it can be mitigated so easily.
Personally, in my opinion, both import X as Y and form X import Y is bad practice, because explicit is better than implicit. You don't want to be on line 2000 trying to remember which package "calculate_mean" comes from because it isn't referenced anywhere else in the code. When i first started using numpy I was copy/pasting code from the internet, and couldn't figure out why i didn't/couldn't pip install np. This obviously isn't a problem if you have pre-existing knowledge that "np" is python for "numpy", but it's a stupid and pointless confusion for the 3 letters it saves. It came from numpy. Use numpy.
There is an advantage of importing a module inside of a function that hasn't been mentioned yet: doing so gives you some control over when the module is loaded. In fact, even though #J.J's answer recommends importing all modules as early as possible, this control allows you to postpone loading the module.
Why would you want to do that? Well, while it doesn't improve the actual performance of your program, doing so can improve the perceived performance, and by virtue of this, the user experience:
In part, users perceive whether your app is fast or slow based on how long it takes to start up.
MSDN: Best practices for your app's startup performance
Loading every module at the beginning of your main script can take some time. For example, one of my apps uses the Qt framework, Pandas, Numpy, and Matplotlib. If all these modules are imported right at the beginning of the app, the appearance of the user interface is delayed by several seconds. Users don't like to wait, and they are likely to perceive your app as generally slow because of this wait.
But if for example Matplotlib is imported only from within those functions that are called whenever the user issues a plot command, the startup time is notably reduced. The user doesn't perceive your app to be that sluggish anymore, which may result in a better user experience.

What is the difference between random.normalvariate() and random.gauss() in python?

What is the difference between random.normalvariate() and random.gauss()?
They take the same parameters and return the same value, performing essentially the same function.
I understand from a previous answer that random.gauss() is not thread safe, but what does this mean in this context? Why should a programmer care about this? Alternatively posed, why was both a thread safe and non-thread safe version of included in Python's 'random'?
This is an interesting question. In general, the best way to know the difference between two python implementations is to inspect the code yourself:
import inspect, random
str_gauss = inspect.getsource(random.gauss)
str_nv=inspect.getsource(random.normalvariate)
and then you print each of the strings to see how the sources differ. A quick look at the codes show that not only they behave differently multithread-wise, but also that the algorithms are not the same; for example, normalvariate uses something called the Kinderman and Monahan method, as per the following comments in str_nv:
# Uses Kinderman and Monahan method. Reference: Kinderman,
# A.J. and Monahan, J.F., "Computer generation of random
# variables using the ratio of uniform deviates", ACM Trans
# Math Software, 3, (1977), pp257-260.
Thread-safe pieces of code must account for possible race conditions during execution. This introduces overhead as a result of synchronization schemes like mutexes, semaphores, etc.
However, if you are writing non-reentrant code, no race conditions normally arise, which essentially means that you can write code that executes a bit faster. I guess this is why random.gauss() was introduced, since the python doc says it's faster than the thread-safe version.
I'm not entirely sure about this but the Python Documentation says that random.gauss is slightly faster so if you're OK with non-thread safe then you can go a little faster.
In a multi-threaded system, calling random.normalvariate twice very quickly in succession will cause the internal code of random.normalvariate to be run twice, potentially before the first call has had a chance to return. Internal variables may of the function may not be reset before the second, which may cause errors in the function output.
Successive calls to random.gauss must instead wait for earlier calls to return before being called themselves.
The advantage with random.normalvariate is therefore that it is faster, but may produce an erroneous output.

PyQt4: Modularizing/Scaling my GUI components?

I'm designing a (hopefully) simple GUI application using PyQt4 that I'm trying to make scalable. In brief, the user inputs some basic information and sends it into one of n queues (implementing waiting lists). Each of these n queues (QTableviews) are identical and each have controls to pop, delete from and rearrange its queue. These, along with some labels etc. form a 'module'. Currently my application is hardcoded to 4 queue modules, so there's elements named btn_table1_pop, btn_table2_pop...etc; four copies of every single module widget. This is obviously not very good UI design if you always assume your clients have four people that need waiting lists! I'd like to be able to easily modify this program so 8 people could use it, or 3 people could use it without a chunk of useless screen-estate!
The really naive solution to programming my application is duplicating the code for each module, but this is really messy, unmaintainable, and bounds my application to always four queues. A better thought would be to write functions for each button that sets an index and calls a function that implements the common logic, but I'm still hardcoded to 4, because the branch logic and the calling functions still have to take into account the names of the elements. If there was a way to 'vectorize' the names of the elements so that I could for example write
btn_table[index]_pop.setEnabled(False)
...I could eliminate this branch logic and really condense my code. But I'm way too new at Python/PyQt to know if this is 1) even possible? or 2) how to even go about it/if this is even the way to go?
Thanks again, SO.
In case anyone is interested I was able to get it working with dummybutton = getattr(self,'btn_table{}'.format(i)) and calling the button's methods on dummybutton.

What's the best way to record the type of every variable assignment in a Python program?

Python is so dynamic that it's not always clear what's going on in a large program, and looking at a tiny bit of source code does not always help. To make matters worse, editors tend to have poor support for navigating to the definitions of tokens or import statements in a Python file.
One way to compensate might be to write a special profiler that, instead of timing the program, would record the runtime types and paths of objects of the program and expose this data to the editor.
This might be implemented with sys.settrace() which sets a callback for each line of code and is how pdb is implemented, or by using the ast module and an import hook to instrument the code, or is there a better strategy? How would you write something like this without making it impossibly slow, and without runnning afoul of extreme dynamism e.g side affects on property access?
I don't think you can help making it slow, but it should be possible to detect the address of each variable when you encounter a STORE_FAST STORE_NAME STORE_* opcode.
Whether or not this has been done before, I do not know.
If you need debugging, look at PDB, this will allow you to step through your code and access any variables.
import pdb
def test():
print 1
pdb.set_trace() # you will enter an interpreter here
print 2
What if you monkey-patched object's class or another prototypical object?
This might not be the easiest if you're not using new-style classes.
You might want to check out PyChecker's code - it does (i think) what you are looking to do.
Pythoscope does something very similar to what you describe and it uses a combination of static information in a form of AST and dynamic information through sys.settrace.
BTW, if you have problems refactoring your project, give Pythoscope a try.

Categories