See the topic, but I am especially interested what is the functional difference between tf.contrib.framework.variable() and tf.get_variable()? The documentation for tf.contrib.framework is not very informative.
If you look at the source code, variable() seems to be just a wrapper around get_variable(). The only additional thing it does is set the "collections" parameter to all variables in graph, if it wasn't set.
contrib.framework is more or less just a bunch of utility functions.
Related
I can't seem to find the code for numpy argmax.
The source link in the docs lead me to here, which doesn't have any actual code.
I went through every function that mentions argmax using the github search tool and still no luck. I'm sure I'm missing something.
Can someone lead me in the right direction?
Thanks
Numpy is written in C. It uses a template engine that parsing some comments to generate many versions of the same generic function (typically for many different types). This tool is very helpful to generate fast code since the C language does not provide (proper) templates unlike C++ for example. However, it also make the code more cryptic than necessary since the name of the function is often generated. For example, generic functions names can look like #TYPE#_#OP# where #TYPE# and #OP# are two macros that can take different values each. On top of all of this, the CPython binding also make the code more complex since C functions have to be wrapped to be called from a CPython code with complex arrays (possibly with a high amount of dimensions and custom user types) and CPython arguments to decode.
_PyArray_ArgMinMaxCommon is a quite good entry point but it is only a wrapping function and not the main computing one. It is only useful if you plan to change the prototype of the Numpy function from Python.
The main computational function can be found here. The comment just above the function is the one used to generate the variants of the functions (eg. CDOUBLE_argmax). Note that there are some alternative specific implementation for alternative type below the main one like OBJECT_argmax since CPython objects and strings must be computed a bit differently. Thank you for contributing to Numpy.
As mentioned in the comments, you'll likely find what you are searching in the C code implementation (here under _PyArray_ArgMinMaxCommon). The code itself can be very convoluted, so if your intent was to open an issue on numpy with a broad idea, I would do it on the page you linked anyway.
I'm using numpy v1.18.2 in some simulations, and using inbuilt functions such as np.unique, np.diff and np.interp. I'm using these functions on standard objects, i.e lists or numpy arrays.
When I checked with cProfile, I saw that these functions make a call to an built-in method numpy.core._multiarray_umath.implement_array_function and that this method accounts for 32.5% of my runtime! To my understanding this is a wrapper that performs some checks to make sure the that the arguments passed to the function are compatible with the function.
I have two questions:
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time? That is, am I misinterpreting the cProfile output? I was confused by the hierarchical output of snakeviz. Please see snakeviz output here and details for the function here.
Is there any way to disable it/bypass it, since the inputs need not be checked each time as the arguments I pass to these numpy functions are already controlled in my code? I am hoping that this will give me a performance improvement.
I already saw this question (what is numpy.core._multiarray_umath.implement_array_function and why it costs lots of time?), but I was not able to understand what exactly the function is or does. I also tried to understand NEP 18, but couldn't make out how to exactly solve the issue. Please fill in any gaps in my knowledge and correct any misunderstandings.
Also I'd appreciate if someone can explain this to me like I'm 5 (r/explainlikeimfive/) instead of assuming developer level knowledge of python.
All the information below is taken from NEP 18.
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time?
As #hpaulj correctly mentioned in the comment, the overhead of the dispatcher adds 2-3 microseconds to each numpy function call. This will probably be shorten to 0.5-1 microseconds once it is implemented in C. See here.
Is there any way to disable it/bypass it
Yes, from NumPy 1.17, you can set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION to 0 (before importing numpy) and this will disable the use of implement_array_function (See here). Something like
import os
os.environ['NUMPY_EXPERIMENTAL_ARRAY_FUNCTION'] = '0'
import numpy as np
However, disabling it probably would not give you any notable performance improvement as the overhead of it is just few microseconds, and this will be the default in later numpy version too.
I am trying to figure what is the right way to plot pandas DataFrames as, there seem to be multiple working syntaxes coexisting. I know Pandas is still developing so my question is which of the methods below is the most future proof?
Let's say I have DataFrame df I could plot it as a histogram using following pandas API calls.
df.plot(kind='hist')
df.plot.hist()
df.hist()
Looking at the documentation options 1, 2 seem to be pretty much the same thing in which case I prefer df.plot.hist() as I get auto-complete with the plot name. 'hist' is still pretty easy to spell as a string, but 'candlestick_ohlc' for example is pretty easy to typo...
What gets me confused is the 3th option. It does not have all the options of the first 2 and API is different. Is that one some legacy thing or the actual right way of doing things?
The recommended method is plot._plot_type this is to avoid the ambiguity in kwarg params and to aid in tab-completion see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0170-plot.
The .hist method still works as a legacy support, I don't believe there are plans to remove this but it's recommended to use plot.hist for future compatibility.
Additionally it simplifies the api somewhat as it was a bit problematic to use kind=graph_type to specify the graphy type and ensure the params were correct for each graphy type, the kwargs for plot._plottype are specified here: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting which should cover all the args in hist
I've always considered df.hist() to be the graphical equivalent to df.describe(): a quick way of getting an overview over the distribution of numeric data in a data frame. As this is indeed useful, and also used by a few people as far as I know, I'd be surprised if it became deprecated in a future version.
In contrast, I understand the df.plot method to be intended for actual data visualization, i.e. the preferred method if you want to tease a specific bit of information out of your data. Consequently, there are more arguments that you can use to modify the plot so that it fits your purpose, whereas with df.hist(), you can get useful distributional plots even with the default settings.
Thus, to answer your question: as I see it, both functions serve different purposes, both can be useful depending on your needs, and both should be future-safe.
Okay so i am currently working on an inhouse statistics package for python, its mainly geared towards a combination of working with arcgis geoprocessor, for modeling comparasion and tools.
Anyways, so i have a single class, that calculates statistics. Lets just call it Stats. Now my Stats class, is getting to the point of being very large. It uses statistics calculated by other statistics, to calculate other statistics sets, etc etc. This leads to alot of private variables, that are kept simply to prevent recalculation. however there is certain ones, while used quite frequintly they are often only used by one or two key subsections of functionality. (e.g. summation of matrix diagonals, and probabilities). However its starting to become a major eyeesore, and i feel as if i am doing this terribly wrong.
So is this bad?
I was recommended by a coworker, to simply start putting core and common functionality togther, in the main class, then simply having capsules, that take a reference to the main class, and simply do what ever functionality they need to within themselves. E.g. for calculating accuracy of model predictions, i would create a capsule, who simply takes a reference to the parent, and it will offload all of the calculations needed, for model predictions.
Is something like this really a good idea? Is there a better way? Right now i have over a dozen different sub statistics that are dumped to a text file to make a smallish report. The code base is growing, and i would just love it if i could start splitting up more and more of my python classes. I am just not sure really what the best way about doing stuff like this is.
Why not create a class for each statistic you need to compute and when of the statistics requires other, just pass an instance of the latter to the computing method? However, there is little known about your code and required functionalities. Maybe you could describe in a broader fashion, what kind of statistics you need calculate and how they depend on each other?
Anyway, if I had to count certain statistics, I would instantly turn to creating separate class for each of them. I did once, when I was writing code statistics library for python. Every statistic, like how many times class is inherited or how often function was called, was a separate class. This way each of them was simple, however I didn't need to use any of them in the other.
I can think of a couple of solutions. One would be to simply store values in an array with an enum like so:
StatisticType = enum('AveragePerDay','MedianPerDay'...)
Another would be to use a inheritance like so:
class StatisticBase
....
class AveragePerDay ( StatisticBase )
...
class MedianPerDay ( StatisticBase )
...
There is no hard and fast rule on "too many", however a guideline is that if the list of fields, properties, and methods when collapsed, is longer than a single screen full, it's probably too big.
It's a common anti-pattern for a class to become "too fat" (have too much functionality and related state), and while this is commonly observed about "base classes" (whence the "fat base class" monicker for the anti-pattern), it can really happen without any inheritance involved.
Many design patterns (DPs for short_ can help you re-factor your code to whittle down the large, untestable, unmaintainable "fat class" to a nice package of cooperating classes (which can be used through "Facade" DPs for simplicity): consider, for example, State, Strategy, Memento, Proxy.
You could attack this problem directly, but I think, especially since you mention in a comment that you're looking at it as a general class design topic, it may offer you a good opportunity to dig into the very useful field of design patterns, and especially "refactoring to patterns" (Fowler's book by that title is excellent, though it doesn't touch on Python-specific issues).
Specifically, I believe you'll be focusing mostly on a few Structural and Behavioral patterns (since I don't think you have much need for Creational ones for this use case, except maybe "lazy initialization" of some of your expensive-to-compute state that's only needed in certain cases -- see this wikipedia entry for a pretty exhaustive listing of DPs, with classification and links for further explanations of each).
Since you are asking about best practices you might want to check out pylint (http://www.logilab.org/857). It has many good suggestions about code style including ones relating to how many private variables in a class.
How bad is it to redefine a class method from another, third-party module, in Python?
In fact, users can create NumPy matrices that contain numbers with uncertainty; ideally, I would like their code to run unmodified (compared to when the code manipulates float matrices); in particular, it would be great if the inverse of matrix m could still be obtained with m.I, despite the fact that m.I has to be calculated with my own code (the original I method does not work, in general).
How bad is it to redefine numpy.matrix.I? For one thing, it does tamper with third-party code, which I don't like, as it may not be robust (what if other modules do the same?…). Another problem is that the new numpy.matrix.I is a wrapper that involves a small overhead when the original numpy.matrix.I can actually be applied in order to obtain the inverse matrix.
Is subclassing NumPy matrices and only changing their I method better? this would force users to update their code and create matrices of numbers with uncertainty with m = matrix_with_uncert(…) (instead of keeping using numpy.matrix(…), as for a matrix of floats), but maybe this is an inconvenience that should be accepted for the sake of robustness? Matrix inversions could still be performed with m.I, which is good… On the other hand, it would be nice if users could build all their matrices (of floats or of numbers with uncertainties) with numpy.matrix() directly, without having to bother checking for data types.
Any comment, or additional approach would be welcome!
Subclassing (which does involve overriding, as the term is generally used) is generally much preferable to "monkey-patching" (stuffing altered methods into existing classes or modules), even when the latter is available (built-in types, meaning ones implemented in C, can protect themselves against monkey-patching, and most of them do).
For example, if your functionality relies on monkey-patching, it will break and stop upgrades if at any time the class you're monkey patching gets upgraded to be implemented in C (for speed or specifically to defend against monkey patching). Maintainers of third party packages hate monkey-patching because it means they get bogus bug reports from hapless users who (unbeknownst to them) are in fact using a buggy monkey-patch which breaks the third party package, where the latter (unless broken monkey-wise) is flawless. You've already remarked on the possible performance hit.
Conceptually, a "matrix of numbers with uncertainty" is a different concept from a "matrix of numbers". Subclassing expresses this cleanly, monkey-patching tries to hide it. That's really the root of what's wrong with monkey-patching in general: a covert channel operating through global, hidden means, without clarity and transparency. All the many practical problems descend in a sense from this root conceptual problem.
I strongly urge you to reject monkey-patching in favor of clean solutions such as subclassing.
In general, it's perfectly acceptable to override methods that are ...
Intentionally permit overrides
In a way they document (satisfying LSP won't hurt)
If both conditions are met, then overriding should be safe.
Depends on what you mean with "redefine". Obviously you can use your own version of it, no problem at all. Also you can redefine it by subclassing if it's a method.
You can also make a new method, and patch it into the class, a practice known as monkey_patching. Like so:
from amodule import aclass
def newfunction(self, param):
do_something()
aclass.oldfunction = newfunction
This will make all instances of aclass use your new function instead of the old one, including instances in any "fourth-party" modules. This works and is highly useful, but it's regarded as very ugly and a last resort option. This is because there is nothing in the aclass code to suggest that you have overridden the method, so it's hard to debug. And even worse it gets when two modules monkeypatch the same thing. Then you really get confused.