Numpy argmax source

Numpy argmax source - python

I can't seem to find the code for numpy argmax.
The source link in the docs lead me to here, which doesn't have any actual code.
I went through every function that mentions argmax using the github search tool and still no luck. I'm sure I'm missing something.
Can someone lead me in the right direction?
Thanks

Numpy is written in C. It uses a template engine that parsing some comments to generate many versions of the same generic function (typically for many different types). This tool is very helpful to generate fast code since the C language does not provide (proper) templates unlike C++ for example. However, it also make the code more cryptic than necessary since the name of the function is often generated. For example, generic functions names can look like #TYPE#_#OP# where #TYPE# and #OP# are two macros that can take different values each. On top of all of this, the CPython binding also make the code more complex since C functions have to be wrapped to be called from a CPython code with complex arrays (possibly with a high amount of dimensions and custom user types) and CPython arguments to decode.
_PyArray_ArgMinMaxCommon is a quite good entry point but it is only a wrapping function and not the main computing one. It is only useful if you plan to change the prototype of the Numpy function from Python.
The main computational function can be found here. The comment just above the function is the one used to generate the variants of the functions (eg. CDOUBLE_argmax). Note that there are some alternative specific implementation for alternative type below the main one like OBJECT_argmax since CPython objects and strings must be computed a bit differently. Thank you for contributing to Numpy.

As mentioned in the comments, you'll likely find what you are searching in the C code implementation (here under _PyArray_ArgMinMaxCommon). The code itself can be very convoluted, so if your intent was to open an issue on numpy with a broad idea, I would do it on the page you linked anyway.

Related

Is `built-in method numpy.core._multiarray_umath.implement_array_function` a performance bottleneck?

I'm using numpy v1.18.2 in some simulations, and using inbuilt functions such as np.unique, np.diff and np.interp. I'm using these functions on standard objects, i.e lists or numpy arrays.
When I checked with cProfile, I saw that these functions make a call to an built-in method numpy.core._multiarray_umath.implement_array_function and that this method accounts for 32.5% of my runtime! To my understanding this is a wrapper that performs some checks to make sure the that the arguments passed to the function are compatible with the function.
I have two questions:
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time? That is, am I misinterpreting the cProfile output? I was confused by the hierarchical output of snakeviz. Please see snakeviz output here and details for the function here.
Is there any way to disable it/bypass it, since the inputs need not be checked each time as the arguments I pass to these numpy functions are already controlled in my code? I am hoping that this will give me a performance improvement.
I already saw this question (what is numpy.core._multiarray_umath.implement_array_function and why it costs lots of time?), but I was not able to understand what exactly the function is or does. I also tried to understand NEP 18, but couldn't make out how to exactly solve the issue. Please fill in any gaps in my knowledge and correct any misunderstandings.
Also I'd appreciate if someone can explain this to me like I'm 5 (r/explainlikeimfive/) instead of assuming developer level knowledge of python.

All the information below is taken from NEP 18.
Is this function (implement_array_function) actually taking up so much time or is it actually the operations I'm doing (np.unique, np.diff, np.interp) that is actually taking up all this time?
As #hpaulj correctly mentioned in the comment, the overhead of the dispatcher adds 2-3 microseconds to each numpy function call. This will probably be shorten to 0.5-1 microseconds once it is implemented in C. See here.
Is there any way to disable it/bypass it
Yes, from NumPy 1.17, you can set the environment variable NUMPY_EXPERIMENTAL_ARRAY_FUNCTION to 0 (before importing numpy) and this will disable the use of implement_array_function (See here). Something like
import os
os.environ['NUMPY_EXPERIMENTAL_ARRAY_FUNCTION'] = '0'
import numpy as np
However, disabling it probably would not give you any notable performance improvement as the overhead of it is just few microseconds, and this will be the default in later numpy version too.

Pyramids and Oblique Cones in MEEP

Apologies if this is not the right place for this question.
I've recently started using MIT's MEEP software (Python3, on Linux). I am quite new to it and would like to mostly use it for photovoltaics projects. Somewhat common shapes that show up here are "inverted pyramid" and slanted (oblique) cone structures. Creating shapes in MEEP seems to generally be done with the GeometricObject class, but they don't seem to directly support either of these structures. Is there any way around this or is my only real option simulating these structures by stacking small Block objects?
As described in my own "answer" posted, it's not too difficult to just define these geometric objects myself, write a function to check if it's inside the object, and return the appropriate material. How would I go about converting this to a MEEP GeometricObject, instead of converting that to a material_func as I've done?

No responses, so I thought I'd post my hacky way around it. There are two solutions: First is as mentioned in the question, just stacking MEEP's Block object. The other approach I did was define my own class Pyramid, which works basically the same way as described here. Then, I convert a list of my class objects and MEEP's shape object to a function that takes a vector and returns a material, and this is fed as material_func in MEEP's Simulation object. So far, it seems to work, hence I'm posting it as an answer. However, It substantially slows down subpixel averaging (and maybe the rest of the simulation, though I haven't done an actual analysis), so I'm not very happy with it.
I'm not sure which is "better" but the second method does feel more precise, insofar that you have pyramids, not just a stack of Blocks.

Why couldn't Julia superset python?

The Julia Language syntax looks very similar to python, while the concept of a class (if one should address it as such a thing) is more what you use in C. There were many reasons why the creators decided on the difference with respect to the OOP. Still would it have been so hard (in comparison to create Julia in first place which is impressive) to find some canonical way to interpret python to Julia and thus get a hold of all the python libraries?

Yes. The design of Python makes it fundamentally difficult to optimize at compile-time (i.e. before you run the code). It is simply false that Julia is fast BECAUSE of its JIT. Rather, Julia is designed with its type system and multiple dispatch in mind so that way the compiler can know all of the necessary details to compile "the same code you would have written in C". That's what makes it fast: the type system. It makes a few trade-offs that allow it to, in "type-stable" functions, fully deduce what the types of every variable is, know what the memory layout of the type should be (including parametric types, so Vector{Float64} has a memory layout which is determined by the type and its parameter which inlines Float64 values like a NumPy array, except this is generalized in a way that your own struct types get the same efficiency), and compile a version of the code specifically for the types which are seen.
There are many ways where this is at odds with Python. For example, if the number of fields in a struct could change, then the memory layout could not be determined and thus these optimizations cannot occur at "compile-time". Julia was painstakingly designed to make sure that it would have type inferrability, and it uses that to generate code which is fully typed and remove all runtime checks (in type-stable functions. When a function is not type-stable, the types of the variables become dynamic rather than static and it slows down to Python-like speeds). In this sense, Julia actually isn't even optimized yet: all of its performance comes "for free" given the design of its type system. Python/MATLAB/R has to try really hard to optimize at runtime because it doesn't have the capability to do these deductions. In fact, those languages are "better optimized" right now in terms of runtime optimizations, but no one has really worked on runtime optimizations in Julia yet because in most performance sensitive cases you can get it all at compile time.
So then, what about Numba? Numba tries to take the route that Julia takes but with Python code by limiting what can be done so that way it can get type-stable code and compile that efficiently. However, this means a few things. First of all, it's not compatible with all Python codes or libraries. But more importantly, since Python is not a language built around its type system, the tools for controlling the code at the level of types is much reduced. So Numba doesn't have parametric vectors and generic codes which auto-specialize via multiple dispatch because these aren't features of the language. But that also means that it cannot make full use of the design, which limits how much it can do. It can handle the "use only floating point array" stuff just fine, but you can see it gets limited if you want one code to produce efficient code for "any number type, even ones I don't know about". However, by design, Julia does this automatically.
So at the core, Julia and Python are extremely different languages. It can be hard to see because Julia's syntax is close to Python's syntax, but they do not work the same at all.
This is a short summary of what I have described in a few blog posts. These go into more detail and show you how Julia is actually generating efficient code, how it gives you a generic "Python looking style" but doing so with full inferrability all the way down, and what the tradeoffs are.
How type-stability plus multiple dispatch gives performance:
http://ucidatascienceinitiative.github.io/IntroToJulia/Html/WhyJulia
http://www.stochasticlifestyle.com/7-julia-gotchas-handle/
How the type system allows for highly performant generic designs
http://www.stochasticlifestyle.com/type-dispatch-design-post-object-oriented-programming-julia/

Automatic CudaMat conversion in Python

I'm looking into speeding up my python code, which is all matrix math, using some form of CUDA. Currently my code is using Python and Numpy, so it seems like it shouldn't be too difficult to rewrite it using something like either PyCUDA or CudaMat.
However, on my first attempt using CudaMat, I realized I had to rearrange a lot of the equations in order to keep the operations all on the GPU. This included the creation of many temporary variables so I could store the results of the operations.
I understand why this is necessary, but it makes what were once easy to read equations into somewhat of a mess that difficult to inspect for correctness. Additionally, I would like to be able to easily modify the equations later on, which isn't in their converted form.
The package Theano manages to do this by first creating a symbolic representation of the operations, then compiling them to CUDA. However, after trying Theano out for a bit, I was frustrated by how opaque everything was. For example, just getting the actual value for myvar.shape[0] is made difficult since the tree doesn't get evaluated until much later. I would also much prefer less of a framework in which my code much conform to a library that acts invisibly in the place of Numpy.
Thus, what I would really like is something much simpler. I don't want automatic differentiation (there are other packages like OpenOpt that can do that if I require it), or optimization of the tree, but just a conversion from standard Numpy notation to CudaMat/PyCUDA/somethingCUDA. In fact, I want to be able to have it evaluate to just Numpy without any CUDA code for testing.
I'm currently considering writing this myself, but before even consider such a venture, I wanted to see if anyone else knows of similar projects or a good starting place. The only other project I know that might be close to this is SymPy, but I don't know how easy it would be to adapt to this purpose.
My current idea would be to create an array class that looked like a Numpy.array class. It's only function would be to build a tree. At any time, that symbolic array class could be converted to a Numpy array class and be evaluated (there would also be a one-to-one parity). Alternatively, the array class could be traversed and have CudaMat commands be generated. If optimizations are required they can be done at that stage (e.g. re-ordering of operations, creation of temporary variables, etc.) without getting in the way of inspecting what's going on.
Any thoughts/comments/etc. on this would be greatly appreciated!
Update
A usage case may look something like (where sym is the theoretical module), where we might be doing something such as calculating the gradient:
W = sym.array(np.rand(size=(numVisible, numHidden)))
delta_o = -(x - z)
delta_h = sym.dot(delta_o, W)*h*(1.0-h)
grad_W = sym.dot(X.T, delta_h)
In this case, grad_W would actually just be a tree containing the operations that needed to be done. If you wanted to evaluate the expression normally (i.e. via Numpy) you could do:
npGrad_W = grad_W.asNumpy()
which would just execute the Numpy commands that the tree represents. If on the other hand, you wanted to use CUDA, you would do:
cudaGrad_W = grad_W.asCUDA()
which would convert the tree into expressions that can executed via CUDA (this could happen in a couple of different ways).
That way it should be trivial to: (1) test grad_W.asNumpy() == grad_W.asCUDA(), and (2) convert your pre-existing code to use CUDA.

Have you looked at the GPUArray portion of PyCUDA?
http://documen.tician.de/pycuda/array.html
While I haven't used it myself, it seems like it would be what you're looking for. In particular, check out the "Single-pass Custom Expression Evaluation" section near the bottom of that page.

I need to speed up a function. Should I use cython, ctypes, or something else?

I'm having a lot of fun learning Python by writing a genetic programming type of application.
I've had some great advice from Torsten Marek, Paul Hankin and Alex Martelli on this site.
The program has 4 main functions:
generate (randomly) an expression tree.
evaluate the fitness of the tree
crossbreed
mutate
As all of generate, crossbreed and mutate call 'evaluate the fitness'. it is the busiest function and is the primary bottleneck speedwise.
As is the nature of genetic algorithms, it has to search an immense solution space so the faster the better. I want to speed up each of these functions. I'll start with the fitness evaluator. My question is what is the best way to do this. I've been looking into cython, ctypes and 'linking and embedding'. They are all new to me and quite beyond me at the moment but I look forward to learning one and eventually all of them.
The 'fitness function' needs to compare the value of the expression tree to the value of the target expression. So it will consist of a postfix evaluator which will read the tree in a postfix order. I have all the code in python.
I need advice on which I should learn and use now: cython, ctypes or linking and embedding.
Thank you.

Ignore everyone elses' answer for now. The first thing you should learn to use is the profiler. Python comes with a profile/cProfile; you should learn how to read the results and analyze where the real bottlenecks is. The goal of optimization is three-fold: reduce the time spent on each call, reduce the number of calls to be made, and reduce memory usage to reduce disk thrashing.
The first goal is relatively easy. The profiler will show you the most time-consuming functions and you can go straight to that function to optimize it.
The second and third goal is harder since this means you need to change the algorithm to reduce the need to make so much calls. Find the functions that have high number of calls and try to find ways to reduce the need to call them. Utilize the built-in collections, they're very well optimized.
If you're doing a lot of number and array processing, you should take a look at pandas, Numpy/Scipy, gmpy third party modules; they're well optimised C libraries for processing arrays/tabular data.
Another thing you want to try is PyPy. PyPy can JIT recompile and do much more advanced optimisation than CPython, and it'll work without the need to change your python code. Though well optimised code targeting CPython can look quite different from well optimised code targeting PyPy.
Next to try is Cython. Cython is a slightly different language than Python, in fact Cython is actually best described as C with typed Python-like syntax.
For parts of your code that is in very tight loops that you can no longer optimize using any other ways, you may want to rewrite it as C extension. Python has a very good support for extending with C. In PyPy, the best way to extend PyPy is with cffi.

Cython is the quickest to get the job done, either by writing your algorithm directly in Cython, or by writing it in C and bind it to python with Cython.
My advice: learn Cython.

Another great option is boost::python which lets you easily wrap C or C++.
Of these possibilities though, since you have python code already written, cython is probably a good thing to try first. Perhaps you won't have to rewrite any code to get a speedup.

Try to work your fitness function so that it will support memoization. This will replace all calls that are duplicates of previous calls with a quick dict lookup.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.