I am trying to figure what is the right way to plot pandas DataFrames as, there seem to be multiple working syntaxes coexisting. I know Pandas is still developing so my question is which of the methods below is the most future proof?
Let's say I have DataFrame df I could plot it as a histogram using following pandas API calls.
df.plot(kind='hist')
df.plot.hist()
df.hist()
Looking at the documentation options 1, 2 seem to be pretty much the same thing in which case I prefer df.plot.hist() as I get auto-complete with the plot name. 'hist' is still pretty easy to spell as a string, but 'candlestick_ohlc' for example is pretty easy to typo...
What gets me confused is the 3th option. It does not have all the options of the first 2 and API is different. Is that one some legacy thing or the actual right way of doing things?
The recommended method is plot._plot_type this is to avoid the ambiguity in kwarg params and to aid in tab-completion see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0170-plot.
The .hist method still works as a legacy support, I don't believe there are plans to remove this but it's recommended to use plot.hist for future compatibility.
Additionally it simplifies the api somewhat as it was a bit problematic to use kind=graph_type to specify the graphy type and ensure the params were correct for each graphy type, the kwargs for plot._plottype are specified here: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting which should cover all the args in hist
I've always considered df.hist() to be the graphical equivalent to df.describe(): a quick way of getting an overview over the distribution of numeric data in a data frame. As this is indeed useful, and also used by a few people as far as I know, I'd be surprised if it became deprecated in a future version.
In contrast, I understand the df.plot method to be intended for actual data visualization, i.e. the preferred method if you want to tease a specific bit of information out of your data. Consequently, there are more arguments that you can use to modify the plot so that it fits your purpose, whereas with df.hist(), you can get useful distributional plots even with the default settings.
Thus, to answer your question: as I see it, both functions serve different purposes, both can be useful depending on your needs, and both should be future-safe.
Related
I can't seem to find the code for numpy argmax.
The source link in the docs lead me to here, which doesn't have any actual code.
I went through every function that mentions argmax using the github search tool and still no luck. I'm sure I'm missing something.
Can someone lead me in the right direction?
Thanks
Numpy is written in C. It uses a template engine that parsing some comments to generate many versions of the same generic function (typically for many different types). This tool is very helpful to generate fast code since the C language does not provide (proper) templates unlike C++ for example. However, it also make the code more cryptic than necessary since the name of the function is often generated. For example, generic functions names can look like #TYPE#_#OP# where #TYPE# and #OP# are two macros that can take different values each. On top of all of this, the CPython binding also make the code more complex since C functions have to be wrapped to be called from a CPython code with complex arrays (possibly with a high amount of dimensions and custom user types) and CPython arguments to decode.
_PyArray_ArgMinMaxCommon is a quite good entry point but it is only a wrapping function and not the main computing one. It is only useful if you plan to change the prototype of the Numpy function from Python.
The main computational function can be found here. The comment just above the function is the one used to generate the variants of the functions (eg. CDOUBLE_argmax). Note that there are some alternative specific implementation for alternative type below the main one like OBJECT_argmax since CPython objects and strings must be computed a bit differently. Thank you for contributing to Numpy.
As mentioned in the comments, you'll likely find what you are searching in the C code implementation (here under _PyArray_ArgMinMaxCommon). The code itself can be very convoluted, so if your intent was to open an issue on numpy with a broad idea, I would do it on the page you linked anyway.
I am a newbie reading Uncle Bob's Clean Code Book.
It is indeed great practice to limit the number of function arguments as few as possible. But I still come across so many functions offered in many libraries that require a bunch of arguments. For example, in Python's pandas, there is a function with 9 arguments:
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
(And this function also violates the advice about flag arguments)
It seems that such cases are much rarer in Python standard libraries, but I still managed to find one with 4 arguments:
re.split(pattern, string, maxsplit=0, flags=0)
I understand that this is just a suggestion instead of silver bullet, but is it applicable when it comes to something mentioned above?
Uncle Bob does not mention a hard limit of arguments that would make your code smell, but I would consider 9 arguments as too much.
Today's IDEs are much better in supporting the readability of the code, nevertheless refactoring stays tricky, especially with a large number of equally typed arguments.
The suggested solution is to encapsulate the arguments in a single struct/object (depending on your language). In the given case, this could be a GroupingStrategy:
strategy = GroupingStrategy();
strategy.by = "Foo"
strategy.axis = 0
strategy.sorted = true
DataFrame.groupby(strategy)
All not mentioned attributes will be assigned with the respective default values.
You could then also convert it to a fluent API:
DataFrame.groupby(GroupingStrategy.by("Foo").axis(0).sorted())
Or keep some of the arguments, if this feels better:
DataFrame.groupby("Foo", GroupingStrategy.default())
The first point to note is that all those arguments to groupby are relevant. You can reduce the number of arguments by having different versions of groupby but that doesn't help much when the arguments can be applied independently of each other, as is the case here. The same logic would apply to re.split.
It's true that integer "flag" arguments can be dodgy from a maintenance point of view - what happens if you want to change a flag value in your code? You have to hunt through and manually fix each case. The traditional approach is to use enums (which map numbers to words eg a Day enum would have Day.Sun = 0, Day.Mon = 1, etc) In compiled languages like C++ or C# this gives you the speed of using integers under the hood but the readability of using labels/words in your code. However enums in Python are slow.
One rule that I think applies to any source code is to avoid "magic numbers", ie numbers which appear directly in the source code. The enum is one solution. Another solution is to have constant variables to represent different flag settings. Python sort-of supports constants (uppercase variable names in constant.py which you then import) however they are constant only by convention, you can actually change their value :(
What would be the easiest way to label the x and y axes of a FiPy viewer? I specifically use the fipy.viewers.matplotlibViewer.matplotlib2DGridContourViewer.Matplotlib2DGridContourViewer. I tried giving the the parameter xlabel="some string" but that obviously didn't work because that class doesn't have that parameter. Then I saw that it can take a matplotlib.figure.Axes object as a parameter, but to create one, I would have to have an existing matplotlib.figure.Figure object and so everything is intertwining and I'm lost in documentations with such a seemingly simple goal. Is there some way to add labels simply by giving the Viwer a specific parameter?
Also I looked at the custom Viewer creation examples but I'm not that good in Python, so I would appreciate a detailed solution if I have to take that route.
The MatplotlibViewer classes have an .axes property, which you can use, e.g.,
viewer = fp.MatplotlibViewer(...)
viewer.axes.set_xlabel("some string")
No, this isn't documented anywhere.
The axes= argument is for cases where you have a complicated/custom presentation that you want FiPy to render into, but, as you've discovered, it's a lot of overhead if you're not already doing that.
Apologies if this is not the right place for this question.
I've recently started using MIT's MEEP software (Python3, on Linux). I am quite new to it and would like to mostly use it for photovoltaics projects. Somewhat common shapes that show up here are "inverted pyramid" and slanted (oblique) cone structures. Creating shapes in MEEP seems to generally be done with the GeometricObject class, but they don't seem to directly support either of these structures. Is there any way around this or is my only real option simulating these structures by stacking small Block objects?
As described in my own "answer" posted, it's not too difficult to just define these geometric objects myself, write a function to check if it's inside the object, and return the appropriate material. How would I go about converting this to a MEEP GeometricObject, instead of converting that to a material_func as I've done?
No responses, so I thought I'd post my hacky way around it. There are two solutions: First is as mentioned in the question, just stacking MEEP's Block object. The other approach I did was define my own class Pyramid, which works basically the same way as described here. Then, I convert a list of my class objects and MEEP's shape object to a function that takes a vector and returns a material, and this is fed as material_func in MEEP's Simulation object. So far, it seems to work, hence I'm posting it as an answer. However, It substantially slows down subpixel averaging (and maybe the rest of the simulation, though I haven't done an actual analysis), so I'm not very happy with it.
I'm not sure which is "better" but the second method does feel more precise, insofar that you have pyramids, not just a stack of Blocks.
From the documentation I can tell they differ for example in the way they handle the arguments.
However, it's not clear to me when it's appropriate to use one over the other. (and similarly for the other scatter(_nd)_foo operations.)
https://www.tensorflow.org/api_docs/python/tf/scatter_update
https://www.tensorflow.org/api_docs/python/tf/scatter_nd_update