Finding median with pandas transform - python

I needed to find the median for a pandas dataframe and used a piece of code from this previous SO answer: How I do find median using pandas on a dataset?.
I used the following code from that answer:
data['metric_median'] = data.groupby('Segment')['Metric'].transform('median')
It seemed to work well, so I'm happy about that, but I had a question: how is it that transform method took the argument 'median' without any prior specification? I've been reading the documentation for transform but didn't find any mention of using it to find a median.
Basically, the fact that .transform('median') worked seems like magic to me, and while I have no problem with magic and fancy myself a young Tony Wonder, I'm curious about how it works.

I'd recommend diving into the source code to see exactly why this works (and I'm mobile so I'll be terse).
When you pass the argument 'median' to tranform pandas converts this behind the scenes via getattr to the appropriate method then behaves like you passed it a function.

Related

Numpy argmax source

I can't seem to find the code for numpy argmax.
The source link in the docs lead me to here, which doesn't have any actual code.
I went through every function that mentions argmax using the github search tool and still no luck. I'm sure I'm missing something.
Can someone lead me in the right direction?
Thanks
Numpy is written in C. It uses a template engine that parsing some comments to generate many versions of the same generic function (typically for many different types). This tool is very helpful to generate fast code since the C language does not provide (proper) templates unlike C++ for example. However, it also make the code more cryptic than necessary since the name of the function is often generated. For example, generic functions names can look like #TYPE#_#OP# where #TYPE# and #OP# are two macros that can take different values each. On top of all of this, the CPython binding also make the code more complex since C functions have to be wrapped to be called from a CPython code with complex arrays (possibly with a high amount of dimensions and custom user types) and CPython arguments to decode.
_PyArray_ArgMinMaxCommon is a quite good entry point but it is only a wrapping function and not the main computing one. It is only useful if you plan to change the prototype of the Numpy function from Python.
The main computational function can be found here. The comment just above the function is the one used to generate the variants of the functions (eg. CDOUBLE_argmax). Note that there are some alternative specific implementation for alternative type below the main one like OBJECT_argmax since CPython objects and strings must be computed a bit differently. Thank you for contributing to Numpy.
As mentioned in the comments, you'll likely find what you are searching in the C code implementation (here under _PyArray_ArgMinMaxCommon). The code itself can be very convoluted, so if your intent was to open an issue on numpy with a broad idea, I would do it on the page you linked anyway.

Performance curiosity on pandas series `any`, `max`, `sum` vs python builtins

Seeking performance in python applications using pandas/numpy often benefits from the use of the pandas/numpy implemented methods other than own implemented code such as through looping. This might be a bad introduction to the question I have, but in the following screenshot (if I hadn't tested it) I was expecting the versions using the series' methods to run faster than the python builtins. Since that's not the case, it means I built a false intuition on this example, but I could not yet find the reason for this. So the question is, why in this case the use of the python builtins has higher performance than the methods applied on the series (am I missing something else?)?
Pandas has its own functions which are way different than Python's built in functions, therefore if you call Series.max() you are in fact calling nanops._nanminmax() which is added via the IndexOpsMixin instead of builtins.max()
Each behave differently, thus have different performance times.
Similarly for the rest of the methods. If you are curious, check the source code for Series class and classes it inherits from for the exact differences between builtin functions and Pandas' implementation.
am I missing something else?
You assumed that you always will get same result, which is not true. Example showing different output for sum and pandas.Series.sum is
import pandas as pd
s = pd.Series([1.0,2.0,float("nan")])
print(s.sum())
print(sum(s))
output
3.0
nan

Python Pandas: Dataframe.loc[] vs dataframe[] behavior difference?

I'm completely new to Pandas, but I was given some code that uses it, and I came across a line that looked like the following:
df.loc[bool_array]['column_name']=0
Looking at that code, I would expect that if I ran df.loc[bool_array]['column_name'] to retrieve the values for the column after running said code, the resulting values would all be zero. In fact, however, this was not the case, the values remained unchanged. To actually change the values, I had to remove the .loc as so:
df[bool_array]['column_name']=0
whereupon displaying the values did, in fact, show 0 as expected, regardless of if I used .loc or not.
Everything I can find about .loc seems to indicate that it should behave essentially the same as simply doing [] when given a boolean array like this. What difference am I missing that explains this behavioral discrepancy?
EDIT: As pointed out by #QuangHoang in the comments, the following recommended code DOES work:
df.loc[bool_array, 'column_name']=0
which simplifies the question to "why doesn't the original code work"?

Which is the most future proof way to plotring pandas dataframes?

I am trying to figure what is the right way to plot pandas DataFrames as, there seem to be multiple working syntaxes coexisting. I know Pandas is still developing so my question is which of the methods below is the most future proof?
Let's say I have DataFrame df I could plot it as a histogram using following pandas API calls.
df.plot(kind='hist')
df.plot.hist()
df.hist()
Looking at the documentation options 1, 2 seem to be pretty much the same thing in which case I prefer df.plot.hist() as I get auto-complete with the plot name. 'hist' is still pretty easy to spell as a string, but 'candlestick_ohlc' for example is pretty easy to typo...
What gets me confused is the 3th option. It does not have all the options of the first 2 and API is different. Is that one some legacy thing or the actual right way of doing things?
The recommended method is plot._plot_type this is to avoid the ambiguity in kwarg params and to aid in tab-completion see here: http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#whatsnew-0170-plot.
The .hist method still works as a legacy support, I don't believe there are plans to remove this but it's recommended to use plot.hist for future compatibility.
Additionally it simplifies the api somewhat as it was a bit problematic to use kind=graph_type to specify the graphy type and ensure the params were correct for each graphy type, the kwargs for plot._plottype are specified here: http://pandas.pydata.org/pandas-docs/stable/api.html#api-dataframe-plotting which should cover all the args in hist
I've always considered df.hist() to be the graphical equivalent to df.describe(): a quick way of getting an overview over the distribution of numeric data in a data frame. As this is indeed useful, and also used by a few people as far as I know, I'd be surprised if it became deprecated in a future version.
In contrast, I understand the df.plot method to be intended for actual data visualization, i.e. the preferred method if you want to tease a specific bit of information out of your data. Consequently, there are more arguments that you can use to modify the plot so that it fits your purpose, whereas with df.hist(), you can get useful distributional plots even with the default settings.
Thus, to answer your question: as I see it, both functions serve different purposes, both can be useful depending on your needs, and both should be future-safe.

Python (Pandas) : When to use replace vs. map vs. transform?

I'm trying to clearly understand for which type of data transformation the following functions in pandas should be used:
replace
map
transform
Can anybody provide some clear examples so I can better understand them?
Many thanks :)
As far as I understand, Replace is used when working on missing values and transform is used while doing group_by operations.Map is used to change series or index

Categories