Why/How does Pandas use square brackets with .loc and .iloc?

Why/How does Pandas use square brackets with .loc and .iloc? - python

So .loc and .iloc are not your typical functions. They somehow use [ and ] to surround the arguments so that it is comparable to normal array indexing. However, I have never seen this in another library (that I can think of, maybe numpy as something like this that I'm blanking on), and I have no idea how it technically works/is defined in the python code.
Are the brackets in this case just syntactic sugar for a function call? If so, how then would one make an arbitrary function use brackets instead of parenthesis? Otherwise, what is special about their use/defintion Pandas?

Note: The first part of this answer is a direct adaptation of my answer to this other question, that was answered before this question was reopened. I expand on the "why" in the second part.
So .loc and .iloc are not your typical functions
Indeed, they are not functions at all. I'll make examples with loc, iloc is analogous (it uses different internal classes).
The simplest way to check what loc actually is, is:
import pandas as pd
df = pd.DataFrame()
print(df.loc.__class__)
which prints
<class 'pandas.core.indexing._LocIndexer'>
this tells us that df.loc is an instance of a _LocIndexer class. The syntax loc[] derives from the fact that _LocIndexer defines __getitem__ and __setitem__*, which are the methods python calls whenever you use the square brackets syntax.
So yes, brackets are, technically, syntactic sugar for some function call, just not the function you thought it was (there are of course many reasons why python is designed this way, I won't go in the details here because 1) I am not sufficiently expert to provide an exhaustive answer and 2) there are a lot of better resources on the web about this topic).
*Technically, it's its base class _LocationIndexer that defines those methods, I'm simplifying a bit here
Why does Pandas use square brackets with .loc and .iloc?
I'm entering speculation area here, because I couldn't find any document explicitly talking about design choices in Pandas, however: there are at least two good reasons I see for choosing the square brackets.
The first, and most important reason is: you simply can't do with a function call everything you do with the square-bracket notation, because assigning to a function call is a syntax error in python:
# contrived example to show this can't work
a = []
def f():
global a
return a
f().append(1) # OK
f() = dict() # SyntaxError: cannot assign to function call
Using round brackets for a "function" call, calls the underlying __call__ method (note that any class that defines __call__ is callable, so "function" call is an incorrect term because python doesn't care whether something is a function or just behaves like one).
Using square brackets, instead, alternatively calls __getitem__ or __setitem__ depending on when the call happens (__setitem__ if it's on the left of an assignment operator, __getitem__ in any other case). There is no way to mimic this behaviour with a function call, you'd need a setter method to modify the data in the dataframe, but it still wouldn't be allowed in an assignment operation:
# imaginary method-based alternative to the square bracket notation:
my_data = df.get_loc(my_index)
df.set_loc(my_index, my_data*2)
This example brings me to the second reason: consistency. You can access elements of a DataFrame via square brackets:
something = df['a']
df['b'] = 2*something
when using loc you're still trying to refer to some items in the DataFrame, so it's more consistent to use the same syntax instead of asking the user to use some getter and setter functions (it's also, I believe, "more pythonic", but that's a fuzzy concept I'd rather stay away from).

Underneath the covers, both are using the __setitem__ and __getitem__ functions.

Related

Is there a way for a caller of multiple functions to forward a function ref to selected functions in a purely functional way?

Problem
I have a function make_pipeline that accepts an arbitrary number of functions, which it then calls to perform sequential data transformation. The resulting call chain performs transformations on a pandas.DataFrame. Some, but not all functions that it may call need to operate on a sub-array of the DataFrame. I have written multiple selector functions. However at present each member-function of the chain has to be explicitly be given the user-selected selector/filter function. This is VERY error-prone and accessibility is very important as the end-code is addressed to non-specialists (possibly with no Python/programming knowledge), so it must be "batteries-included". This entire project is written in a functional style (that's what's always worked for me).
Sample Code
filter_func = simple_filter()
# The API looks like this
make_pipeline(
load_data("somepath", header = [1,0]),
transform1(arg1,arg2),
transform2(arg1,arg2, data_filter = filter_func),# This function needs access to user-defined filter function
transform3(arg1,arg2,, data_filter = filter_func),# This function needs access to user-defined filter function
transform4(arg1,arg2),
)
Expected API
filter_func = simple_filter()
# The API looks like this
make_pipeline(
load_data("somepath", header = [1,0]),
transform1(arg1,arg2),
transform2(arg1,arg2),
transform3(arg1,arg2),
transform4(arg1,arg2),
)
Attempted
I thought that if the data_filter alias is available in the caller's namespace, it also becomes available (something similar to a closure) to all functions it calls. This seems to happen with some toy examples but wont work in the case (UnboundError).
What's a good way to make a function defined in one place available to certain interested functions in the call chain? I'm trying to avoid global.
Notes/Clarification
I've had problems with OOP and mutable states in the past, and functional programming has worked quite well. Hence I've set a goal for myself to NOT use classes (to the extent that Python enables me to anyways). So no classes.
I should have probably clarified this initially: In the pipeline the output of all functions is a DataFrame and the input of all functions (except load data obviously) is a DataFrame. The functions are decorated with a wrapper that calls functools.partial because we want the user to supply the args to each function but not execute it. The actual execution is done be a forloop in make_pipeline.
Each function accepts df:pandas.DataFrame plus all arguements that are specific to that function. The statement seen above transform1(arg1,arg2,...) actually calls the decorated transform1 witch returns functools.partial(transform, arg1,arg2,...) which is now has a signature like transform(df:pandas.DataFrame).
load_dataframe is just a convenience function to load the initial dataframe so that all other functions can begin operating on it. It just felt more intuitive to users to have it part of the chain rather that a separate call
The problem is this: I need a way for a filter function to be initialized (called) in only on place, such that every function in the call chain that needs access to the filter function, gets it without it being explicitly passed as argument to said function. If you're wondering why this is the case, it's because I feel that end users will find it unintuitive and arbitrary. Some functions need it, some don't. I'm also pretty certain that they will make all kinds of errors like passing different filters, forgetting it sometimes etc.
(Update) I've also tried inspect.signature() in make_pipeline to check if each function accepts a data_filter argument and pass it on. However, this raises an incorrect function signature error so some unclear reason (likely because of the decorators/partial calls). If signature could the return the non-partial function signature, this would solve the issue, but I couldn't find much info in the docs

Turns out it was pretty easy. The solution is inspect.signature.
def make_pipeline(*args, data_filter:Optional[Callable[...,Any]] = None)
d = args[0]
for arg in args[1:]:
if "data_filter" in inspect.signature(arg):
d = arg(d, data_filter = data_filter)
else:
d= arg(d)
Leaving this here mostly for reference because I think this is a mini design pattern. I've also seen an function._closure_ on unrelated subject. That may also work, but will likely be more complicated.

How come some methods don't require parentheses?

Consider the following code:
num = 1 + 1j
print(num.imag)
As opposed to
word = "hey"
print(word.islower())
One requires parentheses, and the other doesn't. I know in Python when we call functions without parentheses, we get back only a reference to the function, but it doesn't really answer it. So 'imag' returns a reference? because it seems the method does get executed and returns the imag part.

imag is not a method. It's simply a number-valued attribute.
islower is a method. In order to call the method, you put parentheses after the name.

num.imag is not a function, it's an attribute. To call a function you need the parentheses, or the __call__ method.

Attributes (e.g. imag) are like variables inside the object so you don't use parentheses to access them. Methods (e.g. islower()) are like functions inside the object so they do require parentheses to accept zero or more parameters and perform some work.
Objects can also have 'properties' that are special functions that behave like attributes (i.e. no parentheses) but can perform calculation or additional work when they are referenced or assigned.

Should I use str.something(…) or '…'.something()?

My question is basically the title, in this case i use .lower() as the function,
So:
>>> s = 'A'
>>> s.lower()
'a'
>>> str.lower(s)
'a'
Which one is better, more preferred? (I think it should be the first one because it's easier, but still i don't know.)

If you're just calling it directly, using the bound method s.lower() is definitely preferred. It says what you mean (object: do this!), it's shorter, it's more obvious to the reader, and it's even faster with builtin types like str (but about the same speed with types you create yourself).
The difference is when you need to store the method, or pass it around as a callback, etc., instead of calling it right away. In that case, it's almost always obvious what you want to do:
If you're going to want to call the method on a string you haven't seen yet, or on a whole bunch of different strings, you want to store/pass/whatever the unbound method, str.lower.
If you want to call the method on a string you have sitting around, especially if you want to call it over and over on that same string, you want to store/pass/whatever the bound method, s.lower.
For the case of str.lower, it's pretty easy to think of uses for the unbound method, like:
lower_strings = map(str.lower, strings)
Or, shamelessly stealing from Dair's answer, because it's a better example than mine:
sorted_strings = sorted(strings, key=str.lower)
But it's pretty hard to imagine why you'd ever want to call the bound method s.lower over and over again.
But consider these other cases with different methods:
You're building a GUI, and you want a certain onclick method to get called on your window object every time a button is clicked. Obviously, you always want the method to get called on this particular window object, the one that owns the button, so you'll use the bound method self.onclick as your button callback, not the unbound method Window.onclick.
You're looking for unique values, do you keep a set, values_seen, of all the values you've seen so far. values_seen.add might be a useful thing to store or pass to some other function, but set.add wouldn't do you any good.
There's also a third case you didn't mention:
lambda x: x.lower()
That's just like str.lower, in that you can pass it to map or save it for later, and more complicated, and slower. So, why would you ever do it?
Imagine you need a function that calls lower on anything that "duck types" as a string, not just actual strings. For example, b'AbC'.lower() is a perfectly valid thing to do. So, if you have a list of either strings or bytes but you're not sure which, and you want to lowercase them all, you can't map(str.lower, xs) or map(bytes.lower, xs), but you can map(lambda x: x.lower(), xs).
Although for that particular case you can—and usually should—just use a comprehension:
(x.lower() for x in xs)

.lower is more general than str.lower since str.lower only works for strings. However, there are cases such as bytes.lower where (between str.lower and .lower) .lower is the only way that works. Hence, unless you have some performance case, (or type checking case) I would advise toward more generality.
str.lower can still be convenient as it can be passed into functions like map, and as into the key parameter for sort along with other related functions.

The second one is easier if you want a one-time use, but generally, the first option is preferred.
If you plan to use the string at all in your following lines of code (which is probably true for 99% of code), you would almost definitely want a variable that you can refer to.

Limitations of variables in python

I realize this may be a bit broad, and thought this was an interesting question that I haven't really seen an answer to. It may be hidden in the python documentation somewhere, but as I'm new to python haven't gone through all of it yet.
So.. are there any general rules of things that we cannot set to be variables? Everything in python is an object and we can use variables for the typical standard usage of storing strings, integers, aliasing variables, lists, calling references to classes, etc and if we're clever even something along the lines as the below that I can think of off the top of my head, wherever this may be useful
var = lambda: some_function()
storing comparison operators to clean code up such as:
var = some_value < some_value ...
So, that being said I've never come across anything that I couldn't store as a variable if I really wanted to, and was wondering if there really are any limitations?

You can't store syntactical constructs in a variable. For example, you can't do
command = break
while condition:
if other_condition:
command
or
operator = +
three = 1 operator 2

You can't really store expressions and statements as objects in Python.
Sure, you can wrap an expression in a lambda, and you can wrap a series of statements in a code object or callable, but you can't easily manipulate them. For instance, changing all instances of addition to multiplication is not readily possible.
To some extent, this can be worked around with the ast module, which provides for parsing Python code into abstract syntax trees. You can then manipulate the trees, instead of the code itself, and pass it to compile() to turn it back into a code object.
However, this is a form of indirection, compensating for a feature Python itself lacks. ast can't really compare to the anything-goes flexibility of (say) Lisp macros.

According to the Language Reference, the right hand side of an assignment statement can be an 'expression list' or a 'yield expression'. An expression list is a comma-separated list of one or more expressions. You need to follow this through several more tokens to come up with anything concrete, but ultimately you can find that an 'expression' is any number of objects (literals or variable names, or the result of applying a unary operator such as not, ~ or - to a nested expression_list) chained together by any binary operator (such as the arithmetic, comparison or bitwise operators, or logical and and or) or the ternary a if condition else b.
You can also note in other parts of the language reference that an 'expression' is exactly something you can use as an argument to a function, or as the first part (before the for) of a list comprehension or generator expression.
This is a fairly broad definition - in fact, it amounts to "anything Python resolves to an object". But it does leave out a few things - for example, you can't directly store the less-than operator < in a variable, since it isn't a valid expression by itself (it has to be between two other expressions) and you have to put it in a function that uses it instead. Similarly, most of the Python keywords aren't expressions (the exceptions are True, False and None, which are all canonical names for certain objects).
Note especially that functions are also objects, and hence the name of a function (without calling it) is a valid expression. This means that your example:
var = lambda: some_function()
can be written as:
var = some_function

By definition, a variable is something which can vary, or change. In its broadest sense, a variable is no more than a way of referring to a location in memory in your given program. Another way to think of a variable is as a container to place your information in.
Unlike popular strongly typed languages, variable declaration in Python is not required. You can place pretty much anything in a variable so long as you can come up with a name for it. Furthermore, in addition to the value of a variable in Python being capable of changing, the type often can as well.
To address your question, I would say the limitations on a variable in Python relate only to a few basic necessary attributes:
A name
A scope
A value
(Usually) a type
As a result, things like operators (+ or * for instance) cannot be stored in a variable as they do not meet these basic requirements, and in general you cannot store expressions themselves as variables (unless you're wrapping them in a lambda expression).
As mentioned by Kevin, it's also worth noting that it is possible to sort of store an operator in a variable using the operator module , however even doing so you cannot perform the kinds of manipulations that a variable is otherwise subject to as really you are just making a value assignment. An example of the operator module:
import operator
operations = {"+": operator.add,
"-": operator.sub,}
operator_variable_string= input('Give me an operand:')
operator_function = operations[operator_variable_string]
result = operator_function(8, 4)

Why isn't the 'len' function inherited by dictionaries and lists in Python

example:
a_list = [1, 2, 3]
a_list.len() # doesn't work
len(a_list) # works
Python being (very) object oriented, I don't understand why the 'len' function isn't inherited by the object.
Plus I keep trying the wrong solution since it appears as the logical one to me

Guido's explanation is here:
First of all, I chose len(x) over x.len() for HCI reasons (def __len__() came much later). There are two intertwined reasons actually, both HCI:
(a) For some operations, prefix notation just reads better than postfix — prefix (and infix!) operations have a long tradition in mathematics which likes notations where the visuals help the mathematician thinking about a problem. Compare the easy with which we rewrite a formula like x*(a+b) into x*a + x*b to the clumsiness of doing the same thing using a raw OO notation.
(b) When I read code that says len(x) I know that it is asking for the length of something. This tells me two things: the result is an integer, and the argument is some kind of container. To the contrary, when I read x.len(), I have to already know that x is some kind of container implementing an interface or inheriting from a class that has a standard len(). Witness the confusion we occasionally have when a class that is not implementing a mapping has a get() or keys() method, or something that isn’t a file has a write() method.
Saying the same thing in another way, I see ‘len‘ as a built-in operation. I’d hate to lose that. /…/

The short answer: 1) backwards compatibility and 2) there's not enough of a difference for it to really matter. For a more detailed explanation, read on.
The idiomatic Python approach to such operations is special methods which aren't intended to be called directly. For example, to make x + y work for your own class, you write a __add__ method. To make sure that int(spam) properly converts your custom class, write a __int__ method. To make sure that len(foo) does something sensible, write a __len__ method.
This is how things have always been with Python, and I think it makes a lot of sense for some things. In particular, this seems like a sensible way to implement operator overloading. As for the rest, different languages disagree; in Ruby you'd convert something to an integer by calling spam.to_i directly instead of saying int(spam).
You're right that Python is an extremely object-oriented language and that having to call an external function on an object to get its length seems odd. On the other hand, len(silly_walks) isn't any more onerous than silly_walks.len(), and Guido has said that he actually prefers it (http://mail.python.org/pipermail/python-3000/2006-November/004643.html).

It just isn't.
You can, however, do:
>>> [1,2,3].__len__()
3
Adding a __len__() method to a class is what makes the len() magic work.

This way fits in better with the rest of the language. The convention in python is that you add __foo__ special methods to objects to make them have certain capabilities (rather than e.g. deriving from a specific base class). For example, an object is
callable if it has a __call__ method
iterable if it has an __iter__ method,
supports access with [] if it has __getitem__ and __setitem__.
...
One of these special methods is __len__ which makes it have a length accessible with len().

Maybe you're looking for __len__. If that method exists, then len(a) calls it:
>>> class Spam:
... def __len__(self): return 3
...
>>> s = Spam()
>>> len(s)
3

Well, there actually is a length method, it is just hidden:
>>> a_list = [1, 2, 3]
>>> a_list.__len__()
3
The len() built-in function appears to be simply a wrapper for a call to the hidden len() method of the object.
Not sure why they made the decision to implement things this way though.

there is some good info below on why certain things are functions and other are methods. It does indeed cause some inconsistencies in the language.
http://mail.python.org/pipermail/python-dev/2008-January/076612.html

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.