Unexpected difference between loc and ix - python

I've noticed a strange difference between loc and ix when subsetting a DataFrame in Pandas.
import pandas as pd
# Create a dataframe
df = pd.DataFrame({'id':[10,9,5,6,8], 'x1':[10.0,12.3,13.4,11.9,7.6], 'x2':['a','a','b','c','c']})
df.set_index('id', inplace=True)
df
x1 x2
id
10 10.0 a
9 12.3 a
5 13.4 b
6 11.9 c
8 7.6 c
df.loc[[10, 9, 7]] # 7 does not exist in the index so a NaN row is returned
df.loc[[7]] # KeyError: 'None of [[7]] are in the [index]'
df.ix[[7]] # 7 does not exist in the index so a NaN row is returned
Why does df.loc[[7]] throw an error while df.ix[[7]] returns a row with NaN? Is this a bug? If not, why are loc and ix designed this way?
(Note I'm using Pandas 0.17.1 on Python 3.5.1)

As #shanmuga says, this is (at least for loc) the intended and documented behaviour, and not a bug.
The documentation on loc/selection by label, gives the rules on this (http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label ):
At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!
This means using loc with a single label (eg df.loc[[7]]) will raise an error if this label is not in the index, but when using it with a list of labels (eg df.loc[[7,8,9]]) will not raise an error if at least one of those labels is in the index.
For ix I am less sure, and this is not clearly documented I think. But in any case, ix is much more permissive and has a lot of edge cases (fallback to integer position etc), and is rather a rabbit hole. But in general, ix will always return a result indexed with the provided labels (so does not check if the labels are in the index as loc does), unless it falls back to integer position indexing.
In most cases it is advised to use loc/iloc

I think this behavior is intended, not a bug.
Although I couldn't find any official documentation, I found a comment by jreback on 21 Mar 2014 to issue on GitHub indicating this.
ix can very subtly give wrong results (use an index of say even numbers)
you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location
As for why it is designed so
As mentioned in docs
.ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
In my opinion raising a KeyError would be ambiguous as whether it it came from index, or integer position. Instead ix returns NaN when given a list

Related

Why do I need to slice my series for my function to work? [duplicate]

I'm confused about the rules Pandas uses when deciding that a selection from a dataframe is a copy of the original dataframe, or a view on the original.
If I have, for example,
df = pd.DataFrame(np.random.randn(8,8), columns=list('ABCDEFGH'), index=range(1,9))
I understand that a query returns a copy so that something like
foo = df.query('2 < index <= 5')
foo.loc[:,'E'] = 40
will have no effect on the original dataframe, df. I also understand that scalar or named slices return a view, so that assignments to these, such as
df.iloc[3] = 70
or
df.ix[1,'B':'E'] = 222
will change df. But I'm lost when it comes to more complicated cases. For example,
df[df.C <= df.B] = 7654321
changes df, but
df[df.C <= df.B].ix[:,'B':'E']
does not.
Is there a simple rule that Pandas is using that I'm just missing? What's going on in these specific cases; and in particular, how do I change all values (or a subset of values) in a dataframe that satisfy a particular query (as I'm attempting to do in the last example above)?
Note: This is not the same as this question; and I have read the documentation, but am not enlightened by it. I've also read through the "Related" questions on this topic, but I'm still missing the simple rule Pandas is using, and how I'd apply it to — for example — modify the values (or a subset of values) in a dataframe that satisfy a particular query.
Here's the rules, subsequent override:
All operations generate a copy
If inplace=True is provided, it will modify in-place; only some operations support this
An indexer that sets, e.g. .loc/.iloc/.iat/.at will set inplace.
An indexer that gets on a single-dtyped object is almost always a view (depending on the memory layout it may not be that's why this is not reliable). This is mainly for efficiency. (the example from above is for .query; this will always return a copy as its evaluated by numexpr)
An indexer that gets on a multiple-dtyped object is always a copy.
Your example of chained indexing
df[df.C <= df.B].loc[:,'B':'E']
is not guaranteed to work (and thus you shoulld never do this).
Instead do:
df.loc[df.C <= df.B, 'B':'E']
as this is faster and will always work
The chained indexing is 2 separate python operations and thus cannot be reliably intercepted by pandas (you will oftentimes get a SettingWithCopyWarning, but that is not 100% detectable either). The dev docs, which you pointed, offer a much more full explanation.
Here is something funny:
u = df
v = df.loc[:, :]
w = df.iloc[:,:]
z = df.iloc[0:, ]
The first three seem to be all references of df, but the last one is not!

How to avoid Implicit fix done by python with UserWarning after Masking Data

When I mask my set of data with another, it shows up with a UserWarning: Boolean Series key will be reindexed to match DataFrame index. How would I avoid this? Python will automatically reindex it with but the header for that column is blank and I cannot seem to rename it so I may reference that column in my code. I prefer to not rely on this implicit correction as well.
I have tried to rename the columns manually in two ways pd.DataFrame.columns() or pd.DataFrame.rename(). For some reason I either get an error that it was expecting 3 elements rather than 4 or the empty column index that was added will not be renamed.
# select data and filter it which results in the error which fixes the dataframe but leaves the column name empty
stickData = data[['Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']]
filteredData = stickData[contactData['CONTACT'] == 1]
# moving forward from the error I tried using rename which does not error but also does nothing
filteredData.rename(index={0:'Index'})
# I also tried this
filteredData.rename(index={'':'Old_Index'})
# I even went and tried to add the names of the dataframe like so which resulted in ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements
filteredData.columns = ['Old_Index','Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']
The current dataframe of filteredData.head() looks like this after the implicit correction from python:
Index Time Pitch Stick Position(IN) Roll Stick Position (IN)
0 1421 240:19:06:40.200 0.007263 -0.028500
1 1422 240:19:06:40.400 0.022327 0.139893
2 1423 240:19:06:40.600 -0.016409 0.540756
3 1424 240:19:06:40.800 -0.199329 0.279971
4 1425 240:19:06:41.000 0.013719 -0.018069
But I would like to display with Old_index labeled and more so without relying on the implicit correction:
Index Old_index Time Pitch Stick Position(IN) Roll Stick Position (IN)
1 1421 240:19:06:40.200 0.007263 -0.028500
2 1422 240:19:06:40.400 0.022327 0.139893
3 1423 240:19:06:40.600 -0.016409 0.540756
4 1424 240:19:06:40.800 -0.199329 0.279971
5 1425 240:19:06:41.000 0.013719 -0.018069
There are a few errors in your code:
Don't use chained indexing. Use loc / iloc accessors instead.
Assign back to variables when using methods that don't operate in place.
In general, don't use Boolean indexers derived from other dataframes. If you can guarantee row alignment, then extract the NumPy array representation via pd.Series.values.
For example, this would work, assuming the rows in contactData align with the rows in filteredData
cols = ['Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']
filteredData = data.loc[(contactData['CONTACT'] == 1).values, cols]\
.rename(index={0:'Index'})
Notice we can chain methods such as loc and rename instead of explicitly assigning back to filteredData each time.
can you try:
filteredData = stickData[contactData['CONTACT'] == 1].reset_index().rename(columns={'index': 'Old_index')
or put this piece somewhere, i don't have your sample data, i can't test it out
.reset_index().rename(columns={'index': 'Old_index')

Getting a scalar by integer location and column label (mixed indexing)

To get a scalar at integer location 0 and column label 'A' in a data frame df, I do chained indexing: df.iloc[0]['A']. This works, but pandas documentation says that chained indexing should be avoided.
An alternative I could come up with is df.iat[0, df.columns.get_loc('A')], which is just too much typing compared with the chained indexing. Is there a shorter way to do this?
Note: .ix indexer is deprecated.
Example:
df=pd.DataFrame({'A':[10,20,30,40]}, index=[3,2,1,0])
A
------
3 10
2 20
1 30
0 40
The scalar at integer location 0 in column A is 10 and not 40:
df.iat[0, df.columns.get_loc('A')]
Otuput: 10
You can refer to this other post on loc, iloc, at, iat
To answer your question directly:
This is called mixed type indexing. You want to access one dimension by position and the other by label.
To solve this problem, we need to translate either:
the position into a label then use loc (or at) for label indexing.
the label into a position then use iloc (or iat) for position indexing.
Using loc
We get the label at the 0 position
df.loc[df.index[0], 'A']
Or
df.at[df.index[0], 'A']
Or
df.get_value(df.index[0], 'A')
Using iloc
We get the position of the label using pd.Index.get_loc
df.iloc[0, df.columns.get_loc('A')]
Or
df.iat[0, df.columns.get_loc('A')]
Or
df.get_value(0, df.columns.get_loc('A'), takable=True)
I also included examples of using pd.DataFrame.get_value
Here's Pandas Official Guide on doing Indexing with both Integer and Label. Hope this is helpful.

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

I have a dataframe with these indices and values:
df[df.columns[0]]
1 example
2 example1
3 example2
When I access df[df.columns[0]][2], I get "example1". Makes sense. That's how indices work.
When I access df[df.columns[0]], however, I get "example", and I get example when I access df[df.columns[1]] as well. So for
df[df.columns[0]][0]
df[df.columns[0]][1]
I get "example".
Strangely, I can delete "row" 0, and the result is that 1 is deleted:
gf = df.drop(df.index[[0]])
gf
exampleDF
2 example1
3 example2
But when I delete row 1, then
2 example1
is deleted, as opposed to example.
This is a bit confusing to me; are there inconsistent standards in Pandas regarding row indices, or am I missing something / made an error?
You are probably causing pandas to switch between .iloc (index based) and .loc (label based) indexing.
All arrays in Python are 0 indexed. And I notice that indexes in your DataFrame are starting from 1. So when you run df[df.column[0]][0] pandas realizes that there is no index named 0, and falls back to .iloc which locates things by array indexing. Therefore it returns what it finds at the first location of the array, which is 'example'.
When you run df[df.column[0]][1] however, pandas realizes that there is a index label 1, and uses .loc which returns what it finds at that label, which again happens to be 'example'.
When you delete the first row, your DataFrame does not have index labels 0 and 1. So when you go to locate elements at those places in the way you are, it does not return None to you, but instead falls back on array based indexing and returns elements from the 0th and 1st places in the array.
To enforce pandas to use one of the two indexing techniques, use .iloc or .loc. .loc is label based, and will raise KeyError if you try df[df.column[0]].loc[0]. .iloc is index based and will return 'example' when you try df[df.column[0]].iloc[0].
Additional note
These commands are bad practice: df[col_label].iloc[row_index]; df[col_label].loc[row_label].
Please use df.loc[row_label, col_label]; or df.iloc[row_index, col_index]; or df.ix[row_label_or_index, col_label_or_index]
See Different Choices for Indexing for more information.

Overriding a pandas DataFrame column with dictionary values, where the dictionary keys match a non-index column?

I have a DataFrame df, and a dict d, like so:
>>> df
a b
0 5 10
1 6 11
2 7 12
3 8 13
4 9 14
>>> d = {6: 22, 8: 26}
For every (key, val) in the dictionary, I'd like to find the row where column a matches the key, and override its b column with the value. For example, in this particular case, the value of b in row 1 will change to 22, and its value on row 3 will change to 26.
How should I do that?
Assuming it would be OK to propagate the new values to all rows where column a matches (in the event there were duplicates in column a) then:
for a_val, b_val in d.iteritems():
df['b'][df.a==a_val] = b_val
or to avoid chaining assignment operations:
for a_val, b_val in d.iteritems():
df.loc[df.a==a_val, 'b'] = b_val
Note that to use loc you must be working with Pandas 0.11 or newer. For older versions, you may be able to use .ix to prevent the chained assignment.
#Jeff pointed to this link which discusses a phenomenon that I had already mentioned in this comment. Note that this is not an issue of correctness, since reversing the order of access has a predictable effect. You can see this easily, e.g. below:
In [102]: id(df[df.a==5]['b'])
Out[102]: 113795992
In [103]: id(df['b'][df.a==5])
Out[103]: 113725760
If you get the column first and then assign based on indexes into that column, the changes effect that column. And since the column is part of the DataFrame, the changes effect the DataFrame. If you index a set of rows first, you're now no longer talking about the same DataFrame, so getting the column from the filtered object won't give you a view of the original column.
#Jeff suggests that this makes it "incorrect" whereas my view is that this is the obvious and expected behavior. In the special case when you have a mixed data type column and there is some type promotion/demotion going on that would prevent Pandas from writing a value into the column, then you might have a correctness issue with this. But given that loc is not available until Pandas 0.11, I think it's still fair to point out how to do it with chained assignment, rather than pretending like loc is the only thing that could possibly ever be the correct choice.
If any one can provide more definitive reasons to think it is "incorrect" (as opposed to just not preferring this stylistically), please contribute and I will try to make a more thorough write-up about the various pitfalls.

Categories