Getting a scalar by integer location and column label (mixed indexing)

Getting a scalar by integer location and column label (mixed indexing) - python

To get a scalar at integer location 0 and column label 'A' in a data frame df, I do chained indexing: df.iloc[0]['A']. This works, but pandas documentation says that chained indexing should be avoided.
An alternative I could come up with is df.iat[0, df.columns.get_loc('A')], which is just too much typing compared with the chained indexing. Is there a shorter way to do this?
Note: .ix indexer is deprecated.
Example:
df=pd.DataFrame({'A':[10,20,30,40]}, index=[3,2,1,0])
A
------
3 10
2 20
1 30
0 40
The scalar at integer location 0 in column A is 10 and not 40:
df.iat[0, df.columns.get_loc('A')]
Otuput: 10

You can refer to this other post on loc, iloc, at, iat
To answer your question directly:
This is called mixed type indexing. You want to access one dimension by position and the other by label.
To solve this problem, we need to translate either:
the position into a label then use loc (or at) for label indexing.
the label into a position then use iloc (or iat) for position indexing.
Using loc
We get the label at the 0 position
df.loc[df.index[0], 'A']
Or
df.at[df.index[0], 'A']
Or
df.get_value(df.index[0], 'A')
Using iloc
We get the position of the label using pd.Index.get_loc
df.iloc[0, df.columns.get_loc('A')]
Or
df.iat[0, df.columns.get_loc('A')]
Or
df.get_value(0, df.columns.get_loc('A'), takable=True)
I also included examples of using pd.DataFrame.get_value

Here's Pandas Official Guide on doing Indexing with both Integer and Label. Hope this is helpful.

Related

Selecting by index -1 in a df column / time series throws error

Let's assume we have a simple dataframe like this:
df = pd.DataFrame({'col1':[1,2,3], 'col2':[10,20,30]})
Then I can select elements like this
df.col2[0] or df.col2[1]
But if I want to select the last element with df.col2[-1] it results in the error message:
KeyError: -1
I know that there are workarounds to that. I could do for example df.col2[len(df)-1] or df.iloc[-1,1]. But why wouldn't be the much simpler version of indexing directly by -1 be allowed? Am I maybe missing another simple selection way for -1? Tnx

The index labels of your DataFrame are [0,1,2]. Your code df.col2[1] is an equivalent of using a loc function as df['col2'].loc[1](or df.col2.loc[1]). You can see that you index does not contain a label '-1' (which is why you get the KeyError).
For positional indexing you need to use an iloc function (which you can use on Pandas Series as well as DataFrame), so you could do df['col2'].iloc[-1] (or df.col2.iloc[-1]).
As you can see, you can use both label based ('col2') and position based (-1) indexing together, you don't need to choose one or another as df.iloc[-1,1] or df.col2[len(df)-1] (which would be equivalent to df.loc[lend(df)-1,'col2'])

How to avoid Implicit fix done by python with UserWarning after Masking Data

When I mask my set of data with another, it shows up with a UserWarning: Boolean Series key will be reindexed to match DataFrame index. How would I avoid this? Python will automatically reindex it with but the header for that column is blank and I cannot seem to rename it so I may reference that column in my code. I prefer to not rely on this implicit correction as well.
I have tried to rename the columns manually in two ways pd.DataFrame.columns() or pd.DataFrame.rename(). For some reason I either get an error that it was expecting 3 elements rather than 4 or the empty column index that was added will not be renamed.
# select data and filter it which results in the error which fixes the dataframe but leaves the column name empty
stickData = data[['Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']]
filteredData = stickData[contactData['CONTACT'] == 1]
# moving forward from the error I tried using rename which does not error but also does nothing
filteredData.rename(index={0:'Index'})
# I also tried this
filteredData.rename(index={'':'Old_Index'})
# I even went and tried to add the names of the dataframe like so which resulted in ValueError: Length mismatch: Expected axis has 3 elements, new values have 4 elements
filteredData.columns = ['Old_Index','Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']
The current dataframe of filteredData.head() looks like this after the implicit correction from python:
Index Time Pitch Stick Position(IN) Roll Stick Position (IN)
0 1421 240:19:06:40.200 0.007263 -0.028500
1 1422 240:19:06:40.400 0.022327 0.139893
2 1423 240:19:06:40.600 -0.016409 0.540756
3 1424 240:19:06:40.800 -0.199329 0.279971
4 1425 240:19:06:41.000 0.013719 -0.018069
But I would like to display with Old_index labeled and more so without relying on the implicit correction:
Index Old_index Time Pitch Stick Position(IN) Roll Stick Position (IN)
1 1421 240:19:06:40.200 0.007263 -0.028500
2 1422 240:19:06:40.400 0.022327 0.139893
3 1423 240:19:06:40.600 -0.016409 0.540756
4 1424 240:19:06:40.800 -0.199329 0.279971
5 1425 240:19:06:41.000 0.013719 -0.018069

There are a few errors in your code:
Don't use chained indexing. Use loc / iloc accessors instead.
Assign back to variables when using methods that don't operate in place.
In general, don't use Boolean indexers derived from other dataframes. If you can guarantee row alignment, then extract the NumPy array representation via pd.Series.values.
For example, this would work, assuming the rows in contactData align with the rows in filteredData
cols = ['Time','Pitch Stick Position(IN)','Roll Stick Position (IN)']
filteredData = data.loc[(contactData['CONTACT'] == 1).values, cols]\
.rename(index={0:'Index'})
Notice we can chain methods such as loc and rename instead of explicitly assigning back to filteredData each time.

can you try:
filteredData = stickData[contactData['CONTACT'] == 1].reset_index().rename(columns={'index': 'Old_index')
or put this piece somewhere, i don't have your sample data, i can't test it out
.reset_index().rename(columns={'index': 'Old_index')

Shrink pandas Df by deleting rows through modulo

I need to reduce (or select) for example multiple of 4 of the index.
i have a 2MS dataframe and i want to get less data for a future plot. so the idea is to work with 1/4 of the data. leaving only the rows with index 4 - 8 - 16 - 20 - 4*n (or maybe the same but with 5*n)
if someone has any idea i will be grateful.

You can use the iloc function, which takes a row/column slice.
From the docs
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the
axis), but may also be used with a boolean array.
So you could write df.iloc[::4, :]

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

I have a dataframe with these indices and values:
df[df.columns[0]]
1 example
2 example1
3 example2
When I access df[df.columns[0]][2], I get "example1". Makes sense. That's how indices work.
When I access df[df.columns[0]], however, I get "example", and I get example when I access df[df.columns[1]] as well. So for
df[df.columns[0]][0]
df[df.columns[0]][1]
I get "example".
Strangely, I can delete "row" 0, and the result is that 1 is deleted:
gf = df.drop(df.index[[0]])
gf
exampleDF
2 example1
3 example2
But when I delete row 1, then
2 example1
is deleted, as opposed to example.
This is a bit confusing to me; are there inconsistent standards in Pandas regarding row indices, or am I missing something / made an error?

You are probably causing pandas to switch between .iloc (index based) and .loc (label based) indexing.
All arrays in Python are 0 indexed. And I notice that indexes in your DataFrame are starting from 1. So when you run df[df.column[0]][0] pandas realizes that there is no index named 0, and falls back to .iloc which locates things by array indexing. Therefore it returns what it finds at the first location of the array, which is 'example'.
When you run df[df.column[0]][1] however, pandas realizes that there is a index label 1, and uses .loc which returns what it finds at that label, which again happens to be 'example'.
When you delete the first row, your DataFrame does not have index labels 0 and 1. So when you go to locate elements at those places in the way you are, it does not return None to you, but instead falls back on array based indexing and returns elements from the 0th and 1st places in the array.
To enforce pandas to use one of the two indexing techniques, use .iloc or .loc. .loc is label based, and will raise KeyError if you try df[df.column[0]].loc[0]. .iloc is index based and will return 'example' when you try df[df.column[0]].iloc[0].
Additional note
These commands are bad practice: df[col_label].iloc[row_index]; df[col_label].loc[row_label].
Please use df.loc[row_label, col_label]; or df.iloc[row_index, col_index]; or df.ix[row_label_or_index, col_label_or_index]
See Different Choices for Indexing for more information.

Unexpected difference between loc and ix

I've noticed a strange difference between loc and ix when subsetting a DataFrame in Pandas.
import pandas as pd
# Create a dataframe
df = pd.DataFrame({'id':[10,9,5,6,8], 'x1':[10.0,12.3,13.4,11.9,7.6], 'x2':['a','a','b','c','c']})
df.set_index('id', inplace=True)
df
x1 x2
id
10 10.0 a
9 12.3 a
5 13.4 b
6 11.9 c
8 7.6 c
df.loc[[10, 9, 7]] # 7 does not exist in the index so a NaN row is returned
df.loc[[7]] # KeyError: 'None of [[7]] are in the [index]'
df.ix[[7]] # 7 does not exist in the index so a NaN row is returned
Why does df.loc[[7]] throw an error while df.ix[[7]] returns a row with NaN? Is this a bug? If not, why are loc and ix designed this way?
(Note I'm using Pandas 0.17.1 on Python 3.5.1)

As #shanmuga says, this is (at least for loc) the intended and documented behaviour, and not a bug.
The documentation on loc/selection by label, gives the rules on this (http://pandas.pydata.org/pandas-docs/stable/indexing.html#selection-by-label ):
At least 1 of the labels for which you ask, must be in the index or a KeyError will be raised!
This means using loc with a single label (eg df.loc[[7]]) will raise an error if this label is not in the index, but when using it with a list of labels (eg df.loc[[7,8,9]]) will not raise an error if at least one of those labels is in the index.
For ix I am less sure, and this is not clearly documented I think. But in any case, ix is much more permissive and has a lot of edge cases (fallback to integer position etc), and is rather a rabbit hole. But in general, ix will always return a result indexed with the provided labels (so does not check if the labels are in the index as loc does), unless it falls back to integer position indexing.
In most cases it is advised to use loc/iloc

I think this behavior is intended, not a bug.
Although I couldn't find any official documentation, I found a comment by jreback on 21 Mar 2014 to issue on GitHub indicating this.
ix can very subtly give wrong results (use an index of say even numbers)
you can use whatever function you want; ix is still there, but it doesn't provide the guarantees that loc provides, namely that it won't interpret a number as a location
As for why it is designed so
As mentioned in docs
.ix supports mixed integer and label based access. It is primarily label based, but will fall back to integer positional access unless the corresponding axis is of integer type.
In my opinion raising a KeyError would be ambiguous as whether it it came from index, or integer position. Instead ix returns NaN when given a list

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Getting a scalar by integer location and column label (mixed indexing) - python

Here's Pandas Official Guide on doing Indexing with both Integer and Label. Hope this is helpful.

Related

Selecting by index -1 in a df column / time series throws error

How to avoid Implicit fix done by python with UserWarning after Masking Data

Shrink pandas Df by deleting rows through modulo

What is Pandas doing here that my indexes [0] and [1] refer to the same value?

Unexpected difference between loc and ix

Categories

Resources