pandas dataframe where clause with dot versus brackets column selection - python

I have a regular DataFrame with a string type (object) column. When I try to filter on the column using the equivalent of a WHERE clause, I get a KeyError when I use the dot notation. When in bracket notation, all is well.
I am referring to these instructions:
df[df.colA == 'blah']
df[df['colA'] == 'blah']
The first gives the equivalent of
KeyError: False
Not posting an example as I cannot reproduce the issue on a bespoke DataFrame built for the purpose of illustration: when I do, both notations yield the same result.
Asking then if there is a difference in the two and why.

The dot notation is just a convenient shortcut for accessing things vs. the standard brackets. Notably, they don't work when the column name is something like sum that is already a DataFrame method. My bet would be that the column name in your real example is running into that issue, and so it works fine with the bracket selection but is otherwise testing whether a method is equal to 'blah'.
Quick example below:
In [67]: df = pd.DataFrame(np.arange(10).reshape(5,2), columns=["number", "sum"])
In [68]: df
Out[68]:
number sum
0 0 1
1 2 3
2 4 5
3 6 7
4 8 9
In [69]: df.number == 0
Out[69]:
0 True
1 False
2 False
3 False
4 False
Name: number, dtype: bool
In [70]: df.sum == 0
Out[70]: False
In [71]: df['sum'] == 0
Out[71]:
0 False
1 False
2 False
3 False
4 False
Name: sum, dtype: bool

Related

A faster method than "for" to scan a DataFrame - Python

I'm finding a way (using a built-in pandas function) to scan a column of a DataFrame comparing its-self values for different indices.
Here an example using a for cycle. I've a dataframe with a single column col 1. I want to create a column col 2 with TRUE/FALSE in this way.
df["col_2"] = "False"
N=5
for idx in range(0,len(df)-N):
for i in range (idx+1,idx+N+1):
if(df["col_1"].iloc[idx]==df["col_1"].iloc[i]):
df["col_2"].iloc[idx]=True
What I'm trying to do is to compare the value of col 1 for the i-th index with the next N indices.
I'd like to do the same operation without using a for cycle . I've already tried to use a shift and df.loc , but the computational time is similar.
Have you tried doing something like
df["col_1_shifted"] = df["col_1"].shift(N)
df["col_2"] = (df["col_1"] == df["col_1_shifted"])
update: looking more carefully at your double-loop, it seems you want to flag all duplicates except the last. That's done by just changing the keep argument to 'last' instead of the default 'first'.
As suggested by #QuangHoang in the comments, duplicated() works nicely for this:
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
Example:
df = pd.DataFrame(np.random.randint(0, 5, 10), columns=['col_1'])
newdf = df.assign(col_2=df.duplicated(subset='col_1', keep='last'))
>>> newdf
col_1 col_2
0 2 False
1 0 True
2 1 True
3 0 True
4 0 False
5 3 False
6 1 True
7 1 False
8 4 True
9 4 False

What is the difference between .any() and .any(1)?

I have come across the .any() method several times. I used it quite a few times to check if a particular string is contained in a dataframe. In that case it returns a n array/dataframe (depending on how I wish to structure it) of Trues and Falses depending on whether the string matches the values of the cell. I also found .any(1) method but I am not sure how or in which cases I should use it.
.any(1) is the same as .any(axis=1), which means look row-wise instead of per column.
With this sample dataframe:
x1 x2 x3
0 1 1 0
1 0 0 0
2 1 0 0
See the different outcomes:
import pandas as pd
df = pd.read_csv('bool.csv')
print(df.any())
>>>
x1 True
x2 True
x3 False
dtype: bool
So .any() checks if any value in a column is True
print(df.any(1))
>>>
0 True
1 False
2 True
dtype: bool
So .any(1) checks if any value in a row is True
The Document is self explanatory, However for the sake of the question.
This is Series and Dataframe methods any(). It checks whether any of value in the caller object (Dataframe or series) is not 0 and returns True for that. If all values are 0, it will return False.
Note: However, Even if the caller method contains Nan it will not considered 0.
Example DataFrame:
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
calling df.any() column wise.
>>> df.any(axis=1)
0 True
1 True
dtype: bool
any is true if at least one is true
any is False if all are False
Here is nice Blog Documentation about any() & all() by Guido van Rossum.

Issue With Checking If Number Contained in Index Pandas

So I have a dataframe, lets call it df1, that looks like the following.
Index ID
1 90
2 508
3 692
4 944
5 1172
6 1998
7 2022
Now if I call (508 == df['ID']).any() it returns true as it should. But if I have another dataframe, df2, that looks like the following:
Index Num
1 83
2 508
3 912
and I want to check if the Nums are contained in the IDs from df1 using iloc returns an error of len() of unsized object. This is the exact code I've used:
(df2.iloc[1][0] == df2['ID']).any()
which returns the error mentioned above. I've also tried setting a variable to df1.iloc[1][0], didn't work, and calling int() on that variable, also didn't work. Can anyone provide some insight on this?
Try turning it around.
(df1['ID'] == df2.iloc[1][0]).any()
True
This is happening as a result of how the == is being handled for the objects being passed to it.
In this case you have the first object of type
type(df2.iloc[1][0])
numpy.int64
And the second of type
pandas.core.series.Series
== or __eq__ doesn't handle that combination well.
However, this works too:
(int(df2.iloc[1][0]) == df1['ID']).any()
Or:
(int(df2.iloc[1, 0]) == df1['ID']).any()
This works
(df['ID']==df2.iloc[1][0]).any()
Something like this to check if the ID column is in the Num column of df2:
>>> df1.ID.isin(df2.Num)
Index
1 False
2 True
3 False
4 False
5 False
6 False
7 False
Name: ID, dtype: bool
or:
>>> df2.Num.isin(df1.ID)
Index
1 False
2 True
3 False
Name: Num, dtype: bool
Or if you just want to see the matching numbers by index location:
>>> df2.where(df2.Num.isin(df1.ID) * df2.Num, np.nan)
Num
Index
1 NaN
2 508
3 NaN

How do I delete a row in Pandas dataframe when a specific column contains a value that signals to me that the row should be deleted?

Very simple question everyone, but nearly impossible to find answers to basic questions in official documentation.
I have a dataframe object in Pandas that has rows and columns.
One of the columns, named "CBSM", contains boolean values. I need to delete all rows from the dataframe where the value of the CBSM column = "Y".
I see that there is a method called dataframe.drop()
Label, Axis, and Level are 3 parameters that the drop() method takes in. I have no clue what values to provide these parameters to accomplish my need of deleting the rows in the fashion I described above. I have a feeling the drop() method is not the right way to do what I want.
Please advise, thanks.
This method is called boolean indexing.
You can try loc with str.contains:
df.loc[~df['CBSM'].str.contains('Y')]
Sample:
print df
A CBSM L
0 1 Y 4
1 1 N 6
2 2 N 3
print df['CBSM'].str.contains('Y')
0 True
1 False
2 False
Name: CBSM, dtype: bool
#inverted boolean serie
print ~df['CBSM'].str.contains('Y')
0 False
1 True
2 True
Name: CBSM, dtype: bool
print df.loc[~df['CBSM'].str.contains('Y')]
A CBSM L
1 1 N 6
2 2 N 3
Or:
print df.loc[~(df['CBSM'] == 'Y')]
A CBSM L
1 1 N 6
2 2 N 3

How do I determine if the id is unique?

What code should I type for ipython notebook to determine if the code in the ID column of a csv file is unique?
I have tried searching online but to no avail.
Probably the simplest is to compare the length of the df against the length of the unique values:
len(df) == len(df['ID'].unique())
will yield True or False
Also you could call drop_duplicates():
len(df) == len(df['ID'].drop_duplicates())
Also nunique:
len(df) == df['ID'].nunique()
Example:
In [6]:
df = pd.DataFrame({'a':[0,1,1,2,3,4]})
df
Out[6]:
a
0 0
1 1
2 1
3 2
4 3
5 4
In [7]:
len(df) == df['a'].nunique()
Out[7]:
False
Another method is to invert the boolean series returned from duplicated and pass this np.all which will return true if all values are True, for this sample data we get a single False value hence it will yield False:
In [11]:
np.all(~df['a'].duplicated())
Out[11]:
False

Categories