Python Pandas: Boolean indexing on multiple columns [duplicate] - python

This question already has answers here:
selecting across multiple columns with pandas
(3 answers)
Closed 9 years ago.
despite there being at least two good tutorials on how to index a DataFrame in Python's pandas library, I still can't work out an elegant way of SELECTing on more than one column.
>>> d = pd.DataFrame({'x':[1, 2, 3, 4, 5], 'y':[4, 5, 6, 7, 8]})
>>> d
x y
0 1 4
1 2 5
2 3 6
3 4 7
4 5 8
>>> d[d['x']>2] # This works fine
x y
2 3 6
3 4 7
4 5 8
>>> d[d['x']>2 & d['y']>7] # I had expected this to work, but it doesn't
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I have found (what I think is) a rather inelegant way of doing it, like this
>>> d[d['x']>2][d['y']>7]
But it's not pretty, and it scores fairly low for readability (I think).
Is there a better, more Python-tastic way?

It is a precedence operator issue.
You should add extra parenthesis to make your multi condition test working:
d[(d['x']>2) & (d['y']>7)]
This section of the tutorial you mentioned shows an example with several boolean conditions and the parenthesis are used.

There may still be a better way, but
In [56]: d[d['x'] > 2] and d[d['y'] > 7]
Out[56]:
x y
4 5 8
works.

Related

Using numpy.where to calculate new pandas column, with multiple conditions

I have a problem with regards as to how to appropriately code this condition. I'm currently creating a new pandas column in my dataframe, new_column, which performs a subtraction on the values in column test, based on what index of the data we are at. I'm currently using this code to get it to subtract a different value every 4 times:
subtraction_value = 3
subtraction_value = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]}
data['new_column'] = np.where(data.index%4,
data['test']-subtraction_value,
data['test']-subtraction_value_2)
print (data['new_column']
[6,1,2,1,-5,0,-1,3,4,6]
However, I now wish to get it performing the higher subtraction on the first two positions in the column, and then 3 subtractions with the original value, another two with the higher subtraction value, 3 small subtractions, and so forth. I thought I could do it this way, with an | condition in my np.where statement:
data['new_column'] = np.where((data.index%4) | (data.index%5),
data['test']-subtraction_value,
data['test']-subtraction_value_2)
However, this didn't work, and I feel my maths may be slightly off. My desired output would look like this:
print(data['new_column'])
[6,-2,2,1,-2,-3,-4,3,7,6])
As you can see, this slightly shifts the pattern. Can I still use numpy.where() here, or do I have to take a new approach? Any help would be greatly appreciated!
As mentioned in the comment section, the output should equal
[6,-2,2,1,-2,-3,-4,2,7,6] instead of [6,-2,2,1,-2,-3,-4,3,7,6] according to your logic. Given that, you can do the following:
import pandas as pd
import numpy as np
from itertools import chain
subtraction_value = 3
subtraction_value_2 = 6
data = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9]})
index_pos_large_subtraction = list(chain.from_iterable((data.index[i], data.index[i+1]) for i in range(0, len(data)-1, 5)))
data['new_column'] = np.where(~data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value, data['test']-subtraction_value_2)
# The next line is equivalent to the previous one
# data['new_column'] = np.where(data.index.isin(index_pos_large_subtraction), data['test']-subtraction_value_2, data['test']-subtraction_value)
---------------------------------------------
test new_column
0 12 6
1 4 -2
2 5 2
3 4 1
4 1 -2
5 3 -3
6 2 -4
7 5 2
8 10 7
9 9 6
---------------------------------------------
As you can see, np.where works fine. Your masking condition is the problem and needs to be adjusted, you are not selecting rows according to your logic.

How to find elements that are in first pandas Data frame and not in second, and viceversa. python [duplicate]

This question already has answers here:
set difference for pandas
(12 answers)
Closed 11 months ago.
I have two data frames.
first_dataframe
id
9
8
6
5
7
4
second_dataframe
id
6
4
1
5
2
3
Note: My dataframe has many columns, but I need to compare only based on ID |
I need to find:
ids that are in first dataframe and not in second [1,2,3]
ids that are in second dataframe and not in first [7,8,9]
I have searched for an answer, but all solutions that I've found doesn't seem to work for me, because they look for changes based on index.
Use set subtraction:
inDF1_notinDF2 = set(df1['id']) - set(df2['id']) # Removes all items that are in df2 from df1
inDF2_notinDF1 = set(df2['id']) - set(df1['id']) # Removes all items that are in df1 from df2
Output:
>>> inDF1_notinDF2
{7, 8, 9}
>>> inDF2_notinDF1
{1, 2, 3}

how to subset by fixed column and row by boolean in pandas? [duplicate]

This question already has answers here:
How to deal with SettingWithCopyWarning in Pandas
(20 answers)
Closed 3 years ago.
I am coming from R background. I need elementary with pandas.
if I have a dataframe like this
df = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6]]))
I want to subset dataframe to select a fixed column and select a row by a boolean.
For example
df.iloc[df.2 > 4][2]
then I want to set the value for the subset cell to equal a value.
something like
df.iloc[df.2 > 4][2] = 7
It seems valid for me however it seem pandas work with booleans in more strict way than R
In here it is .loc
df.loc[df[2] > 4,2]
1 6
Name: 2, dtype: int64
df.loc[df[2] > 4,2]=7
df
0 1 2
0 1 2 3
1 4 5 7

How to index column uniquely in Python using Pandas? [duplicate]

This question already has answers here:
Pandas DENSE RANK
(4 answers)
Closed 4 years ago.
I am trying to generate a unique index column in my dataset.
I have a column in my dataset as follows:
665678, 665678, 665678, 665682, 665682, 665682, 665690, 665690
And I would like to generate a separately indexed column looking like this:
1, 1, 1, 2, 2, 2, 3, 3
I came across the post How to index columns uniquely?? that describes exactly what I am trying to do. But since the solutions are described for R, I wanted to know how can I implement the same in Python using Pandas.
Thanks
Use -
df.groupby('col').ngroup()+1
Output
0 1
1 1
2 1
3 2
4 2
5 2
6 3
7 3
dtype: int64

pandas series sub-setting by index

Here is my example:
import pandas as pd
df = pd.DataFrame({'col_1':[1,5,6,77,9],'col_2':[6,2,4,2,5]})
df.index = [8,9,10,11,12]
This sub-setting is by row order:
df.col_1[2:5]
returns
10 6
11 77
12 9
Name: col_1, dtype: int64
while this subsetting is already by index and does not to work:
df.col_1[2]
returns:
KeyError: 2
I find it very confusing and am curios what is the reason behind it?
You're statements are ambiguous, therefore it best to explicitly define what you want.
df.col_1[2:5] is working like df.col_1.iloc[2:5] using integer location.
Where as df.col[2] is working like df.col_1.loc[2] using index label location, hence there is no index labelled 2, so you get the KeyError.
Hence is best to defined whether are are using integer location with .iloc or index label location using .loc.
See Pandas Indexing docs.
Let's assume this is the initial DataFrame:
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=list('abcde')
)
df
Out:
col_1 col_2
a 1 6
b 5 2
c 6 4
d 77 2
e 9 5
The index consists of strings so it is generally obvious what you are trying to do:
df['col_1']['b'] You passed a string so you are probably trying to access by label. It returns 5.
df['col_1'][1] You passed an integer so you are probably trying to access by position. It returns 5.
Same deal with slices: df['col_1']['b':'d'] uses labels and df['col_1'][1:4] uses positions.
When the index is also integer, nothing is obvious anymore.
df = pd.DataFrame(
{
'col_1':[1, 5, 6, 77, 9],
'col_2':[6, 2, 4, 2, 5]
},
index=[8, 9, 10, 11, 12]
)
df
Out:
col_1 col_2
8 1 6
9 5 2
10 6 4
11 77 2
12 9 5
Let's say you type df['col_1'][8]. Are you trying to access by label or by position? What if it was a slice? Nobody knows. At this point, pandas chooses one of them based on their usage. It is in the end a Series and what distinguishes a Series from an array is its labels so the choice for df['col_1'][8] is labels. Slicing with labels is not that common so pandas is being smart here and using positions when you pass a slice. Is it inconsistent? Yes. Should you avoid it? Yes. This is the main reason ix was deprecated.
Explicit is better than implicit so use either iloc or loc when there is room for ambiguity. loc will always raise a KeyError if you try to access an item by position and iloc will always raise a KeyError if you try to access by label.

Categories