Pandas how to check values of specific columns is same - python

I have 5 columns in data-frame called 'A','B','C','D','E'. I want to filter data-frame where values of columns 'A','C'and 'E' are equal.
I have done the following :
OutputDF = DF[DF['A']==DF['C']==DF['E']]
Its giving error as follows:
ValueError: Truth value of series is ambiguous. Use a.empty, a.boolean,a.item(),a.any() or a.all()

You can compare all filtered columns by list by first column of list by DataFrame.eq and test if all values are True by DataFrame.all:
print (df)
A B C D E
0 1 2 3 4 5
1 1 2 1 4 1
2 2 2 2 4 2
L = ['A','C','E']
df = df[df[L].eq(df[L[0]], axis=0).all(axis=1)]
print (df)
A B C D E
1 1 2 1 4 1
2 2 2 2 4 2

To address why this happens:
import pandas as pd
DF = pd.DataFrame({"A": [1, 2],
"C": [1, 2],
"E": [1, 2],
})
OutputDF = DF[DF['A']==DF['C']==DF['E']]
#ValueError: The truth value of a Series is ambiguous.
The issue is that, due to how operator chaining works, DF['A']==DF['C']==DF['E'] is being interpreted as
DF['A']==DF['C'] and DF['C']==DF['E']
Essentially, we are attempting to do a boolean and between two Series, and thus we see our error. Since a Series should be giving multiple values, while and expects a single value on both sides of the operator, Thus there is ambiguity on how to reduce the Series on either sides to a single value.
If you wanted to write the condition correctly, you could use bitwise and (&) instead as follows (the brackets are important with bitwise &):
OutputDF = DF[(DF['A']==DF['C']) & (DF['C']==DF['E'])]

Related

How to build a new column in Pandas from a Conditional (New Column should output strings)

I'm trying to create a column in pandas using a conditional to create a qualitative observation.
For example, if the data frame looks like this:
Distance
1 1
2 5
3 40
4 15
I want to create a new column (let's call it df['length']) which is an observation on the distances.
For example:
if df[Distance] = 1:
print('Short')
I want 'Short' to be input into the new column for each row that fits the conditional.
Or for example:
if df[Distance] > 10:
print('Long')
I want each row that fits the conditional in the new column to be 'Long'.
How would I go about doing this?
I'm trying to write it into a function. This is what I have now:
def trip_distance(row):
df = pd.read_csv('taxi_january_standard_rate.csv')
if df['trip_distance'] > 50 :
return "Long"
and then I try and use that to populate a new column:
df['trip_length'] = df.apply(trip_distance , axis=1)
but it doesn't seem to work. It's giving me an error:
('The truth value of a Series is ambiguous. Use a.empty, a.bool(),
a.item(), a.any() or a.all().', 'occurred at index 0')
Basically, I'm trying to give 5 Qualitative descriptions to a column in a taxicab data set, where for each distance greater than a certain value, I describe it as 'Long' or if it is close to the mean, I describe it as 'Average', etc.
you need np.where
import numpy as np
df['Length']=np.where(df['Distance']>10,'Long','Short')
if you want multiple conditions, go with #sacul solution, use np.select
df['length'] = np.select([df.Distance < 2, df.Distance > 10], ['short', 'long'], 'average')
>>> df = pd.DataFrame(l,columns=['Distannce'])
>>> df
Distannce
0 1
1 5
2 40
3 15
>>> df['length'] = np.nan
>>> df['length'][df['Distannce'] > 10] = 'Long'
>>> df
Distannce length
0 1 NaN
1 5 NaN
2 40 Long
3 15 Long
>>> df['length'][df['Distannce'] == 1] = 'Short'
>>> df
Distannce length
0 1 Short
1 5 NaN
2 40 Long
3 15 Long
>>>
Let me know if it helps, also please mark as answer if it works for you.
Alternatively you could do:
df.loc[df['Distance'] > 10, 'length'] = 'Long'
df.loc[df['Distance'] == 1, 'length'] = 'Short'
Output:
Distance length
0 1 Short
1 5 NaN
2 40 Long
3 15 Long
You can fill NaN with whatever value you want using fillna

Key Error: 0 while searching in pandas dataframe [duplicate]

I have the dataframe shown below. I need to get the scalar value of column B, dependent on the value of A (which is a variable in my script). I'm trying the loc() function but it returns a Series instead of a scalar value. How do I get the scalar value()?
>>> x = pd.DataFrame({'A' : [0,1,2], 'B' : [4,5,6]})
>>> x
A B
0 0 4
1 1 5
2 2 6
>>> x.loc[x['A'] == 2]['B']
2 6
Name: B, dtype: int64
>>> type(x.loc[x['A'] == 2]['B'])
<class 'pandas.core.series.Series'>
First of all, you're better off accessing both the row and column indices from the .loc:
x.loc[x['A'] == 2, 'B']
Second, you can always get at the underlying numpy matrix using .values on a series or dataframe:
In : x.loc[x['A'] == 2, 'B'].values[0]
Out: 6
Finally, if you're not interested in the original question's "conditional indexing", there are also specific accessors designed to get a single scalar value from a DataFrame: dataframe.at[index, column] or dataframe.iat[i, j] (these are similar to .loc[] and .iloc[] but designed for quick access to a single value).
elaborating on #ShineZhang comment:
x.set_index('A').at[2, 'B']
6
pd.__version__
u'0.22.0'

Forcing pandas .iloc to return a single-row dataframe?

For programming purpose, I want .iloc to consistently return a data frame, even when the resulting data frame has only one row. How to accomplish this?
Currently, .iloc returns a Series when the result only has one row. Example:
In [1]: df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
In [2]: df
Out[2]:
a b
0 1 3
1 2 4
In [3]: type(df.iloc[0, :])
Out[3]: pandas.core.series.Series
This behavior is poor for 2 reasons:
Depending on the number of chosen rows, .iloc can either return a Series or a Data Frame, forcing me to manually check for this in my code
- .loc, on the other hand, always return a Data Frame, making pandas inconsistent within itself (wrong info, as pointed out in the comment)
For the R user, this can be accomplished with drop = FALSE, or by using tidyverse's tibble, which always return a data frame by default.
Use double brackets,
df.iloc[[0]]
Output:
a b
0 1 3
print(type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
Short for df.iloc[[0],:]
Accessing row(s) by label: loc
# Setup
df = pd.DataFrame({'X': [1, 2, 3], 'Y':[4, 5, 6]}, index=['a', 'b', 'c'])
df
X Y
a 1 4
b 2 5
c 3 6
To get a DataFrame instead of a Series, pass a list of indices of length 1,
df.loc[['a']]
# Same as
df.loc[['a'], :] # selects all columns
X Y
a 1 4
To select multiple specific rows, use
df.loc[['a', 'c']]
X Y
a 1 4
c 3 6
To select a contiguous range of rows, use
df.loc['b':'c']
X Y
b 2 5
c 3 6
Access row(s) by position: iloc
Specify a list of indices of length 1,
i = 1
df.iloc[[i]]
X Y
b 2 5
Or, specify a slice of length 1:
df.iloc[i:i+1]
X Y
b 2 5
To select multiple rows or a contiguous slice you'd use a similar syntax as with loc.
The double-bracket approach doesn't always work for me (e.g. when I use a conditional to select a timestamped row with loc).
You can, however, just add to_frame() to your operation.
>>> df = pd.DataFrame({'a':[1,2], 'b':[3,4]})
>>> df2 = df.iloc[0, :].to_frame().transpose()
>>> type(df2)
<class 'pandas.core.frame.DataFrame'>
please use the below options:
df1 = df.iloc[[0],:]
#type(df1)
df1
or
df1 = df.iloc[0:1,:]
#type(df1)
df1
For getting single row extraction from Dataframe use:
df_name.iloc[index,:].to_frame().transpose()
single_Sample1=df.iloc[7:10]
single_Sample1
[1]: https://i.stack.imgur.com/RHHDZ.png**strong text**

Python pandas - select by row

I am trying to select rows in a pandas data frame based on it's values matching those of another data frame. Crucially, I only want to match values in rows, not throughout the whole series. For example:
df1 = pd.DataFrame({'a':[1, 2, 3], 'b':[4, 5, 6]})
df2 = pd.DataFrame({'a':[3, 2, 1], 'b':[4, 5, 6]})
I want to select rows where both 'a' and 'b' values from df1 match any row in df2. I have tried:
df1[(df1['a'].isin(df2['a'])) & (df1['b'].isin(df2['b']))]
This of course returns all rows, as the all values are present in df2 at some point, but not necessarily the same row. How can I limit this so the values tested for 'b' are only those rows where the value 'a' was found? So with the example above, I am expecting only row index 1 ([2, 5]) to be returned.
Note that data frames may be of different shapes, and contain multiple matching rows.
Similar to this post, here's one using broadcasting -
df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
The idea is :
1) Use np.all for the both part in ""both 'a' and 'b' values"".
2) Use np.any for the any part in "from df1 match any row in df2".
3) Use broadcasting for doing all these in a vectorized fashion by extending dimensions with None/np.newaxis.
Sample run -
In [41]: df1
Out[41]:
a b
0 1 4
1 2 5
2 3 6
In [42]: df2 # Modified to add another row : [1,4] for variety
Out[42]:
a b
0 3 4
1 2 5
2 1 6
3 1 4
In [43]: df1[(df1.values == df2.values[:,None]).all(-1).any(0)]
Out[43]:
a b
0 1 4
1 2 5
use numpy broadcasting
pd.DataFrame((df1.values[:, None] == df2.values).all(2),
pd.Index(df1.index, name='df1'),
pd.Index(df2.index, name='df2'))

Why does pandas dataframe indexing change axis depending on index type?

when you index into a pandas dataframe using a list of ints, it returns columns.
e.g. df[[0, 1, 2]] returns the first three columns.
why does indexing with a boolean vector return a list of rows?
e.g. df[[True, False, True]] returns the first and third rows. (and errors out if there aren't 3 rows.)
why? Shouldn't it return the first and third columns?
Thanks!
Because if use:
df[[True, False, True]]
it is called boolean indexing by mask:
[True, False, True]
Sample:
df = pd.DataFrame({'A':[1,2,3],
'B':[4,5,6],
'C':[7,8,9]})
print (df)
A B C
0 1 4 7
1 2 5 8
2 3 6 9
print (df[[True, False, True]])
A B C
0 1 4 7
2 3 6 9
Boolean mask is same as:
print (df.B != 5)
0 True
1 False
2 True
Name: B, dtype: bool
print (df[df.B != 5])
A B C
0 1 4 7
2 3 6 9
There are very specific slicing accessors to target rows and columns in specific ways.
Mixed Position and Label Based Selection
Select by position
Selection by Label
loc[], at[], and get_value() take row and column labels and return the appropriate slice
iloc[] and iat[] take row and column positions and return the appropriate slice
What you are seeing is the result of pandas trying to infer what you are trying to do. As you have noticed this is inconsistent at times. In fact, it is more pronounced than just what you've highlighted... but I wont go into that now.
See also
pandas docs
However, when an axis is integer based,
ONLY label based access and not positional access is supported.
Thus, in such cases, it’s usually better to be explicit and use .iloc or .loc.

Categories