pandas isin comparison to multiple columns, not including index - python

I'm trying to test whether a dictionary of key-value pairs is contained in a DataFrame with columns having the same names as the dictionary.
example:
df1 = pd.DataFrame({'A': [2,8,4,9,6], 'B': [7,1,8,3,5], 'C': [8,4,9,1,6], 'D': [7,8,9,1,2], 'E': [3,8,4,9,6]})
df1
A B C D E
0 2 7 8 7 3
1 8 1 4 8 8
2 4 8 9 9 4
3 9 3 1 1 9
4 6 5 6 2 6
d = {'A': 9, 'B': 3, 'C': 1, 'D': 1, 'E': 9}
df2 = pd.DataFrame([d])
df2
A B C D E
0 9 3 1 1 9
What I want is a statement that returns True if the entire row of values in df2 is matched anywhere in df1. I've tried passing d and df2 to the .isin values parameter:
df1.isin(d)
results in an error.
TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'int'
while using df2 returns all False.
df1.isin(df2)
A B C D E
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
I played around with the last example in the pandas.DataFrame.isin doc, and realized my test with df2 fails because the index doesn't match (3 in df1 versus 0 in df2).
Is there an easy way to do this with isin that ignores the index, or some other method that doesn't involve stringing together five equality tests?

Is this what you expect?
>>> df1.eq(df2.values).all(axis=1).any()
True
You can also use d directly:
>>> df1.eq(d).all(axis=1).any()
True

Related

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

Compare string to previous and consecutive row to check if alphabetical order

I need to check if a column in a dataframe is in alphabetical order comparing only the two adjacent values.
idx
Col1
0
A
1
A
2
B
3
A
4
A
5
B
6
B
7
C
or:
import pandas as pd
df = pd.DataFrame(['A','A','B','A','A','B','B','C'], columns=['Col1'])
Now only row 2 is out of order.
I'd like to do something like:
df['InOrder'] = df['Col1'].rolling(2).apply(lambda x: x[0] >= x[1] >= x[2])
But rolling only works for numerical values.
I've also tried:
df['InOrder'] = df['Col1'] >= df['Col1'].shift(1) >= df['Col1'].shift(2)
But I get
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is what I expect to get:
idx
Col1
InOrder
0
A
True
1
A
True
2
B
False
3
A
True
4
A
True
5
B
True
6
B
True
7
C
True
P.S.: Since I have other columns and I need to keep the data in current row order.
one idea with Series.rank and np.diff, replace missing values anf compare by Series.ge for great or equal:
df['InOrder'] = df['Col1'].rank(method='dense').rolling(2).apply(lambda x: np.diff(x)).fillna(0).ge(0)
Or similar like #wwnde solution:
df['InOrder'] = df['Col1'].rank(method='dense').diff().fillna(0).ge(0)
print (df)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True
EDIT: If need match up to 1 value is possible use:
df['InOrder'] = df['Col1'].rank(method='dense').diff().shift(-1).fillna(0).isin([0,1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True
df['InOrder'] = df['Col1'].rank(method='dense').diff(-1).fillna(0).isin([0,-1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True
Convert them into numerals using astype category. Find the consecutive differences and anything less than 0, make it false. Code below
df['InOrder']=df.Col1.astype('category').cat.codes.diff(1).fillna(0).ge(0)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True

(Python) Selecting rows containing a string in ANY column?

I am trying to iterate through a dataframe and return the rows that contain a string "x" in any column.
This is what I have been trying
for col in df:
rows = df[df[col].str.contains(searchTerm, case = False, na = False)]
However, it only returns up to 2 rows if I search for a string I know exists there and in more rows.
How do I make sure it is searching every row of every column?
Edit: My end goal is to get the row and column of the cell containing the string searchTerm
Welcome!
Agree with all the comments. It's generally best practice to find a way to accomplish what you want in Pandas/Numpy without iterating over rows/columns.
If the objective is to "find rows where any column contains the value 'x'), life is a lot easier than you think.
Below is some data:
import pandas as pd
df = pd.DataFrame({
'a': range(10),
'b': ['x', 'b', 'c', 'd', 'x', 'f', 'g', 'h', 'i', 'x'],
'c': [False, False, True, True, True, False, False, True, True, True],
'd': [1, 'x', 3, 4, 5, 6, 7, 8, 'x', 10]
})
print(df)
a b c d
0 0 x False 1
1 1 b False x
2 2 c True 3
3 3 d True 4
4 4 x True 5
5 5 f False 6
6 6 g False 7
7 7 h True 8
8 8 i True x
9 9 x True 10
So clearly rows 0, 1, 4, 8 and 9 should be included.
If we just do df == 'x', pandas broadcasts the comparison across the whole dataframe:
df == 'x'
a b c d
0 False True False False
1 False False False True
2 False False False False
3 False False False False
4 False True False False
5 False False False False
6 False False False False
7 False False False False
8 False False False True
9 False True False False
But pandas also has the handy .any method, to check for True in any dimension. So if we want to check across all columns, we want dimension 1:
rows = (df == 'x').any(axis=1)
print(rows)
0 True
1 True
2 False
3 False
4 True
5 False
6 False
7 False
8 True
9 True
Note that if you want your solution to be truly case sensitive like what you're using with the .str method, you might need something more like:
rows = (df.applymap(lambda x: str(x).lower() == 'x')).any(axis=1)
The correct rows are flagged without any looping. And you get a series back that can be used for indexing the original dataframe:
df.loc[rows]
a b c d
0 0 x False 1
1 1 b False x
4 4 x True 5
8 8 i True x
9 9 x True 10

Identifying closest value in a column for each filter using Pandas

I have a data frame with categories and values. I need to find the value in each category closest to a value. I think I'm close but I can't really get the right output when applying the results of argsort to the original dataframe.
For example, if the input was defined in the code below the output should have only (a, 1, True), (b, 2, True), (c, 2, True) and all other isClosest Values should be False.
If multiple values are closest then it should be the first value listed marked.
Here is the code I have which works but I can't get it to reapply to the dataframe correctly. I would love some pointers.
df = pd.DataFrame()
df['category'] = ['a', 'b', 'b', 'b', 'c', 'a', 'b', 'c', 'c', 'a']
df['values'] = [1, 2, 3, 4, 5, 4, 3, 2, 1, 0]
df['isClosest'] = False
uniqueCategories = df['category'].unique()
for c in uniqueCategories:
filteredCategories = df[df['category']==c]
sortargs = (filteredCategories['value']-2.0).abs().argsort()
#how to use sortargs so that we set column in df isClosest=True if its the closest value in each category to 2.0?
You can create a column of absolute differences:
df['dif'] = (df['values'] - 2).abs()
df
Out:
category values dif
0 a 1 1
1 b 2 0
2 b 3 1
3 b 4 2
4 c 5 3
5 a 4 2
6 b 3 1
7 c 2 0
8 c 1 1
9 a 0 2
And then use groupby.transform to check whether the minimum value of each group is equal to the difference you calculated:
df['is_closest'] = df.groupby('category')['dif'].transform('min') == df['dif']
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
df.groupby('category')['dif'].idxmin() would also give you the indices of the closest values for each category. You can use that for mapping too.
For selection:
df.loc[df.groupby('category')['dif'].idxmin()]
Out:
category values dif
0 a 1 1
1 b 2 0
7 c 2 0
For assignment:
df['is_closest'] = False
df.loc[df.groupby('category')['dif'].idxmin(), 'is_closest'] = True
df
Out:
category values dif is_closest
0 a 1 1 True
1 b 2 0 True
2 b 3 1 False
3 b 4 2 False
4 c 5 3 False
5 a 4 2 False
6 b 3 1 False
7 c 2 0 True
8 c 1 1 False
9 a 0 2 False
The difference between these approaches is that if you check equality against the difference, you would get True for all rows in case of ties. However, with idxmin it will return True for the first occurrence (only one for each group).
Solution with DataFrameGroupBy.idxmin - get indexes of minimal values per group and then assign boolean mask by Index.isin to column isClosest:
idx = (df['values'] - 2).abs().groupby([df['category']]).idxmin()
print (idx)
category
a 0
b 1
c 7
Name: values, dtype: int64
df['isClosest'] = df.index.isin(idx)
print (df)
category values isClosest
0 a 1 True
1 b 2 True
2 b 3 False
3 b 4 False
4 c 5 False
5 a 4 False
6 b 3 False
7 c 2 True
8 c 1 False
9 a 0 False

All rows within a given column must match, for all columns

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

Categories