All rows within a given column must match, for all columns

All rows within a given column must match, for all columns - python

I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?

Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool

You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True

You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool

Related

how to change suffix on new df column of df when merging iteratively

I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?

I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3

pandas isin comparison to multiple columns, not including index

I'm trying to test whether a dictionary of key-value pairs is contained in a DataFrame with columns having the same names as the dictionary.
example:
df1 = pd.DataFrame({'A': [2,8,4,9,6], 'B': [7,1,8,3,5], 'C': [8,4,9,1,6], 'D': [7,8,9,1,2], 'E': [3,8,4,9,6]})
df1
A B C D E
0 2 7 8 7 3
1 8 1 4 8 8
2 4 8 9 9 4
3 9 3 1 1 9
4 6 5 6 2 6
d = {'A': 9, 'B': 3, 'C': 1, 'D': 1, 'E': 9}
df2 = pd.DataFrame([d])
df2
A B C D E
0 9 3 1 1 9
What I want is a statement that returns True if the entire row of values in df2 is matched anywhere in df1. I've tried passing d and df2 to the .isin values parameter:
df1.isin(d)
results in an error.
TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'int'
while using df2 returns all False.
df1.isin(df2)
A B C D E
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
I played around with the last example in the pandas.DataFrame.isin doc, and realized my test with df2 fails because the index doesn't match (3 in df1 versus 0 in df2).
Is there an easy way to do this with isin that ignores the index, or some other method that doesn't involve stringing together five equality tests?

Is this what you expect?
>>> df1.eq(df2.values).all(axis=1).any()
True
You can also use d directly:
>>> df1.eq(d).all(axis=1).any()
True

Compare string to previous and consecutive row to check if alphabetical order

I need to check if a column in a dataframe is in alphabetical order comparing only the two adjacent values.
idx
Col1
0
A
1
A
2
B
3
A
4
A
5
B
6
B
7
C
or:
import pandas as pd
df = pd.DataFrame(['A','A','B','A','A','B','B','C'], columns=['Col1'])
Now only row 2 is out of order.
I'd like to do something like:
df['InOrder'] = df['Col1'].rolling(2).apply(lambda x: x[0] >= x[1] >= x[2])
But rolling only works for numerical values.
I've also tried:
df['InOrder'] = df['Col1'] >= df['Col1'].shift(1) >= df['Col1'].shift(2)
But I get
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is what I expect to get:
idx
Col1
InOrder
0
A
True
1
A
True
2
B
False
3
A
True
4
A
True
5
B
True
6
B
True
7
C
True
P.S.: Since I have other columns and I need to keep the data in current row order.

one idea with Series.rank and np.diff, replace missing values anf compare by Series.ge for great or equal:
df['InOrder'] = df['Col1'].rank(method='dense').rolling(2).apply(lambda x: np.diff(x)).fillna(0).ge(0)
Or similar like #wwnde solution:
df['InOrder'] = df['Col1'].rank(method='dense').diff().fillna(0).ge(0)
print (df)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True
EDIT: If need match up to 1 value is possible use:
df['InOrder'] = df['Col1'].rank(method='dense').diff().shift(-1).fillna(0).isin([0,1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True
df['InOrder'] = df['Col1'].rank(method='dense').diff(-1).fillna(0).isin([0,-1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True

Convert them into numerals using astype category. Find the consecutive differences and anything less than 0, make it false. Code below
df['InOrder']=df.Col1.astype('category').cat.codes.diff(1).fillna(0).ge(0)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True

Creating a matrix of 0 and 1 that maintains its shape

I am attempting to create a matrix of 1 if every 2nd column value is greater than the previous column value and 0s if less, when I use np.where it just flattens it I want to keep the first column and the last column and it shape.
df = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
newd=pd.DataFrame()
for x in df.columns[1::2]:
if bool(df.iloc[:,df.columns.get_loc(x)] <=
df.iloc[:,df.columns.get_loc(x)-1]):
newdf.append(1)
else:newdf.append(0)

This question was a little vague, but I will answer a question that I think gets at the heart of what you are asking:
Say you start with a matrix:
df1 = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
Which creates:
A B C D
0 2.464130 0.796172 -1.406528 0.332499
1 -0.370764 -0.185119 -0.514149 0.158218
2 -2.164707 0.888354 0.214550 1.334445
3 2.019189 0.910855 0.582508 -0.861778
4 1.574337 -1.063037 0.771726 -0.196721
5 1.091648 0.407703 0.406509 -1.052855
6 -1.587963 -1.730850 0.168353 -0.899848
7 0.225723 0.042629 2.152307 -1.086585
Now you can use pd.df.shift() to shift the entire matrix, and then check the resulting columns item by item in one step. For example:
df1.shift(1)
Creates:
A B C D
0 -0.370764 -0.185119 -0.514149 0.158218
1 -2.164707 0.888354 0.214550 1.334445
2 2.019189 0.910855 0.582508 -0.861778
3 1.574337 -1.063037 0.771726 -0.196721
4 1.091648 0.407703 0.406509 -1.052855
5 -1.587963 -1.730850 0.168353 -0.899848
6 0.225723 0.042629 2.152307 -1.086585
7 NaN NaN NaN NaN
And now you can check the resulting columns with a new matrix as so:
df2 = df1.shift(-1) > df1
which returns:
A B C D
0 False False True False
1 False True True True
2 True True True False
3 False False True True
4 False True False False
5 False False False True
6 True True True False
7 False False False False
To complete your question, we convert the True/False to 1/0 as such:
df2 = df2.applymap(lambda x: 1 if x == True else 0)
Which returns:
A B C D
0 0 0 1 0
1 0 1 1 1
2 1 1 1 0
3 0 0 1 1
4 0 1 0 0
5 0 0 0 1
6 1 1 1 0
7 0 0 0 0
In one line:
df2 = (df1.shift(-1)>df1).replace({True:1,False:0})

How to add a new column in Pandas by definning my own function that check values in many columns?

So i have a very big data set, and i need to create a function that checks the value in the same row for multiple columns, obviously the values are different for each column i want to check.
Then if all the given columns to check their values are true, i want to return something and add a new column to the DF to use as flag for later filtering.

I think you need compare by eq with all for check if all values are True:
df = pd.DataFrame({'A':[1,2,3],
'B':[1,5,6],
'C':[1,8,9],
'D':[1,3,5],
'E':[1,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 1 1 1 1 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#check same values in columns A,B,C,E
cols = ['B','C','E']
print (df[cols].eq(df.A, axis=0))
B C E
0 True True True
1 False False False
2 False False False
print (df[cols].eq(df.A, axis=0).all(axis=1))
0 True
1 False
2 False
dtype: bool
df['col'] = df[cols].eq(df.A, axis=0).all(axis=1)
print (df)
A B C D E F col
0 1 1 1 1 1 7 True
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False
EDIT by comment:
You need create boolean mask with & (and), | (or) or ~ (not):
print ((df.A == 1) & (df.B > 167) & (df.B <=200))
0 False
1 False
2 False
dtype: bool
df['col'] = (df.A == 1) & (df.B > 167) & (df.B <=200)
print (df)
A B C D E F col
0 1 1 1 1 1 7 False
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False

please try using masks.for example
mask=df['a']==4
df=df[mask]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

All rows within a given column must match, for all columns - python

You can use apply to loop through columns and check if all the elements in the column are the same: df.apply(lambda col: (col != col[0]).any()) # A False # B False # C False # D False # E False # dtype: bool

Related

how to change suffix on new df column of df when merging iteratively

pandas isin comparison to multiple columns, not including index

Compare string to previous and consecutive row to check if alphabetical order

Creating a matrix of 0 and 1 that maintains its shape

How to add a new column in Pandas by definning my own function that check values in many columns?

Categories

Resources