I have a Pandas DataFrame of data in which all rows within a given column must match:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
In [10]: df
Out[10]:
A B C D E
0 1 2 3 4 5
1 1 2 3 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
I would like a quick way to know if there is an variance anywhere in the DataFrame. At this point, I don't need to know which values have varied, since I will be going in to handle those later. I just need a quick way to know if the DataFrame needs further attention or if I can ignore it and move on to the next one.
I can check any given column using
(df.loc[:,'A'] != df.loc[0,'A']).any()
but my Pandas knowledge limits me to iterating through the columns (I understand iteration is frowned upon in Pandas) to compare all of them:
A B C D E
0 1 2 3 4 5
1 1 2 9 4 5
2 1 2 3 4 5
...
6 1 2 3 4 5
7 1 2 3 4 5
8 1 2 3 4 5
9 1 2 3 4 5
for col in df.columns:
if (df.loc[:,col] != df.loc[0,col]).any():
print("Found a fail in col %s" % col)
break
Out: Found a fail in col C
Is there an elegant way to return a boolean if any row within any column of a dataframe does not match all the values in the column... possibly without iteration?
Given your example dataframe:
df = pd.DataFrame({'A': [1,1,1,1,1,1,1,1,1,1],
'B': [2,2,2,2,2,2,2,2,2,2],
'C': [3,3,3,3,3,3,3,3,3,3],
'D': [4,4,4,4,4,4,4,4,4,4],
'E': [5,5,5,5,5,5,5,5,5,5]})
You can use the following:
df.apply(pd.Series.nunique) > 1
Which gives you:
A False
B False
C False
D False
E False
dtype: bool
If we then force a couple of errors:
df.loc[3, 'C'] = 0
df.loc[5, 'B'] = 20
You then get:
A False
B True
C True
D False
E False
dtype: bool
You can compare the entire DataFrame to the first row like this:
In [11]: df.eq(df.iloc[0], axis='columns')
Out[11]:
A B C D E
0 True True True True True
1 True True True True True
2 True True True True True
3 True True True True True
4 True True True True True
5 True True True True True
6 True True True True True
7 True True True True True
8 True True True True True
9 True True True True True
then test if all values are true:
In [13]: df.eq(df.iloc[0], axis='columns').all()
Out[13]:
A True
B True
C True
D True
E True
dtype: bool
In [14]: df.eq(df.iloc[0], axis='columns').all().all()
Out[14]: True
You can use apply to loop through columns and check if all the elements in the column are the same:
df.apply(lambda col: (col != col[0]).any())
# A False
# B False
# C False
# D False
# E False
# dtype: bool
Related
I have a temp df and a a dflst.
the temp has as columns the unique col names from a dataframes in a dflst .
The dflst has a dynamic len, my problem arrises when len(dflst)>=4.
aLL DFs (temp and all the ones in dflst) have columns with true/false values and a p column with numbers
code to recreate data:
#making temp df
var_cols=['a', 'b', 'c', 'd']
temp = pd.DataFrame(list(itertools.product([False, True], repeat=len(var_cols))), columns=var_cols)
#makinf dflst
df0=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'b']))), columns=['a', 'b'])
df0['p']= np.random.randint(1, 5, df0.shape[0])
df1=pd.DataFrame(list(itertools.product([False, True], repeat=len(['c', 'd']))), columns=['c', 'd'])
df1['p']= np.random.randint(1, 5, df1.shape[0])
df2=pd.DataFrame(list(itertools.product([False, True], repeat=len(['a', 'c', ]))), columns=['a', 'c'])
df2['p']= np.random.randint(1, 5, df2.shape[0])
df3=pd.DataFrame(list(itertools.product([False, True], repeat=len(['d']))), columns=['d'])
df3['p']= np.random.randint(1, 5, df3.shape[0])
dflst=[df0, df1, df2, df3]
I want to merge the dfs in dflst, so that the 'p'col values from dfs in dflst into temp df, in the rows with compatible values between the two .
I am currently doing it with pd.merge as follows:
for df in dflst:
temp = temp.merge(df, on=list(df)[:-1], how='right')
but this results to a df that has same names for different columns, when dflst has 4 or more dfs.. I understand that that is due to suffix of merge. but it creates problems with column indexing.
How can I have unique names on the new columns added to temp iteratively?
I don't fully understand what you want but IIUC:
for i, df in enumerate(dflst):
temp = temp.merge(df.rename(columns={'p': f'p{i}'}),
on=df.columns[:-1].tolist(), how='right')
print(temp)
# Output:
a b c d p0 p1 p2 p3
0 False False False False 4 2 2 1
1 False True False False 3 2 2 1
2 False False True False 4 3 4 1
3 False True True False 3 3 4 1
4 True False False False 3 2 2 1
5 True True False False 3 2 2 1
6 True False True False 3 3 1 1
7 True True True False 3 3 1 1
8 False False False True 4 4 2 3
9 False True False True 3 4 2 3
10 False False True True 4 1 4 3
11 False True True True 3 1 4 3
12 True False False True 3 4 2 3
13 True True False True 3 4 2 3
14 True False True True 3 1 1 3
15 True True True True 3 1 1 3
I'm trying to test whether a dictionary of key-value pairs is contained in a DataFrame with columns having the same names as the dictionary.
example:
df1 = pd.DataFrame({'A': [2,8,4,9,6], 'B': [7,1,8,3,5], 'C': [8,4,9,1,6], 'D': [7,8,9,1,2], 'E': [3,8,4,9,6]})
df1
A B C D E
0 2 7 8 7 3
1 8 1 4 8 8
2 4 8 9 9 4
3 9 3 1 1 9
4 6 5 6 2 6
d = {'A': 9, 'B': 3, 'C': 1, 'D': 1, 'E': 9}
df2 = pd.DataFrame([d])
df2
A B C D E
0 9 3 1 1 9
What I want is a statement that returns True if the entire row of values in df2 is matched anywhere in df1. I've tried passing d and df2 to the .isin values parameter:
df1.isin(d)
results in an error.
TypeError: only list-like or dict-like objects are allowed to be passed to DataFrame.isin(), you passed a 'int'
while using df2 returns all False.
df1.isin(df2)
A B C D E
0 False False False False False
1 False False False False False
2 False False False False False
3 False False False False False
4 False False False False False
I played around with the last example in the pandas.DataFrame.isin doc, and realized my test with df2 fails because the index doesn't match (3 in df1 versus 0 in df2).
Is there an easy way to do this with isin that ignores the index, or some other method that doesn't involve stringing together five equality tests?
Is this what you expect?
>>> df1.eq(df2.values).all(axis=1).any()
True
You can also use d directly:
>>> df1.eq(d).all(axis=1).any()
True
I need to check if a column in a dataframe is in alphabetical order comparing only the two adjacent values.
idx
Col1
0
A
1
A
2
B
3
A
4
A
5
B
6
B
7
C
or:
import pandas as pd
df = pd.DataFrame(['A','A','B','A','A','B','B','C'], columns=['Col1'])
Now only row 2 is out of order.
I'd like to do something like:
df['InOrder'] = df['Col1'].rolling(2).apply(lambda x: x[0] >= x[1] >= x[2])
But rolling only works for numerical values.
I've also tried:
df['InOrder'] = df['Col1'] >= df['Col1'].shift(1) >= df['Col1'].shift(2)
But I get
The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
This is what I expect to get:
idx
Col1
InOrder
0
A
True
1
A
True
2
B
False
3
A
True
4
A
True
5
B
True
6
B
True
7
C
True
P.S.: Since I have other columns and I need to keep the data in current row order.
one idea with Series.rank and np.diff, replace missing values anf compare by Series.ge for great or equal:
df['InOrder'] = df['Col1'].rank(method='dense').rolling(2).apply(lambda x: np.diff(x)).fillna(0).ge(0)
Or similar like #wwnde solution:
df['InOrder'] = df['Col1'].rank(method='dense').diff().fillna(0).ge(0)
print (df)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True
EDIT: If need match up to 1 value is possible use:
df['InOrder'] = df['Col1'].rank(method='dense').diff().shift(-1).fillna(0).isin([0,1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True
df['InOrder'] = df['Col1'].rank(method='dense').diff(-1).fillna(0).isin([0,-1])
print (df)
Col1 InOrder
0 A True
1 A True
2 B False
3 A True
4 A True
5 B True
6 B True
7 C True
Convert them into numerals using astype category. Find the consecutive differences and anything less than 0, make it false. Code below
df['InOrder']=df.Col1.astype('category').cat.codes.diff(1).fillna(0).ge(0)
Col1 InOrder
0 A True
1 A True
2 B True
3 C True
4 B False
5 D True
6 D True
7 E True
I am attempting to create a matrix of 1 if every 2nd column value is greater than the previous column value and 0s if less, when I use np.where it just flattens it I want to keep the first column and the last column and it shape.
df = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
newd=pd.DataFrame()
for x in df.columns[1::2]:
if bool(df.iloc[:,df.columns.get_loc(x)] <=
df.iloc[:,df.columns.get_loc(x)-1]):
newdf.append(1)
else:newdf.append(0)
This question was a little vague, but I will answer a question that I think gets at the heart of what you are asking:
Say you start with a matrix:
df1 = pd.DataFrame(np.random.randn(8, 4),columns=['A', 'B', 'C', 'D'])
Which creates:
A B C D
0 2.464130 0.796172 -1.406528 0.332499
1 -0.370764 -0.185119 -0.514149 0.158218
2 -2.164707 0.888354 0.214550 1.334445
3 2.019189 0.910855 0.582508 -0.861778
4 1.574337 -1.063037 0.771726 -0.196721
5 1.091648 0.407703 0.406509 -1.052855
6 -1.587963 -1.730850 0.168353 -0.899848
7 0.225723 0.042629 2.152307 -1.086585
Now you can use pd.df.shift() to shift the entire matrix, and then check the resulting columns item by item in one step. For example:
df1.shift(1)
Creates:
A B C D
0 -0.370764 -0.185119 -0.514149 0.158218
1 -2.164707 0.888354 0.214550 1.334445
2 2.019189 0.910855 0.582508 -0.861778
3 1.574337 -1.063037 0.771726 -0.196721
4 1.091648 0.407703 0.406509 -1.052855
5 -1.587963 -1.730850 0.168353 -0.899848
6 0.225723 0.042629 2.152307 -1.086585
7 NaN NaN NaN NaN
And now you can check the resulting columns with a new matrix as so:
df2 = df1.shift(-1) > df1
which returns:
A B C D
0 False False True False
1 False True True True
2 True True True False
3 False False True True
4 False True False False
5 False False False True
6 True True True False
7 False False False False
To complete your question, we convert the True/False to 1/0 as such:
df2 = df2.applymap(lambda x: 1 if x == True else 0)
Which returns:
A B C D
0 0 0 1 0
1 0 1 1 1
2 1 1 1 0
3 0 0 1 1
4 0 1 0 0
5 0 0 0 1
6 1 1 1 0
7 0 0 0 0
In one line:
df2 = (df1.shift(-1)>df1).replace({True:1,False:0})
So i have a very big data set, and i need to create a function that checks the value in the same row for multiple columns, obviously the values are different for each column i want to check.
Then if all the given columns to check their values are true, i want to return something and add a new column to the DF to use as flag for later filtering.
I think you need compare by eq with all for check if all values are True:
df = pd.DataFrame({'A':[1,2,3],
'B':[1,5,6],
'C':[1,8,9],
'D':[1,3,5],
'E':[1,3,6],
'F':[7,4,3]})
print (df)
A B C D E F
0 1 1 1 1 1 7
1 2 5 8 3 3 4
2 3 6 9 5 6 3
#check same values in columns A,B,C,E
cols = ['B','C','E']
print (df[cols].eq(df.A, axis=0))
B C E
0 True True True
1 False False False
2 False False False
print (df[cols].eq(df.A, axis=0).all(axis=1))
0 True
1 False
2 False
dtype: bool
df['col'] = df[cols].eq(df.A, axis=0).all(axis=1)
print (df)
A B C D E F col
0 1 1 1 1 1 7 True
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False
EDIT by comment:
You need create boolean mask with & (and), | (or) or ~ (not):
print ((df.A == 1) & (df.B > 167) & (df.B <=200))
0 False
1 False
2 False
dtype: bool
df['col'] = (df.A == 1) & (df.B > 167) & (df.B <=200)
print (df)
A B C D E F col
0 1 1 1 1 1 7 False
1 2 5 8 3 3 4 False
2 3 6 9 5 6 3 False
please try using masks.for example
mask=df['a']==4
df=df[mask]