Element-wise Comparison of Two Pandas Dataframes - python

I am trying to compare two columns in pandas. I know I can do:
# either using Pandas' equals()
df1[col].equals(df2[col])
# or this
df1[col] == df2[col]
However, what I am looking for is to compare these columns elment-wise and when they are not matching print out both values. I have tried:
if df1[col] != df2[col]:
print(df1[col])
print(df2[col])
where I get the error for 'The truth value of a Series is ambiguous'
I believe this is because the column is treated as a series of boolean values for the comparison which causes the ambiguity. I also tried various forms of for loops which did not resolve the issue.
Can anyone point me to how I should go about doing what I described?

This might work for you:
import pandas as pd
df1 = pd.DataFrame({'col1': [1, 2, 3, 4, 5]})
df2 = pd.DataFrame({'col1': [1, 2, 9, 4, 7]})
if not df2[df2['col1'] != df1['col1']].empty:
print(df1[df1['col1'] != df2['col1']])
print(df2[df2['col1'] != df1['col1']])
Output:
col1
2 3
4 5
col1
2 9
4 7

You need to get hold of the index where the column values are not matching. Once you have that index then you can query the individual DFs to get the values.
Please try the fallowing and is if this helps:
for ind in (df1.loc[df1['col1'] != df2['col1']].index):
x = df1.loc[df1.index == ind, 'col1'].values[0]
y = df2.loc[df2.index == ind, 'col1'].values[0]
print(x, y )

Solution
Try this. You could use any of the following one-line solutions.
# Option-1
df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
# Option-2
df.loc[df[col1]!=df[col2], [col1, col2]]
Logic:
Option-1: We use pandas.DataFrame.apply() to evaluate the target columns row by row and pass the returned indices to df.loc[indices, [col1, col2]] and that returns the required set of rows where col1 != col2.
Option-2: We get the indices with df[col1] != df[col2] and the rest of the logic is the same as Option-1.
Dummy Data
I made the dummy data such that for indices: 2,6,8 we will find column 'a' and 'c' to be different. Thus, we want only those rows returned by the solution.
import numpy as np
import pandas as pd
a = np.arange(10)
c = a.copy()
c[[2,6,8]] = [0,20,40]
df = pd.DataFrame({'a': a, 'b': a**2, 'c': c})
print(df)
Output:
a b c
0 0 0 0
1 1 1 1
2 2 4 0
3 3 9 3
4 4 16 4
5 5 25 5
6 6 36 20
7 7 49 7
8 8 64 40
9 9 81 9
Applying the solution to the dummy data
We see that the solution proposed returns the result as expected.
col1, col2 = 'a', 'c'
result = df.loc[df.apply(lambda row: row[col1] != row[col2], axis=1), [col1, col2]]
print(result)
Output:
a c
2 2 0
6 6 20
8 8 40

Related

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.
Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4
IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

How to split dataframe on based on columns row

I have one excel file , dataframe have 20 rows . after few rows there is again column names row, i want to divide dataframe based on column names row.
here is example:
x
0
1
2
3
4
x
23
34
5
6
expected output is:
df1
x
0
1
2
3
4
df2
x
23
34
5
6
Considering your column name is col , you can first group the dataframe taking a cumsum on the col where the value equals x by df['col'].eq('x').cumsum() , then for each group create a dataframe by taking the values from the 2nd row of that group and the columns as the first value of that group using df.iloc[] and save them in a dictionary:
d={f'df{i}':pd.DataFrame(g.iloc[1:].values,columns=g.iloc[0].values)
for i,g in df.groupby(df['col'].eq('x').cumsum())}
print(d['df1'])
x
0 0
1 1
2 2
3 3
4 4
print(d['df2'])
x
0 23
1 34
2 5
3 6
Use df.index[df['x'] == 'x'] to look for the row index of where the column name appears again.
Then, split the dataframe into 2 based on the index found
df = pd.DataFrame(columns=['x'], data=[[0], [1], [2], [3], [4], ['x'], [23], [34], [5], [6]])
df1 = df.iloc[:df.index[df['x'] == 'x'].tolist()[0]]
df2 = df.iloc[df.index[df['x'] == 'x'].tolist()[0]+1:]
You did't mention this is sample of your dataset.Then you can try this
import pandas as pd
df1 = []
df2 = []
df1 = pd.DataFrame({'df1': ['x', 0, 1, 2, 3, 4]})
df2 = pd.DataFrame({'df2': ['x', 23, 34, 5, 6]})
display(df1, df2)

Is there any way to replace values with different values in the same row if they meet a certain condition using a for loop?

I need to replace the values of a certain cell with values from another cell if a certain condition is met.
for r in df:
if df['col1'] > 1 :
df['col2']
else:
I am hoping for every value in column 1 to be replaced with their respective value in column 2 if the condition if the value of the row in column 1 is greater than 1.
No need to loop through the entire dataframe.
idx=df['col1']>1
df.loc[idx,'col1']=df.loc[idx,'col2']
Using a for loop
for _,row in df.iterrows():
if row['col1']>1:
row['col1']=row['col2']
elif condition:
#put assignment here
else other_condition:
#put assignment here
Here is an example
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 4, 6]})
print(df)
print('----------')
# the condition here is A^2 == B
df.loc[df['A'] * df['A'] == df['B'], 'A'] = df['B']
print(df)
output
A B
0 1 4
1 2 4
2 3 6
----------
A B
0 1 4
1 4 4
2 3 6

pandas dataframe drop columns by number of nan

I have a dataframe with some columns containing nan. I'd like to drop those columns with certain number of nan. For example, in the following code, I'd like to drop any column with 2 or more nan. In this case, column 'C' will be dropped and only 'A' and 'B' will be kept. How can I implement it?
import pandas as pd
import numpy as np
dff = pd.DataFrame(np.random.randn(10,3), columns=list('ABC'))
dff.iloc[3,0] = np.nan
dff.iloc[6,1] = np.nan
dff.iloc[5:8,2] = np.nan
print dff
There is a thresh param for dropna, you just need to pass the length of your df - the number of NaN values you want as your threshold:
In [13]:
dff.dropna(thresh=len(dff) - 2, axis=1)
Out[13]:
A B
0 0.517199 -0.806304
1 -0.643074 0.229602
2 0.656728 0.535155
3 NaN -0.162345
4 -0.309663 -0.783539
5 1.244725 -0.274514
6 -0.254232 NaN
7 -1.242430 0.228660
8 -0.311874 -0.448886
9 -0.984453 -0.755416
So the above will drop any column that does not meet the criteria of the length of the df (number of rows) - 2 as the number of non-Na values.
You can use a conditional list comprehension:
>>> dff[[c for c in dff if dff[c].isnull().sum() < 2]]
A B
0 -0.819004 0.919190
1 0.922164 0.088111
2 0.188150 0.847099
3 NaN -0.053563
4 1.327250 -0.376076
5 3.724980 0.292757
6 -0.319342 NaN
7 -1.051529 0.389843
8 -0.805542 -0.018347
9 -0.816261 -1.627026
Here is a possible solution:
s = dff.isnull().apply(sum, axis=0) # count the number of nan in each column
print s
A 1
B 1
C 3
dtype: int64
for col in dff:
if s[col] >= 2:
del dff[col]
Or
for c in dff:
if sum(dff[c].isnull()) >= 2:
dff.drop(c, axis=1, inplace=True)
I recommend the drop-method. This is an alternative solution:
dff.drop(dff.loc[:,len(dff) - dff.isnull().sum() <2], axis=1)
Say you have to drop columns having more than 70% null values.
data.drop(data.loc[:,list((100*(data.isnull().sum()/len(data.index))>70))].columns, 1)
You can do this through another approach as well like below for dropping columns having certain number of na values:
df = df.drop( columns= [x for x in df if df[x].isna().sum() > 5 ])
For dropping columns having certain percentage of na values :
df = df.drop(columns= [x for x in df if round((df[x].isna().sum()/len(df)*100),2) > 20 ])

How to print rows if values appear in any column of pandas dataframe

I would like to print all rows of a dataframe where I find the value '-' in any of the columns. Can someone please explain a way that is better than those described below?
This Q&A already explains how to do so by using boolean indexing but each column needs to be declared separately:
print df.ix[df['A'].isin(['-']) | df['B'].isin(['-']) | df['C'].isin(['-'])]
I tried the following but I get an error 'Cannot index with multidimensional key':
df.ix[df[df.columns.values].isin(['-'])]
So I used this code but I'm not happy with the separate printing for each column tested because it is harder to work with and can print the same row more than once:
import pandas as pd
d = {'A': [1,2,3], 'B': [4,'-',6], 'C': [7,8,'-']}
df = pd.DataFrame(d)
for i in range(len(d.keys())):
temp = df.ix[df.iloc[:,i].isin(['-'])]
if temp.shape[0] > 0:
print temp
Output looks like this:
A B C
1 2 - 8
[1 rows x 3 columns]
A B C
2 3 6 -
[1 rows x 3 columns]
Thanks for your advice.
Alternatively, you could do something like df[df.isin(["-"]).any(axis=1)], e.g.
>>> df = pd.DataFrame({'A': [1,2,3], 'B': ['-','-',6], 'C': [7,8,9]})
>>> df.isin(["-"]).any(axis=1)
0 True
1 True
2 False
dtype: bool
>>> df[df.isin(["-"]).any(axis=1)]
A B C
0 1 - 7
1 2 - 8
(Note I changed the frame a bit so I wouldn't get the axes wrong.)
you can do:
>>> idx = df.apply(lambda ts: any(ts == '-'), axis=1)
>>> df[idx]
A B C
1 2 - 8
2 3 6 -
or
lambda ts: '-' in ts.values
note that in looks into the index not the values, so you need .values

Categories