Checking data match and mismatch between two columns using python pandas

Checking data match and mismatch between two columns using python pandas - python

Sample data below
enter image description here
input of file A and File B is given and the output format also given . can someone help me on this

I'd also be curious to see a clever/pythonic solution to this. My "ugly" solution iterating over index is as follows:
dfa, dfb are the two dataframes, columns named as in example.
dfa = pd.DataFrame({'c1':['v','f','h','m','s','d'],'c2':['100','110','235','999','333','39'],'c3':['tech','jjj',None,'iii','mnp','lf'],'c4':['hhh','scb','kkk','lop','sos','kdk']})
dfb = pd.DataFrame({'c1':['v','h','m','f','L','s'],'c2':['100','235','999','110','777','333'],'c3':['tech',None,'iii','jkl','9kdf','mnp1'],'c4':['hhh','mckkk','lok','scb','ooo','sos1']})
Now let's create lists of indexes to identify the rows that don't match between dfa and dfb
dfa, dfb = dfa.set_index(['c1','c2']), dfb.set_index(['c1','c2'])
mismatch3, mismatch4 = [],[]
for i in dfa.index:
if i in dfb.index:
if dfa.loc[i,'c3']!=dfb.loc[i,'c3']:
mismatch3.append(i)
if dfa.loc[i,'c4']!=dfb.loc[i,'c4']:
mismatch4.append(i)
mismatch = list(set(mismatch3+mismatch4))
Now that this is done, we want to rename dfb, perform the join operation on the mismatched indexes, and add the "status" columns based on mismatch3 and mismatch4.
dfb = dfb.rename(index=str, columns={'c3':'b_c3','c4':'b_c4'})
df = dfa.loc[mismatch].join(dfb)
df['c3_status'] = 'match'
df['c4_status'] = 'match'
df.loc[mismatch3, 'c3_status'] = 'mismatch'
df.loc[mismatch4, 'c4_status'] = 'mismatch'
Finally, let's get those columns in the right order :)
result = df[['c3','b_c3','c3_status','c4','b_c4','c4_status']]
Once again, I'd love to see a prettier solution. I hope this helps!

Here are four lines of code that may do what you are looking for:
columns_to_compare =['c2','c3']
dfa['Combo'] = dfa[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
dfb['Combo1'] = dfb[columns_to_compare].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
explanation
Assume that you want to see what dfb rows are not in dfa, for columns c2 and c3.
To do this, consider the following approach:
Create a column "Combo" in dfa where each row of "Combo" contains a comma separated string, representing the values of the chosen columns to compare (for the row concerned)
dfa['Combo'] = dfa[dfa.columns].apply(lambda x: ', '.join(x[x.notnull()]), axis = 1)
c1 c2 c3 c4 Combo
0 v 100 tech hhh 100, tech
1 f 110 jjj scb 110, jjj
2 h 235 None kkk 235
3 m 999 iii lop 999, iii
4 s 333 mnp sos 333, mnp
5 d 39 lf kdk 39, lf
Apply the same logic to dfb
c1 c2 c3 c4 Combo1
0 v 100 tech hhh 100, tech
1 h 235 None mckkk 235
2 m 999 iii lok 999, iii
3 f 110 jkl scb 110, jkl
4 L 777 9kdf ooo 777, 9kdf
5 s 333 mnp1 sos1 333, mnp1
Create a list containing the required indices from dfb:
[i for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
or to show the actual row values (not indices):
[[x] for i,x in enumerate(dfb['Combo1'].tolist()) if x not in dfa['Combo'].tolist()]
Row Index Result
[3, 4, 5]
Row Value Result
[['110, jkl'], ['777, 9kdf'], ['333, mnp1']]

Related

Append column value if string is contained in another string

I want to add a new column a3 to my dataframe df: If the strings of "b" contain strings of "b2" from dataframe df2, the new column a3 should append values from a2 of df2.
first dataframe df:
d = {'a': [100, 300], 'b': ["abc", "dfg"]}
df = pd.DataFrame(data=d, index=[1, 2])
print(df)
a b
1 100 abc
2 300 dfg
second dataframe df2:
d2 = {'a2': ["L1", "L2", "L3"], 'b2': ["bc", "op", "fg"]}
df2 = pd.DataFrame(data=d2, index=[1, 2, 3])
print(df2)
a2 b2
1 L1 bc
2 L2 op
3 L3 fg
The output should look like this:
print(df)
a b a3
1 100 abc L1
2 300 dfg L3
I tried a nested for loop, which did not work.
for i in df.b:
for ii in df2.b2:
for iii in df2.a3:
if ii in i:
df["a3"]=iii

You need to test all combinations. You could still take advantage of pandas vector str.contains:
common = (pd.DataFrame({x: df['b'].str.contains(x) for x in df2['b2']})
.replace({False: pd.NA})
.stack()
.reset_index(level=1, name='b2')['level_1'].rename('b2')
)
# 1 bc
# 2 fg
# Name: b2, dtype: object
df.join(common).merge(df2, on='b2')
output:
a b b2 a2
0 100 abc bc L1
1 300 dfg fg L3

You can half fix your logic as follows:
for i in df.b:
for ii, iii in zip(df2.b2, df2.a2):
if ii in i:
df["a3"]=iii
However the final line df["a3"] = iii is assigning iii to every row so you just get the last value for iii in the loop for every row:
a b a3
1 100 abc L3
2 300 dfg L3
You will get many 'correct' options, but one that is closest to your attempt is perhaps:
new_column = [None] * len(df) # create list of Nones same 'height' as df
for i, b in enumerate(df.b):
for a2, b2 in zip(df2.a2, df2.b2):
if b2 in b:
new_column[i] = a2
continue # this moves us on to next 'row' in df
df["a3"] = new_column
A difference from your attempt is that this builds the 'new_column' separately and then adds to your dataframe after. In the case where there is no match you will be left with None. In the case of multiple matches, you will get the first (top) match. You could remove the continue line to instead get the last (bottom) match.

Among a lof of approaches, you can use list comprehension:
df["a2"] = [df2.iloc[i]["a2"] for y in df.b for i,x in enumerate(df2.b2) if x in y]
df
Output
a
b
a2
1
100
abc
L1
2
300
dfg
L3
And note that, it shouldn't be d2 = {'a2': [10, 30, 25], 'b2': ["bc", "op", "fg"]}, rather it should be d2 = {'a2': ["L1", "L2", "L3"], 'b2': ["bc", "op", "fg"]}.

Return the unique values of a specific column line by line

I have the following dataframe
column1 column2
0 Paul xx
1 John aa
2 Paul gg
3 John xx
4 John bb
5 George gg
6 Paul gg
7 john xx
.
n Jonathan ff
I want to have the information of each person in one row. On the same row I want to have the index but in another column. So I want a dataframe like this:
column1 column2 column3
0 Paul 0,2,6 xx, gg, gg
1 John 1,3,4,7 aa, xx, bb, xx
5 George 5 gg
.
.
.
n Jonathan n ff
In order to make the above dataframe i execute
df2 = df.reset_index().groupby('column1').agg(list).reset_index()
ix = pd.Index(df2['index'].str.get(0)).rename(None)
df3 = df2.set_index(ix).sort_index()
df3
Which returns:
column1 index column2
0 Paul [0, 2, 6] [xx, gg, gg]
1 John [1, 3, 4, 7] [aa, xx, bb,xx]
5 George [5] [gg]
After that, I delete column1 and index.
To have the values of column2 in a format, not represented as list I execute:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(x))
return df3
dfb=transform_list(df3)
df3.head()
which return:
column2
0 xx, gg, gg
1 aa, xx, bb,xx
5 gg
So now what I want is to have the unique values of each row
so my final dataframe will be
column2
0 xx, gg
1 aa, xx, bb
5 gg
Any ideas?

As long as the order of the elements in your output doesn't matter, you can redefine your function as follows:
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(set(x)))
return df3
A set contains inherently only unique elements, so converting the list x to a set will discard any duplicates. Sets are inherently unordered however, so you may get unintended results if order matters.
If order does matter, you can use the version
def transform_list(df3):
df3['column2'] = df3['column2'].apply(lambda x: ','.join(list(dict.fromkeys(x))))
return df3
This creates a dictionary (which is insertion ordered) with keys from your initial list x, and since the keys can't be multiply defined, we end up with only the unique elements. Converting back to a list takes the keys from the dictionary, and the rest of the workflow can continue as needed without alteration.

You could transform each list to a set and then back to a list, to eliminate duplicate entries. This could be done within your lambda function:
import pandas as pd
df = pd.DataFrame({'column2': [['xx', 'gg', 'gg'],
['aa', 'xx', 'bb', 'xx'],
['gg']]},
index=[0, 1, 5])
df['column2'] = df.column2.apply(lambda x: ', '.join(list(set(x))))
df
column2
0 gg, xx
1 bb, xx, aa
5 gg

How to Match Strings from multiple data frame and return indexes with AND and OR options

This is the data frame that I want to search on and get back the matching row number.
'A' and 'AB' are completely different things.
df2 = pd.DataFrame(np.array(['A','B','AC','AD','NAN','XX','BC','SLK','AC','AD','NAN','XU','BB','FG','XZ','XY','AD','NAN','NF','XY','AB','AC','AD','NAN','XY','LK','AC','AC','AD','NAN','KH','BC','GF','BC','AD']).reshape(5,7),columns=['a','b','c','d','e','f','g'])
a b c d e f g
0 A B AC AD NAN XX BC
1 SLK AC AD NAN XU BB FG
2 XZ XY AD NAN NF XY AB
3 AC AD NAN XY LK AC AC
4 AD NAN KH BC GF BC AD
The strings I will be searching for are from this smaller data frame. Where each row has to be searched as AND, to get back matched string row index of data frame df2.
df = pd.DataFrame(np.array(['A','B','C','D','AA','AB','AC','AD','NAN','BB','BC','AD']).reshape(6,2),columns=['a1','b1'])
a1 b1
0 A B # present in the first row of df2
1 C D # not present in any row of df2
2 AA AB # not present in any row of df2
3 AC AD # present in the second row of df2
4 NAN BB # present in the second row of df2
5 BC AD # present in the fourth row of df2
AND part
Desired output [0,1,3,4]
import pandas as pd
import numpy as np
index1 = df.index # Finds the number of row in df
terms=[]
React=[]
for i in range(len(index1)): #for loop to search each row of df dataframe
terms=df.iloc[i] # Get i row
terms[i]=terms.values.tolist() # converts to a list
print(terms[i]) # to check
# each row
for term in terms[i]: # to search for each string in the
print(term)
results = pd.DataFrame()
if results.empty:
results = df2.isin( [ term ] )
else:
results |= df2.isin( [ term ] )
results['count'] = results.sum(axis=1)
print(results['count'])
print(results[results['count']==len(terms[i])].index.tolist())
React=results[results['count']==len(terms[i])].index.tolist()
React
Getting TypeError: unhashable type: 'list' on results = df2.isin( [ term ] )
For OR it should be easy buy have to exclude AND parts which are already Accounted in the first section
React2=df2.isin([X]).any(1).index.tolist()
React2

It's not the output you'd expect, but I asked for the index in the AND condition. The resulting list of output contains the df2 indexes on a df row-by-row basis. Does this meet the intent of your question?
output = []
for i in range(len(df)):
tmp = []
for k in range(len(df2)):
d = df2.loc[k].isin(df.loc[i,['a1']])
f = df2.loc[k].isin(df.loc[i,['b1']])
d = d.tolist()
f = f.tolist()
if sum(d) >= 1 and sum(f) >=1:
tmp.append(k)
output.append(tmp)
output
[[0], [], [], [0, 1, 3], [1], [0, 4]]

For each item in list L, find all of the corresponding items in a dataframe

I'm looking for a fast solution to this Python problem:
- 'For each item in list L, find all of the corresponding items in a dataframe column (`df [ 'col1' ]).
The catch is that both L and df ['col1'] may contain duplicate values and all duplicates should be returned.
For example:
L = [1,4,1]
d = {'col1': [1,2,3,4,1,4,4], 'col2': ['a','b','c','d','e','f','g']}
df = pd.DataFrame(data=d)
The desired output would be a new DataFrame where df [ 'col1' ] contains the values:
[1,1,1,1,4,4,4]
and rows are duplicated accordingly. Note that 1 appears 4 times (twice in L * twice in df)
I have found that the obvious solutions like .isin() don't work because they drop duplicates.
A list comprehension does work, but it is too slow for my real-life problem, where len(df) = 16 million and len(L) = 150000):
idx = [y for x in L for y in df[df['col1'].values == x]]
res = df.loc[idx].reset_index(drop=True)
This is basically just a problem of comparing two lists (with a bit of dataframe indexing difficulty tacked on), and a clever and very fast solution by Mad Physicist almost works for this, except that duplicates in L are dropped (it returns [1, 4, 1, 4, 4] in the example above; i.e., it finds the duplicates in df but ignores the duplicates in L).
train = np.array([...]) # my df['col1']
keep = np.array([...]) # my list L
keep.sort()
ind = np.searchsorted(keep, train, side='left')
ind[ind == keep.size] -= 1
train_keep = train[keep[ind] == train]
I'd be grateful for any ideas.

Initial data:
L = [1,4,1]
df = pd.DataFrame({'col':[1,2,3,4,1,4,4] })
You can create dataframe from L
df2 = pd.DataFrame({'col':L})
and merge it with initial dataframe:
result = df.merge(df2, how='inner', on='col')
print(result)
Result:
col
0 1
1 1
2 1
3 1
4 4
5 4
6 4

IIUC try:
L = [1,4,1]
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0)
(Not sure how do you want to have indexes- the above will return a bit raw format)
Output:
0 1
4 1
3 4
5 4
6 4
0 1
4 1
Name: col, dtype: int64
Reindexed:
pd.concat([df.loc[df['col'].eq(el), 'col'] for el in L], axis=0).reset_index(drop=True)
#output:
0 1
1 1
2 4
3 4
4 4
5 1
6 1
Name: col, dtype: int64

Remove column index from dataframe

I extracted multiple dataframes from excel sheet by passing cordinates (start & end)
Now i used below funtion to extacr according to cordinates, but when i am trying to
convert it into dataframe, no sure from where index are coming in df as columns
I wanted to remove these index and make 2nd row as columns, this is my dataframe
0 1 2 3 4 5 6
Cols/Rows A A2 B B2 C C2
0 A 50 50 150 150 200 200
1 B 200 200 250 300 300 300
2 C 350 500 400 400 450 450
def extract_dataframes(sheet):
ws = sheet['pivots']
cordinates = [('A1', 'M8'), ('A10', 'Q17'), ('A19', 'M34'), ('A36', 'Q51')]
multi_dfs_list = []
for i in cordinates:
data_rows = []
for row in ws[i[0]:i[1]]:
data_cols = []
for cell in row:
data_cols.append(cell.value)
data_rows.append(data_cols)
multi_dfs_list.append(data_rows)
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
return multi_dfs
I tried to delete index but not working.
Note: when i say
>>> multi_dfs[0].columns # first dataframe
RangeIndex(start=0, stop=13, step=1)

Change
multi_dfs = {i: pd.DataFrame(df) for i, df in enumerate(multi_dfs_list)}
for
multi_dfs = {i: pd.DataFrame(df[1:], columns=df[0]) for i, df in enumerate(multi_dfs_list)}
From the Docs,
columns : Index or array-like
Column labels to use for resulting frame. Will default to RangeIndex (0, 1, 2, …, n) if no column labels are provided

I think need:
df = pd.read_excel(file, skiprows=1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Checking data match and mismatch between two columns using python pandas - python

Sample data below enter image description here input of file A and File B is given and the output format also given . can someone help me on this

Related

Append column value if string is contained in another string

Return the unique values of a specific column line by line

How to Match Strings from multiple data frame and return indexes with AND and OR options

For each item in list L, find all of the corresponding items in a dataframe

Remove column index from dataframe

Categories

Resources