Pandas : select row where column A does not begin with column B - python

I want to select all those rows in a data frame where column A (string) does not begin with column B(string) .I used
df[not df['A'].str.startswith(df['B']) ]
but it is showing not a boolean value.
eg :
A B
1. abcdef abc
2. ab cd
3. ef g
Then the required output is:
eg :
A B
1. ab cd
2. ef g
Please help.

You can do this:
In [250]: data={'A': ['abcdef', 'ab', 'ed'],
...: 'B': ['abc', 'cd','g']}
...: df = pd.DataFrame(data)
In [251]: df
Out[251]:
A B
0 abcdef abc
1 ab cd
2 ed g
In [248]: df[~df.A.str.contains('|'.join(df.B))]
Out[248]:
A B
1 ab cd
2 ed g

import pandas as pd
data={'A': ['abcdef', 'ab', 'ed'],
'B': ['abc', 'cd','g']}
df = pd.DataFrame(data)
df[df.apply(lambda x: x.B not in x.A, axis=1)]
It gives you the exact result you want.

Related

Subtract one text column from the other using pandas

I want to remove the text that within one column from the other column vectorially. Meaning, without using loop or apply.
I found this solution that no longer works old solution link.
Input:
pd.DataFrame({'A': ['ABC', 'ABC'], 'B': ['A', 'B']})
A B
0 ABC A
1 ABC B
Desired output:
0 BC
1 AC
Use a list comprehension:
df['C'] = [a.replace(b, '') for a,b in zip(df['A'], df['B'])]
Output:
A B C
0 ABC A BC
1 ABC B AC
If you want a Series:
out = pd.Series([a.replace(b, '') for a,b in zip(df['A'], df['B'])], index=df.index)
Output:
0 BC
1 AC
dtype: object

Pandas: Sort before aggregate within a group

I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}

Remove one dataframe from another with Pandas

I have two dataframes of different size (df1 nad df2). I would like to remove from df1 all the rows which are stored within df2.
So if I have df2 equals to:
A B
0 wer 6
1 tyu 7
And df1 equals to:
A B C
0 qwe 5 a
1 wer 6 s
2 wer 6 d
3 rty 9 f
4 tyu 7 g
5 tyu 7 h
6 tyu 7 j
7 iop 1 k
The final result should be like so:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I was able to achieve my goal by using a for loop but I would like to know if there is a better and more elegant and efficient way to perform such operation.
Here is the code I wrote in case you need it:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
for i, row in df2.iterrows():
df1 = df1[(df1['A']!=row['A']) & (df1['B']!=row['B'])].reset_index(drop=True)
Use merge with outer join with filter by query, last remove helper column by drop:
df = pd.merge(df1, df2, on=['A','B'], how='outer', indicator=True)
.query("_merge != 'both'")
.drop('_merge', axis=1)
.reset_index(drop=True)
print (df)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
The cleanest way I found was to use drop from pandas using the index of the dataframe you want to drop:
df1.drop(df2.index, axis=0,inplace=True)
You can use np.in1d to check if any row in df1 exists in df2. And then use it as a reversed mask to select rows from df1.
df1[~df1[['A','B']].apply(lambda x: np.in1d(x,df2).all(),axis=1)]\
.reset_index(drop=True)
Out[115]:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
pandas has a method called isin, however this relies on unique indices. We can define a lambda function to create columns we can use in this from the existing 'A' and 'B' of df1 and df2. We then negate this (as we want the values not in df2) and reset the index:
import pandas as pd
df1 = pd.DataFrame({'A' : ['qwe', 'wer', 'wer', 'rty', 'tyu', 'tyu', 'tyu', 'iop'],
'B' : [ 5, 6, 6, 9, 7, 7, 7, 1],
'C' : ['a' , 's', 'd', 'f', 'g', 'h', 'j', 'k']})
df2 = pd.DataFrame({'A' : ['wer', 'tyu'],
'B' : [ 6, 7]})
unique_ind = lambda df: df['A'].astype(str) + '_' + df['B'].astype(str)
print df1[~unique_ind(df1).isin(unique_ind(df2))].reset_index(drop=True)
printing:
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
I think the cleanest way can be:
We have base dataframe D and want to remove a subset D1. Let the output be D2
D2 = pd.DataFrame(D, index = set(D.index).difference(set(D1.index))).reset_index()
I find this other alternative useful too:
pd.concat([df1,df2], axis=0, ignore_index=True).drop_duplicates(subset=["A","B"],keep=False, ignore_index=True)
A B C
0 qwe 5 a
1 rty 9 f
2 iop 1 k
keep=False drops both duplicates.
It doesn't require to put all the equal columns between the two df, so I find that a bit easier.

how to match two dataFrame in python

I have two dataFrame in Python.
The first one is df1:
'ID' 'B'
AA 10
BB 20
CC 30
DD 40
The second one is df2:
'ID' 'C' 'D'
BB 30 0
DD 35 0
What I want to get finally is like df3:
'ID' 'C' 'D'
BB 30 20
DD 35 40
how to reach this goal?
my code is:
for i in df.ID
if len(df2.ID[df2.ID==i]):
df2.D[df2.ID==i]=df1.B[df2.ID==i]
but it doesn't work.
So first of all, I've interpreted the question differently, since your description is rather ambiguous. Mine boils down to this:
df1 is this data structure:
ID B <- column names
AA 10
BB 20
CC 30
DD 40
df2 is this data structure:
ID C D <- column names
BB 30 0
DD 35 0
Dataframes have a merge option, if you wanted to merge based on index the following code would work:
import pandas as pd
df1 = pd.DataFrame(
[
['AA', 10],
['BB', 20],
['CC', 30],
['DD', 40],
],
columns=['ID','B'],
)
df2 = pd.DataFrame(
[
['BB', 30, 0],
['DD', 35, 0],
], columns=['ID', 'C', 'D']
)
df3 = pd.merge(df1, df2, on='ID')
Now df3 only contains rows with ID's in both df1 and df2:
ID B C D <- column names
BB 20 30 0
DD 40 35 0
Now you were trying to remove D, and fill it in with column B, a.k.a
ID C D
BB 30 20
DD 35 40
Something that can be done with these simple steps:
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
Or to summarize:
def match(df1, df2):
df3 = pd.merge(df1, df2, on='ID') # merge them
df3.D = df3['B'] # set D to B's values
del df3['B'] # remove B from df3
return df3
Following code will replace zero in df1 with value df2
df1=pd.DataFrame(['A','B',0,4,6],columns=['x'])
df2=pd.DataFrame(['A','X',3,0,5],columns=['x'])
df3=df1[df1!=0].fillna(df2)

How can I check the ID of a pandas data frame in another data frame in Python?

Hello I have the following Data Frame:
df =
ID Value
a 45
b 3
c 10
And another dataframe with the numeric ID of each value
df1 =
ID ID_n
a 3
b 35
c 0
d 7
e 1
I would like to have a new column in df with the numeric ID, so:
df =
ID Value ID_n
a 45 3
b 3 35
c 10 0
Thanks
Use pandas merge:
import pandas as pd
df1 = pd.DataFrame({
'ID': ['a', 'b', 'c'],
'Value': [45, 3, 10]
})
df2 = pd.DataFrame({
'ID': ['a', 'b', 'c', 'd', 'e'],
'ID_n': [3, 35, 0, 7, 1],
})
df1.set_index(['ID'], drop=False, inplace=True)
df2.set_index(['ID'], drop=False, inplace=True)
print pd.merge(df1, df2, on="ID", how='left')
output:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You could use join(),
In [14]: df1.join(df2)
Out[14]:
Value ID_n
ID
a 45 3
b 3 35
c 10 0
If you want index to be numeric you could reset_index(),
In [17]: df1.join(df2).reset_index()
Out[17]:
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
You can do this in a single operation. join works on the index, which you don't appear to have set. Just set the index to ID, join df after also setting its index to ID, and then reset your index to return your original dataframe with the new column added.
>>> df.set_index('ID').join(df1.set_index('ID')).reset_index()
ID Value ID_n
0 a 45 3
1 b 3 35
2 c 10 0
Also, because you don't do an inplace set_index on df1, its structure remains the same (i.e. you don't change its indexing).

Categories