Subtract one text column from the other using pandas - python

I want to remove the text that within one column from the other column vectorially. Meaning, without using loop or apply.
I found this solution that no longer works old solution link.
Input:
pd.DataFrame({'A': ['ABC', 'ABC'], 'B': ['A', 'B']})
A B
0 ABC A
1 ABC B
Desired output:
0 BC
1 AC

Use a list comprehension:
df['C'] = [a.replace(b, '') for a,b in zip(df['A'], df['B'])]
Output:
A B C
0 ABC A BC
1 ABC B AC
If you want a Series:
out = pd.Series([a.replace(b, '') for a,b in zip(df['A'], df['B'])], index=df.index)
Output:
0 BC
1 AC
dtype: object

Related

Pandas : select row where column A does not begin with column B

I want to select all those rows in a data frame where column A (string) does not begin with column B(string) .I used
df[not df['A'].str.startswith(df['B']) ]
but it is showing not a boolean value.
eg :
A B
1. abcdef abc
2. ab cd
3. ef g
Then the required output is:
eg :
A B
1. ab cd
2. ef g
Please help.
You can do this:
In [250]: data={'A': ['abcdef', 'ab', 'ed'],
...: 'B': ['abc', 'cd','g']}
...: df = pd.DataFrame(data)
In [251]: df
Out[251]:
A B
0 abcdef abc
1 ab cd
2 ed g
In [248]: df[~df.A.str.contains('|'.join(df.B))]
Out[248]:
A B
1 ab cd
2 ed g
import pandas as pd
data={'A': ['abcdef', 'ab', 'ed'],
'B': ['abc', 'cd','g']}
df = pd.DataFrame(data)
df[df.apply(lambda x: x.B not in x.A, axis=1)]
It gives you the exact result you want.

how to groupby and join multiple rows from multiple columns at a time?

I want to know how to groupby a single column and join multiple column strings each row.
Here's an example dataframe:
df = pd.DataFrame(np.array([['a', 'a', 'b', 'b'], [1, 1, 2, 2],
['k', 'l', 'm', 'n']]).T,
columns=['a', 'b', 'c'])
print(df)
a b c
0 a 1 k
1 a 1 l
2 b 2 m
3 b 2 n
I've tried something like,
df.groupby(['b', 'a'])['c'].apply(','.join).reset_index()
b a c
0 1 a k,l
1 2 b m,n
But that is not my required output,
Desired output:
a b c
0 1 a,a k,l
1 2 b,b m,n
How can I achieve this? I need a scalable solution because I'm dealing with millions of rows.
I think you need grouping by b column only and then if necessary create list of columns for apply function with GroupBy.agg:
df1 = df.groupby('b')['a','c'].agg(','.join).reset_index()
#alternative if want join all columns without b
#df1 = df.groupby('b').agg(','.join).reset_index()
print (df1)
b a c
0 1 a,a k,l
1 2 b,b m,n

Renaming columns on slice of dataframe not performing as expected

I was trying to clean up column names in a dataframe but only a part of the columns.
It doesn't work when trying to replace column names on a slice of the dataframe somehow, why is that?
Lets say we have the following dataframe:
Note, on the bottom is copy-able code to reproduce the data:
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
I want to clean up the column names (expected output):
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Approach 1:
I can get the clean column names like this:
df.iloc[:, 1:].columns.str[:4]
Index(['ColA', 'ColB', 'ColC'], dtype='object')
Or
Approach 2:
s = df.iloc[:, 1:].columns
[col[:4] for col in s]
['ColA', 'ColB', 'ColC']
But when I try to overwrite the column names, nothing happens:
df.iloc[:, 1:].columns = df.iloc[:, 1:].columns.str[:4]
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Same for the second approach:
s = df.iloc[:, 1:].columns
cols = [col[:4] for col in s]
df.iloc[:, 1:].columns = cols
Value ColAfjkj ColBhuqwa ColCouiqw
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
This does work, but you have to manually concat the name of the first column, which is not ideal:
df.columns = ['Value'] + df.iloc[:, 1:].columns.str[:4].tolist()
Value ColA ColB ColC
0 1 a e i
1 2 b f j
2 3 c g k
3 4 d h l
Is there an easier way to achieve this? Am I missing something?
Dataframe for reproduction:
df = pd.DataFrame({'Value':[1,2,3,4],
'ColAfjkj':['a', 'b', 'c', 'd'],
'ColBhuqwa':['e', 'f', 'g', 'h'],
'ColCouiqw':['i', 'j', 'k', 'l']})
This is because pandas' index is immutable. If you check the documentation for class pandas.Index, you'll see that it is defined as:
Immutable ndarray implementing an ordered, sliceable set
So in order to modify it you'll have to create a new list of column names, for instance with:
df.columns = [df.columns[0]] + list(df.iloc[:, 1:].columns.str[:4])
Another option is to use rename with a dictionary containing the columns to replace:
df.rename(columns=dict(zip(df.columns[1:], df.columns[1:].str[:4])))
To overwrite columns names you can .rename() method:
So, it will look like:
df.rename(columns={'ColA_fjkj':'ColA',
'ColB_huqwa':'ColB',
'ColC_ouiqw':'ColC'}
, inplace=True)
More info regarding rename here in docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html
I had this problem as well and came up with this solution:
First, create a mask of the columns you want to rename
mask = df.iloc[:,1:4].columns
Then, use list comprehension and a conditional to rename just the columns you want
df.columns = [x if x not in mask else str[:4] for x in df.columns]

Pandas: Sort before aggregate within a group

I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}

String "contains"-slicing on Pandas MultiIndex

How can I slice a MultiIndex by its string content? I.e. whether that particular index contains a certain string?
In [12]: df = pd.DataFrame({'a': ['a', 'ab', 'b'],
'c': ['d', 'd', 'd'],
'val': [1, 2 , 3]}).set_index(['a', 'c'])
In [13]: df
Out[13]:
val
a c
a d 1
ab d 2
b d 3
In [15]: df.xs('a', level='a', drop_level=False)
Out[15]:
val
a c
a d 1
In[16]: df.xs(contains('a'), level='a', drop_level=False)
Expected output:
Out[16]:
a c
a d 1
ab d 2
Obviously that last bit is not possible.
How can this be done elegantly?
Can you do it case-insensitive in some way?
Use boolean indexing with get_level_values and str.contains:
print (df.index.get_level_values('a'))
Index(['a', 'ab', 'b'], dtype='object', name='a'
print (df.index.get_level_values('a').str.contains('a'))
[ True True False]
df1 = df[df.index.get_level_values('a').str.contains('a', case=False)]
print (df1)
val
a c
a d 1
ab d 2
Another method is to use query:
The DataFrame.index and DataFrame.columns attributes of the DataFrame
instance are placed in the query namespace by default, which allows
you to treat both the index and columns of the frame as a column in
the frame.
>>> df.query('a.str.contains("a")')
val
a c
a d 1
ab d 2
which IMO is a little more readable and succinct than
>>> df[df.index.get_level_values('a').str.contains('a')]
When index level is not critical, an alternative is to use df.filter(...) with regular expressions; super helpful when exploring data by either column or row. For example, this will give you the same answer with a bit less code:
df.filter(regex=re.compile('A',re.I),axis=0)
However, this filters at all index levels, df.filter(regex=re.compile('D',re.I),axis=0) will look at index "c" and it show this:

Categories