Pandas split on character and remove trailing values - python

I am trying to remove residual data that is identified with a '[' while keeping the first value.
import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']})
print(df)
becomes:
0 a
1 b[b7
2 c
would like to have
0 a
1 b
2 c
Any recommendations?

df.foo=df.foo.str[0]
df
Out[212]:
foo
0 a
1 b
2 c

I assume you're looking for str.split + str[0] -
df
foo
0 test
1 foo[b7
2 ba[r
df.foo.str.split('[').str[0]
0 test
1 foo
2 ba
Name: foo, dtype: object

import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']} )
df["foo"] = df["foo"].str.replace("(\[.*)","")
Here is the https://regex101.com/ explanation
1st Capturing Group (\[.*)
\[ matches the character [ literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
What this means is that it will look for a [ . If it finds one it will remove the [ and all characters after it.

import pandas as pd
df = pd.DataFrame({'foo':[x.split('[')[0] for x in ['a','b[b7','c']]})
print(df)

Related

negative lookbehind when filtering pandas columns

Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad. I can use a regex condition on the pandas columns using .filter(). However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks
Solution if need remove columns names starting by bad:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^ asserts position at start of a line
Negative Lookahead (?!bad)
Assert that the Regex below does not match bad matches
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Solution for remove all columns with bad substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^ asserts position at start of a line
Negative Lookahead (?!.*bad)
Assert that the Regex below does not match
. matches any character
bad matches the characters bad literally
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

How to filter pd.Dataframe based on strings and special characters?

Here is what I have:
import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df
Output:
ID Desc
0 1 0*1***HHCM
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.
Here is what I tried:
new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]
It returns an empty dataframe.
I want it to return the following:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?
Please help.
Try this:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
# If you don't want to import re you can just use flags=8:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
Result:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
To answer your other question, as long as \n is passed correctly it will be parsed as a newline character instead of an alphanumeric character n. i.e.:
r'\n' -> `\\n` (backslash character + n character)
'\n' -> '\n' (newline character)
For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2
You can try this, it checks only the numbers and letters before CAS and HC, but you can easily modify it to after also:
print(df[~df['Desc'].str.contains('([0-9a-zA-Z]+CAS*)|([0-9a-zA-Z]+HC*)', regex=True)])
ID Desc
1 2 HC:83*20
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5

Splitting string multiple times and return the result as new DataFrame

I am trying to split a pandas column repeatedly. I want to find the string in-between two strings, indefinitely. For example, lets say i have the pandas column from the input below:
import numpy as np
import pandas as pd
data=np.array([["'abc'ad32kn'def'dfannasfl[]12a'ghi'"],
["'jk'adf%#d1asn'lm'dfas923231sassda"],
["'nop'ad&#*-0'qrs'd2&*#^#!!sda'tuv'dasdj_23'w'823a&#'xyz'adfa"]])
df = pd.DataFrame({'Practice Column': data.ravel()})
print(df)
I then, would like to split these string by opening and closing quotes '...', and then take what is inside. So, my final output would be:
Can someone help me out?
Thanks.
Let's use extractall here:
df['Practice Column'].str.extractall(r"'(.*?)'").unstack(1)[0].fillna('')
match 0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz
The pattern '(.*?)' finds all instances of strings within the single quotes. More info -
' # Match opening quote
( # Open capture group
.*? # Non-greedy match for anything
) # End of capture group
' # Match closing quote
To merge this back with df, you can either use join:
v = df.join(df['Practice Column']
.str.extractall(r"'(.*?)'").unstack(1)[0].fillna(''))
Or, assign "Practice Column" back:
v = df['Practice Column'].str.extractall(r"'(.*?)'").unstack(1)[0].fillna('')
v.insert(0, 'Practice Column', df['Practice Column'])
print(v)
match Practice Column 0 1 2 3 4
a 'abc'ad32kn'def'dfannasfl[]12a'ghi' abc def ghi
b 'jk'adf%#d1asn'lm'dfas923231sassda jk lm
c 'nop'ad&#*-0'qrs'd2&*#^#!!sda'tuv'dasdj_23'w'8... nop qrs tuv w xyz
Another solution with a list comprehension (for performance).
import re
p = re.compile("'(.*?)'")
pd.DataFrame([
p.findall(s) for s in df['Practice Column']]).fillna('')
0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz
This won't work if there are NaNs, so here's a modified version of the solution above. You will need to drop the NaNs first.
pd.DataFrame([
p.findall(s) for s in df['Practice Column'].dropna()]
).fillna('')
0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz

Filtering out rows with non-alphanumeric characters

I am trying to get a DataFrame from an existing DataFrame containing only the rows where values in a certain column(whose values are strings) do not contain a certain character.
i.e. If the character we don't want is a '('
Original dataframe:
some_col my_column
0 1 some
1 2 word
2 3 hello(
New dataframe:
some_col my_column
0 1 some
1 2 word
I have tried df.loc['(' not in df['my_column']], but this does not work since df['my_column'] is a Series object.
I have also tried: df.loc[not df.my_column.str.contains('(')], which also does not work.
You're looking for str.isalpha:
df[df.my_column.str.isalpha()]
some_col my_column
0 1 some
1 2 word
A similar method is str.isalnum, if you want to retain letters and digits.
If you want to handle letters and whitespace characters, use
df[~df.my_column.str.contains(r'[^\w\s]')]
some_col my_column
0 1 some
1 2 word
Lastly, if you are looking to remove punctuation as a whole, I've written a Q&A here which might be a useful read: Fast punctuation removal with pandas
If you are looking to filter out just that character:
negation of str.contains
Escape the open paren. Some characters can be interpreted as special regex characters. You can escape them with a backslash.
df[~df.my_column.str.contains('\(')]
some_col my_column
0 1 some
1 2 word
str.match all non-open-paren
By the way, this is a bad idea! Checking the whole string that it isn't a character with regex is gross.
df[df.my_column.str.match('^[^\(]*$')]
some_col my_column
0 1 some
1 2 word
Comprehension using in
df[['(' not in x for x in df.my_column]]
some_col my_column
0 1 some
1 2 word

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Categories