Pandas split on character and remove trailing values

Pandas split on character and remove trailing values - python

I am trying to remove residual data that is identified with a '[' while keeping the first value.
import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']})
print(df)
becomes:
0 a
1 b[b7
2 c
would like to have
0 a
1 b
2 c
Any recommendations?

df.foo=df.foo.str[0]
df
Out[212]:
foo
0 a
1 b
2 c

I assume you're looking for str.split + str[0] -
df
foo
0 test
1 foo[b7
2 ba[r
df.foo.str.split('[').str[0]
0 test
1 foo
2 ba
Name: foo, dtype: object

import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']} )
df["foo"] = df["foo"].str.replace("(\[.*)","")
Here is the https://regex101.com/ explanation
1st Capturing Group (\[.*)
\[ matches the character [ literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
What this means is that it will look for a [ . If it finds one it will remove the [ and all characters after it.

import pandas as pd
df = pd.DataFrame({'foo':[x.split('[')[0] for x in ['a','b[b7','c']]})
print(df)

Related

negative lookbehind when filtering pandas columns

Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad. I can use a regex condition on the pandas columns using .filter(). However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks

Solution if need remove columns names starting by bad:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^ asserts position at start of a line
Negative Lookahead (?!bad)
Assert that the Regex below does not match bad matches
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Solution for remove all columns with bad substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^ asserts position at start of a line
Negative Lookahead (?!.*bad)
Assert that the Regex below does not match
. matches any character
bad matches the characters bad literally
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

How to filter pd.Dataframe based on strings and special characters?

Here is what I have:
import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df
Output:
ID Desc
0 1 0*1***HHCM
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.
Here is what I tried:
new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]
It returns an empty dataframe.
I want it to return the following:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?
Please help.

Try this:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
# If you don't want to import re you can just use flags=8:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
Result:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
To answer your other question, as long as \n is passed correctly it will be parsed as a newline character instead of an alphanumeric character n. i.e.:
r'\n' -> `\\n` (backslash character + n character)
'\n' -> '\n' (newline character)
For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2

You can try this, it checks only the numbers and letters before CAS and HC, but you can easily modify it to after also:
print(df[~df['Desc'].str.contains('([0-9a-zA-Z]+CAS*)|([0-9a-zA-Z]+HC*)', regex=True)])
ID Desc
1 2 HC:83*20
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5

Splitting string multiple times and return the result as new DataFrame

I am trying to split a pandas column repeatedly. I want to find the string in-between two strings, indefinitely. For example, lets say i have the pandas column from the input below:
import numpy as np
import pandas as pd
data=np.array([["'abc'ad32kn'def'dfannasfl[]12a'ghi'"],
["'jk'adf%#d1asn'lm'dfas923231sassda"],
["'nop'ad&#*-0'qrs'd2&*#^#!!sda'tuv'dasdj_23'w'823a&#'xyz'adfa"]])
df = pd.DataFrame({'Practice Column': data.ravel()})
print(df)
I then, would like to split these string by opening and closing quotes '...', and then take what is inside. So, my final output would be:
Can someone help me out?
Thanks.

Let's use extractall here:
df['Practice Column'].str.extractall(r"'(.*?)'").unstack(1)[0].fillna('')
match 0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz
The pattern '(.*?)' finds all instances of strings within the single quotes. More info -
' # Match opening quote
( # Open capture group
.*? # Non-greedy match for anything
) # End of capture group
' # Match closing quote
To merge this back with df, you can either use join:
v = df.join(df['Practice Column']
.str.extractall(r"'(.*?)'").unstack(1)[0].fillna(''))
Or, assign "Practice Column" back:
v = df['Practice Column'].str.extractall(r"'(.*?)'").unstack(1)[0].fillna('')
v.insert(0, 'Practice Column', df['Practice Column'])
print(v)
match Practice Column 0 1 2 3 4
a 'abc'ad32kn'def'dfannasfl[]12a'ghi' abc def ghi
b 'jk'adf%#d1asn'lm'dfas923231sassda jk lm
c 'nop'ad&#*-0'qrs'd2&*#^#!!sda'tuv'dasdj_23'w'8... nop qrs tuv w xyz
Another solution with a list comprehension (for performance).
import re
p = re.compile("'(.*?)'")
pd.DataFrame([
p.findall(s) for s in df['Practice Column']]).fillna('')
0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz
This won't work if there are NaNs, so here's a modified version of the solution above. You will need to drop the NaNs first.
pd.DataFrame([
p.findall(s) for s in df['Practice Column'].dropna()]
).fillna('')
0 1 2 3 4
0 abc def ghi
1 jk lm
2 nop qrs tuv w xyz

Filtering out rows with non-alphanumeric characters

I am trying to get a DataFrame from an existing DataFrame containing only the rows where values in a certain column(whose values are strings) do not contain a certain character.
i.e. If the character we don't want is a '('
Original dataframe:
some_col my_column
0 1 some
1 2 word
2 3 hello(
New dataframe:
some_col my_column
0 1 some
1 2 word
I have tried df.loc['(' not in df['my_column']], but this does not work since df['my_column'] is a Series object.
I have also tried: df.loc[not df.my_column.str.contains('(')], which also does not work.

You're looking for str.isalpha:
df[df.my_column.str.isalpha()]
some_col my_column
0 1 some
1 2 word
A similar method is str.isalnum, if you want to retain letters and digits.
If you want to handle letters and whitespace characters, use
df[~df.my_column.str.contains(r'[^\w\s]')]
some_col my_column
0 1 some
1 2 word
Lastly, if you are looking to remove punctuation as a whole, I've written a Q&A here which might be a useful read: Fast punctuation removal with pandas

If you are looking to filter out just that character:
negation of str.contains
Escape the open paren. Some characters can be interpreted as special regex characters. You can escape them with a backslash.
df[~df.my_column.str.contains('\(')]
some_col my_column
0 1 some
1 2 word
str.match all non-open-paren
By the way, this is a bad idea! Checking the whole string that it isn't a character with regex is gross.
df[df.my_column.str.match('^[^\(]*$')]
some_col my_column
0 1 some
1 2 word
Comprehension using in
df[['(' not in x for x in df.my_column]]
some_col my_column
0 1 some
1 2 word

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?

Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN

Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas split on character and remove trailing values - python

I am trying to remove residual data that is identified with a '[' while keeping the first value. import pandas as pd df=pd.DataFrame({'foo':['a','b[b7','c']}) print(df) becomes: 0 a 1 b[b7 2 c would like to have 0 a 1 b 2 c Any recommendations?

df.foo=df.foo.str[0] df Out[212]: foo 0 a 1 b 2 c

I assume you're looking for str.split + str[0] - df foo 0 test 1 foo[b7 2 ba[r df.foo.str.split('[').str[0] 0 test 1 foo 2 ba Name: foo, dtype: object

import pandas as pd df = pd.DataFrame({'foo':[x.split('[')[0] for x in ['a','b[b7','c']]}) print(df)

Related

negative lookbehind when filtering pandas columns

How to filter pd.Dataframe based on strings and special characters?

Splitting string multiple times and return the result as new DataFrame

Filtering out rows with non-alphanumeric characters

How to replace an entire cell with NaN on pandas DataFrame

Categories

Resources