negative lookbehind when filtering pandas columns - python

Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad. I can use a regex condition on the pandas columns using .filter(). However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks

Solution if need remove columns names starting by bad:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^ asserts position at start of a line
Negative Lookahead (?!bad)
Assert that the Regex below does not match bad matches
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Solution for remove all columns with bad substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^ asserts position at start of a line
Negative Lookahead (?!.*bad)
Assert that the Regex below does not match
. matches any character
bad matches the characters bad literally
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

Related

Python - count successive leading digits on a pandas row string without counting non successive digits

I need to create a new column that counts the number of leading 0s, however I am getting errors trying to do so.
I extracted data from mongo based on the following regex [\^0[0]*[1-9][0-9]*\] on mongo and saved it to a csv file. This is all "Sequences" that start with a 0.
df['Sequence'].str.count('0')
and
df['Sequence'].str.count('0[0]*[1-9][0-9]')
Give the below results. As you can see that both of the "count" string return will also count non leading 0s. Or simply the total number of 0s.
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 2
3 002486248 2
4 045074305 3
5 080666140 3
I also tried writing using loops which worked when testing but when using it on the data frame, I encounter the following **IndexError: string index out of range**
results = []
count = 0
index = 0
for item in df['Sequence']:
count = 0
index = 0
while (item[index] == "0"):
count = count + 1
index = index + 1
results.append(count)
df['0s'] = results
df
In short; If I can get 2 for 001230 substring instead of 3. I could save the results in a column to do my stats on.
You can use extract with the ^(0*) regex to match only the leading zeros. Then use str.len to get the length.
df['0s'] = df['sequence'].str.extract('^(0*)', expand = False).str.len()
Example input:
df = pd.DataFrame({'sequence': ['12040', '01230', '00010', '00120']})
Output:
sequence 0s
0 12040 0
1 01230 1
2 00010 3
3 00120 2
You can use this regex:
'^0+'
the ^ means, capture if the pattern starts at the beginning of the string.
the +means, capture if occuring at least once or multiple times.
IIUC, you want to count the number of leading 0s, right? Take advantage of the fact that leading 0s disappear when an integer of type str is converted to that of type int. Here's one solution:
df['leading 0s'] = df['Sequence'].str.len() - df['Sequence'].astype(int).astype(str).str.len()
Output:
Sequence leading 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1
Try str.findall:
df['0s'] = df['Sequence'].str.findall('^0*').str[0].str.len()
print(df)
# Output:
Sequence 0s
0 012312312 1
1 024624624 1
2 036901357 1
3 002486248 2
4 045074305 1
5 080666140 1

Remove leading zeroes pandas

For example I have such a data frame
import pandas as pd
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24']}
df = pd.DataFrame(nums)
And I need to remove all leading zeroes and replace NONEs with zeros:
I did it with cycles but for large frames it works not fast enough.
I'd like to rewrite it using vectores
you can try str.replace
df['amount'].str.replace(r'^(0+)', '').fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Name: amount, dtype: object
df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')
I see already nice answer from #Epsi95 though, you even can try with character set with regex
>>> df['amount'].str.replace(r'^[0]*', '', regex=True).fillna('0')
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
Explanation:
^[0]*
^ asserts position at start of a line
Match a single character present in the list below [0]
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
Step by step :
Remove all leading zeros:
Use str.lstrip which returns a copy of the string with leading characters removed (based on the string argument passed).
Here,
df['amount'] = df['amount'].str.lstrip('0')
For more, (https://www.programiz.com/python-programming/methods/string/lstrip)
Replace None with zeros:
Use fill.na which works with others than None as well
Here,
df['amount'].fillna(value='0')
And for more : https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.fillna.html
Result in one line:
df['amount'] = df['amount'].str.lstrip('0').fillna(value='0')
If you need to ensure single 0 or the last 0 is not removed, you can use:
df['amount'] = df['amount'].str.replace(r'^(0+)(?!$)', '', regex=True).fillna('0')
Regex (?!$) ensure the matching substring (leading zeroes) does not including the last 0. Thus, effectively keeping the last 0.
Demo
Input Data
nums = {'amount': ['0324','S123','0010', None, '0030', 'SA40', 'SA24', '0', '000']}
df = pd.DataFrame(nums)
amount
0 0324
1 S123
2 0010
3 None
4 0030
5 SA40
6 SA24
7 0 <== Added a single 0 here
8 000 <== Added a sequence of all 0's here
Output
print(df)
amount
0 324
1 S123
2 10
3 0
4 30
5 SA40
6 SA24
7 0 <== Single 0 is not removed
8 0 <== Last 0 is kept

How to filter pd.Dataframe based on strings and special characters?

Here is what I have:
import re
import pandas as pd
d = {'ID': [1, 2, 3, 4, 5], 'Desc': ['0*1***HHCM', 'HC:83*20', 'HC:5*2CASL', 'DM*72\nCAS*', 'HC:564*CAS*5']}
df = pd.DataFrame(data=d)
df
Output:
ID Desc
0 1 0*1***HHCM
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
I need to filter the dataframe by column "Desc", if it contains "CAS" or "HC" that are not surrounded by letters or digits.
Here is what I tried:
new_df = df[df['Desc'].str.match(r'[^A-Za-z0-9]CAS[^A-Za-z0-9]|[^A-Za-z0-9]HC[^A-Za-z0-9]') == True]
It returns an empty dataframe.
I want it to return the following:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
Another thing: since 3rd row has "\nCas", where "\n" is a line separator, will it treat it as a letter before "CAS"?
Please help.
Try this:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=re.M)]
# If you don't want to import re you can just use flags=8:
df.loc[df['Desc'].str.contains(r'(\W|^)(HC|CAS)(\W|$)', flags=8)]
Result:
ID Desc
1 2 HC:83*20
2 3 HC:5*2CASL
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5
To answer your other question, as long as \n is passed correctly it will be parsed as a newline character instead of an alphanumeric character n. i.e.:
r'\n' -> `\\n` (backslash character + n character)
'\n' -> '\n' (newline character)
For further explanation on the regex, please see Regex101 demo: https://regex101.com/r/FNBgPV/2
You can try this, it checks only the numbers and letters before CAS and HC, but you can easily modify it to after also:
print(df[~df['Desc'].str.contains('([0-9a-zA-Z]+CAS*)|([0-9a-zA-Z]+HC*)', regex=True)])
ID Desc
1 2 HC:83*20
3 4 DM*72\nCAS*
4 5 HC:564*CAS*5

Pandas split on character and remove trailing values

I am trying to remove residual data that is identified with a '[' while keeping the first value.
import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']})
print(df)
becomes:
0 a
1 b[b7
2 c
would like to have
0 a
1 b
2 c
Any recommendations?
df.foo=df.foo.str[0]
df
Out[212]:
foo
0 a
1 b
2 c
I assume you're looking for str.split + str[0] -
df
foo
0 test
1 foo[b7
2 ba[r
df.foo.str.split('[').str[0]
0 test
1 foo
2 ba
Name: foo, dtype: object
import pandas as pd
df=pd.DataFrame({'foo':['a','b[b7','c']} )
df["foo"] = df["foo"].str.replace("(\[.*)","")
Here is the https://regex101.com/ explanation
1st Capturing Group (\[.*)
\[ matches the character [ literally (case sensitive)
.* matches any character (except for line terminators)
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
What this means is that it will look for a [ . If it finds one it will remove the [ and all characters after it.
import pandas as pd
df = pd.DataFrame({'foo':[x.split('[')[0] for x in ['a','b[b7','c']]})
print(df)

How to replace an entire cell with NaN on pandas DataFrame

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?
Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN
Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))
I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Categories