How to replace an entire cell with NaN on pandas DataFrame

How to replace an entire cell with NaN on pandas DataFrame - python

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?

Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN

Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Related

negative lookbehind when filtering pandas columns

Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad. I can use a regex condition on the pandas columns using .filter(). However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks

Solution if need remove columns names starting by bad:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^ asserts position at start of a line
Negative Lookahead (?!bad)
Assert that the Regex below does not match bad matches
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Solution for remove all columns with bad substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^ asserts position at start of a line
Negative Lookahead (?!.*bad)
Assert that the Regex below does not match
. matches any character
bad matches the characters bad literally
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

Manipulate Dataframe Series

I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?

As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)

>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object

Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]

Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.

Cut string in dataframe column until certain string but including

I have similar data as the following:
df = pd.DataFrame({'pagePath':['/my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda6LM/',
'/my/orders/details/151726/',
'/my/retours/retourmethod/']})
print(df)
pagePath
0 /my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda...
1 /my/orders/details/151726/
2 /my/retours/retourmethod/
What I want to do is to cut the string until (but including) details
Expected output
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
The following works, but its slow
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.apply(lambda x: x[0:x.find('details')+8]),
df.pagePath)
print(df)
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
I tried regex, but could only get it to work excluding:
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.str.extract('(.+?(?=details))'),
df.pagePath)
print(df)
pagePath
0 /my/retour/
1 /my/orders/
2 NaN
Plus the regex code returns NaN, when the row does not contain details
So I feel there's an easier and more elegant way for this. How would I write a regex code to solve my problem? Or is my solution already sufficient?

All you need to do is provide a fallback in the regex for when there is no 'details':
>>> df.pagePath.str.extract('(.+?details/?|.*)')
0
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/

Would you like to try str.extract
('/'+df.pagePath.str.extract('/(.*)details')+'details')[0].fillna(df.pagePath)
Out[130]:
0 /my/retour/details
1 /my/orders/details
2 /my/retours/retourmethod/
Name: 0, dtype: object

Python parse dataframe element

I have a pandas dataframe column (Data Type) which I want to split into three columns
target_table_df = LoadS_A [['Attribute Name',
'Data Type',
'Primary Key Indicator']]
Example input (target_table_df)
Attribute Name Data Type Primary Key Indicator
0 ACC_LIM DECIMAL(18,4) False
1 ACC_NO NUMBER(11,0) False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4) False
4 ACDB DECIMAL(18,4) False
5 AGRMNT_ID NUMBER(11,0) True
6 BRNCH_NUM NUMBER(11,0) False
7 CLRD_BAL DECIMAL(18,4) False
8 CR_INT_ACRD_GRSS DECIMAL(18,4) False
9 CR_INT_ACRD_NET DECIMAL(18,4) False
I aim to:
Reassign 'Data Type' to the text preceding the parenthesis
[..if parenthesis exists in 'Data Type']:
Create new column 'Precision' and assign to first comma separated
value
Create new column 'Scale' and assign to second comma separated value
Intended output would therefore become:
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date
3 decimal 18 4
4 decimal 18 4
5 number 4 0
I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?
Any help much appreciated

Use target_table_df['Data Type'].str.extract(pattern)
You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.
\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.
,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.
(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.
Solution
everything together looks like this:
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]
# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df
I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date NaN NaN
3 decimal 18 4
4 decimal 18 4
5 number 11 0

Delineate twice through a dataframe in pandas

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in

You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.

Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to replace an entire cell with NaN on pandas DataFrame - python

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?

Replace all strings: df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

I would use applymap like this df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Related

negative lookbehind when filtering pandas columns

Manipulate Dataframe Series

Cut string in dataframe column until certain string but including

Python parse dataframe element

Delineate twice through a dataframe in pandas

Categories

Resources