How to replace an entire cell with NaN on pandas DataFrame - python

I want to replace the entire cell that contains the word as circled in the picture with blanks or NaN. However when I try to replace for example '1.25 Dividend' it turned out as '1.25 NaN'. I want to return the whole cell as 'NaN'. Any idea how to work on this?

Option 1
Use a regular expression in your replace
df.replace('^.*Dividend.*$', np.nan, regex=True)
From comments
(Using regex=True) means that it will interpret the problem as a regular expression one. You still need an appropriate pattern. The '^' says to start at the beginning of the string. '^.*' matches all characters from the beginning of the string. '$' says to end the match with the end of the string. '.*$' matches all characters up to the end of the string. Finally, '^.*Dividend.*$' matches all characters from the beginning, has 'Dividend' somewhere in the middle, then any characters after it. Then replace this whole thing with np.nan
Consider the dataframe df
df = pd.DataFrame([[1, '2 Dividend'], [3, 4], [5, '6 Dividend']])
df
0 1
0 1 2 Dividend
1 3 4
2 5 6 Dividend
then the proposed solution yields
0 1
0 1 NaN
1 3 4.0
2 5 NaN
Option 2
Another alternative is to use pd.DataFrame.mask in conjunction with a applymap.
If I pass a lambda to applymap that identifies if any cell has 'Dividend' in it.
df.mask(df.applymap(lambda s: 'Dividend' in s if isinstance(s, str) else False))
0 1
0 1 NaN
1 3 4
2 5 NaN
Option 3
Similar in concept but using stack/unstack + pd.Series.str.contains
df.mask(df.stack().astype(str).str.contains('Dividend').unstack())
0 1
0 1 NaN
1 3 4
2 5 NaN

Replace all strings:
df.apply(lambda x: pd.to_numeric(x, errors='coerce'))

I would use applymap like this
df.applymap(lambda x: 'NaN' if (type(x) is str and 'Dividend' in x) else x)

Related

negative lookbehind when filtering pandas columns

Consider this simple example
import pandas as pd
df = pd.DataFrame({'good_one' : [1,2,3],
'bad_one' : [1,2,3]})
Out[7]:
good_one bad_one
0 1 1
1 2 2
2 3 3
In this artificial example I would like to filter the columns that DO NOT start with bad. I can use a regex condition on the pandas columns using .filter(). However, I am not able to make it work with a negative lookbehind.
See here
df.filter(regex = 'one')
Out[8]:
good_one bad_one
0 1 1
1 2 2
2 3 3
but now
df.filter(regex = '(?<!bad).*')
Out[9]:
good_one bad_one
0 1 1
1 2 2
2 3 3
does not do anything. Am I missing something?
Thanks
Solution if need remove columns names starting by bad:
df = pd.DataFrame({'good_one' : [1,2,3],
'not_bad_one' : [1,2,3],
'bad_one' : [1,2,3]})
#https://stackoverflow.com/a/5334825/2901002
df1 = df.filter(regex=r'^(?!bad).*$')
print (df1)
good_one not_bad_one
0 1 1
1 2 2
2 3 3
^ asserts position at start of a line
Negative Lookahead (?!bad)
Assert that the Regex below does not match bad matches
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line
Solution for remove all columns with bad substring:
df2 = df.filter(regex=r'^(?!.*bad).*$')
print (df2)
good_one
0 1
1 2
2 3
^ asserts position at start of a line
Negative Lookahead (?!.*bad)
Assert that the Regex below does not match
. matches any character
bad matches the characters bad literally
. matches any character
* matches the previous token between zero and unlimited times, as many times as possible, giving back as needed (greedy)
$ asserts position at the end of a line

Manipulate Dataframe Series

I have a dataframe and I want to change some element of a column based on a condition.
In particular given this column:
... VALUE ....
0
"1076A"
12
9
"KKK0139"
5
I want to obtain this:
... VALUE ....
0
"1076A"
12
9
"0139"
5
In the 'VALUE' column there are both strings and numbers, when I found a particular substring in a string value, I want to obtain the same value without that substring.
I have tried:
1) df['VALUE'] = np.where(df['VALUE'].str.contains('KKK', na=False), df['VALUE'].str[3:], df['VALUE'])
2) df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df['VALUE'].str[3:]
But these two attempts returns a IndexError: invalid index to scalar variable
Some advice ?
As the column contains both numeric value (non-string) and string values, you cannot use .str.replace() since it handles strings only. You have to use .replace() instead. Otherwise, non-string elements will be converted to NaN by str.replace().
Here, you can use:
df['VALUE'] = df['VALUE'].replace(r'KKK', '', regex=True)
Input:
data = {'VALUE': [0, "1076A", 12, 9, "KKK0139", 5]}
df = pd.DataFrame(data)
Result:
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
If you use .str.replace(), you will get:
Note the NaN values result for numeric values (not of string type)
0 NaN
1 1076A
2 NaN
3 NaN
4 0139
5 NaN
Name: VALUE, dtype: object
In general, if you want to remove leading alphabet substring, you can use:
df['VALUE'] = df['VALUE'].replace(r'^[A-Za-z]+', '', regex=True)
>>> df['VALUE'].str.replace(r'KKK', '')
0 0
1 1076A
2 12
3 9
4 0139
5 5
Name: VALUE, dtype: object
Your second solution fails because you also need to apply the row selector to the right side of your assignment.
df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'] = df.loc[df['VALUE'].str.contains('KKK', na=False), 'VALUE'].str[3:]
Looking at your sample data, if k is the only problem, just replace it with empty string
df['VALUE'].str.replace('K', '')
0 0
1 "1076A"
2 12
3 9
4 "0139"
5 5
Name: text, dtype: object
If you want to do it for specific occurrences or positions of k, you can do that as well.

Cut string in dataframe column until certain string but including

I have similar data as the following:
df = pd.DataFrame({'pagePath':['/my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda6LM/',
'/my/orders/details/151726/',
'/my/retours/retourmethod/']})
print(df)
pagePath
0 /my/retour/details/n8hWu7iWtuRXzSvDvCAUZRAlPda...
1 /my/orders/details/151726/
2 /my/retours/retourmethod/
What I want to do is to cut the string until (but including) details
Expected output
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
The following works, but its slow
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.apply(lambda x: x[0:x.find('details')+8]),
df.pagePath)
print(df)
pagePath
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
I tried regex, but could only get it to work excluding:
df['pagePath'] = np.where(df.pagePath.str.contains('details'),
df.pagePath.str.extract('(.+?(?=details))'),
df.pagePath)
print(df)
pagePath
0 /my/retour/
1 /my/orders/
2 NaN
Plus the regex code returns NaN, when the row does not contain details
So I feel there's an easier and more elegant way for this. How would I write a regex code to solve my problem? Or is my solution already sufficient?
All you need to do is provide a fallback in the regex for when there is no 'details':
>>> df.pagePath.str.extract('(.+?details/?|.*)')
0
0 /my/retour/details/
1 /my/orders/details/
2 /my/retours/retourmethod/
Would you like to try str.extract
('/'+df.pagePath.str.extract('/(.*)details')+'details')[0].fillna(df.pagePath)
Out[130]:
0 /my/retour/details
1 /my/orders/details
2 /my/retours/retourmethod/
Name: 0, dtype: object

Python parse dataframe element

I have a pandas dataframe column (Data Type) which I want to split into three columns
target_table_df = LoadS_A [['Attribute Name',
'Data Type',
'Primary Key Indicator']]
Example input (target_table_df)
Attribute Name Data Type Primary Key Indicator
0 ACC_LIM DECIMAL(18,4) False
1 ACC_NO NUMBER(11,0) False
2 ACC_OPEN_DT DATE False
3 ACCB DECIMAL(18,4) False
4 ACDB DECIMAL(18,4) False
5 AGRMNT_ID NUMBER(11,0) True
6 BRNCH_NUM NUMBER(11,0) False
7 CLRD_BAL DECIMAL(18,4) False
8 CR_INT_ACRD_GRSS DECIMAL(18,4) False
9 CR_INT_ACRD_NET DECIMAL(18,4) False
I aim to:
Reassign 'Data Type' to the text preceding the parenthesis
[..if parenthesis exists in 'Data Type']:
Create new column 'Precision' and assign to first comma separated
value
Create new column 'Scale' and assign to second comma separated value
Intended output would therefore become:
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date
3 decimal 18 4
4 decimal 18 4
5 number 4 0
I have tried in anger to achieve this but i'm new to dataframes....can't work out if I am to iterate over all rows or if there is a way to apply to all values in the dataframe?
Any help much appreciated
Use target_table_df['Data Type'].str.extract(pattern)
You'll need to assign pattern to be a regular expression that captures each of the components you're looking for.
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
([^\(]+) says grab as many non-open parenthesis characters you can up to the first open parenthesis.
\(([^,]*, says to grab the first set of non-comma characters after an open parenthesis and stop at the comma.
,(.*)\) says to grab the rest of the characters between the comma and the close parenthesis.
(\(([^,]*),(.*)\))? says the whole parenthesis thing may not even happen, grab it if you can.
Solution
everything together looks like this:
pattern = r'([^\(]+)(\(([^,]*),(.*)\))?'
df = s.str.extract(pattern, expand=True).iloc[:, [0, 2, 3]]
# Formatting to get it how you wanted
df.columns = ['Data Type', 'Precision', 'Scale']
df.index.name = None
print df
I put a .iloc[:, [0, 2, 3]] at the end because the pattern I used grabs the whole parenthesis in column 1 and I wanted to skip it. Leave it off and see.
Data Type Precision Scale
0 decimal 18 4
1 number 11 0
2 date NaN NaN
3 decimal 18 4
4 decimal 18 4
5 number 11 0

Delineate twice through a dataframe in pandas

I have a sparse pandas DataFrame/Series with values that look like variations of "AB1234:12, CD5678:34, EF3456:56". Something to the effect of
"AB1234:12, CD5678:34, EF3456:56"
"AB1234:12, CD5678:34"
NaN
"GH5678:34, EF3456:56"
"OH56:34"
Which I'd like to convert into
["AB1234","CD5678", "EF3456"]
["AB1234","CD5678"]
NaN
["GH5678","EF3456"]
["OH56"]
This kind of "double delineation" has been proving difficult. I know we can A = df["columnName"].str.split(",") however I've run across a couple of problems including that .split(", ") doesnt seem to work and '.split(",")' leaves whitespace. Also, that iterating through the generated A and splitting seems to be interpreting my new lists as 'floats'. Although that last one might be a technical difficulty with ipython - I'm trying to work out that problem as well.
Is there a way to delineate on two types of separators - instead of just one? If not, how do you perform the loop to iterate over the inner list?
//Edit: changed the apostrophes to commas - that was just my dyslexia
kicking in
You nearly had it, note you can use a regular expression to split more generally:
In [11]: s2
Out[11]:
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 NaN
3 GH5678:34, EF3456:56
4 OH56:34
dtype: object
In [12]: s2.str.split(", '")
Out[12]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
In [13]: s2.str.split("\s*,\s*'")
Out[13]:
0 [AB1234:12, CD5678:34, EF3456:56]
1 [AB1234:12, CD5678:34]
2 NaN
3 [GH5678:34, EF3456:56]
4 [OH56:34]
dtype: object
Where this removes any spaces before or after a comma.
Here is your DataFrame
>>> df
A
0 AB1234:12, CD5678:34, EF3456:56
1 AB1234:12, CD5678:34
2 None
3 GH5678:34, EF3456:56
4 OH56:34
And now I use split and replace to split by ', ' and remove all ':'
>>> df.A = [i.replace(':','').split(", ") if isinstance(i,str) else i for i in df.A]
>>> df.A
0 [AB123412, CD567834, EF345656]
1 [AB123412, CD567834]
2 None
3 [GH567834, EF345656]
4 [OH5634]
Name: A

Categories