Find and replace substrings in a Pandas dataframe ignore case - python

df.replace('Number', 'NewWord', regex=True)
how to replace Number or number or NUMBER with NewWord

Same as you'd do with the standard regex, using the i flag.
df = df.replace('(?i)Number', 'NewWord', regex=True)
Granted, df.replace is limiting in the sense that flags must be passed as part of the regex string (rather than flags). If this was using str.replace, you could've used case=False or flags=re.IGNORECASE.

Simply use case=False in str.replace.
Example:
df = pd.DataFrame({'col':['this is a Number', 'and another NuMBer', 'number']})
>>> df
col
0 this is a Number
1 and another NuMBer
2 number
df['col'] = df['col'].str.replace('Number', 'NewWord', case=False)
>>> df
col
0 this is a NewWord
1 and another NewWord
2 NewWord
[Edit]: In the case of having multiple columns you are looking for your substring in, you can select all columns with object dtypes, and apply the above solution to them. Example:
>>> df
col col2 col3
0 this is a Number numbernumbernumber 1
1 and another NuMBer x 2
2 number y 3
str_columns = df.select_dtypes('object').columns
df[str_columns] = (df[str_columns]
.apply(lambda x: x.str.replace('Number', 'NewWord', case=False)))
>>> df
col col2 col3
0 this is a NewWord NewWordNewWordNewWord 1
1 and another NewWord x 2
2 NewWord y 3

Brutish. This only works if the whole string is either 'Number' or 'NUMBER'. It will not replace those within a larger string. And of course, it is limited to just those two words.
df.replace(['Number', 'NUMBER'], 'NewWord')
More Brute Force
If it wasn't obvious enough, this is far inferior to #coldspeed's answer
import re
df.applymap(lambda x: re.sub('number', 'NewWord', x, flags=re.IGNORECASE))
Or with a cue from #coldspeed's answer
df.applymap(lambda x: re.sub('(?i)number', 'NewWord', x))

This solution will work if the text you're looking to convert is in specific columns of your dataframe:
df['COL_n'] = df['COL_n'].str.lower()
df['COL_n'] = df['COL_n'].replace('number', 'NewWord', regex=True)

Related

Convert pandas series strings to numbers

`Following series, contains result as string of lists with values either PASS or FAIL.
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"
Output:
result
1
1
0
If any row has at-least one PASS as value then return 1 else return 0
Input:-
result
"['PASS','FAIL']"
"['PASS','FAIL','PASS','FAIL']"
"['FAIL','FAIL']"
If there are lists use in statement:
df['result'] = [int('PASS' in x) for x in df['result']]
#alternative solution
df['result'] = df['result'].apply(lambda x: 'PASS' in x).astype(int)
If strings use Series.str.contains:
df['result'] = df['result'].str.contains('PASS').astype(int)
A simple and fast approach, use a regex with str.contains:
# if your want a robust check
df['result'] = df['result'].str.contains(r'\bPASS\b').astype(int)
# or if you're sure there are only PASS/FAIL
df['result'] = df['result'].str.contains('PASS').astype(int)

How to extract pandas dataframe's rows whose indices are strings and contain a substring from a known list [duplicate]

I have a pandas DataFrame with a column of string values. I need to select rows based on partial string matches.
Something like this idiom:
re.search(pattern, cell_in_question)
returning a boolean. I am familiar with the syntax of df[df['A'] == "hello world"] but can't seem to find a way to do the same with a partial string match, say 'hello'.
Vectorized string methods (i.e. Series.str) let you do the following:
df[df['A'].str.contains("hello")]
This is available in pandas 0.8.1 and up.
I am using pandas 0.14.1 on macos in ipython notebook. I tried the proposed line above:
df[df["A"].str.contains("Hello|Britain")]
and got an error:
cannot index with vector containing NA / NaN values
but it worked perfectly when an "==True" condition was added, like this:
df[df['A'].str.contains("Hello|Britain")==True]
How do I select by partial string from a pandas DataFrame?
This post is meant for readers who want to
search for a substring in a string column (the simplest case) as in df1[df1['col'].str.contains(r'foo(?!$)')]
search for multiple substrings (similar to isin), e.g., with df4[df4['col'].str.contains(r'foo|baz')]
match a whole word from text (e.g., "blue" should match "the sky is blue" but not "bluejay"), e.g., with df3[df3['col'].str.contains(r'\bblue\b')]
match multiple whole words
Understand the reason behind "ValueError: cannot index with vector containing NA / NaN values" and correct it with str.contains('pattern',na=False)
...and would like to know more about what methods should be preferred over others.
(P.S.: I've seen a lot of questions on similar topics, I thought it would be good to leave this here.)
Friendly disclaimer, this is post is long.
Basic Substring Search
# setup
df1 = pd.DataFrame({'col': ['foo', 'foobar', 'bar', 'baz']})
df1
col
0 foo
1 foobar
2 bar
3 baz
str.contains can be used to perform either substring searches or regex based search. The search defaults to regex-based unless you explicitly disable it.
Here is an example of regex-based search,
# find rows in `df1` which contain "foo" followed by something
df1[df1['col'].str.contains(r'foo(?!$)')]
col
1 foobar
Sometimes regex search is not required, so specify regex=False to disable it.
#select all rows containing "foo"
df1[df1['col'].str.contains('foo', regex=False)]
# same as df1[df1['col'].str.contains('foo')] but faster.
col
0 foo
1 foobar
Performance wise, regex search is slower than substring search:
df2 = pd.concat([df1] * 1000, ignore_index=True)
%timeit df2[df2['col'].str.contains('foo')]
%timeit df2[df2['col'].str.contains('foo', regex=False)]
6.31 ms ± 126 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.8 ms ± 241 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Avoid using regex-based search if you don't need it.
Addressing ValueErrors
Sometimes, performing a substring search and filtering on the result will result in
ValueError: cannot index with vector containing NA / NaN values
This is usually because of mixed data or NaNs in your object column,
s = pd.Series(['foo', 'foobar', np.nan, 'bar', 'baz', 123])
s.str.contains('foo|bar')
0 True
1 True
2 NaN
3 True
4 False
5 NaN
dtype: object
s[s.str.contains('foo|bar')]
# ---------------------------------------------------------------------------
# ValueError Traceback (most recent call last)
Anything that is not a string cannot have string methods applied on it, so the result is NaN (naturally). In this case, specify na=False to ignore non-string data,
s.str.contains('foo|bar', na=False)
0 True
1 True
2 False
3 True
4 False
5 False
dtype: bool
How do I apply this to multiple columns at once?
The answer is in the question. Use DataFrame.apply:
# `axis=1` tells `apply` to apply the lambda function column-wise.
df.apply(lambda col: col.str.contains('foo|bar', na=False), axis=1)
A B
0 True True
1 True False
2 False True
3 True False
4 False False
5 False False
All of the solutions below can be "applied" to multiple columns using the column-wise apply method (which is OK in my book, as long as you don't have too many columns).
If you have a DataFrame with mixed columns and want to select only the object/string columns, take a look at select_dtypes.
Multiple Substring Search
This is most easily achieved through a regex search using the regex OR pipe.
# Slightly modified example.
df4 = pd.DataFrame({'col': ['foo abc', 'foobar xyz', 'bar32', 'baz 45']})
df4
col
0 foo abc
1 foobar xyz
2 bar32
3 baz 45
df4[df4['col'].str.contains(r'foo|baz')]
col
0 foo abc
1 foobar xyz
3 baz 45
You can also create a list of terms, then join them:
terms = ['foo', 'baz']
df4[df4['col'].str.contains('|'.join(terms))]
col
0 foo abc
1 foobar xyz
3 baz 45
Sometimes, it is wise to escape your terms in case they have characters that can be interpreted as regex metacharacters. If your terms contain any of the following characters...
. ^ $ * + ? { } [ ] \ | ( )
Then, you'll need to use re.escape to escape them:
import re
df4[df4['col'].str.contains('|'.join(map(re.escape, terms)))]
col
0 foo abc
1 foobar xyz
3 baz 45
re.escape has the effect of escaping the special characters so they're treated literally.
re.escape(r'.foo^')
# '\\.foo\\^'
Matching Entire Word(s)
By default, the substring search searches for the specified substring/pattern regardless of whether it is full word or not. To only match full words, we will need to make use of regular expressions here—in particular, our pattern will need to specify word boundaries (\b).
For example,
df3 = pd.DataFrame({'col': ['the sky is blue', 'bluejay by the window']})
df3
col
0 the sky is blue
1 bluejay by the window
Now consider,
df3[df3['col'].str.contains('blue')]
col
0 the sky is blue
1 bluejay by the window
v/s
df3[df3['col'].str.contains(r'\bblue\b')]
col
0 the sky is blue
Multiple Whole Word Search
Similar to the above, except we add a word boundary (\b) to the joined pattern.
p = r'\b(?:{})\b'.format('|'.join(map(re.escape, terms)))
df4[df4['col'].str.contains(p)]
col
0 foo abc
3 baz 45
Where p looks like this,
p
# '\\b(?:foo|baz)\\b'
A Great Alternative: Use List Comprehensions!
Because you can! And you should! They are usually a little bit faster than string methods, because string methods are hard to vectorise and usually have loopy implementations.
Instead of,
df1[df1['col'].str.contains('foo', regex=False)]
Use the in operator inside a list comp,
df1[['foo' in x for x in df1['col']]]
col
0 foo abc
1 foobar
Instead of,
regex_pattern = r'foo(?!$)'
df1[df1['col'].str.contains(regex_pattern)]
Use re.compile (to cache your regex) + Pattern.search inside a list comp,
p = re.compile(regex_pattern, flags=re.IGNORECASE)
df1[[bool(p.search(x)) for x in df1['col']]]
col
1 foobar
If "col" has NaNs, then instead of
df1[df1['col'].str.contains(regex_pattern, na=False)]
Use,
def try_search(p, x):
try:
return bool(p.search(x))
except TypeError:
return False
p = re.compile(regex_pattern)
df1[[try_search(p, x) for x in df1['col']]]
col
1 foobar
More Options for Partial String Matching: np.char.find, np.vectorize, DataFrame.query.
In addition to str.contains and list comprehensions, you can also use the following alternatives.
np.char.find
Supports substring searches (read: no regex) only.
df4[np.char.find(df4['col'].values.astype(str), 'foo') > -1]
col
0 foo abc
1 foobar xyz
np.vectorize
This is a wrapper around a loop, but with lesser overhead than most pandas str methods.
f = np.vectorize(lambda haystack, needle: needle in haystack)
f(df1['col'], 'foo')
# array([ True, True, False, False])
df1[f(df1['col'], 'foo')]
col
0 foo abc
1 foobar
Regex solutions possible:
regex_pattern = r'foo(?!$)'
p = re.compile(regex_pattern)
f = np.vectorize(lambda x: pd.notna(x) and bool(p.search(x)))
df1[f(df1['col'])]
col
1 foobar
DataFrame.query
Supports string methods through the python engine. This offers no visible performance benefits, but is nonetheless useful to know if you need to dynamically generate your queries.
df1.query('col.str.contains("foo")', engine='python')
col
0 foo
1 foobar
More information on query and eval family of methods can be found at Dynamically evaluate an expression from a formula in Pandas.
Recommended Usage Precedence
(First) str.contains, for its simplicity and ease handling NaNs and mixed data
List comprehensions, for its performance (especially if your data is purely strings)
np.vectorize
(Last) df.query
If anyone wonders how to perform a related problem: "Select column by partial string"
Use:
df.filter(like='hello') # select columns which contain the word hello
And to select rows by partial string matching, pass axis=0 to filter:
# selects rows which contain the word hello in their index label
df.filter(like='hello', axis=0)
Quick note: if you want to do selection based on a partial string contained in the index, try the following:
df['stridx']=df.index
df[df['stridx'].str.contains("Hello|Britain")]
Should you need to do a case insensitive search for a string in a pandas dataframe column:
df[df['A'].str.contains("hello", case=False)]
Say you have the following DataFrame:
>>> df = pd.DataFrame([['hello', 'hello world'], ['abcd', 'defg']], columns=['a','b'])
>>> df
a b
0 hello hello world
1 abcd defg
You can always use the in operator in a lambda expression to create your filter.
>>> df.apply(lambda x: x['a'] in x['b'], axis=1)
0 True
1 False
dtype: bool
The trick here is to use the axis=1 option in the apply to pass elements to the lambda function row by row, as opposed to column by column.
You can try considering them as string as :
df[df['A'].astype(str).str.contains("Hello|Britain")]
Suppose we have a column named "ENTITY" in the dataframe df. We can filter our df,to have the entire dataframe df, wherein rows of "entity" column doesn't contain "DM" by using a mask as follows:
mask = df['ENTITY'].str.contains('DM')
df = df.loc[~(mask)].copy(deep=True)
Here's what I ended up doing for partial string matches. If anyone has a more efficient way of doing this please let me know.
def stringSearchColumn_DataFrame(df, colName, regex):
newdf = DataFrame()
for idx, record in df[colName].iteritems():
if re.search(regex, record):
newdf = concat([df[df[colName] == record], newdf], ignore_index=True)
return newdf
Using contains didn't work well for my string with special characters. Find worked though.
df[df['A'].str.find("hello") != -1]
A more generalised example - if looking for parts of a word OR specific words in a string:
df = pd.DataFrame([('cat andhat', 1000.0), ('hat', 2000000.0), ('the small dog', 1000.0), ('fog', 330000.0),('pet', 330000.0)], columns=['col1', 'col2'])
Specific parts of sentence or word:
searchfor = '.*cat.*hat.*|.*the.*dog.*'
Creat column showing the affected rows (can always filter out as necessary)
df["TrueFalse"]=df['col1'].str.contains(searchfor, regex=True)
col1 col2 TrueFalse
0 cat andhat 1000.0 True
1 hat 2000000.0 False
2 the small dog 1000.0 True
3 fog 330000.0 False
4 pet 3 30000.0 False
Maybe you want to search for some text in all columns of the Pandas dataframe, and not just in the subset of them. In this case, the following code will help.
df[df.apply(lambda row: row.astype(str).str.contains('String To Find').any(), axis=1)]
Warning. This method is relatively slow, albeit convenient.
Somewhat similar to #cs95's answer, but here you don't need to specify an engine:
df.query('A.str.contains("hello").values')
There are answers before this which accomplish the asked feature, anyway I would like to show the most generally way:
df.filter(regex=".*STRING_YOU_LOOK_FOR.*")
This way let's you get the column you look for whatever the way is wrote.
( Obviusly, you have to write the proper regex expression for each case )
My 2c worth:
I did the following:
sale_method = pd.DataFrame(model_data['Sale Method'].str.upper())
sale_method['sale_classification'] = \
np.where(sale_method['Sale Method'].isin(['PRIVATE']),
'private',
np.where(sale_method['Sale Method']
.str.contains('AUCTION'),
'auction',
'other'
)
)
df[df['A'].str.contains("hello", case=False)]

How can I efficiently and idiomatically filter rows of PandasDF based on multiple StringMethods on a single column?

I have a Pandas DataFrame df with many columns, of which one is:
col
---
abc:kk__LL-z12-1234-5678-kk__z
def:kk_A_LL-z12-1234-5678-kk_ss_z
abc:kk_AAA_LL-z12-5678-5678-keek_st_z
abc:kk_AA_LL-xx-xxs-4rt-z12-2345-5678-ek__x
...
I am trying to fetch all records where col starts with abc: and has the first -num- between '1234' and '2345' (inclusive using a string search; the -num- parts are exactly 4 digits each).
In the case above, I'd return
col
---
abc:kk__LL-z12-1234-5678-kk__z
abc:kk_AA_LL-z12-2345-5678-ek__x
...
My current (working, I think) solution looks like:
df = df[df['col'].str.startswith('abc:')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].ge('1234')]
df = df[df['col'].str.extract('.*-(\d+)-(\d+)-.*')[0].le('2345')]
What is a more idiomatic and efficient way to do this in Pandas?
Complex string operations are not as efficient as numeric calculations. So the following approach might be more efficient:
m1 = df['col'].str.startswith('abc')
m2 = pd.to_numeric(df['col'].str.split('-').str[2]).between(1234, 2345)
dfn = df[m1&m2]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
One way would be to use regexp and apply function. I find it easier to play with regexp in a separate function than to crowd the pandas expression.
import pandas as pd
import re
def filter_rows(string):
z = re.match(r"abc:.*-(\d+)-(\d+)-.*", string)
if z:
return 1234 <= (int(z.groups()[0])) <= 2345
else:
return False
Then use the defined function to select rows
df.loc[df['col'].apply(filter_rows)]
col
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x
Another play on regex :
#string starts with abc,greedy search,
#then look for either 1234, or 2345,
#search on for 4 digit number and whatever else after
pattern = r'(^abc.*(?<=1234-|2345-)\d{4}.*)'
df.col.str.extract(pattern).dropna()
0
0 abc:kk__LL-z12-1234-5678-kk__z
3 abc:kk_AA_LL-z12-2345-5678-ek__x

Select two pandas columns that have the same digit in the column name

I am cleaning up some data and the raw dataset has entries as ['Field1.1', 'Field2.1', 'Field1.2', 'Field2.2']. For the dataset, either 'Field1' XOR 'Field2' will have a non-empty string. I'd like to create a single field 'Field.1' that will extract the non-empty string from 'Field1.1' XOR 'Field2.1' and place it in 'Field.1'. Similarly, I'd like to do this for 'Field1.2' and 'Field2.2' as 'Field.2'.
I am not sure how to select matching fields, i.e. 'X.1' with 'Y.1' and 'X.2' with 'Y.2', in order to do this.
My logic is that once I can select the correct pairs I can simply use a concat statement to add them and thereby extract the non-empty string for later use. If this logic is incorrect or there is a better way that does not rely on extracting the non-empty string in this way to concatenate them then please let me know.
If the logic is sound, please explain how might do this extraction given the indexing problem.
To be clearer, I want to go from:
df = pd.DataFrame({'field1.1': ['string1',''], 'field2.1':['','string2'],
'field1.2': ['string3',''], 'field2.2':['','string4']})
df
Out[1]:
field1.1 field2.1 field1.2 field2.2
0 string1 string2
1 string3 string4
to:
df2 = pd.DataFrame({'field.1': ['string1','string3'], 'field.2':['string2','string4']})
df2
Out[2]:
field.1 field.2
0 string1 string2
1 string3 string4
You can use wide_to_long, bfill, and then pivot back:
(pd.wide_to_long(df.where(df.ne('')).reset_index(),
stubnames=['Field1','Field2'],
i='index',
j='group',
sep='.')
.bfill(1)
.reset_index()
.pivot(values='Field1',index='index',columns='group')
)
Sample data:
df = pd.DataFrame([
['a','','b',''],
['c','','','d'],
['','e','','f'],
['','g','h','']],
columns=['Field1.1', 'Field2.1', 'Field1.2', 'Field2.2'])
group 1 2
index
0 a b
1 c d
2 e f
3 g h

Is there a way in pandas to filter rows in a column that are contained in a string

There is a way to check if the string in the column contains another string:
df["column"].str.contains("mystring")
But I'm looking for the opposite, the column string to be contained in another string, without doing an apply function which I guess is slower than the vectorised .contains:
df["column"].apply(lambda x: x in "mystring", axis=1)
Update with data:
mystring = "abc"
df = pd.DataFrame({"A": ["ab", "az"]})
df
A
0 ab
1 az
I would like to show only "ab" because it is contained in mystring.
Only one option (jpp had it) - iterate with a list comprehension:
df[[r in mystring for r in df.A]]
A
0 ab
Or,
pd.DataFrame([r for r in df.A if r in mystring], columns=['A'])
A
0 ab
pd.Series([i in mystring for i in df.A])
Output:
0 True
1 False
dtype: bool

Categories