Ignoring NaNs with str.contains - python

I want to find rows that contain a string, like so:
DF[DF.col.str.contains("foo")]
However, this fails because some elements are NaN:
ValueError: cannot index with vector containing NA / NaN values
So I resort to the obfuscated
DF[DF.col.notnull()][DF.col.dropna().str.contains("foo")]
Is there a better way?

There's a flag for that:
In [11]: df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
In [12]: df.a.str.contains("foo")
Out[12]:
0 True
1 True
2 False
3 NaN
Name: a, dtype: object
In [13]: df.a.str.contains("foo", na=False)
Out[13]:
0 True
1 True
2 False
3 False
Name: a, dtype: bool
See the str.replace docs:
na : default NaN, fill value for missing values.
So you can do the following:
In [21]: df.loc[df.a.str.contains("foo", na=False)]
Out[21]:
a
0 foo1
1 foo2

In addition to the above answers, I would say for columns having no single word name, you may use:-
df[df['Product ID'].str.contains("foo") == True]
Hope this helps.

df[df.col.str.contains("foo").fillna(False)]

I'm not 100% on why (actually came here to search for the answer), but this also works, and doesn't require replacing all nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
newdf = df.loc[df['a'].str.contains('foo') == True]
Works with or without .loc.
I have no idea why this works, as I understand it when you're indexing with brackets pandas evaluates whatever's inside the bracket as either True or False. I can't tell why making the phrase inside the brackets 'extra boolean' has any effect at all.

You can also use query method to query the columns of a DataFrame with a boolean expression as follows:
df.query("a.str.contains('foo', na=False)")
Or more interestingly, which I think is more readable:
df.query("a.str.contains('foo')==True")
Note you might not get performance improvement, but it is more readable (arguably).

You can also patern :
DF[DF.col.str.contains(pat = '(foo)', regex = True) ]

Related

Find cell with value LIKE string in Pandas [duplicate]

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Lazy evaluate Pandas dataframe filters

I'm observing a behavior that's weird to me, can anyone tell me how I can define filter once and re-use throughout my code?
>>> df = pd.DataFrame([1,2,3], columns=['A'])
>>> my_filter = df.A == 2
>>> df.loc[1] = 5
>>> df[my_filter]
A
1 5
I expect my_filter to return empty dataset since none of the A columns are equal to 2.
I'm thinking about making a function that returns the filter and re-use that but is there any more pythonic as well as pandaic way of doing this?
def get_my_filter(df):
return df.A == 2
df[get_my_filter()]
change df
df[get_my_filter()]
Masks are not dynamic, they stay how you defined them when you defined them.
So if you still need to change the dataframe value, you should swap lines 2 and 3.
That would work.
you applied the filter in the first place. Changing a value in the row won't help.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = df.A == 2
print(my_filter)
'''
A
0 False
1 True
2 False
'''
as you can see, it returns a series. If you change the data after this process, it will not work. because this represents the first version of the df. But you can use define filter as a string. You can achieve what you want if you use the string filter inside the eval() function.
df = pd.DataFrame([1,2,3], columns=['A'])
my_filter = 'df.A == 2'
df.loc[1] = 5
df[eval(my_filter)]
'''
Out[205]:
Empty DataFrame
Columns: [A]
Index: []
'''

Return all strings that 'starts with' in a pandas dataframe [duplicate]

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Numpy/Pandas Series begins with operator? Does it exist?

I am trying to create a series in my dataframe (sdbfile) whose values are based on several nested conditional statements using elements from sdbfile dataframe. The series reins_code is filled with string values.
The statement below works however I need to configure to say if 'reins_code' begins' with 'R' rather than == a specific 'R#'
sdbfile['product'] = np.where(sdbfile.reins_code == 'R2', 'HiredPlant','Trad')
It doesn't like the string function startswith() as its a np.series?
Can anybody help please? Have waded through the documentation but cannot see a reference to this problem.......
Use the vectorised str.startswith to return a boolean mask:
In [6]:
df = pd.DataFrame({'a':['R1asda','R2asdsa','foo']})
df
Out[6]:
a
0 R1asda
1 R2asdsa
2 foo
In [8]:
df['a'].str.startswith('R2')
Out[8]:
0 False
1 True
2 False
Name: a, dtype: bool
In [9]:
df[df['a'].str.startswith('R2')]
Out[9]:
a
1 R2asdsa
Use the pandas str attribute. http://pandas.pydata.org/pandas-docs/stable/text.html
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps
most importantly, these methods exclude missing/NA values
automatically. These are accessed via the str attribute and generally
have names matching the equivalent (scalar) built-in string methods:
sdbfile['product'] = np.where(sdbfile.reins_code.str[0] == 'R', 'HiredPlant','Trad')

pandas select from Dataframe using startswith

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Categories