pandas select from Dataframe using startswith - python

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?

You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...

To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]

Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]

You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis

This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Related

Find cell with value LIKE string in Pandas [duplicate]

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Return all strings that 'starts with' in a pandas dataframe [duplicate]

This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')

Numpy/Pandas Series begins with operator? Does it exist?

I am trying to create a series in my dataframe (sdbfile) whose values are based on several nested conditional statements using elements from sdbfile dataframe. The series reins_code is filled with string values.
The statement below works however I need to configure to say if 'reins_code' begins' with 'R' rather than == a specific 'R#'
sdbfile['product'] = np.where(sdbfile.reins_code == 'R2', 'HiredPlant','Trad')
It doesn't like the string function startswith() as its a np.series?
Can anybody help please? Have waded through the documentation but cannot see a reference to this problem.......
Use the vectorised str.startswith to return a boolean mask:
In [6]:
df = pd.DataFrame({'a':['R1asda','R2asdsa','foo']})
df
Out[6]:
a
0 R1asda
1 R2asdsa
2 foo
In [8]:
df['a'].str.startswith('R2')
Out[8]:
0 False
1 True
2 False
Name: a, dtype: bool
In [9]:
df[df['a'].str.startswith('R2')]
Out[9]:
a
1 R2asdsa
Use the pandas str attribute. http://pandas.pydata.org/pandas-docs/stable/text.html
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps
most importantly, these methods exclude missing/NA values
automatically. These are accessed via the str attribute and generally
have names matching the equivalent (scalar) built-in string methods:
sdbfile['product'] = np.where(sdbfile.reins_code.str[0] == 'R', 'HiredPlant','Trad')

Extracting just a string element from a pandas dataframe

Okay, so say I have a pandas dataframe x, and I'm interested in extracting a value from it:
> x.loc[bar==foo]['variable_im_interested_in']
Let's say that returns the following, of type pandas.core.series.Series:
24 Boss
Name: ep_wb_ph_brand, dtype: object
But all I want is the string 'Boss'. Wrapping the first line of code in str() doesn't help either, I just get:
'24 Boss\nName: ep_wb_ph_brand, dtype: object'
How do I just extract the string?
Based on your comments, this code is returning a length-1 pandas Series:
x.loc[bar==foo]['variable_im_interested_in']
If you assign this value to a variable, then you can just access the 0th element to get what you're looking for:
my_value_as_series = x.loc[bar==foo]['variable_im_interested_in']
# Assumes the index to get is number 0, but from your example, it might
# be 24 instead.
plain_value = my_value_as_series[0]
# Likewise, this needs the actual index value, not necessarily 0.
also_plain_value = my_value_as_series.ix[0]
# This one works with zero, since `values` is a new ndarray.
plain_value_too = my_value_as_series.values[0]
You don't have to assign to a variable to do this, so you could just write x.loc[bar==foo]['variable_im_interested_in'][0] (or similar for the other options), but cramming more and more accessor and fancy indexing syntax onto a single expression is usually a bad idea.
Also note that you can directly index the column of interest inside of the call to loc:
x.loc[bar==foo, 'variable_im_interested_in'][24]
Code to get the last value of an array (run in a Jupyter notebook, noted with the >s):
> import pandas
> df = pandas.DataFrame(data=['a', 'b', 'c'], columns=['name'])
> df
name
0 a
1 b
2 c
> df.tail(1)['name'].values[0]
'c'
You could use string.split function.
>>> s = '24 Boss\nName: ep_wb_ph_brand, dtype: object'
>>> s.split()[1]
'Boss'

Ignoring NaNs with str.contains

I want to find rows that contain a string, like so:
DF[DF.col.str.contains("foo")]
However, this fails because some elements are NaN:
ValueError: cannot index with vector containing NA / NaN values
So I resort to the obfuscated
DF[DF.col.notnull()][DF.col.dropna().str.contains("foo")]
Is there a better way?
There's a flag for that:
In [11]: df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
In [12]: df.a.str.contains("foo")
Out[12]:
0 True
1 True
2 False
3 NaN
Name: a, dtype: object
In [13]: df.a.str.contains("foo", na=False)
Out[13]:
0 True
1 True
2 False
3 False
Name: a, dtype: bool
See the str.replace docs:
na : default NaN, fill value for missing values.
So you can do the following:
In [21]: df.loc[df.a.str.contains("foo", na=False)]
Out[21]:
a
0 foo1
1 foo2
In addition to the above answers, I would say for columns having no single word name, you may use:-
df[df['Product ID'].str.contains("foo") == True]
Hope this helps.
df[df.col.str.contains("foo").fillna(False)]
I'm not 100% on why (actually came here to search for the answer), but this also works, and doesn't require replacing all nan values.
import pandas as pd
import numpy as np
df = pd.DataFrame([["foo1"], ["foo2"], ["bar"], [np.nan]], columns=['a'])
newdf = df.loc[df['a'].str.contains('foo') == True]
Works with or without .loc.
I have no idea why this works, as I understand it when you're indexing with brackets pandas evaluates whatever's inside the bracket as either True or False. I can't tell why making the phrase inside the brackets 'extra boolean' has any effect at all.
You can also use query method to query the columns of a DataFrame with a boolean expression as follows:
df.query("a.str.contains('foo', na=False)")
Or more interestingly, which I think is more readable:
df.query("a.str.contains('foo')==True")
Note you might not get performance improvement, but it is more readable (arguably).
You can also patern :
DF[DF.col.str.contains(pat = '(foo)', regex = True) ]

Categories