I am trying to create a series in my dataframe (sdbfile) whose values are based on several nested conditional statements using elements from sdbfile dataframe. The series reins_code is filled with string values.
The statement below works however I need to configure to say if 'reins_code' begins' with 'R' rather than == a specific 'R#'
sdbfile['product'] = np.where(sdbfile.reins_code == 'R2', 'HiredPlant','Trad')
It doesn't like the string function startswith() as its a np.series?
Can anybody help please? Have waded through the documentation but cannot see a reference to this problem.......
Use the vectorised str.startswith to return a boolean mask:
In [6]:
df = pd.DataFrame({'a':['R1asda','R2asdsa','foo']})
df
Out[6]:
a
0 R1asda
1 R2asdsa
2 foo
In [8]:
df['a'].str.startswith('R2')
Out[8]:
0 False
1 True
2 False
Name: a, dtype: bool
In [9]:
df[df['a'].str.startswith('R2')]
Out[9]:
a
1 R2asdsa
Use the pandas str attribute. http://pandas.pydata.org/pandas-docs/stable/text.html
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps
most importantly, these methods exclude missing/NA values
automatically. These are accessed via the str attribute and generally
have names matching the equivalent (scalar) built-in string methods:
sdbfile['product'] = np.where(sdbfile.reins_code.str[0] == 'R', 'HiredPlant','Trad')
Related
This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')
Fairly new to python. This seems to be a really simple question but I can't find any information about it.
I have a list of strings, and for each string I want to check whether it is present in a dataframe (actually in a particular column of the dataframe. Not whether a substring is present, but the whole exact string.
So my dataframe is something like the following:
A=pd.DataFrame(["ancestry","time","history"])
I should simply be able to use the "string in dataframe" method, as in
"time" in A
This returns False however.
If I run
"time" == A.iloc[1]
it returns "True", but annoyingly as part of a series, and this depends on knowing where in the dataframe the corresponding string is.
Is there some way I can just use the string in df method, to easily find out whether the strings in my list are in the dataframe?
Add .to_numpy() to the end:
'time' in A.to_numpy()
As you've noticed, the x in pandas.DataFrame syntax doesn't produce the result you want. But .to_numpy() transforms the dataframe into a Numpy array, and x in numpy.array works as you expect.
The way to deal with this is to compare the whole dataframe with "time". That will return a mask where each value of the DF is True if it was time, False otherwise. Then, you can use .any() to check if there are any True values:
>>> A = pd.DataFrame(["ancestry","time","history"])
>>> A
0
0 ancestry
1 time
2 history
>>> A == "time" # or A.eq("time")
0
0 False
1 True
2 False
>>> (A == "time").any()
0 True
dtype: bool
Notice in the above output, (A == "time").any() returns a Series where each entry is a column and whether or not that column contained time. If you want to check the entire dataframe (across all columns), call .any() twice:
>>> (A == "time").any().any()
True
I believe (myseries==mystr).any() will do what you ask. The special __contains__ method of DataFrames (which informs behavior of in) checks whether your string is a column of the DataFrame, e.g.
>>> A = pd.DataFrame({"c": [0,1,2], "d": [3,4,5]})
>>> 'c' in A
True
>>> 0 in A
False
I would slightly modify your dataframe and use .str.contains for checking where the string is present in your series.
df=pd.DataFrame()
df['A']=pd.Series(["ancestry","time","history"])
df['A'].str.contains("time")
This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')
Okay, so say I have a pandas dataframe x, and I'm interested in extracting a value from it:
> x.loc[bar==foo]['variable_im_interested_in']
Let's say that returns the following, of type pandas.core.series.Series:
24 Boss
Name: ep_wb_ph_brand, dtype: object
But all I want is the string 'Boss'. Wrapping the first line of code in str() doesn't help either, I just get:
'24 Boss\nName: ep_wb_ph_brand, dtype: object'
How do I just extract the string?
Based on your comments, this code is returning a length-1 pandas Series:
x.loc[bar==foo]['variable_im_interested_in']
If you assign this value to a variable, then you can just access the 0th element to get what you're looking for:
my_value_as_series = x.loc[bar==foo]['variable_im_interested_in']
# Assumes the index to get is number 0, but from your example, it might
# be 24 instead.
plain_value = my_value_as_series[0]
# Likewise, this needs the actual index value, not necessarily 0.
also_plain_value = my_value_as_series.ix[0]
# This one works with zero, since `values` is a new ndarray.
plain_value_too = my_value_as_series.values[0]
You don't have to assign to a variable to do this, so you could just write x.loc[bar==foo]['variable_im_interested_in'][0] (or similar for the other options), but cramming more and more accessor and fancy indexing syntax onto a single expression is usually a bad idea.
Also note that you can directly index the column of interest inside of the call to loc:
x.loc[bar==foo, 'variable_im_interested_in'][24]
Code to get the last value of an array (run in a Jupyter notebook, noted with the >s):
> import pandas
> df = pandas.DataFrame(data=['a', 'b', 'c'], columns=['name'])
> df
name
0 a
1 b
2 c
> df.tail(1)['name'].values[0]
'c'
You could use string.split function.
>>> s = '24 Boss\nName: ep_wb_ph_brand, dtype: object'
>>> s.split()[1]
'Boss'
This works (using Pandas 12 dev)
table2=table[table['SUBDIVISION'] =='INVERNESS']
Then I realized I needed to select the field using "starts with" Since I was missing a bunch.
So per the Pandas doc as near as I could follow I tried
criteria = table['SUBDIVISION'].map(lambda x: x.startswith('INVERNESS'))
table2 = table[criteria]
And got AttributeError: 'float' object has no attribute 'startswith'
So I tried an alternate syntax with the same result
table[[x.startswith('INVERNESS') for x in table['SUBDIVISION']]]
Reference http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
Section 4: List comprehensions and map method of Series can also be used to produce more complex criteria:
What am I missing?
You can use the str.startswith DataFrame method to give more consistent results:
In [11]: s = pd.Series(['a', 'ab', 'c', 11, np.nan])
In [12]: s
Out[12]:
0 a
1 ab
2 c
3 11
4 NaN
dtype: object
In [13]: s.str.startswith('a', na=False)
Out[13]:
0 True
1 True
2 False
3 False
4 False
dtype: bool
and the boolean indexing will work just fine (I prefer to use loc, but it works just the same without):
In [14]: s.loc[s.str.startswith('a', na=False)]
Out[14]:
0 a
1 ab
dtype: object
.
It looks least one of your elements in the Series/column is a float, which doesn't have a startswith method hence the AttributeError, the list comprehension should raise the same error...
To retrieve all the rows which startwith required string
dataFrameOut = dataFrame[dataFrame['column name'].str.match('string')]
To retrieve all the rows which contains required string
dataFrameOut = dataFrame[dataFrame['column name'].str.contains('string')]
Using startswith for a particular column value
df = df.loc[df["SUBDIVISION"].str.startswith('INVERNESS', na=False)]
You can use apply to easily apply any string matching function to your column elementwise.
table2=table[table['SUBDIVISION'].apply(lambda x: x.startswith('INVERNESS'))]
this assuming that your "SUBDIVISION" column is of the correct type (string)
Edit: fixed missing parenthesis
This can also be achieved using query:
table.query('SUBDIVISION.str.startswith("INVERNESS").values')