Extracting just a string element from a pandas dataframe - python

Okay, so say I have a pandas dataframe x, and I'm interested in extracting a value from it:
> x.loc[bar==foo]['variable_im_interested_in']
Let's say that returns the following, of type pandas.core.series.Series:
24 Boss
Name: ep_wb_ph_brand, dtype: object
But all I want is the string 'Boss'. Wrapping the first line of code in str() doesn't help either, I just get:
'24 Boss\nName: ep_wb_ph_brand, dtype: object'
How do I just extract the string?

Based on your comments, this code is returning a length-1 pandas Series:
x.loc[bar==foo]['variable_im_interested_in']
If you assign this value to a variable, then you can just access the 0th element to get what you're looking for:
my_value_as_series = x.loc[bar==foo]['variable_im_interested_in']
# Assumes the index to get is number 0, but from your example, it might
# be 24 instead.
plain_value = my_value_as_series[0]
# Likewise, this needs the actual index value, not necessarily 0.
also_plain_value = my_value_as_series.ix[0]
# This one works with zero, since `values` is a new ndarray.
plain_value_too = my_value_as_series.values[0]
You don't have to assign to a variable to do this, so you could just write x.loc[bar==foo]['variable_im_interested_in'][0] (or similar for the other options), but cramming more and more accessor and fancy indexing syntax onto a single expression is usually a bad idea.
Also note that you can directly index the column of interest inside of the call to loc:
x.loc[bar==foo, 'variable_im_interested_in'][24]

Code to get the last value of an array (run in a Jupyter notebook, noted with the >s):
> import pandas
> df = pandas.DataFrame(data=['a', 'b', 'c'], columns=['name'])
> df
name
0 a
1 b
2 c
> df.tail(1)['name'].values[0]
'c'

You could use string.split function.
>>> s = '24 Boss\nName: ep_wb_ph_brand, dtype: object'
>>> s.split()[1]
'Boss'

Related

Why does assignment on a non-subsetted series overwrite the entire series but assignment on a df-subsetted series overwrites by-value?

I was completing an exercise on string substitution within a df.
I'll simply the code to illustrate the point
test2 = pd.DataFrame({'column1':['a','a','b','b']})
Say I want to change every 'b' to an 'a'. My initial inclination was to use .str.replace(), which could work. But I noticed an example elsewhere (without explanation) that I could instead just write:
test2.loc[test2['column1'] == 'b', 'column1'] = 'a'
which returns a series of all a's.
This was a bit curious to me because if I take the type of
test2.loc[test2['column1'] == 'b', 'column1']
I get <class 'pandas.core.series.Series'>
But if I were to create an identical starter-series from scratch:
test1 = pd.Series(['a','a','b','b'])
I get the same object type. Yet if I write
test1 = 'a'
that would convert the entire series object into the one-letter string, 'a'.
I know that a workaround could be to subset the non-subsetted series itself, i.e.,
test1[test1 == 'b'] = 'a'
which yields the same 4-element series of 'a'.
But I don't totally follow the technical reason for when/why you can simply assign a value to a series and have it replace respective values in the series, instead of overwrite the series altogether. So clarification on this specific point would be appreciated!

python csv problem- want to use variable in loc code

I imported my csv file and I want to use variable in my loc code like\
a = 0; b = 1
df.loc[int(a+2),int(2*b-2)]
but it has an error
how do I fix it?
if you want to get the value from position (a+2, 2*b-2), use .iloc instead.
.loc is used to select data from index name.
try this:
df.iloc[a+2,2*b-2]
pandas docs
It seems that you want to use a slice of int numbers to select a piece of data. In pandas.DataFrame, you can use df.loc and df.iloc for indexing.
Allowed inputs of df.loc are:
A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer position along the index).
A list or array of labels, e.g. ['a', 'b', 'c'].
A slice object with labels, e.g. 'a':'f'.
A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
An alignable boolean Series.
An alignable Index.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above)
Allowed inputs of df.iloc are:
An integer, e.g. 5.
A list or array of integers, e.g. [4, 3, 0].
A slice object with ints, e.g. 1:7.
A boolean array.
A callable function with one argument (the calling Series or DataFrame) and that returns valid output for indexing (one of the above).
But both of them don't support two integers for inputing.
So maybe you should use df.iloc to accpet a list of integers df.iloc[[int(a+2),int(2*b-2)]], instead of using df.loc df.loc[int(a+2),int(2*b-2)].
For more details, you can see the documents:
https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html

How do i extract certain values in a row satisfying a condition in a dataframe?

I have a DataFrame looking something like this -
Now , how do i extract all the elements in row A having a value greater than 2 ?
In the above case it would be the value 2.706850
I did something like this-
df.loc['A']>2
But i got a series containing Boolean Values something like this -
What should i do to get 2.706850 as the output ?
Recommended solution
You can index the dataframe with the conditional expression and the series label itself:
df.loc[df.loc['A'] > 2, 'A']
Old answer, not recommended
Avoid using this approach as it encourages chained assignment. Check the following answer for more details
You just need to index back into the series with your boolean mask as follows:
>>> df.loc['A'][df.loc['A'] > 2]
F 2.706850
Name: A, dtype: float64
Below are tried OK
df.loc['A',df.loc['A']>2]
df.loc['A'][df.loc['A']>2]

Append to Series in python/pandas not working

I am trying to append values to a pandas Series obtained by finding the difference between the nth and nth + 1 element:
q = pd.Series([])
while i < len(other array):
diff = some int value
a = pd.Series([diff], ignore_index=True)
q.append(a)
i+=1
The output I get is:
Series([], dtype: float64)
Why am I not getting an array with all the appended values?
--
P.S. This is a data science question where I have to find state with the most counties by searching through a dataframe. I am using the index values where one state ends and the next one begins (the values in the array that I am using to find the difference) to determine how many counties are in that state. If anyone knows how to solve this problem better than I am above, please let me know!
The append method doesn't work in-place. Instead, it returns a new Series object. So it should be:
q = q.append(a)
Hope it helps!
The Series.append documentation states that append rows of other to the end of this frame, returning a new object.
The examples are a little confusing as it appears to show it working but if you look closely you'll notice they are using interactive python which prints the result of the last call (the new object) rather than showing the original object.
The result of calling append is actually a brand new Series.
In your example you would need to assign q each time to the new object returned by .append:
q = pd.Series([])
while i < len(other array):
diff = some int value
a = pd.Series([diff], ignore_index=True)
# change of code here
q = q.append(a)
i+=1

Numpy/Pandas Series begins with operator? Does it exist?

I am trying to create a series in my dataframe (sdbfile) whose values are based on several nested conditional statements using elements from sdbfile dataframe. The series reins_code is filled with string values.
The statement below works however I need to configure to say if 'reins_code' begins' with 'R' rather than == a specific 'R#'
sdbfile['product'] = np.where(sdbfile.reins_code == 'R2', 'HiredPlant','Trad')
It doesn't like the string function startswith() as its a np.series?
Can anybody help please? Have waded through the documentation but cannot see a reference to this problem.......
Use the vectorised str.startswith to return a boolean mask:
In [6]:
df = pd.DataFrame({'a':['R1asda','R2asdsa','foo']})
df
Out[6]:
a
0 R1asda
1 R2asdsa
2 foo
In [8]:
df['a'].str.startswith('R2')
Out[8]:
0 False
1 True
2 False
Name: a, dtype: bool
In [9]:
df[df['a'].str.startswith('R2')]
Out[9]:
a
1 R2asdsa
Use the pandas str attribute. http://pandas.pydata.org/pandas-docs/stable/text.html
Series and Index are equipped with a set of string processing methods
that make it easy to operate on each element of the array. Perhaps
most importantly, these methods exclude missing/NA values
automatically. These are accessed via the str attribute and generally
have names matching the equivalent (scalar) built-in string methods:
sdbfile['product'] = np.where(sdbfile.reins_code.str[0] == 'R', 'HiredPlant','Trad')

Categories