I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?
If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493
These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502
You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]
Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932
It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]
I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]
It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64
The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')
In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])
df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0
I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789
I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.
For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.
Converting it to integer worked for me:
int(sub_df.iloc[0])
Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar
To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()
So I have read in a csv file as a pandas dataframe:
But when I group it, the year column is shifted down by one:
So when I try to pull out Years into a numpy array, it gives an error saying "KeyError:'Year'".
Is there a way to get the array to find the years, or a way to shift that first column up by one?
I have found a way to shift a dataframe column up by one, but I need to shift the grouping, not the dataframe.
I also tried turning the new grouping into a new dataframe so that I can shift the year column up, but haven't been successful.
Year is the name of the index.
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df.index.name = "foo"
In [14]: df
Out[14]:
A B
foo
0 1 2
1 3 4
Pull out the index with .index:
In [15]: df.index
Out[15]: Int64Index([0, 1], dtype='int64', name='foo')
In [16]: df.index.values
Out[16]: array([0, 1])
I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!
This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.
I have a pandas Series with an integer index which I've sorted (by value), how I access values by position in this Series.
For example:
s_original = pd.Series({0: -0.000213, 1: 0.00031399999999999999, 2: -0.00024899999999999998, 3: -2.6999999999999999e-05, 4: 0.000122})
s_sorted = np.sort(s_original)
In [3]: s_original
Out[3]:
0 -0.000213
1 0.000314
2 -0.000249
3 -0.000027
4 0.000122
In [4]: s_sorted
Out[4]:
2 -0.000249
0 -0.000213
3 -0.000027
4 0.000122
1 0.000314
In [5]: s_sorted[3]
Out[5]: -2.6999999999999999e-05
But I would like to get the value 0.000122 i.e. the item in position 3.
How can I do this?
Replace the line
b = np.sort(a)
with
b = pd.Series(np.sort(a), index=a.index)
This will sort the values, but keep the index.
EDIT:
To get the fourth value in the sorted Series:
np.sort(a).values[3]
You can use iget to retrieve by position:
(In fact, this method was created especially to overcome this ambiguity.)
In [1]: s = pd.Series([0, 2, 1])
In [2]: s.sort()
In [3]: s
Out[3]:
0 0
2 1
1 2
In [4]: s.iget(1)
Out[4]: 1
.
The behaviour of .ix with an integer index is noted in the pandas "gotchas":
In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix.
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Note: this would work if you were using a non-integer index, where .ix is not ambiguous.
For example:
In [11]: s1 = pd.Series([0, 2, 1], list('abc'))
In [12]: s1
Out[12]:
a 0
b 2
c 1
In [13]: s1.sort()
In [14]: s1
Out[14]:
a 0
c 1
b 2
In [15]: s1.ix[1]
Out[15]: 1
I think a reverse/negative dataframe.drop functionality would be a very useful tool.
Has anybody have a overcome to this?
Generally, I find myself using boolean indexing and the tilde operator when obtaining the inverse of a selection, rather than df.drop(), though the same concept applies to df.drop when boolean indexing is used to form the array of labels to drop. Hope that helps.
In [44]: df
Out[44]:
A B
0 0.642010 0.116227
1 0.848426 0.710739
2 0.563803 0.416422
In [45]: cond = (df.A > .6) & (df.B > .3)
In [46]: df[cond]
Out[46]:
A B
1 0.848426 0.710739
In [47]: df[~cond]
Out[47]:
A B
0 0.642010 0.116227
2 0.563803 0.416422
If I understand you right, you can get this effect just by indexing with an "isin" on the index:
>>> df
A B C
0 0.754956 -0.597896 0.245254
1 -0.987808 0.162506 -0.131674
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827
4 -0.191055 -0.461204 0.412220
>>> df[df.index.isin([0, 2, 3])] # Drop rows whose label is not in the set [0, 2, 3]
A B C
0 0.754956 -0.597896 0.245254
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827