Should multi-index levels be updated after dropna() called on pandas DataFrame?

Should multi-index levels be updated after dropna() called on pandas DataFrame? - python

After calling dropna on a multi index dataframe, the levels metadata in the index does not appear to be updated. Is this a bug?
In [1]: import pandas
In [2]: print pandas.__version__
0.10.1
In [3]: df_multi = pandas.DataFrame(index=[[1, 2],['a', 'b',]],
data=[[float('nan'), 5], [6, 7]])
In [4]: print df_multi
0 1
1 a NaN 5
2 b 6 7
In [5]: df_multi = df_multi.dropna(axis=0, how='any')
In [6]: print df_multi
0 1
2 b 6 7
In [7]: print df_multi.index
MultiIndex
[(2, b)]
In [8]: print df_multi.index.levels
[Int64Index([1, 2], dtype=int64), Index([a, b], dtype=object)]
Note above that the MultiIndex only has (2, b), but it reports 1 and 'a' are in the index.levels.
The workaround I have is to reindex with a "clean" Multi-Index as follows:
In [10]: c_clean = pandas.MultiIndex.from_tuples(df_multi.index)
In [11]: df_multi = df_multi.reindex(c_clean)
In [12]: print df_multi
0 1
2 b 6 7
In [13]: print df_multi.index.levels
[Int64Index([2], dtype=int64), Index([b], dtype=object)]
Edit:
This problem also occurs during a slicing with .ix, and probably with other indexing operations as well.

This is a known situtation archived here
https://github.com/pydata/pandas/issues/2655
People are currently contemplating how to deal with it.
My work-around is to use index.get_level_values(level), because a dropna(how='all') might only remove some of an axis but not all, but I might need all remaining values of one of the levels of a multi-index.
For some reason the return of index.get_level_values(level) is correct, while index.levels has not been updated (maybe too costly for speed reasons?).

Related

Passing row and column name to get value [duplicate]

I have constructed a condition that extracts exactly one row from my data frame:
d2 = df[(df['l_ext']==l_ext) & (df['item']==item) & (df['wn']==wn) & (df['wd']==1)]
Now I would like to take a value from a particular column:
val = d2['col_name']
But as a result, I get a data frame that contains one row and one column (i.e., one cell). It is not what I need. I need one value (one float number). How can I do it in pandas?

If you have a DataFrame with only one row, then access the first (only) row as a Series using iloc, and then the value using the column name:
In [3]: sub_df
Out[3]:
A B
2 -0.133653 -0.030854
In [4]: sub_df.iloc[0]
Out[4]:
A -0.133653
B -0.030854
Name: 2, dtype: float64
In [5]: sub_df.iloc[0]['A']
Out[5]: -0.13365288513107493

These are fast access methods for scalars:
In [15]: df = pandas.DataFrame(numpy.random.randn(5, 3), columns=list('ABC'))
In [16]: df
Out[16]:
A B C
0 -0.074172 -0.090626 0.038272
1 -0.128545 0.762088 -0.714816
2 0.201498 -0.734963 0.558397
3 1.563307 -1.186415 0.848246
4 0.205171 0.962514 0.037709
In [17]: df.iat[0, 0]
Out[17]: -0.074171888537611502
In [18]: df.at[0, 'A']
Out[18]: -0.074171888537611502

You can turn your 1x1 dataframe into a NumPy array, then access the first and only value of that array:
val = d2['col_name'].values[0]

Most answers are using iloc which is good for selection by position.
If you need selection-by-label, loc would be more convenient.
For getting a value explicitly (equiv to deprecated
df.get_value('a','A'))
# This is also equivalent to df1.at['a','A']
In [55]: df1.loc['a', 'A']
Out[55]: 0.13200317033032932

It doesn't need to be complicated:
val = df.loc[df.wd==1, 'col_name'].values[0]

I needed the value of one cell, selected by column and index names.
This solution worked for me:
original_conversion_frequency.loc[1,:].values[0]

It looks like changes after pandas 10.1 or 13.1.
I upgraded from 10.1 to 13.1. Before, iloc is not available.
Now with 13.1, iloc[0]['label'] gets a single value array rather than a scalar.
Like this:
lastprice = stock.iloc[-1]['Close']
Output:
date
2014-02-26 118.2
name:Close, dtype: float64

The quickest and easiest options I have found are the following. 501 represents the row index.
df.at[501, 'column_name']
df.get_value(501, 'column_name')

In later versions, you can fix it by simply doing:
val = float(d2['col_name'].iloc[0])

df_gdp.columns
Index([u'Country', u'Country Code', u'Indicator Name', u'Indicator Code',
u'1960', u'1961', u'1962', u'1963', u'1964', u'1965', u'1966', u'1967',
u'1968', u'1969', u'1970', u'1971', u'1972', u'1973', u'1974', u'1975',
u'1976', u'1977', u'1978', u'1979', u'1980', u'1981', u'1982', u'1983',
u'1984', u'1985', u'1986', u'1987', u'1988', u'1989', u'1990', u'1991',
u'1992', u'1993', u'1994', u'1995', u'1996', u'1997', u'1998', u'1999',
u'2000', u'2001', u'2002', u'2003', u'2004', u'2005', u'2006', u'2007',
u'2008', u'2009', u'2010', u'2011', u'2012', u'2013', u'2014', u'2015',
u'2016'],
dtype='object')
df_gdp[df_gdp["Country Code"] == "USA"]["1996"].values[0]
8100000000000.0

I am not sure if this is a good practice, but I noticed I can also get just the value by casting the series as float.
E.g.,
rate
3 0.042679
Name: Unemployment_rate, dtype: float64
float(rate)
0.0426789

I've run across this when using dataframes with MultiIndexes and found squeeze useful.
From the documentation:
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar.
DataFrames with a single column or a single row are squeezed to a
Series. Otherwise the object is unchanged.
# Example for a dataframe with MultiIndex
> import pandas as pd
> df = pd.DataFrame(
[
[1, 2, 3],
[4, 5, 6],
[7, 8, 9]
],
index=pd.MultiIndex.from_tuples( [('i', 1), ('ii', 2), ('iii', 3)] ),
columns=pd.MultiIndex.from_tuples( [('A', 'a'), ('B', 'b'), ('C', 'c')] )
)
> df
A B C
a b c
i 1 1 2 3
ii 2 4 5 6
iii 3 7 8 9
> df.loc['ii', 'B']
b
2 5
> df.loc['ii', 'B'].squeeze()
5
Note that while df.at[] also works (if you aren't needing to use conditionals) you then still AFAIK need to specify all levels of the MultiIndex.
Example:
> df.at[('ii', 2), ('B', 'b')]
5
I have a dataframe with a six-level index and two-level columns, so only having to specify the outer level is quite helpful.

For pandas 0.10, where iloc is unavailable, filter a DF and get the first row data for the column VALUE:
df_filt = df[df['C1'] == C1val & df['C2'] == C2val]
result = df_filt.get_value(df_filt.index[0],'VALUE')
If there is more than one row filtered, obtain the first row value. There will be an exception if the filter results in an empty data frame.

Converting it to integer worked for me:
int(sub_df.iloc[0])

Using .item() returns a scalar (not a Series), and it only works if there is a single element selected. It's much safer than .values[0] which will return the first element regardless of how many are selected.
>>> df = pd.DataFrame({'a': [1,2,2], 'b': [4,5,6]})
>>> df[df['a'] == 1]['a'] # Returns a Series
0 1
Name: a, dtype: int64
>>> df[df['a'] == 1]['a'].item()
1
>>> df2 = df[df['a'] == 2]
>>> df2['b']
1 5
2 6
Name: b, dtype: int64
>>> df2['b'].values[0]
5
>>> df2['b'].item()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python3/dist-packages/pandas/core/base.py", line 331, in item
raise ValueError("can only convert an array of size 1 to a Python scalar")
ValueError: can only convert an array of size 1 to a Python scalar

To get the full row's value as JSON (instead of a Serie):
row = df.iloc[0]
Use the to_json method like below:
row.to_json()

pandas groupby first column shifted down

So I have read in a csv file as a pandas dataframe:
But when I group it, the year column is shifted down by one:
So when I try to pull out Years into a numpy array, it gives an error saying "KeyError:'Year'".
Is there a way to get the array to find the years, or a way to shift that first column up by one?
I have found a way to shift a dataframe column up by one, but I need to shift the grouping, not the dataframe.
I also tried turning the new grouping into a new dataframe so that I can shift the year column up, but haven't been successful.

Year is the name of the index.
In [11]: df = pd.DataFrame([[1, 2], [3, 4]], columns=["A", "B"])
In [12]: df
Out[12]:
A B
0 1 2
1 3 4
In [13]: df.index.name = "foo"
In [14]: df
Out[14]:
A B
foo
0 1 2
1 3 4
Pull out the index with .index:
In [15]: df.index
Out[15]: Int64Index([0, 1], dtype='int64', name='foo')
In [16]: df.index.values
Out[16]: array([0, 1])

Replace values in a dataframe column based on condition

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!

This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.

Access value by location in sorted pandas series with integer index

I have a pandas Series with an integer index which I've sorted (by value), how I access values by position in this Series.
For example:
s_original = pd.Series({0: -0.000213, 1: 0.00031399999999999999, 2: -0.00024899999999999998, 3: -2.6999999999999999e-05, 4: 0.000122})
s_sorted = np.sort(s_original)
In [3]: s_original
Out[3]:
0 -0.000213
1 0.000314
2 -0.000249
3 -0.000027
4 0.000122
In [4]: s_sorted
Out[4]:
2 -0.000249
0 -0.000213
3 -0.000027
4 0.000122
1 0.000314
In [5]: s_sorted[3]
Out[5]: -2.6999999999999999e-05
But I would like to get the value 0.000122 i.e. the item in position 3.
How can I do this?

Replace the line
b = np.sort(a)
with
b = pd.Series(np.sort(a), index=a.index)
This will sort the values, but keep the index.
EDIT:
To get the fourth value in the sorted Series:
np.sort(a).values[3]

You can use iget to retrieve by position:
(In fact, this method was created especially to overcome this ambiguity.)
In [1]: s = pd.Series([0, 2, 1])
In [2]: s.sort()
In [3]: s
Out[3]:
0 0
2 1
1 2
In [4]: s.iget(1)
Out[4]: 1
.
The behaviour of .ix with an integer index is noted in the pandas "gotchas":
In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix.
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Note: this would work if you were using a non-integer index, where .ix is not ambiguous.
For example:
In [11]: s1 = pd.Series([0, 2, 1], list('abc'))
In [12]: s1
Out[12]:
a 0
b 2
c 1
In [13]: s1.sort()
In [14]: s1
Out[14]:
a 0
c 1
b 2
In [15]: s1.ix[1]
Out[15]: 1

pyPandas functionality request: reverse/negative df.drop

I think a reverse/negative dataframe.drop functionality would be a very useful tool.
Has anybody have a overcome to this?

Generally, I find myself using boolean indexing and the tilde operator when obtaining the inverse of a selection, rather than df.drop(), though the same concept applies to df.drop when boolean indexing is used to form the array of labels to drop. Hope that helps.
In [44]: df
Out[44]:
A B
0 0.642010 0.116227
1 0.848426 0.710739
2 0.563803 0.416422
In [45]: cond = (df.A > .6) & (df.B > .3)
In [46]: df[cond]
Out[46]:
A B
1 0.848426 0.710739
In [47]: df[~cond]
Out[47]:
A B
0 0.642010 0.116227
2 0.563803 0.416422

If I understand you right, you can get this effect just by indexing with an "isin" on the index:
>>> df
A B C
0 0.754956 -0.597896 0.245254
1 -0.987808 0.162506 -0.131674
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827
4 -0.191055 -0.461204 0.412220
>>> df[df.index.isin([0, 2, 3])] # Drop rows whose label is not in the set [0, 2, 3]
A B C
0 0.754956 -0.597896 0.245254
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Should multi-index levels be updated after dropna() called on pandas DataFrame? - python

Related

Passing row and column name to get value [duplicate]

pandas groupby first column shifted down

Replace values in a dataframe column based on condition

Access value by location in sorted pandas series with integer index

pyPandas functionality request: reverse/negative df.drop

Categories

Resources