Replace values in a dataframe column based on condition - python

I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!

This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.

Related

Boolean Indexing along the row axis of a DataFrame in pandas

a = [ [1,2,3,4,5], [6,np.nan,8,np.nan,10]]
df = pd.DataFrame(a, columns=['a', 'b', 'c', 'd', 'e'], index=['foo', 'bar'])
In [5]: df
Out[5]:
a b c d e
foo 1 2.0 3 4.0 5
bar 6 NaN 8 NaN 10
I understand how normal boolean indexing works, for example if I want to select the rows that have c > 3 I would write df[df.c > 3]. However, what if I want to do that along the row axis. Say I want only the columns that have 'bar' == np.nan.
I would have assumed that the following should do it due to the similarly of df['a'] and df.loc['bar']:
df.loc[df.loc['bar'].isnull()]
But it doesn't, and obviously neither does results[results.loc['hl'].isnull()] giving the same error *** pandas.core.indexing.IndexingError: Unalignable boolean Series key provided
So how would I do it?
IIUC you want to use the boolean mask to mask the columns:
In [135]:
df[df.columns[df.loc['bar'].isnull()]]
Out[135]:
b d
foo 2.0 4.0
bar NaN NaN
Or you can use ix and decay the series to np array:
In [138]:
df.ix[:,df.loc['bar'].isnull().values]
Out[138]:
b d
foo 2.0 4.0
bar NaN NaN
The problem here is that the boolean series returned is a mask on the columns:
In [136]:
df.loc['bar'].isnull()
Out[136]:
a False
b True
c False
d True
e False
Name: bar, dtype: bool
but your index contains none of these column values as the labels hence the error so you need to use the mask against the columns or you can pass a np array to mask the columns in ix

Filter rows of pandas dataframe whose values are lower than 0

I have a pandas dataframe like this
df = pd.DataFrame(data=[[21, 1],[32, -4],[-4, 14],[3, 17],[-7,NaN]], columns=['a', 'b'])
df
I want to be able to remove all rows with negative values in a list of columns and conserving rows with NaN.
In my example there is only 2 columns, but I have more in my dataset, so I can't do it one by one.
If you want to apply it to all columns, do df[df > 0] with dropna():
>>> df[df > 0].dropna()
a b
0 21 1
3 3 17
If you know what columns to apply it to, then do for only those cols with df[df[cols] > 0]:
>>> cols = ['b']
>>> df[cols] = df[df[cols] > 0][cols]
>>> df.dropna()
a b
0 21 1
2 -4 14
3 3 17
I've found you can simplify the answer by just doing this:
>>> cols = ['b']
>>> df = df[df[cols] > 0]
dropna() is not an in-place method, so you have to store the result.
>>> df = df.dropna()
I was looking for a solution for this that doesn't change the dtype (which will happen if NaN's are mixed in with ints as suggested in the answers that use dropna. Since the questioner already had a NaN in their data, that may not be an issue for them. I went with this solution which preserves the int64 dtype. Here it is with my sample data:
df = pd.DataFrame(data={'a':[0, 1, 2], 'b': [-1,0,1], 'c': [-2, -1, 0]})
columns = ['b', 'c']
filter_ = (df[columns] >= 0).all(axis=1)
df[filter_]
a b c
2 2 1 0

Filtering rows from pandas dataframe using concatenated strings

I have a pandas dataframe plus a pandas series of identifiers, and would like to filter the rows from the dataframe that correspond to the identifiers in the series. To get the identifiers from the dataframe, I need to concatenate its first two columns. I have tried various things to filter, but none seem to work so far. Here is what I have tried:
1) I tried adding a column of booleans to the data frame, being true if that row corresponds to one of the identifiers, and false otherwise (hoping to be able to do filtering afterwards using the new column):
df["isInAcids"] = (df["AcNo"] + df["Sortcode"]) in acids
where
acids
is the series containing the identifiers.
However, this gives me a
TypeError: unhashable type
2) I tried filtering using the apply function:
df[df.apply(lambda x: x["AcNo"] + x["Sortcode"] in acids, axis = 1)]
This doesn't give me an error, but the length of the data frame remains unchanged, so it doesn't appear to filter anything.
3) I have added a new column, containing the concatenated strings/identifiers, and then try to filter afterwards (see Filter dataframe rows if value in column is in a set list of values):
df["ACIDS"] = df["AcNo"] + df["Sortcode"]
df[df["ACIDS"].isin(acids)]
But again, the dataframe doesn't change.
I hope this makes sense...
Any suggestions where I might be going wrong?
Thanks,
Anne
I think you're asking for something like the following:
In [1]: other_ids = pd.Series(['a', 'b', 'c', 'c'])
In [2]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'c', 'f']})
In [3]: df
Out[3]:
ids vals
0 a 1
1 b 2
2 c 3
3 f 4
In [4]: other_ids
Out[4]:
0 a
1 b
2 c
3 c
dtype: object
In this case, the series other_ids would be like your series acids. We want to select just those rows of df whose id is in the series other_ids. To do that we'll use the dataframe's method .isin().
In [5]: df.ids.isin(other_ids)
Out[5]:
0 True
1 True
2 True
3 False
Name: ids, dtype: bool
This gives a column of bools that we can index into:
In [6]: df[df.ids.isin(other_ids)]
Out[6]:
ids vals
0 a 1
1 b 2
2 c 3
This is close to what you're doing with your 3rd attempt. Once you post a sample of your dataframe I can edit this answer, if it doesn't work already.
Reading a bit more, you may be having trouble because you have two columns in df that are your ids? Dataframe doesn't have an isin method, but we can get around that with something like:
In [26]: df = pd.DataFrame({'vals': [1, 2, 3, 4], 'ids': ['a', 'b', 'f', 'f'],
'ids2': ['e', 'f', 'c', 'f']})
In [27]: df
Out[27]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3
3 f f 4
In [28]: df.ids.isin(ids) + df.ids2.isin(ids)
Out[28]:
0 True
1 True
2 True
3 False
dtype: bool
True is like 1 and False is like zero so we add the two boolean series from the two isins() to get something like an OR operation. Then like before we can index into this boolean series:
In [29]: new = df.ix[df.ids.isin(ids) + df.ids2.isin(ids)]
In [30]: new
Out[30]:
ids ids2 vals
0 a e 1
1 b f 2
2 f c 3

Access value by location in sorted pandas series with integer index

I have a pandas Series with an integer index which I've sorted (by value), how I access values by position in this Series.
For example:
s_original = pd.Series({0: -0.000213, 1: 0.00031399999999999999, 2: -0.00024899999999999998, 3: -2.6999999999999999e-05, 4: 0.000122})
s_sorted = np.sort(s_original)
In [3]: s_original
Out[3]:
0 -0.000213
1 0.000314
2 -0.000249
3 -0.000027
4 0.000122
In [4]: s_sorted
Out[4]:
2 -0.000249
0 -0.000213
3 -0.000027
4 0.000122
1 0.000314
In [5]: s_sorted[3]
Out[5]: -2.6999999999999999e-05
But I would like to get the value 0.000122 i.e. the item in position 3.
How can I do this?
Replace the line
b = np.sort(a)
with
b = pd.Series(np.sort(a), index=a.index)
This will sort the values, but keep the index.
EDIT:
To get the fourth value in the sorted Series:
np.sort(a).values[3]
You can use iget to retrieve by position:
(In fact, this method was created especially to overcome this ambiguity.)
In [1]: s = pd.Series([0, 2, 1])
In [2]: s.sort()
In [3]: s
Out[3]:
0 0
2 1
1 2
In [4]: s.iget(1)
Out[4]: 1
.
The behaviour of .ix with an integer index is noted in the pandas "gotchas":
In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix.
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Note: this would work if you were using a non-integer index, where .ix is not ambiguous.
For example:
In [11]: s1 = pd.Series([0, 2, 1], list('abc'))
In [12]: s1
Out[12]:
a 0
b 2
c 1
In [13]: s1.sort()
In [14]: s1
Out[14]:
a 0
c 1
b 2
In [15]: s1.ix[1]
Out[15]: 1

pyPandas functionality request: reverse/negative df.drop

I think a reverse/negative dataframe.drop functionality would be a very useful tool.
Has anybody have a overcome to this?
Generally, I find myself using boolean indexing and the tilde operator when obtaining the inverse of a selection, rather than df.drop(), though the same concept applies to df.drop when boolean indexing is used to form the array of labels to drop. Hope that helps.
In [44]: df
Out[44]:
A B
0 0.642010 0.116227
1 0.848426 0.710739
2 0.563803 0.416422
In [45]: cond = (df.A > .6) & (df.B > .3)
In [46]: df[cond]
Out[46]:
A B
1 0.848426 0.710739
In [47]: df[~cond]
Out[47]:
A B
0 0.642010 0.116227
2 0.563803 0.416422
If I understand you right, you can get this effect just by indexing with an "isin" on the index:
>>> df
A B C
0 0.754956 -0.597896 0.245254
1 -0.987808 0.162506 -0.131674
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827
4 -0.191055 -0.461204 0.412220
>>> df[df.index.isin([0, 2, 3])] # Drop rows whose label is not in the set [0, 2, 3]
A B C
0 0.754956 -0.597896 0.245254
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827

Categories