I think a reverse/negative dataframe.drop functionality would be a very useful tool.
Has anybody have a overcome to this?
Generally, I find myself using boolean indexing and the tilde operator when obtaining the inverse of a selection, rather than df.drop(), though the same concept applies to df.drop when boolean indexing is used to form the array of labels to drop. Hope that helps.
In [44]: df
Out[44]:
A B
0 0.642010 0.116227
1 0.848426 0.710739
2 0.563803 0.416422
In [45]: cond = (df.A > .6) & (df.B > .3)
In [46]: df[cond]
Out[46]:
A B
1 0.848426 0.710739
In [47]: df[~cond]
Out[47]:
A B
0 0.642010 0.116227
2 0.563803 0.416422
If I understand you right, you can get this effect just by indexing with an "isin" on the index:
>>> df
A B C
0 0.754956 -0.597896 0.245254
1 -0.987808 0.162506 -0.131674
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827
4 -0.191055 -0.461204 0.412220
>>> df[df.index.isin([0, 2, 3])] # Drop rows whose label is not in the set [0, 2, 3]
A B C
0 0.754956 -0.597896 0.245254
2 -1.064639 -2.193629 1.814078
3 -0.483950 -1.290789 1.776827
Related
I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)
I am trying to make the last two rows of my dataframe df the first two of my dataframe with the previous first row becoming the 3rd row after the shift. Its because I just added the rows [3,0.3232, 0, 0, 2,0.500], [6,0.3232, 0, 0, 2,0.500]. However, these get added to to the end of df and hence become the last two rows, when I want them to be the first two. I was just wondering how to do this.
df = df.T
df[0] = [3,0.3232, 0, 0, 2,0.500]
df[1] = [6,0.3232, 0, 0, 2,0.500]
df = df.T
df = df.reset_index()
You can just call reindex and pass the new desired order:
In [14]:
df = pd.DataFrame({'a':['a','b','c']})
df
Out[14]:
a
0 a
1 b
2 c
In [16]:
df.reindex([1,2,0])
Out[16]:
a
1 b
2 c
0 a
EDIT
Another method would be to use np.roll note that this returns a np.array so we have to explicitly select the columns from the df to overwrite them:
In [30]:
df = pd.DataFrame({'a':['a','b','c'], 'b':np.arange(3)})
df
Out[30]:
a b
0 a 0
1 b 1
2 c 2
In [42]:
df[df.columns] = np.roll(df, shift=-1, axis=0)
df
Out[42]:
a b
0 b 1
1 c 2
2 a 0
The axis=0 param seems to be necessary otherwise the column order is not preserved:
In [44]:
df[df.columns] = np.roll(df, shift=-1)
df
Out[44]:
a b
0 0 b
1 1 c
2 2 a
Unless I'm missing something, the easiest solution is just to add the new rows to the beginning in the first place:
existing_rows = pd.DataFrame( np.random.randn(4,3) )
new_rows = pd.DataFrame( np.random.randn(2,3) )
new_rows.append( existing_rows )
0 1 2
0 0.406690 -0.699925 0.449278
1 1.729282 0.387896 0.652381
0 0.091711 1.634247 0.749282
1 1.354132 -0.180248 -1.880638
2 -0.151871 -1.266152 0.333071
3 1.351072 -0.421404 -0.951583
If you really want to switch rows you can do as EdChum suggests. Another way is like this:
df.iloc[-2:].append( df.iloc[:-2] )
I think this is slightly simpler than np.roll as suggested by EdChum, but numpy is generally faster so I'd use np.roll if you care about speed. (And doing some quick tests on 1,000x3 data suggests it is about 3x to 4x faster than append.)
I have a seemingly easy task. Dataframe with 2 columns: A and B. If values in B are larger than values in A - replace those values with values of A. I used to do this by doing df.B[df.B > df.A] = df.A, however recent upgrade of pandas started giving a SettingWithCopyWarning when encountering this chained assignment. Official documentation recommends using .loc.
Okay, I said, and did it through df.loc[df.B > df.A, 'B'] = df.A and it all works fine, unless column B has all values of NaN. Then something weird happens:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, np.NaN, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 NaN
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 -9223372036854775808
2 3 -9223372036854775808
Now, if even one of B's elements satisfies the condition (larger than A), then it all works fine:
In [1]: df = pd.DataFrame({'A': [1, 2, 3],'B': [np.NaN, 4, np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 4
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 NaN
1 2 2
2 3 NaN
But if none of Bs elements satisfy, then all NaNs get replaces with -9223372036854775808:
In [1]: df = pd.DataFrame({'A':[1,2,3],'B':[np.NaN,1,np.NaN]})
In [2]: df
Out[2]:
A B
0 1 NaN
1 2 1
2 3 NaN
In [3]: df.loc[df.B > df.A, 'B'] = df.A
In [4]: df
Out[4]:
A B
0 1 -9223372036854775808
1 2 1
2 3 -9223372036854775808
Is this a bug or a feature? How should I have done this replacement?
Thank you!
This is a buggie, fixed here.
Since pandas allows basically anything to be set on the right-hand-side of an expression in loc, there are probably 10+ cases that need to be disambiguated. To give you an idea:
df.loc[lhs, column] = rhs
where rhs could be: list,array,scalar, and lhs could be: slice,tuple,scalar,array
and a small subset of cases where the resulting dtype of the column needs to be inferred / set according to the rhs. (This is a bit complicated). For example say you don't set all of the elements on the lhs and it was integer, then you need to coerce to float. But if you did set all of the elements AND the rhs was an integer then it needs to be coerced BACK to integer.
In this this particular case, the lhs is an array, so we would normally try to coerce the lhs to the type of the rhs, but this case degenerates if we have an unsafe conversion (int -> float)
Suffice to say this was a missing edge case.
I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)
I have a pandas Series with an integer index which I've sorted (by value), how I access values by position in this Series.
For example:
s_original = pd.Series({0: -0.000213, 1: 0.00031399999999999999, 2: -0.00024899999999999998, 3: -2.6999999999999999e-05, 4: 0.000122})
s_sorted = np.sort(s_original)
In [3]: s_original
Out[3]:
0 -0.000213
1 0.000314
2 -0.000249
3 -0.000027
4 0.000122
In [4]: s_sorted
Out[4]:
2 -0.000249
0 -0.000213
3 -0.000027
4 0.000122
1 0.000314
In [5]: s_sorted[3]
Out[5]: -2.6999999999999999e-05
But I would like to get the value 0.000122 i.e. the item in position 3.
How can I do this?
Replace the line
b = np.sort(a)
with
b = pd.Series(np.sort(a), index=a.index)
This will sort the values, but keep the index.
EDIT:
To get the fourth value in the sorted Series:
np.sort(a).values[3]
You can use iget to retrieve by position:
(In fact, this method was created especially to overcome this ambiguity.)
In [1]: s = pd.Series([0, 2, 1])
In [2]: s.sort()
In [3]: s
Out[3]:
0 0
2 1
1 2
In [4]: s.iget(1)
Out[4]: 1
.
The behaviour of .ix with an integer index is noted in the pandas "gotchas":
In pandas, our general viewpoint is that labels matter more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the standard tools like .ix.
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the API change was made to stop “falling back” on position-based indexing).
Note: this would work if you were using a non-integer index, where .ix is not ambiguous.
For example:
In [11]: s1 = pd.Series([0, 2, 1], list('abc'))
In [12]: s1
Out[12]:
a 0
b 2
c 1
In [13]: s1.sort()
In [14]: s1
Out[14]:
a 0
c 1
b 2
In [15]: s1.ix[1]
Out[15]: 1