I have a dataframe like this:
df = pd.DataFrame(columns=['Dog', 'Small', 'Adult'])
df.Dog = ['Poodle', 'Shepard', 'Bird dog','St.Bernard']
df.Small = [1,1,0,0]
df.Adult = 0
That will look like this:
Dog Small Adult
0 Poodle 1 0
1 Shepard 1 0
2 Bird dog 0 0
3 St.Bernard 0 0
Then I would like to change one column based on another. I can do that:
df.loc[df.Small == 0, 'Adult'] = 1
However, I just want to do so for the 3 first rows.
I can select the first three rows:
df.iloc[0:2]
But if I try to change values on the first three rows:
df.iloc[0:2, df.Small == 0, 'Adult'] = 1
I get an error.
I also get an error if I merge the two:
df.iloc[0:2].loc[df.Small == 0, 'Adult'] = 1
It tells me that I am trying to set a value on a copy of a slice.
How should I do this correctly?
You could include the range as another condition in your .loc selection (for the general case, I'll explicitly include the 0):
df.loc[(df.Small == 0) & (0 <= df.index) & (df.index <= 2), 'Adult'] = 1
Another option is to transform the index into a series to use pd.Series.between:
df.loc[(df.Small == 0) & (df.index.to_series().between(0, 2)), 'Adult'] = 1
adding conditionals based on index works only if the index is already sorted. Alternatively, you can do the following:
ind = df[df.Small == 0].index[:2]
df.loc[ind, 'Adult'] = 1
Related
Our problem statement looks like:
np.random.seed(0)
df = pd.DataFrame({'Close': np.random.uniform(0, 100, size=10)})
This is sample data taken,other actual data is of a company's stock price
Close change
0 54.881350 NaN
1 71.518937 16.637586
2 60.276338 -11.242599
3 54.488318 -5.788019
4 42.365480 -12.122838
We have assinged a threshold with a range(0-1)
First, diff in change in index 1 and index 2 value are compared with threshold value,
if result is positive and greater than threshold, then assign = 1
if result is negative and less than threshold, then assign = -1
if result is within the range of threshold, then assign = 0
Same will be done for index 2 and index 3, and then index 3 and index 4
Now say the results are, final result will be through majority of voting:
index 1&2 index 2&3 index 3&4 Majority of voting
1 0 1 1
Exception
if results are 1, 0, -1 then the result would be 0
Now, the final result by majority of voting will be assigned to a new column at index 0, and so on.
EXPECTED RESULT(example)
Close change Result
0 54.881350 NaN 0
1 71.518937 16.637586 1
2 60.276338 -11.242599 -1
3 54.488318 -5.788019 1
4 42.365480 -12.122838 -1
I tried few times, but couldn't figure out how it will finally executed.
np.select is what you are looking for:
lbound, ubound = 0, 1
change = df["Close"].diff()
df["Change"] = change
df["Result"] = np.select(
[
# The exceptions. Floats are not exact. If they are "close enough" to
# 1, we consider them to be equal to 1, etc.
np.isclose(change, 1) | np.isclose(change, 0) | np.isclose(change, -1),
# The other conditions
(change > 0) & (change > ubound),
(change < 0) & (change < lbound),
change.between(lbound, ubound),
],
[0, 1, -1, 0],
)
I want to create a column where several columns can be greater than one but one column has to be 0 at all times e.g. :
df['indicator'] = np.where(( (df['01'] > 0) | (df['02']> 0) | (df['03']> 0) | (df['04']> 0)
& (df['spend'] == 0 )), 1, 0)
I want to create this flag based on whether if either of columns 01 to 04 are greater than 0 then 1 else 0. But whilst each of these are > 0 the spend column must be kept at 0 in all cases. This means if 01 and 02 are > 0 then spend must be 0, etc.
However using the above logic i end up with cases where spend is > 0 - what am i missing ?
personally, when working with multiple conditions in a data frame, I use masks: stackoverflow post about masks
col_1_idx = df['01'] > 0
col_2_idx = df['02'] > 0
col_3_idx = df['03'] > 0
col_4_idx = df['04'] > 0
or_col_idx = col_1_idx | col_2_idx | col_3_idx | col_4_idx
spend_idx = df['spend'] == 0
df['indicator'] = np.where(df[or_col_idx & spend_idx]), 1, 0)
IIUC, this can be simplified to:
df['indicator'] = (df[['01','02','03','04']].gt(0).any(axis=1) & df['spend'].eq(0)).astype(int)
I use the .gt(), .lt(), .eq(), .le() etc. a lot to simplify these () we run into.
You really don't need np.where when your desired output is essentially a numeric Boolean.
How do I set the values of a pandas dataframe slice, where the rows are chosen by a boolean expression and the columns are chosen by position?
I have done it in the following way so far:
>>> vals = [5,7]
>>> df = pd.DataFrame({'a':[1,2,3,4], 'b':[5,5,7,7]})
>>> df
a b
0 1 5
1 2 5
2 3 7
3 4 7
>>> df.iloc[:,1][df.iloc[:,1] == vals[0]] = 0
>>> df
a b
0 1 0
1 2 0
2 3 7
3 4 7
This works as expected on this small sample, but gives me the following warning on my real life dataframe:
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
What is the recommended way to achieve this?
Use DataFrame.columns and DataFrame.loc:
col = df.columns[1]
df.loc[df.loc[:,col] == vals[0], col] = 0
One way is to use index of column header and loc (label based indexing):
df.loc[df.iloc[:, 1] == vals[0], df.columns[1]] = 0
Another way is to use np.where with iloc (integer position indexing), np.where returns the tuple of row, column index positions where True:
df.iloc[np.where(df.iloc[:, 1] == vals[0])[0], 1] = 0
I believe this can be also done with a combination of loc and iloc:
df.loc[df.iloc[:,1] == vals[0]].iloc[:, 1] = 0
I'm a longtime SAS user trying to get into Pandas. I'd like to set a column's value based on a variety of if conditions. I think I can do it using nested np.where commands but thought I'd check if there's a more elegant solution. For instance, if I set a left bound and right bound, and want to return a column of string values for if x is left, middle, or right of these boundaries, what is the best way to do it? Basically if x < lbound return "left", else if lbound < x < rbound return "middle", else if x > rbound return "right".
df
lbound rbound x
0 -1 1 0
1 5 7 1
2 0 1 2
Can check for one condition by using np.where:
df['area'] = np.where(df['x']>df['rbound'],'right','somewhere else')
But not sure what to do it I want to check multiple if-else ifs in a single line.
Output should be:
df
lbound rbound x area
0 -1 1 0 middle
1 5 7 1 left
2 0 1 2 right
Option 1
You can use nested np.where statements. For example:
df['area'] = np.where(df['x'] > df['rbound'], 'right',
np.where(df['x'] < df['lbound'],
'left', 'somewhere else'))
Option 2
You can use .loc accessor to assign specific ranges. Note you will have to add the new column before use. We take this opportunity to set the default, which may be overwritten later.
df['area'] = 'somewhere else'
df.loc[df['x'] > df['rbound'], 'area'] = 'right'
df.loc[df['x'] < df['lbound'], 'area'] = 'left'
Explanation
These are both valid alternatives with comparable performance. The calculations are vectorised in both instances. My preference is for Option 2 as it seems more readable. If there are a large number of nested criteria, np.where may be more convenient.
You can use numpy select instead of np.where
cond = [df['x'].between(df['lbound'], df['rbound']), (df['x'] < df['lbound']) , df['x'] > df['rbound'] ]
output = [ 'middle', 'left', 'right']
df['area'] = np.select(cond, output, default=np.nan)
lbound rbound x area
0 -1 1 0 middle
1 5 7 1 left
2 0 1 2 right
I am filtering rows in a dataframe by values in two columns.
For some reason the OR operator behaves like I would expect AND operator to behave and vice versa.
My test code:
df = pd.DataFrame({'a': range(5), 'b': range(5) })
# let's insert some -1 values
df['a'][1] = -1
df['b'][1] = -1
df['a'][3] = -1
df['b'][4] = -1
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a != -1) | (df.b != -1)]
print(pd.concat([df, df1, df2], axis=1,
keys = [ 'original df', 'using AND (&)', 'using OR (|)',]))
And the result:
original df using AND (&) using OR (|)
a b a b a b
0 0 0 0 0 0 0
1 -1 -1 NaN NaN NaN NaN
2 2 2 2 2 2 2
3 -1 3 NaN NaN -1 3
4 4 -1 NaN NaN 4 -1
[5 rows x 6 columns]
As you can see, the AND operator drops every row in which at least one value equals -1. On the other hand, the OR operator requires both values to be equal to -1 to drop them. I would expect exactly the opposite result. Could anyone explain this behavior?
I am using pandas 0.13.1.
As you can see, the AND operator drops every row in which at least one
value equals -1. On the other hand, the OR operator requires both
values to be equal to -1 to drop them.
That's right. Remember that you're writing the condition in terms of what you want to keep, not in terms of what you want to drop. For df1:
df1 = df[(df.a != -1) & (df.b != -1)]
You're saying "keep the rows in which df.a isn't -1 and df.b isn't -1", which is the same as dropping every row in which at least one value is -1.
For df2:
df2 = df[(df.a != -1) | (df.b != -1)]
You're saying "keep the rows in which either df.a or df.b is not -1", which is the same as dropping rows where both values are -1.
PS: chained access like df['a'][1] = -1 can get you into trouble. It's better to get into the habit of using .loc and .iloc.
Late answer, but you can also use query(), i.e. :
df_filtered = df.query('a == 4 & b != 2')
A little mathematical logic theory here:
"NOT a AND NOT b" is the same as "NOT (a OR b)", so:
"a NOT -1 AND b NOT -1" is equivalent of "NOT (a is -1 OR b is -1)", which is opposite (Complement) of "(a is -1 OR b is -1)".
So if you want exact opposite result, df1 and df2 should be as below:
df1 = df[(df.a != -1) & (df.b != -1)]
df2 = df[(df.a == -1) | (df.b == -1)]
You can try the following:
df1 = df[(df['a'] != -1) & (df['b'] != -1)]
By de Morgan's laws, (i) the negation of a union is the intersection of the negations, and (ii) the negation of an intersection is the union of the negations, i.e.,
A AND B <=> not A OR not B
A OR B <=> not A AND not B
If the aim is to
drop every row in which at least one value equals -1
you can either use AND operator to identify the rows to keep or use OR operator to identify the rows to drop.
# select rows where both a and b values are not equal to -1
df2_0 = df[df['a'].ne(-1) & df['b'].ne(-1)]
# index of rows where at least one of a or b equals -1
idx = df.index[df.eval('a == -1 or b == -1')]
# drop `idx` rows
df2_1 = df.drop(idx)
df2_0.equals(df2_1) # True
On the other hand, if the aim is to
drop every row in which both values equal -1
you do the exact opposite; either use OR operator to identify the rows to keep or use AND operator to identify the rows to drop.