Set value of pandas data frame on conditional - python

I can't find a similar question for this query. However, I have a pandas dataframe where I want to use two of the columns to make conditional and if its true, replace the values in one of these columns.
For example. One of my columns is the 'itemname' and the other is the 'value'.
the 'itemname' may be repeated many times. I want to check for each 'itemname', if all other items with the same name have value 0, then replace these 'value' with 100.
I know this should be simple, however I cannot get my head around it.
Just to make it clearer, here
itemname value
0 a 0
1 b 100
2 c 0
3 a 0
3 b 75
3 c 90
I would like my statement to change this data frame to
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Hope that makes sense. I check if someone else has asked something similar and couldnt find something in this case.

Using transform with any:
df.loc[~df.groupby('itemname').value.transform('any'), 'value'] = 100
Using numpy.where:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=np.where(s, 100, df.value))
Using addition and multiplication:
s = ~df.groupby('itemname').value.transform('any')
df.assign(value=df.value + (100 * s))
Both produce the correct output, however, np.where and the final solution don't modify the DataFrame in place:
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90
Explanation
~df.groupby('itemname').value.transform('any')
0 True
1 False
2 False
3 True
3 False
3 False
Name: value, dtype: bool
Since 0 is a falsey value, we can use any, and negate the result, to find groups where all values are equal to 0.

You can use GroupBy + transform to create a mask. Then assign via pd.DataFrame.loc and Boolean indexing:
mask = df.groupby('itemname')['value'].transform(lambda x: x.eq(0).all())
df.loc[mask.astype(bool), 'value'] = 100
print(df)
itemname value
0 a 100
1 b 100
2 c 0
3 a 100
3 b 75
3 c 90

If all your values are positive or 0
Could use transform with sum and check if 0:
m = (df.groupby('itemname').transform('sum') == 0)['value']
df.loc[m, 'value'] = 100

Related

How to apply a formula to a Pandas DataFrame using a rolling window as well as a mean value across this window?

I am trying to implement a small algorithm that creates a new column in my DataFrame depending on wether a certain condition on another column exceeds a threshold or not. The formula looks like this:
df.loc[:, 'new_column'] = 0
df.loc[sum(abs(np.diff(df.loc[:, 'old_column']) / df.loc[:, 'old_column'].mean())) > threshold, 'new_column'] = 1
However now I don't want to apply this formula to the whole height of the DataFrame, but rather would like to apply a rolling window, i.e. the mean value calculated in the formula rolls through the rows of the DataFrame. I found this page in the documentation but don't know how I can apply this for a formula like in my case. How could I do something like this?
Using column names a and b instead of old_column and new_column:
df = pd.DataFrame(np.random.randint(10, size=10), columns=['a'])
window = 3
val = df['a'].diff().abs() / df['a'].rolling(window, 1).mean()
threshold = 1
condition = (val > threshold)
df['b'] = 0
df.loc[condition, 'b'] = 1
Example results:
a b
0 1 0
1 7 1
2 6 0
3 1 1
4 8 1
5 2 1
6 3 0
7 1 0
8 0 0
9 8 1
Pay close attention to NaN values in the intermediate results. diff() returns nan in the first row. This is unlike np.diff() because np.diff() returns a smaller array than the input.
And rolling().mean() will return NaN values depending on your min_periods parameter.
The final result contains no NaN values because (val > threshold) is always False for NaN inputs.

How to create a new column based on a condition in another column

In pandas, How can I create a new column B based on a column A in df, such that:
B=1 if A_(i+1)-A_(i) > 5 or A_(i) <= 10
B=0 if A_(i+1)-A_(i) <= 5
However, the first B_i value is always one
Example:
A
B
5
1 (the first B_i)
12
1
14
0
22
1
20
0
33
1
Use diff with a comparison to your value and convertion from boolean to int using le:
N = 5
df['B'] = (~df['A'].diff().le(N)).astype(int)
NB. using a le(5) comparison with inversion enables to have 1 for the first value
output:
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1
updated answer, simply combine a second condition with OR (|):
df['B'] = (~df['A'].diff().le(5)|df['A'].lt(10)).astype(int)
output: same as above with the provided data
I was little confused with your rows numeration bacause we should have missing value on last row instead of first if we calcule for B_i basing on condition A_(i+1)-A_(i) (first row should have both, A_(i) and A_(i+1) and last row should be missing A_(i+1) value.
Anyway,basing on your example i assumed that we calculate for B_(i+1).
import pandas as pd
df = pd.DataFrame(columns=["A"],data=[5,12,14,22,20,33])
df['shifted_A'] = df['A'].shift(1) #This row can be removed - it was added only show to how shift works on final dataframe
df['B']=''
df.loc[((df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10), 'B']=1 #Update rows that fulfill one of conditions with 1
df.loc[(df['A']-df['A'].shift(1))<=5, 'B']=0 #Update rows that fulfill condition with 0
df.loc[df.index==0, 'B']=1 #Update first row in B column
print(df)
That prints:
A shifted_A B
0 5 NaN 1
1 12 5.0 1
2 14 12.0 0
3 22 14.0 1
4 20 22.0 0
5 33 20.0 1
I am not sure if it is fastest way, but i guess it should be one of easier to understand.
Little explanation:
df.loc[mask, columnname]=newvalue allows us to update value in given column if condition (mask) is fulfilled
(df['A']-df['A'].shift(1))>5) + (df['A'].shift(1)<=10)
Each condition here returns True or False. If we added them the result is True if any of that is True (which is simply OR). In case we need AND we can multiply the conditions
Use Series.diff, replace first missing value for 1 after compare for greater or equal by Series.ge:
N = 5
df['B'] = (df.A.diff().fillna(N).ge(N) | df.A.lt(10)).astype(int)
print (df)
A B
0 5 1
1 12 1
2 14 0
3 22 1
4 20 0
5 33 1

How to drop conflicted rows in Dataframe?

I have a cliassification task, which means the conflicts harm the performance, i.e. same feature but different label.
idx feature label
0 a 0
1 a 1
2 b 0
3 c 1
4 a 0
5 b 0
How could I get formated dataframe as below?
idx feature label
2 b 0
3 c 1
5 b 0
Dataframe.duplicated() only output the duplicated rows, it seems the logic operation between df["features"].duplicated() and df.duplicated() do not return the results I want.
I think you need rows with only one unique value per groups - so use GroupBy.transform with DataFrameGroupBy.nunique, compare by 1 and filter in boolean indexing:
df = df[df.groupby('feature')['label'].transform('nunique').eq(1)]
print (df)
idx feature label
2 2 b 0
3 3 c 1
5 5 b 0

Use groupby in dataframe to perform data filtering and element-wise subtraction

I have a dataframe composed by the following table:
A B C D
A1 5 3 4
A1 8 1 0
A2 1 1 0
A2 1 9 1
A2 1 3 1
A3 0 4 7
...
I need to group the data according to the 'A' label, then check whether the sum of the 'B' column for each label is larger than 10. If it is larger than 10 then perform an operation that involves subtracting 'C' and 'D'. Finally, I need to drop all rows that identify those 'A' labels for which the condition on the sum is not larger than 10. I am trying to use the groupby method, but I am not sure this is the right way to go. So far I have grouped everything with df.groupby('A')['B'].sum() and get a list of sums per grouped label in order to check the aforementioned condition on the 10 elements. But then how to apply the subtraction between columns C and D and also drop the irrelevant rows?
Use GroupBy.transform with sum for new Series filled by aggregate values and filter rows greater like 10 in boolean indexing with Series.gt and then subtract columns:
df = df[df.groupby('A')['B'].transform('sum').gt(10)].copy()
df['E'] = df['C'].sub(df['D'])
print (df)
A B C D E
0 A1 5 3 4 -1
1 A1 8 1 0 1
Similar idea if need sum column:
df['sum'] = df.groupby('A')['B'].transform('sum')
df['E'] = df['C'].sub(df['D'])
df = df[df['sum'].gt(10)].copy()
print (df)
A B C D sum E
0 A1 5 3 4 13 -1
1 A1 8 1 0 13 1

Pandas groupby data and do calculation

I have a dataframe looks like below and I have reordered the dataframe depending on the value of column B.
a = df.sort(['B', 'A'], ascending=[True, False])
#This is my df
A,B
a,2
b,3
c,4
d,5
d,6
d,7
d,9
Then I'd like to calculate the difference between each element in column B when column A is the same. But if column A only contain single data point then the result will be zero.
So firstly I used groupby() to do so.
b = a['B'].groupby(df['A']))
Then I stuck here, I know I can use lambda x: abs(x[i] - x[i+1]) or even apply() function to finish the calculation. But I still fail to get it done.
Can anyone give me a tip or suggestion?
# What I want to see in the result
A,B
a,0
b,0
c,0
d,0 # 5 minus 5
d,1 # 6 minus 5
d,1 # 7 minus 6
d,2 # 9 minus 7
In both the 1-member and multimember group cases, taking the diff will produce a nan for the first value, which we can fillna with 0:
>>> df["B"] = df.groupby("A")["B"].diff().fillna(0)
>>> df
A B
0 a 0
1 b 0
2 c 0
3 d 0
4 d 1
5 d 1
6 d 2
This assumes there aren't NaNs already there you want to preserve. We could still make that work if we needed to.
You can do that:
df.groupby(level="A").B.diff().fillna(0)
A
a 0
b 0
c 0
d 0
d 1
d 1
d 2

Categories