df.where necessary condition and a secondary condition - python

The wording of the title may be confusing, but I will explain in the code. Say I have a dataframe df:
In [1]: import pandas as pd
df = pd.DataFrame([[20, 20], [20, 0], [0, 20], [0, 0]], columns=['a', 'b'])
df
Out[1]:
a b
0 20 20
1 20 0
2 0 20
3 0 0
Now I want to create a new dataframe "df_new" based on 2 conditions, for example:
If 'a' is greater than 10, then check 'b'. If 'b' is greater than 5, fill values with NaN or cut out data (doesn't matter). If 'b' is less than 5, return the data.
If 'a' is less than 10, return the data regardless of the value of 'b'.
Here's my I attempt with df.where -- it does not return how I would like.
In [2]: df_new = df.where((df['a'] < 10) & (df['b'] < 5))
df_new
Out[2]:
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 0.0 0.0
This is how I would like df_new to return
Out[3]:
a b
0 NaN NaN
1 20.0 0.0
2 0.0 20.0
3 0.0 0.0
I know df.where is doing exactly what I told it to do, but I am not sure how to check the 'b' value depending on the 'a' value with df.where -- I am trying to avoid a loop since my actual dataframe is quite large.

Just use this condition (df.a < 10) | (df.b < 5):
df[(df.a < 10) | (df.b < 5)]
a b
1 20 0
2 0 20
3 0 0

Related

Subtract with value in previous row to create a new column by subject

Using python and this data set https://raw.githubusercontent.com/yadatree/AL/main/AK4.csv I would like to create a new column for each subject, that starts with 0 (in the first row) and then subtracts the SCALE value from row 2 from row 1, then row 3 from row 2, row 4 from row 3, etc.
However, if this produces a negative value, then to give the output of 0.
Edit: Thank you for the response. That worked perfectly. The only remaining issue is that I'd like to start again with each subject (SUBJECT column). The number of values for each subject is not fixed thus something that checks the SUBJECT column and then starts again from 0 would be ideal.
screenshot
You can use .shift(1) create new column with values moved from previous rows - and then you will have both values in the same row and you can substract columns.
And later you can selecte all negative results and assign zero
import pandas as pd
data = {
'A': [1, 3, 2, 5, 1],
}
df = pd.DataFrame(data)
df['previous'] = df['A'].shift(1)
df['result'] = df['A'] - df['previous']
print(df)
#df['result'] = df['A'] - df['A'].shift(1)
#print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
Result:
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 -1.0
3 5 2.0 3.0
4 1 5.0 -4.0
A previous result
0 1 NaN NaN
1 3 1.0 2.0
2 2 3.0 0.0
3 5 2.0 3.0
4 1 5.0 0.0
EDIT:
If you use df['result'] = df['A'] - df['A'].shift(1) then you get column result without creating column previous.
And if you use .shift(1, fill_value=0) then it will put 0 instead of NaN in first row.
EDIT:
You can use groupy("SUBJECT") to group by subject and later in every group you can put 0 in first row.
import pandas as pd
data = {
'S': ['A', 'A', 'A', 'B', 'B', 'B'],
'A': [1, 3, 2, 1, 5, 1],
}
df = pd.DataFrame(data)
df['result'] = df['A'] - df['A'].shift(1, fill_value=0)
print(df)
df.loc[ df['result'] < 0 , 'result'] = 0
print(df)
all_groups = df.groupby('S')
first_index = all_groups.apply(lambda grp: grp.index[0])
df.loc[first_index, 'result'] = 0
print(df)
Results:
S A result
0 A 1 1
1 A 3 2
2 A 2 -1
3 B 1 -1
4 B 5 4
5 B 1 -4
S A result
0 A 1 1
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0
S A result
0 A 1 0
1 A 3 2
2 A 2 0
3 B 1 0
4 B 5 4
5 B 1 0

Select one column of data frame per row using value stored in another column and modify it

Given an example dataframe df:
df = (pd.DataFrame.from_dict({'a':[np.nan, 10, 20],
'b':[np.nan, np.nan, 20],
'c':[np.nan, 30, np.nan],
'c':[40, np.nan, np.nan],
'col':['b','c','a']})
)
a b c col
0 NaN NaN 40.0 b
1 10.0 NaN NaN c
2 20.0 20.0 NaN a
I want to fillna(0) at each row of df only selected column according to value stored in df.col.
The output should be:
a b c col
0 NaN 0.0 40.0 b
1 10.0 NaN 0.0 c
2 20.0 20.0 NaN a
I've found an iterative solution:
for index, row in df.iterrows():
df.loc[index, row.col] = np.nanmax([df.loc[index, row.col],0])
but it is slow, and hard to use is chaining operations (assign method).
Is there another, maybe vectorized way or apply version of getting desired result?
Method 1: using stack and unstack
df = df.set_index('col').stack(dropna=False)
m1 = df.index.get_level_values(0) == df.index.get_level_values(1)
m2 = df.isna()
df.loc[m1 & m2] = 0
df = df.unstack(level=1).reset_index()
Method 2: melt and pivot
We can melt the dataframe, then find the rows which we need to fill the NaN. Finally pivot our dataframe back in the format we want:
m = df.melt(id_vars='col')
m['value'] = np.where(
m['col'] == m['variable'],
m['value'].fillna(0),
m['value']
)
df = m.pivot_table(
index='col',
columns='variable',
values='value'
).reindex(df['col'])
df = df.reset_index().rename_axis(columns=None)
Output
col a b c
0 b NaN 0.00 40.00
1 c 10.00 NaN 0.00
2 a 20.00 20.00 NaN
You can use .transform() with np.where:
print(
df.transform(
lambda x: np.where(x.isna() & (x.index == x["col"]), 0, x),
axis=1,
)
)
Prints:
a b c col
0 NaN 0 40 b
1 10 NaN 0 c
2 20 20 NaN a

.agg Sum Converting NaN to 0

I am trying to bin a Pandas DataFrame into three day windows. I have two columns, A and B, which I want to sum in each window. This code which I wrote for the task
df = df.groupby(df.index // 3).agg({'A': 'sum', 'B':'sum'})
Converts NaN values to zero when doing this sum, but I would like them to remain NaN as my data has actual non-NaN zero values.
For example if I had this df:
df = pd.DataFrame([
[np.nan, np.nan],
[np.nan, 0],
[np.nan, np.nan],
[2, 0],
[4 , 0],
[0 , 0]
], columns=['A','B'])
Index A B
0 NaN Nan
1 NaN 3
2 NaN Nan
3 2 0
4 4 0
5 0 0
I would like the new df to be:
Index A B
0 NaN 3
1 6 0
But my current code outputs:
Index A B
0 0 3
1 6 0
df.groupby(df.index // 3)['A', 'B'].mean()
The above snippet provides the mentioned sample output.
If you want to go for the sum, look at df.groupby(df.index // 3)['A', 'B'].sum(min_count = 1)
Another option:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=True)})
Try with this code:
df.groupby(df.index // 3).agg({'A': lambda x: x.sum(skipna=False),
'B':lambda x: x.sum(skipna=False)})
Out[282]:
A B
0 NaN NaN
1 6.0 0.0

Fill few missing values in python

I want to fill missing values of a specific column only if a condition is met.
e.g. A B
Nan 0
Nan 0
0 0
Nan 1
Nan 1
.....................
.....................
In the above case I want to fill Nan values in column A only when corresponding value in column B is 0. Rest values in A (with Nan) should not change.
Use mask with fillna:
df['A'] = df['A'].mask(df['B'] == 0, df['A'].fillna(3))
Alternatives with loc, numpy.where:
df.loc[df['B'] == 0, 'A'] = df['A'].fillna(3)
df['A'] = np.where(df['B'] == 0, df['A'].fillna(3), df['A'])
print (df)
A B
0 3.0 0
1 3.0 0
2 0.0 0
3 NaN 1
4 NaN 1
np.where is quicke and simple solution.
In [47]: df['A'] = np.where(np.isnan(df['A']) & df['B'] == 0, 3, df['A'])
In [48]: df
Out[48]:
A B
0 3.0 0
1 3.0 0
2 3.0 0
3 NaN 1
4 NaN 1
You should use a loop over all elements, something like this:
for i in range(len(A))
if numpy.isnan(A[i]) && B[i] == 0:
A[i] = value
There are nicer ways to implement these loops, but I don't know what structures you are using.

Drop rows in pandas dataframe based on columns value

I have a dataframe like this :
cols = [ 'a','b']
df = pd.DataFrame(data=[[NaN, -1, NaN, 34],[-32, 1, -4, NaN],[4,5,41,14],[3, NaN, 1, NaN]], columns=['a', 'b', 'c', 'd'])
I want to retrieve all rows, when the columns 'a' and 'b' are non-negative but if any of them or all are missing, I want to keep them.
The result should be
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
I've tried this but it doesn't give the expected result.
df[(df[cols]>0).all(axis=1) | df[cols].isnull().any(axis=1)]
IIUC, you actually want
>>> df[((df[cols] > 0) | df[cols].isnull()).all(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN
Right now you're getting "if they're all positive" or "any are null". You want "if they're all (positive or null)". (Replace > 0 with >=0 for nonnegativity.)
And since NaN isn't positive, we could simplify by flipping the condition, and use something like
>>> df[~(df[cols] <= 0).any(axis=1)]
a b c d
2 4 5 41 14
3 3 NaN 1 NaN

Categories