how to fill null values in a row based on the value in another column.
A B
0 5
1 NAN
1 6
0 NAN
for the null value in B if coressponding value in A is 0 then fill with the previous value.
A B
0 5
1 NAN
1 6
0 6
```
it want it to be like this
numpy.where + isnull + ffill
df.assign(
B=np.where(df.A.eq(0) & df.B.isnull(), df.B.ffill(), df.B)
)
A B
0 0 5.0
1 1 NaN
2 1 6.0
3 0 6.0
Another way using loc,
df.loc[df['A'].eq(0), 'B'] = df['B'].ffill()
A B
0 0 5
1 1 NaN
2 1 6
3 0 6
A faster way (compared to the previous ones):
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':[0,1,1,0], 'B':[5,np.nan,6,np.nan]})
df.B = np.where(df.A==0, df.B.ffill(), df.B)
and you get:
A B
0 0 5.0
1 1 NaN
2 1 6.0
3 0 6.0
Related
this is my data and i want to find the min value of selected columns(a,b,c,d) in each row then calculate the difference between that and dd. I need to ignore 0 in rows, I mean in the first row i need to find 8
need to ignore 0 in rows
Then just replace it with nan, consider following simple example
import numpy as np
import pandas as pd
df = pd.DataFrame({"A":[1,2,0],"B":[3,5,7],"C":[7,0,7]})
df.replace(0,np.nan).apply(min)
df["minvalue"] = df.replace(0,np.nan).apply("min",axis=1)
print(df)
gives output
A B C minvalue
0 1 3 7 1.0
1 2 5 0 2.0
2 0 7 7 7.0
You can use pandas.apply with axis=1 and all column ['a','b','c','d'] convert to Series then replace 0 with +inf and find min. At the end compute diff min with colmun 'dd'.
import numpy as np
df['min_dd'] = df.apply(lambda row: min(pd.Series(row[['a','b','c','d']]).replace(0,np.inf)) - row['d'], axis=1)
print(df)
a b c d dd min_dd
0 0 15 0 8 6 2.0 # min_without_zero : 8 , dd : 6 -> 8-6=2
1 2 0 5 3 2 0.0 # min_without_zero : 2 , dd : 2 -> 2-2=0
2 5 3 3 0 2 1.0 # 3 - 2
3 0 2 3 4 2 0.0 # 2 - 2
You can try
cols = ['a','b','c','d']
df['res'] = df[cols][df[cols].ne(0)].min(axis=1) - df['dd']
print(df)
a b c d dd res
0 0 15 0 8 6 2.0
1 2 0 5 3 2 0.0
2 5 3 3 0 2 1.0
3 2 3 4 4 2 0.0
EDIT: Upon request I provide an example that is closer to the real data I am working with.
So I have a table data that looks something like
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.265421 -0.623274 0.041326
4 -2.325031 -0.218792 -1.245911
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.042513 -0.128535
1 1.366463 -0.665195 0.35151
2 0.90347 0.094012 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.009618 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
(think: collection of time series) and a second table valid_range
start stop
run
0 1 3
1 2 5
For each run I want to drop all rows that do not satisfy start≤step≤stop.
I tried the following (table generating code at the end)
for idx in valid_range.index:
slc = data.loc[idx]
start, stop = valid_range.loc[idx]
cond = (start <= slc.index) & (slc.index <= stop)
data.loc[idx] = data.loc[idx][cond]
However, this results in:
value0 value1 value2
run step
0 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
1 0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
I also tried data.loc[idx].drop(slc[cond].index, inplace=True) but it didn't have any effect...
Generating code for table
import numpy as np
from pandas import DataFrame, MultiIndex, Index
rng = np.random.default_rng(0)
valid_range = DataFrame({"start": [1, 2], "stop":[3, 5]}, index=Index(range(2), name="run"))
midx = MultiIndex(levels=[[],[]], codes=[[],[]], names=["run", "step"])
data = DataFrame(columns=[f"value{k}" for k in range(3)], index=midx)
for run in range(2):
for step in range(6):
data.loc[(run, step), :] = rng.normal(size=(3))
)
First, merge data and valid range based on 'run', using the merge method
>>> data
value0 value1 value2
run step
0 0 0.12573 -0.132105 0.640423
1 0.1049 -0.535669 0.361595
2 1.304 0.947081 -0.703735
3 -1.26542 -0.623274 0.041326
4 -2.32503 -0.218792 -1.24591
5 -0.732267 -0.544259 -0.3163
1 0 0.411631 1.04251 -0.128535
1 1.36646 -0.665195 0.35151
2 0.90347 0.0940123 -0.743499
3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
5 0.540846 0.214659 0.355373
>>> valid_range
start stop
run
0 1 3
1 2 5
>>> merged = data.reset_index().merge(valid_range, how='left', on='run')
>>> merged
run step value0 value1 value2 start stop
0 0 0 0.12573 -0.132105 0.640423 1 3
1 0 1 0.1049 -0.535669 0.361595 1 3
2 0 2 1.304 0.947081 -0.703735 1 3
3 0 3 -1.26542 -0.623274 0.041326 1 3
4 0 4 -2.32503 -0.218792 -1.24591 1 3
5 0 5 -0.732267 -0.544259 -0.3163 1 3
6 1 0 0.411631 1.04251 -0.128535 2 5
7 1 1 1.36646 -0.665195 0.35151 2 5
8 1 2 0.90347 0.0940123 -0.743499 2 5
9 1 3 -0.921725 -0.457726 0.220195 2 5
10 1 4 -1.00962 -0.209176 -0.159225 2 5
11 1 5 0.540846 0.214659 0.355373 2 5
Then select the rows which satisfy the condition using eval. Use the boolean array to mask data
>>> cond = merged.eval('start < step < stop').to_numpy()
>>> data[cond]
value0 value1 value2
run step
0 2 1.304 0.947081 -0.703735
1 3 -0.921725 -0.457726 0.220195
4 -1.00962 -0.209176 -0.159225
Or if you want, here is a similar approach using query
res = (
data.reset_index()
.merge(valid_range, on='run', how='left')
.query('start < step < stop')
.drop(columns=['start','stop'])
.set_index(['run', 'step'])
)
I would go on groupby like this:
(df.groupby(level=0)
.apply(lambda x: x[x['small']>1])
.reset_index(level=0, drop=True) # remove duplicate index
)
which gives:
big small
animal animal attribute
cow cow speed 30.0 20.0
weight 250.0 150.0
falcon falcon speed 320.0 250.0
lama lama speed 45.0 30.0
weight 200.0 100.0
I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0
Suppose I have a df like below:
A B C
null 0 null
null 4 null
5 6 null
0 0 0
Now, I want to fill my column C based on Columns A & B condition being:
only if there is a null in column A against the '0' of column B then let column C be null otherwise in all other cases copy column B to column C. This means that I want my df to look like this:
A B C
null 0 null
null 4 4
5 6 6
0 0 0
How can i achieve this in pandas?
Any help would be appreciated as I am quite new in python and pandas programming.
Use numpy.where with conditions chained by & for bitwise AND:
import numpy as np
m1 = df.A.isna()
m2 = df.B.eq(0)
df['C'] = np.where(m1 & m2, np.nan, df.B)
print (df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
Use Series.fillna + Series.mask:
df['C']=df['C'].fillna(df['B'].mask(df['B'].eq(0)))
print(df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
or using Series.where:
df['C']=df['B'].mask(df['B'].eq(0)).where(df['C'].isnull(),df['C'])
print(df)
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
Using fillna and checking if A + B > 0, if so then fill C with B using loc:
mask = df['A'].fillna(0) + df['B'] > 0
df.loc[mask, 'C'] = df['B']
A B C
0 NaN 0 NaN
1 NaN 4 4.0
2 5.0 6 6.0
3 0.0 0 0.0
I would like to discard all cells that contain a value below a given value. So not only the rows or only the columns that, but for for all cells.
Tried code below, where all values in each cell should be at least 3. Doesn't work.
df[(df >= 3).any(axis=1)]
Example
import pandas as pd
my_dict = {'A':[1,5,6,2],'B':[9,9,1,2],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 2 2 5
I want to keep only the cells that are at least 3.
If you want "all values in each cell should be at least 3"
df [df < 3] = 3
df
A B C
0 3 9 3
1 5 9 3
2 6 3 3
3 3 3 5
If you want "to keep only the cells that are at least 3"
df = df [df >= 3]
df
A B C
0 NaN 9.0 NaN
1 5.0 9.0 NaN
2 6.0 NaN 3.0
3 3.0 3.0 5.0
You can check if the value is >= 3 then drop all rows with NaN value.
df[df >= 3 ].dropna()
DEMO:
import pandas as pd
my_dict = {'A':[1,5,6,3],'B':[9,9,1,3],'C':[1,1,3,5]}
df = pd.DataFrame(my_dict)
df
A B C
0 1 9 1
1 5 9 1
2 6 1 3
3 3 3 5
df = df[df >= 3 ].dropna().reset_index(drop=True)
df
A B C
0 3.0 3.0 5.0