Minimum value for each row in pandas dataframe - python

im trying to compute the minimum for each row in a pandas dataframe.
I would like to add a column that calculates the minimum values and ignores "NaN" and "WD"
For example
A B C D
1 3 2 WD
3 WD NaN 2
should give me a new column like
Min
1
2
I tried df.where(df > 0).min(axis=1)
and df.where(df != "NaN").min(axis=1) without success

Convert values to numeric with non numeric to NaNs by errors='coerce' in to_numeric and DataFrame.apply so is possible use min:
df['Min'] = df.apply(pd.to_numeric, errors='coerce').min(axis=1)
print (df)
A B C D Min
0 1 3 2.0 WD 1.0
1 3 WD NaN 2 2.0

list=DataFrame.min(axis=1, skipna=[Nan,WD],numeric_only=true)
DataFrame['min']=list

Related

How to set conditions in a data frame and then average the values and replace them?

I have a DataFrame.
a b
0 0.5 1
1 2#3 4
2 1 4#4
I want to set a condition for each column to check if "#" is present for each row values. And if it is, then I need to split the values and average them and replace them as new values. The values that do not have "#" do not have to go through any of the operations. My result should be:
a b
0 0.5 1
1 2.5 4
2 1 4
2#3---need to be split as 2 and 3 and the average of these values need to be taken.
You could stack + str.split + explode + astype(float) to create a MultiIndex Series of dtype float out of df. Then groupby the index, find mean and unstack to build back the DataFrame:
out = (df.stack().str.split('#').explode().astype(float)
.groupby(level=[0,1]).mean().unstack())
Output:
a b
0 0.5 1.0
1 2.5 4.0
2 1.0 4.0
Although it is not an optimal solution, one way could be as following
import pandas as pd
for col in df.columns:
for idx, i in enumerate(df[col]):
if '#' in str(i):
temp = [int(j) for j in str(i).split('#')]
avg = sum(temp) / len(temp)
df.loc[idx, col] = avg
print(df)

Pandas: Replace missing dataframe values / conditional calculation: fillna

I want to calculate a pandas dataframe, but some rows contain missing values. For those missing values, i want to use a diffent algorithm. Lets say:
If column B contains a value, then substract A from B
If column B does not contain a value, then subtract A from C
import pandas as pd
df = pd.DataFrame({'a':[1,2,3,4], 'b':[1,1,None,1],'c':[2,2,2,2]})
df['calc'] = df['b']-df['a']
results in:
print(df)
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 NaN
3 4 1.0 2 -3.0
Approach 1: fill the NaN rows using .where:
df['calc'].where(df['b'].isnull()) = df['c']-df['a']
which results in SyntaxError: cannot assign to function call.
Approach 2: fill the NaN rows using .iterrows():
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
is executed without errors and calculation is correct, these i values are printed to the console:
0.0
-1.0
-1.0
-3.0
but the values are not written into df['calc'], the datafram remains as is:
print(df['calc'])
0 0.0
1 -1.0
2 NaN
3 -3.0
What is the correct way of overwriting the NaN values?
Finally, I stumbled over .fillna:
df['calc'] = df['calc'].fillna( df['c']-df['a'] )
gets the job done! Can anyone explain what is wrong with above two approaches...?
Approach 2:
you are assigning it to i value. but this won't modify your original dataframe.
for index, row in df.iterrows():
i = df['calc'].iloc[index]
if pd.isnull(row['b']):
i = row['c']-row['a']
print(i)
else:
i = row['b']-row['a']
print(i)
df.loc[index,'calc'] = i #<------------- here
also don't use iterrows() it is too slow.
Approach 1:
Pandas where() method is used to check a data frame for one or more condition and return the result accordingly. By default, The rows not satisfying the condition are filled with NaN value.
it should be:
df['calc'] = df['calc'].where(df['b'].isnull(), df['c']-df['a'])
but this will only find those row value where you have non zero value and fill that with the given value.
Use:
df['calc'] = df['calc'].where(~df['b'].isnull(), df['c']-df['a'])
OR
df['calc'] = np.where(df['b'].isnull(), df['c']-df['a'], df['calc'])
Instead of subtracting b from a then c from a what you can do is first fill the nan values in column b with the values from column c, then subtract column a:
df['calc'] = df['b'].fillna(df['c']) - df['a']
a b c calc
0 1 1.0 2 0.0
1 2 1.0 2 -1.0
2 3 NaN 2 -1.0
3 4 1.0 2 -3.0

Max element from more than 2 columns of Dataframes in Pandas

I have the following dataframe as "w":
A B
0 Alex Benedict
1 John NaN
I want to find the maximum from these 2 columns and store it in "A" column
I used the following method:
w["A"] = w[['A','B']].max(axis=1)
A B
0 NaN Benedict
1 NaN NaN
I don't want this output of "NaN" in the "A" column. How should I get rid of this?
It is possible with max per rows with removing missing values:
w['A'] = w[['A','B']].apply(lambda x: x.dropna().max(), axis=1)
print (w)
A B
0 Benedict Benedict
1 John NaN
The nanmax() function of numpy does the job
w['A'] = w[['A', 'B']].apply(np.nanmax)

Count all NaNs in a pandas DataFrame

I'm trying to count NaN element (data type class 'numpy.float64')in pandas series to know how many are there
which data type is class 'pandas.core.series.Series'
This is for count null value in pandas series
import pandas as pd
oc=pd.read_csv(csv_file)
oc.count("NaN")
my expected output of oc,count("NaN") to be 7 but it show 'Level NaN must be same as name (None)'
The argument to count isn't what you want counted (it's actually the axis name or index).
You're looking for df.isna().values.sum() (to count NaNs across the entire DataFrame), or len(df) - df['column'].count() (to count NaNs in a specific column).
You can use either of the following if your Series.dtype is float64:
oc.isin([np.nan]).sum()
oc.isna().sum()
If your Series is of mixed data-type you can use the following:
oc.isin([np.nan, 'NaN']).sum()
oc.size : returns total element counts of dataframe including NaN
oc.count().sum(): return total element counts of dataframe excluding NaN
Therefore, another way to count number of NaN in dataframe is doing subtraction on them:
NaN_count = oc.size - oc.count().sum()
Just for fun, you can do either
df.isnull().sum().sum()
or
len(df)*len(df.columns) - len(df.stack())
If your dataframe looks like this ;
aa = pd.DataFrame(np.array([[1,2,np.nan],[3,np.nan,5],[8,7,6],
[np.nan,np.nan,0]]), columns=['a','b','c'])
a b c
0 1.0 2.0 NaN
1 3.0 NaN 5.0
2 8.0 7.0 6.0
3 NaN NaN 0.0
To count 'nan' by cols, you can try this
aa.isnull().sum()
a 1
b 2
c 1
For total count of nan
aa.isnull().values.sum()
4

New nulls when assigning series to dataframe column

I can't figure out why new null values are popping up after assigning a dataframe column as a series that doesn't have any nulls originally. Here's an example:
df.date_col.shape returns (100000,)
df.date_col.isnull().sum() returns 0
I then create a new series of the same size with:
new_series = pd.Series([int(d[:4]) for d in df.date_col])
new_series.shape returns (100000,)
new_series.isnull().sum() returns 0
But then if I try to assign this new series to the original column:
df.date_col = new_series
df.date_col.isnull().sum() returns 6328
Would someone please tell me what might be going on here?
IIUC, your index is not continue , when you create the pd.Series, it auto assign the index from 0 to len(s)-1, dataframe assign is base on the index , index miss match will create the NaN
df=pd.DataFrame({'col':[1,2,3]},index=[1,2,3])
s=pd.Series([d*2 for d in df.col])
df['New']=s
df
Out[170]:
col New
1 1 4.0
2 2 6.0
3 3 NaN
df['New2']=s.values
df
Out[172]:
col New New2
1 1 4.0 2
2 2 6.0 4
3 3 NaN 6

Categories