How to sum NaN in numpy? - python

I have to sum two values obtained by np.average as
for i in x :
a1 = np.average(function1(i))
a2 = np.average(function2(i))
plt.plot(i, a1+a2, 'o')
But the np.average may return NaN. Then, only points for which both a1 and a2 are available will be calculated.
How can I use zero instead of NaN to make the sum for all points?
I tried to find a function in numpy to do so, but numpy.nan_to_num is for arrays.

You can use numpy like this:
import numpy as np
a = [1, 2, np.nan]
a_sum = np.nansum(a)
a_mean = np.nanmean(a)
print('a = ', a) # [1, 2, nan]
print("a_sum = {}".format(a_sum)) # 3.0
print("a_mean = {}".format(a_mean)) # 1.5

You can also use :
clean_x = x[~np.isnan(x)]

Related

Using np.where to return the mean of df row's based on criteria

Let's suppose I have this following code
import pandas as pd
import numpy as np
flag = pd.DataFrame({'flag': [ [], ['red'], ['red, green'], ['red, blue'], ['blue'] ]})
colors_values = pd.DataFrame({'red': [1, 1, 1, 1, 1], 'green': [2, 2, 2, 2, 2], 'blue': [4, 4, 4, 4, 4]})
I have a 1D df called 'flag' that each row contains a list of colors (red, green, blue) and another df 'colors_values' with these colors names. They have the same number of rows.
My goal is to use np.where to return the mean of the values for each row of 'colors_values' based on 'flag'. The output would be something like this:
If there is a better/faster way to do it instead of using np.where, I'd like to know.
Pandas merge is pretty fast, if you allow of bit of a ramp up time you could do a merge/groupby:
df_flag = flag.explode('flag').reset_index()
df_colors = colors_values.reset_index().melt(ignore_index=False, var_name='flag').reset_index()
df_flag = df_flag.merge(df_colors, on=['index', 'flag'], how='left')
df_grouped = df_flag.groupby(['index'])['value'].mean()
Fast solution
from sklearn.preprocessing import MultiLabelBinarizer
# encode the colors into indicator variables
mask = MultiLabelBinarizer().fit_transform(flag['flag'])
# mask the color values where indicator is zero then calculate mean
result = colors_values.sort_index(axis=1).mask(mask == 0).mean(axis=1)
Result
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0
dtype: float64
You can arrange color names matching between dataframes as shown below:
means = colors_values.apply(lambda x: x[flag.iloc[x.name][0]].mean(), axis=1)
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0
you could use str.get_dummies() and multiply by the color_values df
(flag['flag']
.str[0]
.str.get_dummies(sep=', ')
.mul(colors_values)
.where(lambda x: x.ne(0))
.mean(axis=1))
Output:
0 NaN
1 1.0
2 1.5
3 2.5
4 4.0

Assign consequential values to a DataFrame from a numpy array based on a condition

The task seems easy but I've been googling and experimenting for hours without any result. I can easily assign a 'static' value in such case or assign a value if I have two columns in the same DataFrame (of the same length, ofc) but I'm stuck with this situation.
I need to assign a consequential value to a pandas DataFrame column from a numpy array based on a condition when the sizes of the DataFrame and the numpy.array are different.
Here is the example:
import pandas as pd
import numpy as np
if __name__ == "__main__":
df = pd.DataFrame([np.nan, 1, np.nan, 1, np.nan, 1, np.nan])
arr = np.array([4, 5, 6])
i = iter(arr)
df[0] = np.where(df[0] == 1, next(i), np.nan)
print(df)
The result is:
0
0 NaN
1 4.0
2 NaN
3 4.0
4 NaN
5 4.0
6 NaN
But I need the result where consequential numbers from the numpy array are put in the DataFrame like:
0
0 NaN
1 4.0
2 NaN
3 5.0
4 NaN
5 6.0
6 NaN
I appreciate any help.
it's not the very efficient way but it will do the job.
import pandas as pd
import numpy as np
def util(it, row):
ele = next(it, None)
return ele if ele is not None else row
df = pd.DataFrame([np.nan, 1, np.nan, 1, np.nan, 1, np.nan])
arr = np.array([4, 5, 6])
it = iter(arr)
df[0] = np.array(list(map(lambda r : util(it, r) if r == 1.0 else np.nan, df[0])))

Counting the number of pandas.DataFrame rows for each column

What I want to do
I would like to count the number of rows with conditions. Each column should have different numbers.
import numpy as np
import pandas as pd
## Sample DataFrame
data = [[1, 2], [0, 3], [np.nan, np.nan], [1, -1]]
index = ['i1', 'i2', 'i3', 'i4']
columns = ['c1', 'c2']
df = pd.DataFrame(data, index=index, columns=columns)
print(df)
## Output
# c1 c2
# i1 1.0 2.0
# i2 0.0 3.0
# i3 NaN NaN
# i4 1.0 -1.0
## Question 1: Count non-NaN values
## Expected result
# [3, 3]
## Question 2: Count non-zero numerical values
## Expected result
# [2, 3]
Note: Data types of results are not important. They can be list, pandas.Series, pandas.DataFrame etc. (I can convert data types anyway.)
What I have checked
## For Question 1
print(df[df['c1'].apply(lambda x: not pd.isna(x))].count())
## For Question 2
print(df[df['c1'] != 0].count())
Obviously these two print functions are only for column c1. It's easy to check one column by one column. I would like to know if there is a way to calculate counts of all columns at once.
Environment
Python 3.10.5
pandas 1.4.3
You do not iterate over your data using apply. You can achieve your results in a vectorized fashion:
print(df.notna().sum().to_list()) # [3, 3]
print((df.ne(0) & df.notna()).sum().to_list()) # [2, 3]
Note that I have assumed that "Question 2: Count non-zero values" also excluded nan values, otherwise you would get [3, 4].
You was close I think ! To answer your first question :
>>> df.apply(lambda x : x.isna().sum(), axis = 0)
c1 1
c2 1
dtype: int64
You change to axis = 1 to apply this operation on each row.
To answer your second question this is from here (already answered question on SO) :
>>> df.astype(bool).sum(axis=0)
c1 3
c2 4
dtype: int64
In the same way you can change axis to 1 if you want ...
Hope it helps !

how do i use np.nanmin when comparing one column of pandas dataframe with a integer?

import pandas as pd
import numpy as np
a = np.array([[1, 2], [3, np.nan]])
np.nanmin(a, axis=0)
array([1., 2.])
I want to use same logic but on pandas dataframe columns and comparing each value of column with an integer.
use case:
MC_cond = df['MODEL'].isin(["MC"])
df_lgd_type = df['LGD_TYPE'].isin(["FIXED"])
df_without_lgd_type = ~(df_lgd_type)
x = np.nanmin((1,df.loc[MC_cond & df_without_lgd_type,'A'] + df.loc[MC_cond &
df_without_lgd_type,'B']))
comparing sum of column A and column B with 1.
This should do the trick even without np.nanmin. I hope I've understood everything correctly from your sparse description.
I'm assuming you also want to replace those NaN values that are left after summation. So we fill those with 1 and then clip all values to max at 1.
a = df.loc[MC_cond & df_without_lgd_type, 'A']
b = df.loc[MC_cond & df_without_lgd_type, 'B']
x = (a + b).fillna(1).clip(upper=1)
Example:
df = pd.DataFrame({
'A': [-1, np.nan, 2, 3, 4],
'B': [-4, 5, np.nan, 7, -8]
})
(df.A + df.B).fillna(1).clip(upper=1)
# Output:
# 0 -5.0
# 1 1.0
# 2 1.0
# 3 1.0
# 4 -4.0
# dtype: float64
In case you don't want NaN values in one column leading to row sum being NaN too, just fill them before:
x = (a.fillna(0) + b.fillna(0)).fillna(1).clip(upper=1)
Just for completeness, this would be a pure numpy solution resembling your approach:
a = df.loc[MC_cond & df_without_lgd_type, 'A'].to_numpy()
b = df.loc[MC_cond & df_without_lgd_type, 'B'].to_numpy()
# optionally fill NaNs with 0
# a = np.nan_to_num(a)
# b = np.nan_to_num(b)
s = a + b
x = np.nanmin(np.stack(s, np.ones_like(s))), axis=0)

read and write array at the same time python

I want to recalculate column "a" of a given dataframe = df. But my way of doing it does not fill in the new, calculated values over the old ones.
import pandas as pd
import numpy as np
from numpy.random import randn
df = pd.DataFrame(randn(100))
df["a"] = np.nan
df["b"] = randn()
df.a[0] = 0.5
df.a= df.a.shift(1) * df.b
Do you have any ideas how I can fill to solve that?
I want to calculate "a" depending on its previous value an "b":
a b
0.5 2 #set as starting value with df.a[0] = 0.5 since there is no value for a prior to that, there's no calculation performed.
1.5 3 # a = previous value of a *b (0.5*3) =1.5
15 10 # a = previous value of a *b (1.5*10) =15
45 3 # a = previous value of a *b (15*3) =45
The problem is that the calcation does not perform / results of calcutation do not overwrite previously set values.
How about this?
df = pd.DataFrame({'a': [None] * 4, 'b': [2, 3, 10, 3]})
df.a.iloc[0] = 0.5
df.a.iloc[1:] = (df.b.shift(-1).cumprod() * df.a.iat[0])[:-1].values
>>> df
a b
0 0.5 2
1 1.5 3
2 15 10
3 45 3
You can do it using a for loop like this:
for i in df.index[1:]:
df.a.ix[i] = df.b.ix[i]*df.a.ix[i-1]
If anyone knows a vectorized way I'd be interested to see it.

Categories