import numpy as np
import pandas as pd
GFG_dict = { '2019': [10,20,30,40],'2020': [20,30,40,50],'2021': [30, np.NaN, np.NaN, 60],'2022':[40,50,60,70]}
gfg = pd.DataFrame(GFG_dict)
gfg['mean1'] = gfg.mean(axis = 1)
gfg['mean2'] = gfg.loc[:,'2019':'2022'].sum(axis = 1)/4
gfg
I want to get the average as in column mean2.
How do I get the same result using: .mean(axis =1)
When you do mean2 you are doing the sum (which skips nan values) and then always dividing by 4 (which is the count including nan).
You can achieve this by doing a fillna(0) and then calculating the mean:
gfg['mean1'] = gfg.fillna(0).mean(axis = 1)
NOTE: this will not fill the nan values in your dataframe but only do it for the mean calculation, so your dataframe will not be modified
Related
I have a data frame where every column either has True or Nan in each row. I am looking for a way to filter the 10 columns with the most occurrences of true.
import pandas as pd
import numpy as np
import random
num_rows = 50
num_cols = 10
df = pd.DataFrame({f'col{i}': [random.choice([True, np.NaN]) for j in range(num_rows)] for i in range(num_cols)})
df.iloc[:, :10].count().idxmax()
I would like to know how can one plot only dots which are above some threshold in pandas scatter?
Lets assume I have dataframe like this - I had two datasets, I calculated difference between them (column "Difference"), then based on whether number in dataset1 was higher I gave it argument "True", if number in dataset2 was higher I gave it argument "False", finally there is atom number ("ATOM"):
import pandas as pd
import numpy as np
d = {'Difference': [1.095842, 1.295069, 1.021345, np.nan, 1.773725], 'ARG': [True, False, True, np.nan, False], 'ATOM': [1, 3, 5, 7, 9]}
df=pd.DataFrame(d)
df
Difference ARG ATOM
0 1.095842 True 1
1 1.295069 False 3
2 1.021345 True 5
3 NaN NaN 7
4 1.773725 False 9
Then I plot the DataFrame to make scatter where the dots are colored based on whether is ARG True or False.
df = df.dropna(axis=0)
df.plot(x='ATOM', y='Difference', kind='scatter', color=df.ARG.map({True: 'orange', False: 'blue'}))
But what if I am only interested in plotting dots(ATOMS) which have "Difference" value above some threshold? E.g. difference >= 1.15?
Can I somehow drop all rows when in column "Difference" is value that doesnt meet required threshold?
I tried
df = df[df.Difference >= 0.15]
But it returns error: '>=' not supported between instances of 'method' and 'float'
Thank you for your suggestions.
The answer is slicing.
First you address the column you want to use as threshold (df['Difference']) and than slicing with a logical vector (df[lg])
lg = df['Difference'] >= 0.15 # logical vector
df = df[lg]
So in a minimum working example, you get:
import pandas as pd
import numpy as np
# create dummy DataFrame
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=['Col 1','Col 2','Col 4','Col 5'])
# thresholding
lg = df['Col 1'] >= 42
# slicing
df[lg]
len(df)
len(df[lg])
100
56
I have a dataframe with 2 columns in python. I want to enter the dataframe with one column and obtain the value of the 2nd column. Sometimes the values can be exact, but they can also be values between 2 rows.
I have this example dataframe:
x y
0 0 0
1 10 100
2 20 200
I want to find the value of y if I check the dataframe with the value of x. For example, if I enter in the dataframe with the value of 10, I obtain the value of 100. But if I check with 15, I need to interpolate between the two values of y. Is there any function to do it?
numpy.interp is probaly the simplest way here for linear interpolation:
def interpolate(xval, df, xcol, ycol):
# compute xval as the linear interpolation of xval where df is a dataframe and
# df.x are the x coordinates, and df.y are the y coordinates. df.x is expected to be sorted.
return np.interp([xval], df[xcol], df[ycol])
With your example data it gives:
>>> interpolate(10, df, 'x', 'y')
>>> 100.0
>>> interpolate(15, df, 'x', 'y')
>>> 150.0
You can even directly do:
>>> np.interp([10, 15], df.x, df.y)
array([100., 150.])
You can have a look at the interpolate method provided in Pandas module (doc). But I'm not sure that answers your question.
You can do it with interp1d from the sklearn module. Several types of interpolation are possible: ‘linear’, ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’... You find the list at the (doc page).
The interpolation process can be summarised as three steps:
Split your data between missing and non missing values. I use isna (doc)
Create the interpolation function using the data without missing values. I use interp1d (doc)
Interpolate (predict the missing values). Just call the function find in step 2 on the missing data (column x).
Here the code:
# Import modules
import pandas as pd
import numpy as np
from scipy.interpolate import interp1d
# Data
df = pd.DataFrame(
[[0, 0],
[10, 100],
[11, np.NaN],
[15, np.NaN],
[17, np.NaN],
[20, 200]],
columns=["x", "y"])
print(df)
# x y
# 0 0 0.0
# 1 10 100.0
# 2 11 NaN
# 3 15 NaN
# 4 17 NaN
# 5 20 200.0
# Split data in training (not NaN values) and missing (NaN values)
missing = df.isna().any(axis=1)
df_training = df[~missing]
df_missing = df[missing].reset_index(drop=True)
# Create function that interpolate missing value (from our training values)
f = interp1d(df_training.x, df_training.y)
# Interpolate the missing values
df_missing["y"] = f(df_missing.x)
print(df_missing)
# x y
# 0 11 110.0
# 1 15 150.0
# 2 17 170.0
You can find others works on the topic at this link.
How could I randomly introduce NaN values into my dataset for each column taking into account the null values already in my starting data.
I want to have for example 20% of NaN values by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.
Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))
Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)
I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)
I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values