Randomly introduce NaN values in pandas dataframe - python

How could I randomly introduce NaN values ​​into my dataset for each column taking into account the null values ​​already in my starting data.
I want to have for example 20% of NaN values ​​by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values ​​by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))

I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.

Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))

Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)

I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)

Related

what is the efficient way to collect computed results in python and turn it to a dataframe to make some analysis?

I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.

Interpolate values in one column of a dataframe (python)

I have a dataframe with three columns (timestamp, temperature and waterlevel).
What I want to do is to replace all NaN values in the waterlevel column with interpolated values. For example:
The waterlevel value is always decreasing till it is 0. Therefore, the waterlevel cannot be negative. Also, if the waterlevel is staying the same, the interpolated values should also be the same. Ideally, the stepsize between the interpolated values (within two available waterlevel values) should be the same.
What I have tried so far was:
df['waterlevel'].interpolate(method ='linear', limit_direction ='backward') # backwards because the waterlevel value is always decreasing.
This does not work. After executing this line, every NaN value has turned to a 0 with the parameter 'forward' and stays NaN with the parameter 'backward'.
and
df = df['waterlevel'].assign(InterpolateLinear=df.target.interpolate(method='linear'))
Any suggestions on how to solve this?
I assume NaN is np.nan Object
import pandas as pd
import numpy as np
df = pd.DataFrame({"waterlevel": ['A',np.nan,np.nan,'D'],"interpolated values":['Ai','Bi','Ci','D']})
print(df)
df.loc[df['waterlevel'].isnull(),'waterlevel'] = df['interpolated values']
print(df)
O/P:
waterlevel interpolated values
0 A Ai
1 NaN Bi
2 NaN Ci
3 D D
waterlevel interpolated values
0 A Ai
1 Bi Bi
2 Ci Ci
3 D D

Forming an array from column values based on a different column value condition

I have the following dataset.
10
20
30
40
50
60
70
80
90
100
My aim is to replace the values in Column 2 with the average of two successive values, when the value in Column 1 remains less than 5 and greater than 6. To be precise,in <5 range the values in Column 1 are 1,2,3 and 4 and corresponding column 2 values are 10,20,30,40. So, i want to take the average of (10,20)=15 and (30,40)=35. I want to do the same in the range >6 for Column 1, take the average of corresponding Column 2 values (70,80)=75 and (80,90)=95. I will not take the average of Column 2 values when Column 1 values do not fall in those two ranges (5 and 6) and corresponding Column 2 values (50 and 60) and finally create an array of Column 2 values based on these three conditions.
I tried the following approach:
import numpy as np
import pandas as pd
data= pd.read_table('/Users/Hrihaan/Desktop/Data.txt', dtype=float, header=None, sep='\s+').values
t=data[:,0]
df = pd.DataFrame({"x":t, "y":data[:,1]})
x=np.where(t<=4,data[:,1],np.nan)
x_1=np.nanmean(x.reshape(-1, 2), axis=1)
y=np.where((df.x>4)&(df.x<7), df.y,np.nan)
z=np.where(t>6,data[:,1],np.nan)
z_1=np.nanmean(z.reshape(-1, 2), axis=1)
A=np.concatenate((x_1,y,z_1), axis=0)
print(A)
I am getting the following output:
[ 15. 35. nan nan nan nan nan nan nan 50. 60. nan nan nan nan
nan nan nan 75. 95.]
My expected output is:
[ 15. 35. 50. 60. 75. 95.]
Any help on how to get around the np.nan in my code would be really helpful.
I genuinely have difficulties to see your bigger concept here. For your very specific problem, this would work:
import pandas as pd
#read your file
data= pd.read_table('test.txt', dtype = float, delim_whitespace = True, names = ["x", "y"])
#define rows you want to exclude
exclude_rows = set([5, 6])
#create new column with rolling mean of two rows
data["mean"] = data["y"].rolling(2).mean()
#overwrite rolling mean, when row should be excluded from calculating the average
data["mean"][data["x"].isin(exclude_rows)] = data["y"]
#filter data
A = data["mean"][(data["x"].isin(exclude_rows)) | (data["x"] % 2 == 0)]
But what is your expected output, if you want to exclude for instance x = 4 and 6? Then you have several singular values, for which you didn't give any instruction, what should happen with them in the process of averaging.
This does what you want
a=np.vstack((np.arange(1,11),np.arange(10,110,10))).T
b=(a[:-1,1]+a[1:,1])/2
indL=np.argmax(a[:,0]>5)-1
indH=np.argmax(a[:,0]>6)
out=np.hstack((b[:indL:2],a[indL:indH,1],b[indH::2]))

Python pandas shift dataframe with time index value

I am quite new with python and am struggling with the shift in pandas.
I am comparing data, but it needs to be aligned to compare it. To align the data, I only need to shift one of the data's index values.
Reference data: Data to be shifted:
acc acc
index index
1480681219**96**0000000 1 1480681220**04**0000000 8
1480681220**00**0000000 2 1480681220**08**0000000 9
1480681220**04**0000000 3 1480681220**12**0000000 7
1480681220**08**0000000 4 1480681220**16**0000000 10
1480681220**12**0000000 5 1480681220**20**0000000 6
(The bold editing option did not seem to work, but I wanted to highlight those parts of the indexes)
I would like to shift my data frame with amount of extra time given. Please note, the time is in nanoseconds. I realized that something like df.shift(2) shifts my data 2 places, but I would like to shift my data with -80000000 nanoseconds which in this case is 2 places:
Input:
acc
index
1480681220040000000 8
1480681220080000000 9
1480681220120000000 7
1480681220160000000 10
1480681220200000000 6
Desired output:
acc
index
1480681219960000000 8
1480681220000000000 9
1480681220040000000 7
1480681220080000000 10
1480681220120000000 6
1480681220160000000 NaN
1480681220200000000 NaN
This is a smaller scale of my code:
class device_data(object):
def __init__(self):
_index = [1480681220040000000,
1480681220080000000,
1480681220120000000,
1480681220160000000,
1480681220200000000]
self.df = pd.DataFrame({'acc': [8, 9, 7, 10, 6], 'index': _index})
self.df = self.df.set_index('index')
if __name__ == '__main__':
extratime = np.int64(-40000000)
session = dict()
session[2] = {'testnumber': '401',
'devicename': 'peanut'}
session[2]['data_in_device_class'] = device_data()
print session[2]['data_in_device_class'].df
if hasattr(session[2]['data_in_device_class'], 'df'):
session[2]['data_in_device_class'].df = session[2]['data_in_device_class'].df.shift(int(round(extratime)))
else:
pass
print session[2]['data_in_device_class'].df
When I ran the original code, it gave me this error: OverflowError: Python int too large to convert to C long
I used extratime = np.int64(extratime) to solve the problem. I notice that with the scaled down version of my code, that it is not really needed.
My question still stands as how I could use shift to move my index with a value amount and not with the amount of places it needs to move?
Thank you
First you want to shift your index by the desired amount, and then reindex, to make things easier I take a copy here, shift the index, and we reindex on the union of the shifted index and the original index to introduce NaN rows:
In [232]:
df1 = df.copy()
df1.index -= 80000000
df1.reindex(df1.index.union(df.index))
Out[232]:
acc
index
1480681219960000000 8.0
1480681220000000000 9.0
1480681220040000000 7.0
1480681220080000000 10.0
1480681220120000000 6.0
1480681220160000000 NaN
1480681220200000000 NaN
IIUC:
You can just reassign your index with itself added to extra time.
Consider the dataframe df as an example
df = pd.DataFrame(np.arange(100).reshape(5, -1))
df
I can "shift" the entire dataframe down like this
df.index = df.index + 5
df
Let me know if this is on the mark. Otherwise, I'll delete it.

How to replace all non-NaN entries of a dataframe with 1 and all NaN with 0

I have a dataframe with 71 columns and 30597 rows. I want to replace all non-nan entries with 1 and the nan values with 0.
Initially I tried for-loop on each value of the dataframe which was taking too much time.
Then I used data_new=data.subtract(data) which was meant to subtract all the values of the dataframe to itself so that I can make all the non-null values 0.
But an error occurred as the dataframe had multiple string entries.
You can take the return value of df.notnull(), which is False where the DataFrame contains NaN and True otherwise and cast it to integer, giving you 0 where the DataFrame is NaN and 1 otherwise:
newdf = df.notnull().astype('int')
If you really want to write into your original DataFrame, this will work:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
Use notnull with casting boolean to int by astype:
print ((df.notnull()).astype('int'))
Sample:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': [np.nan, 4, np.nan], 'b': [1,np.nan,3]})
print (df)
a b
0 NaN 1.0
1 4.0 NaN
2 NaN 3.0
print (df.notnull())
a b
0 False True
1 True False
2 False True
print ((df.notnull()).astype('int'))
a b
0 0 1
1 1 0
2 0 1
I'd advise making a new column rather than just replacing. You can always delete the previous column if necessary but its always helpful to have a source for a column populated via an operation on another.
e.g. if df['col1'] is the existing column
df['col2'] = df['col1'].apply(lambda x: 1 if not pd.isnull(x) else np.nan)
where col2 is the new column. Should also work if col2 has string entries.
I do a lot of data analysis and am interested in finding new/faster methods of carrying out operations. I had never come across jezrael's method, so I was curious to compare it with my usual method (i.e. replace by indexing). NOTE: This is not an answer to the OP's question, rather it is an illustration of the efficiency of jezrael's method. Since this is NOT an answer I will remove this post if people do not find it useful (and after being downvoted into oblivion!). Just leave a comment if you think I should remove it.
I created a moderately sized dataframe and did multiple replacements using both the df.notnull().astype(int) method and simple indexing (how I would normally do this). It turns out that the latter is slower by approximately five times. Just an fyi for anyone doing larger-scale replacements.
from __future__ import division, print_function
import numpy as np
import pandas as pd
import datetime as dt
# create dataframe with randomly place NaN's
data = np.ones( (1e2,1e2) )
data.ravel()[np.random.choice(data.size,data.size/10,replace=False)] = np.nan
df = pd.DataFrame(data=data)
trials = np.arange(100)
d1 = dt.datetime.now()
for r in trials:
new_df = df.notnull().astype(int)
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
# create a dummy copy of df. I use a dummy copy here to prevent biasing the
# time trial with dataframe copies/creations within the upcoming loop
df_dummy = df.copy()
d1 = dt.datetime.now()
for r in trials:
df_dummy[df.isnull()] = 0
df_dummy[df.isnull()==False] = 1
print( (dt.datetime.now()-d1).total_seconds()/trials.size )
This yields times of 0.142 s and 0.685 s respectively. It is clear who the winner is.
There is a method .fillna() on DataFrames which does what you need. For example:
df = df.fillna(0) # Replace all NaN values with zero, returning the modified DataFrame
or
df.fillna(0, inplace=True) # Replace all NaN values with zero, updating the DataFrame directly
for fmarc 's answer:
df.loc[~df.isnull()] = 1 # not nan
df.loc[df.isnull()] = 0 # nan
The code above does not work for me, and the below works.
df[~df.isnull()] = 1 # not nan
df[df.isnull()] = 0 # nan
With the pandas 0.25.3
And if you want to just change values in specific columns, you may need to create a temp dataframe and assign it to the columns of the original dataframe:
change_col = ['a', 'b']
tmp = df[change_col]
tmp[tmp.isnull()]='xxx'
df[change_col]=tmp
Try this one:
df.notnull().mul(1)
Here i will give a suggestion to take a particular column and if the rows in that column is NaN replace it by 0 or values are there in that column replace it as 1
this below line will change your column to 0
df.YourColumnName.fillna(0,inplace=True)
Now Rest of the Not Nan Part will be Replace by 1 by below code
df["YourColumnName"]=df["YourColumnName"].apply(lambda x: 1 if x!=0 else 0)
Same Can Be applied to the total dataframe by not defining the column Name
Use: df.fillna(0)
to fill NaN with 0.
Generally there are two steps - substitute all not NAN values and then substitute all NAN values.
dataframe.where(~dataframe.notna(), 1) - this line will replace all not nan values to 1.
dataframe.fillna(0) - this line will replace all NANs to 0
Side note: if you take a look at pandas documentation, .where replaces all values, that are False - this is important thing. That is why we use inversion to create a mask ~dataframe.notna(), by which .where() will replace values

Categories