Interpolate values in one column of a dataframe (python) - python

I have a dataframe with three columns (timestamp, temperature and waterlevel).
What I want to do is to replace all NaN values in the waterlevel column with interpolated values. For example:
The waterlevel value is always decreasing till it is 0. Therefore, the waterlevel cannot be negative. Also, if the waterlevel is staying the same, the interpolated values should also be the same. Ideally, the stepsize between the interpolated values (within two available waterlevel values) should be the same.
What I have tried so far was:
df['waterlevel'].interpolate(method ='linear', limit_direction ='backward') # backwards because the waterlevel value is always decreasing.
This does not work. After executing this line, every NaN value has turned to a 0 with the parameter 'forward' and stays NaN with the parameter 'backward'.
and
df = df['waterlevel'].assign(InterpolateLinear=df.target.interpolate(method='linear'))
Any suggestions on how to solve this?

I assume NaN is np.nan Object
import pandas as pd
import numpy as np
df = pd.DataFrame({"waterlevel": ['A',np.nan,np.nan,'D'],"interpolated values":['Ai','Bi','Ci','D']})
print(df)
df.loc[df['waterlevel'].isnull(),'waterlevel'] = df['interpolated values']
print(df)
O/P:
waterlevel interpolated values
0 A Ai
1 NaN Bi
2 NaN Ci
3 D D
waterlevel interpolated values
0 A Ai
1 Bi Bi
2 Ci Ci
3 D D

Related

what is the efficient way to collect computed results in python and turn it to a dataframe to make some analysis?

I am doing some computing on a dataset using loops. Then, based on random event, I am going to compute some float number(This means that I don't know in advance how many floats I am going to retrieve). I want to save these numbers(results) in a some kind of a list and then save them to a dataframe column ( I want to have these results for each iteration in my loop and save them in a column so I can compare them, meaning, each iteration will produce a "list" of results that will be registred in a df column)
example:
for y in range(1,10):
for x in range(1,100):
if(x>random number and x<y):
result=2*x
I want to save all the results in a dataframe columns by combination x,y. For example, the results for x=1,y=2 in a column then x=2,y=2 in column ...etc and the results are not of the same size, so I guess that I'll use fillna.
Now I know that I can create an empty dataframe with max index and then fill it result by result, but I think there's a better way to do it!
Thanks in advance.
You want to take advantage of the efficiency that numpy and pandas give you. If you use numpy.where, you can set the value to nan when the if statement is False, and otherwise you can execute your formula:
import numpy as np
import pandas as pd
np.random.seed(0) # so you can reproduce my result, you can remove this in practice
x = list(range(10))
y = list(range(1, 11))
random_nums = 10 * np.random.random(10)
df = pd.DataFrame({'x' : x, 'y': y})
# the first argument is your if condition
df['new_col'] = np.where((df['x'] > random_nums) & (df['x'] < df['y']), 2*df['x'], np.nan)
print(df)
Here, random_nums generates an entire np.ndarray of random numbers to compare with. This gives
x y new_col
0 0 1 NaN
1 1 2 NaN
2 2 3 NaN
3 3 4 NaN
4 4 5 NaN
5 5 6 NaN
6 6 7 12.0
7 7 8 NaN
8 8 9 NaN
9 9 10 18.0
This is especially faster if your formula (here, 2*x) is relatively quick to compute.

In pandas: Interpolate between two rows such that the sum of interpolated values is equal to the second row

I am looking for a way to interpolate between two values (A and G) such that the sum of the interpolated values is equal to the second value (G), preferably while the distances between the interpolated values are linear/equally-sized.
What I got is:
Label
Value
A
0
B
NaN
C
NaN
D
NaN
E
NaN
F
NaN
G
10
... and I want to get to this:
Label
Value
A
0
B
2
C
2
D
2
E
2
F
2
G
10
The function pandas.interpolate unfortunately does not allow for this. I could manually create sections in these columns using something like numpy.linspace but this appears to be a rather makeshift solution and not particularly efficient for larger tables where the indices that require interpolation are irregularly scatter across rows.
What is the most efficient way to do this in Python?
I don't know if this is the most efficient way but it works for any number of breaks, including none, using only numpy and pandas:
df['break'] = np.where(df['Value'].notnull(), 1, 0)
df['group'] = df['break'].shift().fillna(0).cumsum()
df['Value'] = df.groupby('group').Value.apply(lambda x: x.fillna( x.max() / (len(x)-1) ) )
You will get a couple of warnings from the underlying numpy calculations due to NaNs and zeroes but the replacement still works.
RuntimeWarning: invalid value encountered in double_scalars
RuntimeWarning: divide by zero encountered in double_scalars

Fill missing data with random values from categorical column - Python

I'm working on a hotel booking dataset. Within the data frame, there's a discrete numerical column called ‘agent’ that has 13.7% missing values. My intuition is to just drop the rows of missing values, but considering the number of missing values is not that small, now I want to use the Random Sampling Imputation to replace them proportionally with the existing categorical variables.
My code is:
new_agent = hotel['agent'].dropna()
agent_2 = hotel['agent'].fillna(lambda x: random.choice(new_agent,inplace=True))
results
The first 3 rows was nan but now replaced with <function at 0x7ffa2c53d700>. Is there something wrong with my code, maybe in the lambda syntax?
UPDATE:
Thanks ti7 helped me solved the problem:
new_agent = hotel['agent'].dropna() #get a series of just the
available values
n_null = hotel['agent'].isnull().sum() #length of the missing entries
new_agent.sample(n_null,replace=True).values #sample it with
repetition and get values
hotel.loc[hotel['agent'].isnull(),'agent']=new_agent.sample(n_null,replace=True).values
#fill and replace
.fillna() is naively assigning your function to the missing values. It can do this because functions are really objects!
You probably want some form of generating a new Series with random values from your current series (you know the shape from subtracting the lengths) and use that for the missing values.
get a Series of just the available values (.dropna())
.sample() it with repetition (replace=True) to a new Series of the same length as the missing entries (df["agent"].isna().sum())
get the .values (this is a flat numpy array)
filter the column and assign
quick code
df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
df["agent"].isna().sum(), # get the same number of values as are missing
replace=True # repeat values
).values # throw out the index
demo
>>> import pandas as pd
>>> df = pd.DataFrame({'agent': [1,2, None, None, 10], 'b': [3,4,5,6,7]})
>>> df
agent b
0 1.0 3
1 2.0 4
2 NaN 5
3 NaN 6
4 10.0 7
>>> df["agent"].isna().sum()
2
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 1.])
>>> df["agent"].dropna().sample(df["agent"].isna().sum(), replace=True).values
array([2., 2.])
>>> df.loc[df["agent"].isna(), "agent"] = df["agent"].dropna().sample(
... df["agent"].isna().sum(),
... replace=True
... ).values
>>> df
agent b
0 1.0 3
1 2.0 4
2 10.0 5
3 2.0 6
4 10.0 7

Randomly introduce NaN values in pandas dataframe

How could I randomly introduce NaN values ​​into my dataset for each column taking into account the null values ​​already in my starting data.
I want to have for example 20% of NaN values ​​by column.
For example:
If I have in my dataset 3 columns: "A", "B" and "C" for each columns I have NaN values rate how do I introduce randomly NaN values ​​by column to reach 20% per column:
A: 10% nan
B: 15% nan
C: 8% nan
For the moment I tried this code but it degrades too much my dataset and I think that it is not the good way:
df = df.mask(np.random.choice([True, False], size=df.shape, p=[.20,.80]))
I am not sure what do you mean by the last part ("degrades too much") but here is a rough way to do it.
import numpy as np
import pandas as pd
A = pd.Series(np.arange(99))
# Original missing rate (for illustration)
nanidx = A.sample(frac=0.1).index
A[nanidx] = np.NaN
###
# Complementing to 20%
# Original ratio
ori_rat = A.isna().mean()
# Adjusting for the dataframe without missing values
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
nanidx2 = A.dropna().sample(frac=add_miss_rat).index
A[nanidx2] = np.NaN
A.isna().mean()
Obviously, it will not always be exactly 20%...
Update
Applying it for the whole dataframe
for col in df:
ori_rat = df[col].isna().mean()
if ori_rat >= 0.2: continue
add_miss_rat = (0.2 - ori_rat) / (1 - ori_rat)
vals_to_nan = df[col].dropna().sample(frac=add_miss_rat).index
df.loc[vals_to_nan, col] = np.NaN
Update 2
I made a correction to also take into account the effect of dropping NaN values when calculating the ratio.
Unless you have a giant DataFrame and speed is a concern, the easy-peasy way to do it is by iteration.
import pandas as pd
import numpy as np
import random
df = pd.DataFrame({'A':list(range(100)),'B':list(range(100)),'C':list(range(100))})
#before adding nan
print(df.head(10))
nan_percent = {'A':0.10, 'B':0.15, 'C':0.08}
for col in df:
for i, row_value in df[col].iteritems():
if random.random() <= nan_percent[col]:
df[col][i] = np.nan
#after adding nan
print(df.head(10))
Here is a way to get as close to 20% nan in each column as possible:
def input_nan(x,pct):
n = int(len(x)*(pct - x.isna().mean()))
idxs = np.random.choice(len(x), max(n,0), replace=False, p=x.notna()/x.notna().sum())
x.iloc[idxs] = np.nan
df.apply(input_nan, pct=.2)
It first takes the difference between the NaN percent you want, and the percent NaN values in your dataset already. Then it multiplies it by the length of the column, which gives you how many NaN values you want to put in (n). Then uses np.random.choice which randomly choose n indexes that don't have NaN values in them.
Example:
df = pd.DataFrame({'y':np.random.randn(10), 'x1':np.random.randn(10), 'x2':np.random.randn(10)})
df.y.iloc[1]=np.nan
df.y.iloc[8]=np.nan
df.x2.iloc[5]=np.nan
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 0.289559
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 0.180651 NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 0.475805 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
df.apply(input_nan)
# y x1 x2
# 0 2.635094 0.800756 -1.107315
# 1 NaN 0.055017 0.018097
# 2 0.673101 -1.053402 1.525036
# 3 0.246505 0.005297 NaN
# 4 0.883769 1.172079 0.551917
# 5 -1.964255 NaN NaN
# 6 -0.247067 0.431622 -0.846953
# 7 0.603750 NaN 0.524619
# 8 NaN -0.452400 -0.191480
# 9 -0.583601 -0.446071 0.029515
I have applied it to the whole dataset, but you can apply it to any column you want. For example, if you wanted 15% NaNs in columns y and x1, you could call df[['y','x1]].apply(input_nan, pct=.15)
I guess I am a little late to the party but if someone needs a solution that's faster and takes the percentage value into account when introducing null values, here's the code:
nan_percent = {'A':0.15, 'B':0.05, 'C':0.23}
for col, perc in nan_percent.items():
df['null'] = np.random.choice([0, 1], size=df.shape[0], p=[1-perc, perc])
df.loc[df['null'] == 1, col] = np.nan
df.drop(columns=['null'], inplace=True)

Forming an array from column values based on a different column value condition

I have the following dataset.
10
20
30
40
50
60
70
80
90
100
My aim is to replace the values in Column 2 with the average of two successive values, when the value in Column 1 remains less than 5 and greater than 6. To be precise,in <5 range the values in Column 1 are 1,2,3 and 4 and corresponding column 2 values are 10,20,30,40. So, i want to take the average of (10,20)=15 and (30,40)=35. I want to do the same in the range >6 for Column 1, take the average of corresponding Column 2 values (70,80)=75 and (80,90)=95. I will not take the average of Column 2 values when Column 1 values do not fall in those two ranges (5 and 6) and corresponding Column 2 values (50 and 60) and finally create an array of Column 2 values based on these three conditions.
I tried the following approach:
import numpy as np
import pandas as pd
data= pd.read_table('/Users/Hrihaan/Desktop/Data.txt', dtype=float, header=None, sep='\s+').values
t=data[:,0]
df = pd.DataFrame({"x":t, "y":data[:,1]})
x=np.where(t<=4,data[:,1],np.nan)
x_1=np.nanmean(x.reshape(-1, 2), axis=1)
y=np.where((df.x>4)&(df.x<7), df.y,np.nan)
z=np.where(t>6,data[:,1],np.nan)
z_1=np.nanmean(z.reshape(-1, 2), axis=1)
A=np.concatenate((x_1,y,z_1), axis=0)
print(A)
I am getting the following output:
[ 15. 35. nan nan nan nan nan nan nan 50. 60. nan nan nan nan
nan nan nan 75. 95.]
My expected output is:
[ 15. 35. 50. 60. 75. 95.]
Any help on how to get around the np.nan in my code would be really helpful.
I genuinely have difficulties to see your bigger concept here. For your very specific problem, this would work:
import pandas as pd
#read your file
data= pd.read_table('test.txt', dtype = float, delim_whitespace = True, names = ["x", "y"])
#define rows you want to exclude
exclude_rows = set([5, 6])
#create new column with rolling mean of two rows
data["mean"] = data["y"].rolling(2).mean()
#overwrite rolling mean, when row should be excluded from calculating the average
data["mean"][data["x"].isin(exclude_rows)] = data["y"]
#filter data
A = data["mean"][(data["x"].isin(exclude_rows)) | (data["x"] % 2 == 0)]
But what is your expected output, if you want to exclude for instance x = 4 and 6? Then you have several singular values, for which you didn't give any instruction, what should happen with them in the process of averaging.
This does what you want
a=np.vstack((np.arange(1,11),np.arange(10,110,10))).T
b=(a[:-1,1]+a[1:,1])/2
indL=np.argmax(a[:,0]>5)-1
indH=np.argmax(a[:,0]>6)
out=np.hstack((b[:indL:2],a[indL:indH,1],b[indH::2]))

Categories