I have this three column dataset formatted as in the following
t_stamp,Xval,Ytval
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
How can we predict the current value of Y (the true value) using the last 5 data points of Xval using random forest classifier model of sklearn in Python? Meaning taking [0,0,1,2,3] of Xval column as an input - i want to predict the 5th row value of Ytval. Using a simple rolling OLS regression model, we can do it as in the following but I wanted to do it using random forest model.
import pandas as pd
df = pd.read_csv('data_pred.csv')
model = pd.stats.ols.MovingOLS(y=df.Ytval, x=df[['Xval']],
window_type='rolling', window=5, intercept=True)
You can realize the rolling input data on your own by reforming your data so that each of the last 5 values of X becomes it's own feature:
import pandas as pd
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
data = StringIO("""t_stamp,Xval,Ytval
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10""")
df = pd.read_csv(data)
for i in range(1,6):
df['Xval_t'+str(i)] = df['Xval'].shift(i)
Which yields df:
t_stamp Xval Ytval Xval_t1 Xval_t2 Xval_t3 Xval_t4 Xval_t5
0.000543 0 10 NaN NaN NaN NaN NaN
0.000575 0 10 0.0 NaN NaN NaN NaN
0.041324 1 10 0.0 0.0 NaN NaN NaN
0.041331 2 10 1.0 0.0 0.0 NaN NaN
0.041336 3 10 2.0 1.0 0.0 0.0 NaN
0.041340 4 10 3.0 2.0 1.0 0.0 0.0
0.041345 5 10 4.0 3.0 2.0 1.0 0.0
0.041350 6 10 5.0 4.0 3.0 2.0 1.0
0.041354 7 10 6.0 5.0 4.0 3.0 2.0
Of course, you need to decide on how to handle the NaNs. I just drop them for demonstration purposes.
df.dropna(inplace=True)
X = df[['Xval', 'Xval_t1', 'Xval_t2', 'Xval_t3', 'Xval_t4', 'Xval_t5']].values
y = df['Ytval'].values
reg = RandomForestRegressor()
reg.fit(X,y)
print(reg.predict(X))
Result:
[ 10. 10. 10. 10.]
Related
What is the most pandastic way to forward fill with ascending logic (without iterating over the rows)?
input:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['test'] = np.nan,np.nan,1,np.nan,np.nan,3,np.nan,np.nan,2,np.nan,6,np.nan,np.nan
df['desired_output'] = np.nan,np.nan,1,1,1,3,3,3,3,3,6,6,6
print (df)
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
In the 'test' column, the number of consecutive NaN's is random.
In the 'desired_output' column, trying to forward fill with ascending values only. Also, when lower values are encountered (row 8, value = 2.0 above), they are overwritten with the current higher value.
Can anyone help? Thanks in advance.
You can combine cummax to select the cumulative maximum value and ffill to replace the NaNs:
df['desired_output'] = df['test'].cummax().ffill()
output:
test desired_output
0 NaN NaN
1 NaN NaN
2 1.0 1.0
3 NaN 1.0
4 NaN 1.0
5 3.0 3.0
6 NaN 3.0
7 NaN 3.0
8 2.0 3.0
9 NaN 3.0
10 6.0 6.0
11 NaN 6.0
12 NaN 6.0
intermediate Series:
df['test'].cummax()
0 NaN
1 NaN
2 1.0
3 NaN
4 NaN
5 3.0
6 NaN
7 NaN
8 3.0
9 NaN
10 6.0
11 NaN
12 NaN
Name: test, dtype: float64
I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0
I have the following data frame which I want to apply bfill as follows:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
Nan
Nan
3.0
50
4.0
10
Nan
5.0
10
I want to bfill Nan in the amount column as per percentage in the percentage column i.e., if the corresponding percentage is 50 then fill 50% of Nan before the number (partial fill). e.g. amount with 3.0 value have a percentage of 50 so out of 4 Nan entries, only 50% are to be bfill.
proposed output:
'amount'
'percentage'
Nan
1.0
20
2.0
10
Nan
Nan
3.0
3.0
3.0
50
4.0
10
Nan
5.0
10
Please help.
Create groups according to NaNs
df['group_id'] = df.amount.where(df.amount.isna(), 1).cumsum().bfill()
Create a filling function
def custom_fill(x):
# Calculate number of rows to be filled
max_fill_rows = math.floor(x.iloc[-1, 1] * (x.shape[0] - 1) / 100)
# Fill only if number of rows to fill is not zero
return x.bfill(limit=max_fill_rows) if max_fill_rows else x
Fill the DataFrame
df.groupby('group_id').apply(custom_fill)
Output
amount percentage group_id
0 NaN NaN 1.0
1 1.0 20.0 1.0
2 2.0 10.0 2.0
3 NaN NaN 3.0
4 NaN NaN 3.0
5 3.0 50.0 3.0
6 3.0 50.0 3.0
7 3.0 50.0 3.0
8 4.0 10.0 4.0
9 NaN NaN 5.0
10 5.0 10.0 5.0
PS: Don't forget to import the required libraries
import math
I'm having a dataframe with energy use data. In order to post-process the data I need to be sure I only go forward with reliable energy uses.
One of the steps here is making sure the values in the dataframe rows are not identical, because this indicates an error in the database (energy use for households are hardly ever identical over the years except for zero energy uses (due to renewable energy installations).
The question is as follows on a simple example df:
The dataframe can contain empty cells (np.nan).
If 2 or more row-values are identical, then keep one of the
identical values and set the rest to np.nan except if the identical values are zeros.
In the example below, row 2 and 4 are replaced with np.nan but the last row is not because the identical values are zeros.
Does anyone know how to go from the initial df to the desired df? The code works except for the condition if the identical values are zeros, these should not be changed to np.nan (see last row in df)
Initial df:
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 7.0
9.0 NaN NaN
3.0 3.0 3.0
2.0 4.0 6.0
0.0 0.0 NaN
Desired df:
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 NaN
9.0 NaN NaN
3.0 NaN NaN
2.0 4.0 6.0
0.0 0.0 NaN
Tried code:
import pandas as pd
import numpy as np
df = pd.DataFrame({"y_2010": [4,2,9,3,2,0],
"y_2011": [6,7,np.nan,3,4,0],
"y_2012": [3,7,np.nan,3,6,np.nan]})
print(df)
mask = df.apply(pd.Series.duplicated, 1)
df = df.mask(mask, np.nan)
print(df)
y_2010 y_2011 y_2012
4.0 6.0 3.0
2.0 7.0 NaN
9.0 NaN NaN
3.0 NaN NaN
2.0 4.0 6.0
0.0 NaN NaN -> 0 changed to NaN and I don't want that
Let us try adding 0 check
df = df.mask(df.apply(pd.Series.duplicated, 1) & df.ne(0))
y_2010 y_2011 y_2012
0 4.0 6.0 3.0
1 2.0 7.0 NaN
2 9.0 NaN NaN
3 3.0 NaN NaN
4 2.0 4.0 6.0
5 0.0 0.0 NaN
You can try:
df.apply(lambda x: x.mask(x.duplicated()&x.ne(0)), axis=1)
Output:
y_2010 y_2011 y_2012
0 4.0 6.0 3.0
1 2.0 7.0 NaN
2 9.0 NaN NaN
3 3.0 NaN NaN
4 2.0 4.0 6.0
5 0.0 0.0 NaN
I got an example pandas dataframe like this:
a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
I want to add a new column, linear to the column, which is the linear regression output of fit a on b. Now I got:
from sklearn.linear_model import LinearRegression
repr = LinearRegression()
repr.fit(df['a'].as_matrix().reshape(-1,1),df['b'].as_matrix().reshape(-1,1))
repr.predict(df['a'].as_matrix().reshape(-1,1)) # This will give the linear regression outcome for whole column
Now I want to incrementally do linear regression on series a, so the first entry of linear will be b[0], and the second will be b[0]/a[0]*a[1], and the third will be the linear regression outcome of the first two entries, and so on and so forth. I have no clue how to do that with pandas, except for iterating through all the entries, is there a batter way?
You can use expanding with some custom apply functions. Interesting way to do LR...
from io import StringIO
import pandas as pd
import numpy as np
df = pd.read_table(StringIO(""" a b
0 6.0 0.6
1 1.0 0.3
2 3.0 0.8
3 5.0 0.1
4 7.0 0.4
5 2.0 0.2
6 0.0 0.9
7 4.0 0.7
8 8.0 0.0
9 9.0 0.5
10 10.0 0.4
11 11.0 0.35
12 12.0 0.3
13 13.0 0.28
14 14.0 0.27
15 15.0 0.22"""), sep='\s+')
df = df.sort_values(by='a')
ax = df.plot(x='a',y='b',kind='scatter')
m, b = np.polyfit(df['a'],df['b'],1)
lin_reg = lambda x, m, b : m*x + b
df['lin'] = lin_reg(df['a'], m, b)
def make_m(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[0]
def make_b(x):
y = df['b'].iloc[0:len(x)]
return np.polyfit(x, y, 1)[1]
df['new'] = df['a'].expanding().apply(make_m, raw=True)*df['a'] + df['a'].expanding().apply(make_b, raw=True)
# df = df.sort_values(by='a')
ax.plot(df.a,df.lin)
ax.plot(df.a,df.new)