I want to perform OLS regression using python's statsmodels package. But my dataset has nans in it. Currently, I know I can use missing='drop' option when perform OLS regression but some of the results (fitted value or residuals) will have different lengths as the original y variable.
I have the following code as an example:
import numpy as np
import statsmodels.api as sm
yvars = np.array([1.0, 6.0, 3.0, 2.0, 8.0, 4.0, 5.0, 2.0, np.nan, 3.0])
xvars = np.array(
[
[1.0, 8.0],
[8.0, np.nan],
[np.nan, 3.0],
[3.0, 6.0],
[5.0, 3.0],
[2.0, 7.0],
[1.0, 3.0],
[2.0, 2.0],
[7.0, 9.0],
[3.0, 1.0],
]
)
res = sm.OLS(yvar, sm.add_constant(xvars), missing='drop').fit()
res.resid
The result is as follows:
array([-0.71907958, -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, -1.54362416])
My question is that the result is an array has length 7 (after dropping nans), but the length of yvar is 10. So, what if I want to return the residual of the same length as yvar and just output nan in whatever position where there are at least 1 nan in either yvar or xvars?
Basically, the result I want to get is:
array([-0.71907958, nan , nan , -1.9012464 , 1.78811122, 1.18983701, 2.63854267,
-1.45254075, nan , -1.54362416])
That's too difficult to implement in statsmodels. So users need to handle it themselves.
The results attributes like fittedvalues and resid are for the actual sample used.
The predict method of the results instance preserves nans in the provided predict data exog array, but other methods and attributes do not.
results.predict(xvars_all)
One workaround:
Use a pandas DataFrame for the data.
Then, AFAIR, resid and fittedvalues of the results instance are pandas Series with the appropriate index.
This can then be used to add those to the original index or DataFrame.
That's what the predict method does.
Related
Suppose I have a Pandas data frame structured similarly to the following:
data = {
'A' : [5.0, np.nan, 1.0],
'B' : [7.0, np.nan, np.nan],
'C' : [9.0, 2.0, 6.0],
'D' : [np.nan, 4.0, 9.0],
'E' : [np.nan, 6.0, np.nan],
'F' : [np.nan, np.nan, np.nan],
'G' : [np.nan, np.nan, 8.0]
}
df = pd.DataFrame(
data,
index=['11','22','33']
)
From each row, I would like to extract the longest continuous block of non-null values and append them to a list.
So the following values from these rows:
row11: [5,7,9]
row22: [2,4,6]
row33: [6,9]
Giving me a list of values:
[5.0, 7.0, 9.0, 2.0, 4.0, 6.0, 6.0, 9.0]
My current approach uses iterrows() first_valid_index() and last_valid_index():
mylist = []
for i, r in df.iterrows():
start = r.first_valid_index()
end = r.last_valid_index()
mylist.extend(r[start: end].values)
This works fine when the valid digits are blocked together such as row11 and row22. However my approach falls down when digits are interspersed with null values such as in row33. In this case, my approach extracts the entire row as the first and last index contain non-null values. My solution (incorrectly) outputs a final list of:
[5.0, 7.0, 9.0, 2.0, 4.0, 6.0, 1.0, nan, 6.0, 9.0, nan, nan, 8.0]
I have the following questions:
1.) How can I combat the error I'm facing in the example of row33?
2.) Is there a more efficient approach than using iterrows()? My actual data has many thousands of rows. While it isn't necessarily too slow, I'm always wary of resorting to iteration when using Pandas.
One option using a groupby to get the stretches of non-NA and max to filter the longest:
def get_longest(s):
m = s.isna()
return max(s[~m].groupby(m.cumsum()),
key=lambda x: len(x[1])
)[1].dropna().tolist()
out = df.apply(get_longest, axis=1)
Output:
11 [5.0, 7.0, 9.0]
22 [2.0, 4.0, 6.0]
33 [6.0, 9.0]
dtype: object
With numpy.ma.masked_invalid and numpy.ma.clump_unmasked functions to split a row into continuous slices of non-nan values and select slice with largest length:
res = df.apply(lambda x: x[max(np.ma.clump_unmasked(np.ma.masked_invalid(x.values)),
key=lambda sl: sl.stop - sl.start)].tolist(), axis=1)
11 [5.0, 7.0, 9.0]
22 [2.0, 4.0, 6.0]
33 [6.0, 9.0]
traj0
Out[52]:
state action reward
0 [1.0, 4.0, 6.0] 3.0 4.0
1 [4.0, 6.0, 11.0] 4.0 5.0
2 [6.0, 7.0, 3.0] 3.0 22.0
3 [3.0, 3.0, 2.0] 1.0 10.0
4 [2.0, 9.0, 5.0] 2.0 2.0
Suppose I have a pandas dataframe looking like this where the state column has as its entries, 3-element numpy arrays.
How can I query for the row that has state as np.array([3.0,3.0,2.0]) here?
I know traj0.query("state == '[3.0,3.0,2.0]'") works, I know. But I don't want to hardcode the array value in my query.
I'm looking for something like
x = np.array([3.0,3.0,2.0])
traj0.query('state ==' + x)
=============
It's not a duplicate question because my previous question pandas query with a column consisting of array entries was only for the case where there was only one value in each array. Here I'm looking for if the arrays have multiple values.
You can do this with df.loc and a lambda function using numpy.array_equal:
x = [1., 4., 6.]
traj0.loc[df.state.apply(lambda a: np.array_equal(a, x))]
Basically this checks each element of the state column for equivalence to x and returns only those rows where the column matches.
Example
df = pd.DataFrame(data={'state': [[1., 4., 6.], [4., 5., 6.]],
'value': [5, 6]})
print(df.loc[df.state.apply(lambda a: np.array_equal(a, x))])
state value
0 [1.0, 4.0, 6.0] 5
import numpy as np
import pandas as pd
df = pd.DataFrame([[np.array([1.0, 4.0, 6.0]), 3.0, 4.0],
[np.array([4.0, 6.0, 11.0]), 4.0, 5.0],
[np.array([6.0, 7.0, 3.0]), 3.0, 22.0],
[np.array([3.0, 3.0, 2.0]), 1.0, 10.0],
[np.array([2.0, 9.0, 5.0]), 2.0, 2.0]
], columns=['state','action','reward'])
x = str(np.array([3.0, 3.0, 2.0]))
df[df.state.astype(str) == x]
// to use pd.query
df['state_str'] = df.state.astype(str)
df.query("state_str == '{}'".format(x))
Output
state action reward
3 [3.0, 3.0, 2.0] 1.0 10.0
Best not to use pd.DataFrame.query here. You can perform a vectorised comparison and then use Boolean indexing:
x = [3, 3, 2]
mask = (np.array(df['state'].values.tolist()) == x).all(1)
res = df[mask]
print(res)
state action reward
3 [3.0, 3.0, 2.0] 1.0 10.0
In general, you shouldn't store lists or arrays within a Pandas series. This is inefficient and removes the possibility of direct vectorised operations. Here, we've had to convert to a NumPy array explicitly for a simple comparison.
I've got a question on identifying patterns within an array. I'm working with the following array:
A = [1.0, 1.1, 9.0, 9.2, 0.9, 9.1, 1.0, 1.0, 1.2, 9.2, 8.9, 1.1]
Now, this array is clearly made of elements clustering about ~1 and elements about ~9.
Is there a way to separate these clusters? I.e., to get to something like:
a_1 = [1.0, 1.1, 0.9, 1.0, 1.0, 1.2, 1.1] # elements around ~1
a_2 = [9.0, 9.2, 9.1, 9.2, 8.9] # elements around ~9
Thanks a lot. Best.
You can do that by comparing each element with which is closer. Is it closer to 1 or 9:
a_1 = [i for i in A if abs(i-1)<=abs(i-9)]
a_2 = [i for i in A if abs(i-1)>abs(i-9)]
But of course this is not a general solution for clustering. It only work in this case when you know the center of the cluster (1 and 9).
If you don't know the center of the cluster, I think you should use a clustering algorithm like K-Means
This is a simple K-Means implementation (with k=2 and 100 as limit iteration). You didn't need to know the center of the cluster, it picks randomly at first.
from random import randint
A = [1.0, 1.1, 9.0, 9.2, 0.9, 9.1, 1.0, 1.0, 1.2, 9.2, 8.9, 1.1]
x = A[randint(0,len(A)-1)]
y = A[randint(0,len(A)-1)]
for _ in range(100):
a_1 = [i for i in A if abs(i-x)<=abs(i-y)]
a_2 = [i for i in A if abs(i-x)>abs(i-y)]
print(x,y)
x = sum(a_1)/len(a_1)
y = sum(a_2)/len(a_2)
print a_1
print a_2
I have tensor like this
[[1.0, 2.0]
[2.5, 3.0]]
and tensor like this
[1, 2]
ok, I know what is [1.0, 2.0] is corresponds to 1 and [2.5, 3.0] is corresponds to 2
How can I get tensor with value from second tensor if I have first tensor with deviations in some values?
For example: [[1.1, 2.00009]] -> [1]
Say I have this input data:
[[1.0, 1.5, 2.0],
[2.0, 1.5, 2.0],
[1.0, 2.0, 3.0],
[1.0, 1.5, 3.0]]
And this output data:
[100, 100, 80, 60]
Assuming the input values have no predetermined correlation, how can I use scikit-learn to use this data to estimate an output value from a set of input values?
You can use any scikit-learn regression model to predict output for next set of features. Please go through scikit-learn documentation. To get you started,
In [1]: from sklearn.linear_model import LinearRegression
In [2]: xtr = [[1.0, 1.5, 2.0],
...: [2.0, 1.5, 2.0],
...: [1.0, 2.0, 3.0],
...: [1.0, 1.5, 3.0]]
In [3]: ytr = [100, 100, 80, 60]
In [4]: lr = LinearRegression()
In [5]: lr.fit(xtr, ytr)
Out[5]: LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)
In [6]: lr.predict([1.0, 2.5, 1.5])
Out[6]: array([ 160.])