I have a problem with a predictive modelling problem. Hopefully, someone have time and can help me. The starting position is shown below. S1-S2 are sensor measurements and RUL is my target value.
DataStructure:
id period s1 s2 s3 RUL
1 1 510.23 643.43 1585.29 6
1 2 512.34 644.89 1586.12 5
1 3 514.65 645.11 1587.99 4
1 4 512.98 647.59 1588.45 3
1 5 516.34 649.04 1590.65 2
1 6 518.12 652.62 1593.09 1
2 1 509.77 640.61 1584.91 9
2 2 510.26 642.06 1586.00 8
2 3 511.95 643.62 1588.09 7
2 4 513.51 646.51 1589.45 6
2 5 512.17 648.06 1589.54 5
2 6 515.56 646.11 1586.22 4
2 7 518.78 649.34 1586.96 3
2 8 519.90 650.30 1588.95 2
2 9 521.05 651.39 1591.34 1
3 1 501.11 653.99 1580.45 8
3 2 511.45 643.23 1584.09 7
3 3 505.45 643.78 1586.11 6
3 4 504.45 643.43 1588.34 5
3 5 506.45 643.71 1589.89 4
3 6 511.45 643.33 1591.21 3
3 7 516.45 643.61 1592.42 2
3 8 518.45 643.05 1596.77 1
Target:
My target is to predict the remaining usefull live (RUL) of unseen data. In this case I have only 1 type of machine with different id's (That means 1 type and 3 different physical systems). For the prediction the id doesn't matter, because it's the same machine. Furthermore, I want to add new features. The moving average of s1 s2 and s3. So I have to add three new column with the names a1, a2 and a3.
For instance, a1 should look like:
a1
NaN
NaN
512.41
513.32
514.66
515.81
NaN
NaN
510.66
511.91
512.54
513.75
515.50
518.08
519.91
NaN
NaN
506.00
507.12
505.45
507.45
511.45
515.45
The next problem is, that I can't work with NaN, because it's a string. How can I ignore/work with it for a1, a2, and a3 ?
Next Question is: How can I use a regression models like RandomForest and Bagged Decision Trees with a train_test_split to predict the RUL of unseen new data? (Of course I need more data, this example gives just the structure.) [s1],[s2],[s3] are my inputs and RUL is the output.
Furthermore, I want to evaluate the model with Mean Absolut Error, Mean Squared Error and R².
Finaly, I want to use the gridsearch Method for tuning.
Thank you:
Thank you in advance. I know what I wanna do, but I'm not able to realize it with python. A complete code would be perfect.
The standard way of solving this problem is through imputation. SciKitLearn has a built in package for imputation. The documentation is here: http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html
There are 3 strategies for replacing the NaNs:
1) replacing it with the mean of the column
2) replacing it with the mode of the column
3) replacing it with the median of the column
An example of usage would look like this:
from sklearn.preprocessing import Imputer
imp = Imputer(strategy = 'mean', axis = 1)
a1 = Imputer.fit_transform(a1, strategy = 'mean')
There are also usage examples available here: http://scikit-learn.org/stable/modules/preprocessing.html#imputation
Related
Note: Contrived example. Please don't hate on forecasting and I don't need advice on it. This is strictly a Pandas how-to question.
Example - One Solution
I have two different sized DataFrames, one representing sales and one representing a forecast.
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
The forecast needs to be with the latest sales, which is at the end of the list of sales numbers [5, 6, 7, 5]. Other times, I might want it at other locations (please don't ask why, I just need it this way).
This works:
df = pd.concat([sales, forecast], ignore_index=True, axis=1)
df.columns = ['sales', 'forecast'] # Not necessary, making next command pretty
df.forecast = df.forecast.shift(len(sales) - len(forecast))
This gives me the desired outcome:
Question
What I want to know is: Can I concatenate to the end of the sales data without performing the additional shift (the last command)? I'd like to do this in one step instead of two. concat or something similar is fine, but I'd like to skip the shift.
I'm not hung up on having two lines of code. That's okay. I want a solution with the maximum possible performance. My application is sensitive to every millisecond we throw at it on account of huge volumes.
Not sure if that is much faster but you could do
sales = pd.DataFrame({'sales':[5,3,5,6,4,4,5,6,7,5]})
forecast = pd.DataFrame({'forecast':[5,5.5,6,5]})
forecast.index = sales.index[-forecast.shape[0]:]
which gives
forecast
6 5.0
7 5.5
8 6.0
9 5.0
and then simply
pd.concat([sales, forecast], axis=1)
yielding the desired outcome:
sales forecast
0 5 NaN
1 3 NaN
2 5 NaN
3 6 NaN
4 4 NaN
5 4 NaN
6 5 5.0
7 6 5.5
8 7 6.0
9 5 5.0
A one-line solution using the same idea, as mentioned by #Dark in the comments, would be:
pd.concat([sales, forecast.set_axis(sales.index[-len(forecast):], inplace=False)], axis=1)
giving the same output.
I'm working on a model which will predict a number from others opinion. For this i will use Linear Regression from Sklearn.
For example, i have 5 agents from witch i collect data over time of theirs last changes in each iteration, if they didn't insert it yet, data contains Nan, till their first change. Data looks something like this:
a1 a2 a3 a4 a5 target
1 nan nan nan nan 3 4.5
2 4 nan nan nan 3 4.5
3 4 5 nan nan 3 4.5
4 4 5 5 nan 3 4.5
5 4 5 5 4 3 4.5
6 5 5 5 4 3 4.5
So in each iteration/change i want to predict end number. As we know linear regression doesn't allow you to have an = Nan's in data. I replace them with an = 0, witch doesn't ruin answer, because formula of linear regression is: result = a1*w1 + a2*w2 + ... + an*wn + c.
Current questions i have at the moment:
Does my solution somehow effects on fit? Is there any better solution for my problem? Should i learn my model only with full data than use it with current solution?
Setting nan's to 0 and training a linear regression to find coefficients for each of the variables is fine depending on the use case.
Why?
You are essentialy training the model and telling it that for many rows - the importance of variable a1 ,a2 , etc (when the value is nan and set to 0).
If the NAN's are because of data not being filled in yet, then setting them to 0 and training your model is wrong. It's better to train your model after all the data has been entered (atleast for all the agents who have entered some data) This can later be used to predict for new agents. Else , your coefficients will be over fit for 0's(NAN's) if many agents have not yet entered in their data.
Based on the end target(which is a continuous variable) , linear regression is a good approach to go by.
I want to scale the numerical values (similar like R's scale function) based on different groups.
Noted: when I talked about the scale, I am referring to this metric
(x-group_mean)/group_std
Dataset (for demonstration the ideas) for example:
advertiser_id value
10 11
10 22
10 2424
11 34
11 342342
.....
Desirable results:
advertiser_id scaled_value
10 -0.58
10 -0.57
10 1.15
11 -0.707
11 0.707
.....
referring to this link: implementing R scale function in pandas in Python? I used the function for def scale and want to apply for it, like this fashion:
dt.groupby("advertiser_id").apply(scale)
but get an error:
ValueError: Shape of passed values is (2, 15770), indices imply (2, 23375)
In my original datasets the number of rows is 15770, but I don't think in my case the scale function maps a single value to more than 2 (in this case) results.
I would appreciate if you can give me some sample code or some suggestions into how to modify it, thanks!
First, np.std behaves differently than most other languages in that it delta degrees of freedom defaults to be 0. Therefore:
In [9]:
print df
advertiser_id value
0 10 11
1 10 22
2 10 2424
3 11 34
4 11 342342
In [10]:
print df.groupby('advertiser_id').transform(lambda x: (x-np.mean(x))/np.std(x, ddof=1))
value
0 -0.581303
1 -0.573389
2 1.154691
3 -0.707107
4 0.707107
This matches R result.
2nd, if any of your groups (by advertiser_id) happens to contain just 1 item, std would be 0 and you will get nan. Check if you get nan for this reason. R would return nan in this case as well.
I have a DataFrame with 2 columns. I need to know at what point the number of questions has increased.
In [19]: status
Out[19]:
seconds questions
0 751479 9005591
1 751539 9207129
2 751599 9208994
3 751659 9210429
4 751719 9211944
5 751779 9213287
6 751839 9214916
7 751899 9215924
8 751959 9216676
9 752019 9217533
I need the change in percent of 'questions' column and then sort on it. This does not work:
status.pct_change('questions').sort('questions').head()
Any suggestions?
Try this way instead:
>>> status['change'] = status.questions.pct_change()
>>> status.sort_values('change', ascending=False)
questions seconds change
0 9005591 751479 NaN
1 9207129 751539 0.022379
2 9208994 751599 0.000203
6 9214916 751839 0.000177
4 9211944 751719 0.000164
3 9210429 751659 0.000156
5 9213287 751779 0.000146
7 9215924 751899 0.000109
9 9217533 752019 0.000093
8 9216676 751959 0.000082
pct_change can be performed on Series as well as DataFrames and accepts an integer argument for the number of periods you want to calculate the change over (the default is 1).
I've also assumed that you want to sort on the 'change' column with the greatest percentage changes showing first...
I have an enormous timeseries of functions stored in a pandas dataframe in an HDF5 store and I want to make plots of a certain transform of every function in the timeseries. Since the number of plots is so large, and plotting them takes so long, I've used fork() and numpy.array_split() to break the indices up and run several plots in parallel.
Doing things this way means that every process has a copy of the whole timeseries. Since what limits how many processes I can run is the total amount of memory I use, I would like to be able to have each process store only it's own chunk of the dataframe.
How can I split up a pandas dataframe?
np.array_split works pretty well for this usecase.
[40]: df = DataFrame(np.random.randn(5,10))
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7 8 9
0 -1.998163 -1.973708 0.461369 -0.575661 0.862534 -1.326168 1.164199 -1.004121 1.236323 -0.339586
1 -0.591188 -0.162782 0.043923 0.101241 0.120330 -1.201497 -0.108959 -0.033221 0.145400 -0.324831
2 0.114842 0.200597 2.792904 0.769636 -0.698700 -0.544161 0.838117 -0.013527 -0.623317 -1.461193
3 1.309628 -0.444961 0.323008 -1.409978 -0.697961 0.132321 -2.851494 1.233421 -1.540319 1.107052
4 0.436368 0.627954 -0.942830 0.448113 -0.030464 0.764961 -0.241905 -0.620992 1.238171 -0.127617
Just pretty-printing as you get a list of 3 elements here.
In [43]: for dfs in np.array_split(df,3,axis=1):
....: print dfs, "\n"
....:
0 1 2 3
0 -1.998163 -1.973708 0.461369 -0.575661
1 -0.591188 -0.162782 0.043923 0.101241
2 0.114842 0.200597 2.792904 0.769636
3 1.309628 -0.444961 0.323008 -1.409978
4 0.436368 0.627954 -0.942830 0.448113
4 5 6
0 0.862534 -1.326168 1.164199
1 0.120330 -1.201497 -0.108959
2 -0.698700 -0.544161 0.838117
3 -0.697961 0.132321 -2.851494
4 -0.030464 0.764961 -0.241905
7 8 9
0 -1.004121 1.236323 -0.339586
1 -0.033221 0.145400 -0.324831
2 -0.013527 -0.623317 -1.461193
3 1.233421 -1.540319 1.107052