There is a industrial fault diagnosis scene.This is a binary classification problem concern to time series.When a fault occurs,the data from one machine is shown below:the label change from zero to one
| time | feature |label|
| -------- | -------------- | -------------- |
| 1 | 26 |0|
| 2 |29 |1|
| 3 | 30 |1|
| 4 | 20 |0|
The question is ,the fault doesnt happen a frequently,so i need need to select sufficient amount of slices of time series for training.
Thus i wanna ask that how should i orgnaize these data:should i take them as one time serise or any other choices.How to orgnize theses data and What machine learning method should I use to realize fault diagnosis?
What's the correct way to undertake PCA on complex-valued data?
I see that this solution exists.
Are there any python packages that have implemented a complex-PCA?
So far I have just broken my data into real and imaginary parts and performed PCA as if they were real. For example:
| sw | fw | mw |
|4+4i |3+2i|1-1i|
would become:
| swreal | swimag | fwreal | fwimag | mwreal | mwimag |
| 4 | 4 | 3 | 2 | 1 | -1 |
My PCA ends up looking like this:
I want to pursue a complex PCA, but I'm not even sure how I would end up representing it? If it were a 2D plot, the only way is similar to above(?), in which case, would it look any different?
I have a database that consists of 10049972 rows x 19 columns. I was using Isolation Forest to detect outliers, then created an extra column that has outliers set as -1, I dropped all rows containing outliers as -1 then removed the column.
My question is: Do I need to do train, test and validate for isolation forest to work? Also can someone please confirm if my code is valid?
Here is my code.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
df = pd.read_csv('D:\\Project\\database\\4-Final\\Final After.csv',low_memory=True)
iForest = IsolationForest(n_estimators=100, contamination=0.1 , random_state=42, max_samples=200)
iForest.fit(df.values.reshape(-1,1))
pred = iForest.predict(df.values.reshape(-1,1))
pred=df['anomaly']
df=df.drop(df['anomaly'==-1],inplace=True)
df.to_csv('D:\\Project\\database\\4-Final\\IF TEST.csv', index=False)
Thank you.
My question is do I need to do test train and validate for isolation forest to work?
You want to detect outliers in just this batch file, right? In this case, your solution may be ok, but in most cases, you must split.
But please, try to understand when would you need to do the split.
To explain this, let's enter into a real case scenario.
Let's suppose you are trying to predict the anomalous behaviour of different engines. You create a model using the data available in your database until "today", and start predicting incoming data. It may be possible that the predicted data is not equal to the data used to train, right? Then, how can you simulate this situation when you are configuring your model? Using train-test-validate and evaluating with the right metrics.
Edit: Let me add an example. I'll try to make it super simple.
If your engine data base data is:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 1 | 0 | 0.25 |
| 2 | 0 | 0.40 |
| 3 | 1 | 0.16 |
| 4 | 1 | 0.30 |
| 5 | 0 | 5.3 | <- anomaly
| 6 | 1 | 14.4 | <- anomaly
| 7 | 0 | 16.30 | <- anomaly
+----+-------------+--------------+
And use it all to train the model, the model will use the three anomalous values to train, right? The algorithm will create the forest using these 3 anomalous values, so it can be easier for the model to predict them.
Now, what would happen with this production data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 8 | 1 | 3.25 | <- anomaly
| 9 | 1 | 4.40 | <- anomaly
| 10 | 0 | 2.16 |
+----+-------------+--------------+
You pass it to your model, and it says the points are not anomalous, but normal data, because it thinks your "threshold" is for values bigger than 5.
This "threshold" is product of the algorithm hyperparameters, maybe with other configuration, the model could have predicted the values as anomalous, but you are not testing the model generalization.
So how can you improve this configuration? Splitting the data that you have available at that moment. Instead of training with all the database data, you could have trained with only a part of it and use the other part to test, for example use this part as train data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 1 | 0 | 0.25 |
| 2 | 0 | 0.40 |
| 3 | 1 | 0.16 |
| 4 | 1 | 0.30 |
| 7 | 0 | 16.30 | <- anomaly
+----+-------------+--------------+
And this as test data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 5 | 0 | 5.3 | <- anomaly
| 6 | 1 | 14.4 | <- anomaly
+----+-------------+--------------+
And set a combination of hyperparameters that makes this algorithm predict the test data correctly. Does it ensure that in the future the predictions will be perfect? No, it does not, but it is not the same as just fitting the data without evaluating how well the model is generalizing.
Also can someone please confirm if my code is valid?
Yes, but let me add a recommendation, changing this:
iForest.fit(df.values.reshape(-1,1))
pred = iForest.predict(df.values.reshape(-1,1))
pred=df['anomaly']
To this:
df['anomaly'] = iForest.fit_predict(df.values.reshape(-1,1))
Also, if you are using the new pandas version, use:
df['anomaly'] = iForest.fit_predict(df.to_numpy().reshape(-1,1))
I have a list with: (converted to a list after reading a .feature file)
Given Device unit of measure is set to value "<uom>"
And Device is set to value "Disabled"
And Device is set to value "<time>"
Examples:
| time | uom |
| 1 | kpa |
| 2 | kpa |
| 3 | kpa |
| 4 | kpa |
| 5 | kpa |
| 10 | kpa |
| 15 | kpa |
| 30 | kpa |
| 60 | kpa |
| 90 | kpa |
I am trying to convert it to:
Iteration 1:
Given Device unit of measure is set to value "kpa"
And Device is set to value "Disabled"
And Device is set to value "1"
Iteration 2:
Given Device unit of measure is set to value "kpa"
And Device is set to value "Disabled"
And Device is set to value "2"
and so on... Should have 10 iterations at the end with each of the value substituted.
I know when running behave tests, behave does this for you, but I am trying to get these steps stored in a database for future reference.
Question is how can I convert the gherkin steps to the iterations I have shown above, using python?
Thanks for your help!
Found a solution. basically broke it apart piece by piece.
parsing the scenario/scenario outline
parsing the examples
Piecing back together the steps
I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file