how to process multiple time series with machine-learning/deep learning method(fault diagnosis) - python

There is a industrial fault diagnosis scene.This is a binary classification problem concern to time series.When a fault occurs,the data from one machine is shown below:the label change from zero to one
| time | feature |label|
| -------- | -------------- | -------------- |
| 1 | 26 |0|
| 2 |29 |1|
| 3 | 30 |1|
| 4 | 20 |0|
The question is ,the fault doesnt happen a frequently,so i need need to select sufficient amount of slices of time series for training.
Thus i wanna ask that how should i orgnaize these data:should i take them as one time serise or any other choices.How to orgnize theses data and What machine learning method should I use to realize fault diagnosis?

Related

Which methodology of programming technique could I use to solve the workflow optimization with constraints?

So there is a problem about how to maximize the productivity of the production line if there are many constraints.
Below is the table of the productivity of each worker and in which step they can produce.
The constraints are like,
Each product is required to process these 6 procedures sequentially (1 to 2 to 3 to 4 to 5 to 6) and each worker is only capable to process certain steps. All the products will start from Building A, and after completing all the steps, they can be in either building for shipment. Each worker can only process 1 product at one time and is not allowed to run different procedures concurrently. It is assumed that the product is always available to start at Building X.
The transportation time within the same building is assumed to be negligible. However, cross building transportation time is 25 mins. The truck of a maximum capacity of 5, can only be at either building at any point in time.
| Worker | Procedure 1 time/min | Procedure 2 time/min | Procedure 3 time/min | Procedure 4 time/min | Procedure 5 time/min | Procedure 6 time/min |
| -------- | -------- |-------- |-------- |-------- |-------- |-------- |
| a | 5 | | 10 | | | |
| b | | 15 | | | | 10 |
| c | | 15 | | | 10 | |
| d | 5 | | | 15 | | |
| e | 5 | |5 | | 15 | |
| f | | | | 10 | | 10 |
The objective is to find the the maximum throughput (the total number of products produced) within 168 hours. You will also need to be able to list out every step that each product went through during the process.
I have tried to split the question into two parts:
Firstly, the workers produce the products normally (I have to list out every single steps by hand but I am still not sure if it is the best way to optimise the results) , and at some point in time -- the last stage is to assume that all the workers are in equilibrium state in doing each procedure, and each procedure produces the some amount of products at the same time. (The idea is to assume that all the workers are working all the time as well as the truck to maximise the productivity) I have tried to solve the second part using linear programming and get the results, but I cannot get the specific steps of which the results will be optimised using this methodology.
Now I am not sure which methodology could I use to solve this problem, can someone give me any suggestions please? I really appreciate it.

Python, extracting features form time series (TSFRESH package or what can I use?)

I need some help for feature extraction in time series, maybe using the TSFRESH package.
I have circa 5000 CSV files, and each one of them is a single time series (they may differ in length). The CSV-time-series is pretty straight forward:
Example of a CSV-Time-Series file:
| Date | Value |
| ------ | ----- |
| 1/1/1904 01:00:00,000000 | 1,464844E-3 |
| 1/1/1904 01:00:01,000000 | 1,953125E-3 |
| 1/1/1904 01:00:02,000000 | 4,882813E-4 |
| 1/1/1904 01:00:03,000000 | -2,441406E-3 |
| 1/1/1904 01:00:04,000000 | -9,765625E-4 |
| ... | ... |
Along with these CSV files, I also have a metadata file (in a CSV format), where each row refers to one of those 5000 CSV-time-series, and reports more general information about that time series such as the energy, etc.
Example of the metadata-CSV file:
| Path of the CSV-timeseries | Label | Energy | Penetration | Porosity |
| ------ | ----- | ------ | ----- | ----- | ----------- |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
The most important column is the "Label" one since it reports if a CSV-time-series was labeled as:
Good
Bad
I should also consider the energy, penetration, and porosity columns since those values have a big role in the labeling of the time series. (I already tried a decision tree by looking at only the features, now I would like to analyze the time series to extract knowledge)
I intend to extract features from the time series such that I can understand what are the features that make one time series be labeled as "Good" or "Bad".
How can I do this with TSFRESH?
There are other ways to do this?
Could you show me how to do it? Thank you :)
I'm doing something similar currently and this example jupyter notebook from github helped me.
The basic process is in short:
Bring time series in acceptable format, see the tsfresh documentation for more information
Extract features from time serieses using X = extract_features(...)
Select relevant features using X_filtered = select_features(X, y) with y being your label, good or bad being e.g. 1 and 0.
Put select features into a classifier, also shown in the jupyter notebook.

Is there any way to specifically optimize a single output from a neural network in tensorflow?

For example, if I had a neural network that was playing draughts/checkers and attempted to make an invalid move, is there a way to specifically optimize that particular output?
---------------------------------------
8 | | bM | | bM | | bM | | bM |
---------------------------------------
7 | bM | | bM | | bM | | bM | |
---------------------------------------
6 | | bM | | bM | | bM | | bM |
---------------------------------------
5 | | | | | | | | |
---------------------------------------
4 | | | | | | | | |
---------------------------------------
3 | wM | | wM | | wM | | wM | |
---------------------------------------
2 | | wM | | wM | | wM | | wM |
---------------------------------------
1 | wM | | wM | | wM | | wM | |
---------------------------------------
A B C D E F G H
If the board were to look like this, and there was an output neuron for every possible move in the realms of a draught piece (up to a movement of 2 in any direction) so 64 * 8 output neurons, if the highest probability output was neuron 8 (or any other invalid output) which would be something like B1C2 (B1 being starting position and C2 being ending position).
Is there a way, if the output of the neural network is already a probability distribution, to update the network so that this particular output is 0 and all the other outputs are updated and normalized?
I've tried looking at examples of neural nets that train on the mnist data set and adamoptimizer but couldn't find anything that only changes one particular output rather than changing the whole output layer.
Thanks for any help!
For this specific example, you're better off restructuring your network to only include moves that could potentially be valid. B1C2 will never be a valid move, so don't let that be a part of your network.
For moves that could potentially be valid but aren't actually valid, such as B2C3 (not valid for the first turn but valid after moving the piece currently on C3), you can write a custom activation function, but it will probably be easier to just adjust the output.
You can write a function to set each invalid move to zero, and then you will divide all the other answers by (1 - sum of invalid move predictions). Note that this assumes that you are already using softmax as your last activation function.
Edit based on follow up question below:
You can write one function that takes the board state and predictions as input and returns the predictions with invalid moves set to zero and the rest of the predictions normalized.
If instead of modifying the end result you rather have the network learn which moves are invalid, that can be handled by your loss function. For instance, if you are doing deep Q learning then you would add a heavy penalty to the score for invalid moves.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

SciPy Optimization algorithm

I need to solve an optimization task with Python.
The task is following:
Fabric produces desks, chairs, bureau and cupboards. For producing this stuff two types of boards could be used. Fabric has 1500m. of first type and 1000m. of second. Fabric has 800 Employees. What should produce fabric and how much to receive a maximum profit?
The input values are following:
| | Products |
| | Desk | Chair | Bureau | Cupboard |
|--------------|------|-------|--------|----------|
| Board 1 type | 5 | 1 | 9 | 12 |
| Board 2 type | 2 | 3 | 4 | 1 |
| Employees | 3 | 2 | 5 | 10 |
| Profit | 12 | 5 | 15 | 10 |
Unfortunately I don't have an experience in solving optimization tasks so I don't even know where to start. What I did:
I found sciPy optimization package which suppose to solve such type of problems.
I have some vision about input and output for my function. The input should amount of each type of product and the output supposed to be the profit. But the choice of resources(boards, employees) might also be different. And this affects algorithm implementation.
Could you please give me at least any direction where to go? Thank you!
EDIT:
Basically #Balzola is right. It's a simplex algorithm. The task might be solved by using SciPy.optimize.linprog solution which uses simplex under the hood.
Typical https://en.wikipedia.org/wiki/Simplex_algorithm
Looks like scipy can do it:
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#nelder-mead-simplex-algorithm-method-nelder-mead

Categories