Simple moving average for random related time values - python

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data

You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.

awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

Related

Which methodology of programming technique could I use to solve the workflow optimization with constraints?

So there is a problem about how to maximize the productivity of the production line if there are many constraints.
Below is the table of the productivity of each worker and in which step they can produce.
The constraints are like,
Each product is required to process these 6 procedures sequentially (1 to 2 to 3 to 4 to 5 to 6) and each worker is only capable to process certain steps. All the products will start from Building A, and after completing all the steps, they can be in either building for shipment. Each worker can only process 1 product at one time and is not allowed to run different procedures concurrently. It is assumed that the product is always available to start at Building X.
The transportation time within the same building is assumed to be negligible. However, cross building transportation time is 25 mins. The truck of a maximum capacity of 5, can only be at either building at any point in time.
| Worker | Procedure 1 time/min | Procedure 2 time/min | Procedure 3 time/min | Procedure 4 time/min | Procedure 5 time/min | Procedure 6 time/min |
| -------- | -------- |-------- |-------- |-------- |-------- |-------- |
| a | 5 | | 10 | | | |
| b | | 15 | | | | 10 |
| c | | 15 | | | 10 | |
| d | 5 | | | 15 | | |
| e | 5 | |5 | | 15 | |
| f | | | | 10 | | 10 |
The objective is to find the the maximum throughput (the total number of products produced) within 168 hours. You will also need to be able to list out every step that each product went through during the process.
I have tried to split the question into two parts:
Firstly, the workers produce the products normally (I have to list out every single steps by hand but I am still not sure if it is the best way to optimise the results) , and at some point in time -- the last stage is to assume that all the workers are in equilibrium state in doing each procedure, and each procedure produces the some amount of products at the same time. (The idea is to assume that all the workers are working all the time as well as the truck to maximise the productivity) I have tried to solve the second part using linear programming and get the results, but I cannot get the specific steps of which the results will be optimised using this methodology.
Now I am not sure which methodology could I use to solve this problem, can someone give me any suggestions please? I really appreciate it.

how to process multiple time series with machine-learning/deep learning method(fault diagnosis)

There is a industrial fault diagnosis scene.This is a binary classification problem concern to time series.When a fault occurs,the data from one machine is shown below:the label change from zero to one
| time | feature |label|
| -------- | -------------- | -------------- |
| 1 | 26 |0|
| 2 |29 |1|
| 3 | 30 |1|
| 4 | 20 |0|
The question is ,the fault doesnt happen a frequently,so i need need to select sufficient amount of slices of time series for training.
Thus i wanna ask that how should i orgnaize these data:should i take them as one time serise or any other choices.How to orgnize theses data and What machine learning method should I use to realize fault diagnosis?

Python, extracting features form time series (TSFRESH package or what can I use?)

I need some help for feature extraction in time series, maybe using the TSFRESH package.
I have circa 5000 CSV files, and each one of them is a single time series (they may differ in length). The CSV-time-series is pretty straight forward:
Example of a CSV-Time-Series file:
| Date | Value |
| ------ | ----- |
| 1/1/1904 01:00:00,000000 | 1,464844E-3 |
| 1/1/1904 01:00:01,000000 | 1,953125E-3 |
| 1/1/1904 01:00:02,000000 | 4,882813E-4 |
| 1/1/1904 01:00:03,000000 | -2,441406E-3 |
| 1/1/1904 01:00:04,000000 | -9,765625E-4 |
| ... | ... |
Along with these CSV files, I also have a metadata file (in a CSV format), where each row refers to one of those 5000 CSV-time-series, and reports more general information about that time series such as the energy, etc.
Example of the metadata-CSV file:
| Path of the CSV-timeseries | Label | Energy | Penetration | Porosity |
| ------ | ----- | ------ | ----- | ----- | ----------- |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
The most important column is the "Label" one since it reports if a CSV-time-series was labeled as:
Good
Bad
I should also consider the energy, penetration, and porosity columns since those values have a big role in the labeling of the time series. (I already tried a decision tree by looking at only the features, now I would like to analyze the time series to extract knowledge)
I intend to extract features from the time series such that I can understand what are the features that make one time series be labeled as "Good" or "Bad".
How can I do this with TSFRESH?
There are other ways to do this?
Could you show me how to do it? Thank you :)
I'm doing something similar currently and this example jupyter notebook from github helped me.
The basic process is in short:
Bring time series in acceptable format, see the tsfresh documentation for more information
Extract features from time serieses using X = extract_features(...)
Select relevant features using X_filtered = select_features(X, y) with y being your label, good or bad being e.g. 1 and 0.
Put select features into a classifier, also shown in the jupyter notebook.

Minimum number of rows where the column sum is as close to N, dealing with non-integers

I've got a DataFrame that looks something like this in Pandas.
| clip_id | duration |
|---------:|-----------:|
| 0050 | 3.085 |
| 0019 | 3.125 |
| 0001 | 3.265 |
...
| 0010 | 4.47 |
| 0024 | 4.48 |
| 0034 | 4.49 |
| 0004 | 4.515 |
...
| 0008 | 6.795 |
| 0034 | 6.99 |
| 0026 | 6.995 |
...
| 0004 | 9.005 |
| 0024 | 9.185 |
| 0048 | 9.265 |
| 0029 | 10.055 |
| 0001 | 10.255 |
| 0006 | 10.85 |
I've trimmed the table here using ellipses, but the number of rows is usually around between 30 and 100. Also I have sorted the table using the duration column.
My goal is to find the minimum number of clips such that their sum is lazily above some value N. In other words, if N=25, picking up the bottom three rows is not a good enough solution, as the sum there would be 31.16 and there exists a greedier/lazier solution which is closer to 25.
It's been a super long time since I've taken an algorithms / data-structures class but I was sure there was a Heap-related solution to this problem. I also haven't done dynamic programming in Python before but perhaps there's a solution which involves DP. Looking around at other solved questions on StackOverflow, the most voted answers always assume that
(A) you're dealing with integers, OR
(B) you'll be able to find an exact sum
But that won't be the case for what I'm trying to do. Ideally if I can do this while the data is stored in a Pandas DataFrame, then it'll be easy to return the clip_id values for those resultant rows.
Appreciate any and all help I can get on this front!
Edit: So thinking about the problem more, what makes it tough is that there are two competing goals: I want the fewest number of rows, but I also want the sum barely above N if possible. So between the two goals, I would say being as close to N is the more important condition. So for example, if increasing the number of rows by 2 can bring the total closer to >= N, then that would be more preferred.
Is the solution below something that you meant? Do you want to get the rows more than N or you want to exclude them? According to your answer, I can update the code.
def min_num_rows(data, threshold):
for i in range(1, len(data)):
if data[-i:]['duration'].sum() > threshold:
data = data[: -i]
break
else:
continue
return data[-i: ]
df = df.sort_values('duration', ascending = True)
above = min_num_rows(df, 25)

Improving MySQL read time, MySQLdb

I have a table with more than a million record with the following structure:
mysql> SELECT * FROM Measurement;
+----------------+---------+-----------------+------+------+
| Time_stamp | Channel | SSID | CQI | SNR |
+----------------+---------+-----------------+------+------+
| 03_14_14_30_14 | 7 | open | 40 | -70 |
| 03_14_14_30_14 | 7 | roam | 31 | -79 |
| 03_14_14_30_14 | 8 | open2 | 28 | -82 |
| 03_14_14_30_15 | 8 | roam2 | 29 | -81 |....
I am reading data from this table into python for plotting. The problem is, the MySQL reads are too slow and it is taking me hours to get the plots even after using
MySQLdb.cursors.SSCursor (as suggested by a few in this forum) to quicken up the task.
con = mdb.connect('localhost', 'testuser', 'conti', 'My_Freqs', cursorclass = MySQLdb.cursors.SSCursor);
cursor=con.cursor()
cursor.execute("Select Time_stamp FROM Measurement")
for row in cursor:
... Do processing ....
Will normalizing the table help me in speeding up the task? If so, How should i normalize it?
P.S: Here is the result for EXPLAIN
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| Time_stamp | varchar(128) | YES | | NULL | |
| Channel | int(11) | YES | | NULL | |
| SSID | varchar(128) | YES | | NULL | |
| CQI | int(11) | YES | | NULL | |
| SNR | float | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
The problem is probably that you are looping over the cursor instead of just dumping out all the data at once and then processing it. You should be able to dump out a couple million rows in a couple/few seconds. Try to do something like
cursor.execute("select Time_stamp FROM Measurement")
data = cusror.fetchall()
for row in data:
#do some stuff...
Well, since you're saying the whole table has to be read, I guess you can't do much about it. It has more than 1 million records... you're not going to optimize much on the database side.
How much time does it take you to process just one record? Maybe you could try optimizing that part. But even if you got down to 1 millisecond per record, it would still take you about half an hour to process the full table. You're dealing with a lot of data.
Maybe run multiple plotting jobs in parallel? With the same metrics as above, dividing your data in 6 equal-sized jobs would (theoretically) give you the plots in 5 minutes.
Do your plots have to be fine-grained? You could look for ways to ignore certain values in the data, and generate a complete plot only when the user needs it (wild speculation here, I really have no idea what your plots look like)

Categories