Python, extracting features form time series (TSFRESH package or what can I use?) - python

I need some help for feature extraction in time series, maybe using the TSFRESH package.
I have circa 5000 CSV files, and each one of them is a single time series (they may differ in length). The CSV-time-series is pretty straight forward:
Example of a CSV-Time-Series file:
| Date | Value |
| ------ | ----- |
| 1/1/1904 01:00:00,000000 | 1,464844E-3 |
| 1/1/1904 01:00:01,000000 | 1,953125E-3 |
| 1/1/1904 01:00:02,000000 | 4,882813E-4 |
| 1/1/1904 01:00:03,000000 | -2,441406E-3 |
| 1/1/1904 01:00:04,000000 | -9,765625E-4 |
| ... | ... |
Along with these CSV files, I also have a metadata file (in a CSV format), where each row refers to one of those 5000 CSV-time-series, and reports more general information about that time series such as the energy, etc.
Example of the metadata-CSV file:
| Path of the CSV-timeseries | Label | Energy | Penetration | Porosity |
| ------ | ----- | ------ | ----- | ----- | ----------- |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
| ... | ... | ... | ... | ... | ... |
The most important column is the "Label" one since it reports if a CSV-time-series was labeled as:
Good
Bad
I should also consider the energy, penetration, and porosity columns since those values have a big role in the labeling of the time series. (I already tried a decision tree by looking at only the features, now I would like to analyze the time series to extract knowledge)
I intend to extract features from the time series such that I can understand what are the features that make one time series be labeled as "Good" or "Bad".
How can I do this with TSFRESH?
There are other ways to do this?
Could you show me how to do it? Thank you :)

I'm doing something similar currently and this example jupyter notebook from github helped me.
The basic process is in short:
Bring time series in acceptable format, see the tsfresh documentation for more information
Extract features from time serieses using X = extract_features(...)
Select relevant features using X_filtered = select_features(X, y) with y being your label, good or bad being e.g. 1 and 0.
Put select features into a classifier, also shown in the jupyter notebook.

Related

how to process multiple time series with machine-learning/deep learning method(fault diagnosis)

There is a industrial fault diagnosis scene.This is a binary classification problem concern to time series.When a fault occurs,the data from one machine is shown below:the label change from zero to one
| time | feature |label|
| -------- | -------------- | -------------- |
| 1 | 26 |0|
| 2 |29 |1|
| 3 | 30 |1|
| 4 | 20 |0|
The question is ,the fault doesnt happen a frequently,so i need need to select sufficient amount of slices of time series for training.
Thus i wanna ask that how should i orgnaize these data:should i take them as one time serise or any other choices.How to orgnize theses data and What machine learning method should I use to realize fault diagnosis?

Update a column data w.r.t values in other columns regex match in dataframes

I have a data frame of rows of more than 1,000,000 and 15 columns.
I have to make new columns and assign the value to the columns w.r.t the other string values in the other columns via matching them either with regex or exact character match.
For example, if a column called FIle path is there. I have to make a column as a feature that will be assigned values with the input of the folder path (Full | partial) and match it with the file path and update the feature column.
I thought about using the iteration with for loop but it is so much time taking and while using pandas for this I think iterating would consume more time if looping components increase in the future.
Is there an efficient way for the pandas to do this type of operation
Please help me with this.
Example:
I have a df as:
| ID | File |
| -------- | -------------- |
| 1 | SWE_Toot |
| 2 | SWE_Thun |
| 3 | IDH_Toet |
| 4 | SDF_Then |
| 5 | SWE_Toot |
| 6 | SWE_Thun |
| 7 | SEH_Toot |
| 8 | SFD_Thun |
I will get components in other tables as
| ID | File |
| -------- | -------------- |
| Software | */SWE_Toot/*.h |
| |*/IDH_Toet/*.c |
| |*/SFD_Toto/*.c |
second as:
| ID | File |
| -------- | -------------- |
| Wire | */SDF_Then/*.h |
| |*/SFD_Thun/*.c |
| |*/SFD_Toto/*.c |
etc., will me around like 1000000 files and 278 components are received
I want as
| ID | File |Component|
| -------- | -------------- |---------|
| 1 | SWE_Toot |Software |
| 2 | SWE_Thun |Other |
| 3 | IDH_Toet |Software |
| 4 | SDF_Then |Wire |
| 5 | SWE_Toto |Various |
| 6 | SWE_Thun |Other |
| 7 | SEH_Toto |Various |
| 8 | SFD_Thun |Wire |
Other - will be filled at last once all the fields and regex are checked and do not belong to any component.
Various - It may belong to more than one (or) we can give a list of components it belong to.
I was able to read the components tables and create a regex and if I want to create the component column then I have to write for loops for all the 278 columns and I have to loop the same table with the component.
Is there a way to do this with the pandas easier
Because the date will be very large

How to detect statements made about entities using NLP

I want to research using NLP to detect negative/non-constructive comments, i.e. those that frequently arise when discussing politics online. I am curious to know that if given a sentence like this:
You're a liberal dweeb. Clinton is ruining the US with her inappropriate behavior as president.
Whether it's possible to not only deduce the entities (you, Clinton) using NER but also get a tree of the statements made about each entity:
+-----------------+ +------------------------+
| | | |
| | | |
| you | | Clinton |
| +------+ | +------+
| | | | | |
+--+--------------+ | | | |
| | +--+---------------------+ |
| | | |
| | | |
+-+-------+ +----+-----+ | +---------+----------+
| | | | +----+---------+ | |
| | | dweeb | | | | |
| liberal| | | | ruining US | | has inappropriate |
| | +----------+ | | | behavior as pres. |
+---------+ | | | |
+--------------+ +--------------------+
Is something like this possible with NLP?
A constituency parser or dependency parser, possibly plus some kind of semantic analysis to give you more info about named and non-named entities, may be what you're looking for. Try pasting some example sentences into http://corenlp.run/ or into http://demo.ark.cs.cmu.edu/parse, which applies a dependency parse and a semantic parse, to see if it's the type of thing you're looking for.
Yes, what you are looking for is certainly possible with NLP. There are two approaches you should research further.
1) The faster approach in terms of coding but requiring investment in labelling and training annotated data is to use the "relation extractor" feature of an NLP framework like Stanford NLP, Spacy, etc. However, you will have to do some customisation and training of the default models. Here's a link to a sample blog article about doing this with NLTK but you should look for a more recent article if you pursue this route.
2) The slower approach in terms of coding but not requiring data labelling and annotation is mentioned by #Gabriel in running a dependency parser and entity NER pipeline on your sentences and then using a set of manual rules in your code for extraction of the relations.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

Improving MySQL read time, MySQLdb

I have a table with more than a million record with the following structure:
mysql> SELECT * FROM Measurement;
+----------------+---------+-----------------+------+------+
| Time_stamp | Channel | SSID | CQI | SNR |
+----------------+---------+-----------------+------+------+
| 03_14_14_30_14 | 7 | open | 40 | -70 |
| 03_14_14_30_14 | 7 | roam | 31 | -79 |
| 03_14_14_30_14 | 8 | open2 | 28 | -82 |
| 03_14_14_30_15 | 8 | roam2 | 29 | -81 |....
I am reading data from this table into python for plotting. The problem is, the MySQL reads are too slow and it is taking me hours to get the plots even after using
MySQLdb.cursors.SSCursor (as suggested by a few in this forum) to quicken up the task.
con = mdb.connect('localhost', 'testuser', 'conti', 'My_Freqs', cursorclass = MySQLdb.cursors.SSCursor);
cursor=con.cursor()
cursor.execute("Select Time_stamp FROM Measurement")
for row in cursor:
... Do processing ....
Will normalizing the table help me in speeding up the task? If so, How should i normalize it?
P.S: Here is the result for EXPLAIN
+------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------+--------------+------+-----+---------+-------+
| Time_stamp | varchar(128) | YES | | NULL | |
| Channel | int(11) | YES | | NULL | |
| SSID | varchar(128) | YES | | NULL | |
| CQI | int(11) | YES | | NULL | |
| SNR | float | YES | | NULL | |
+------------+--------------+------+-----+---------+-------+
The problem is probably that you are looping over the cursor instead of just dumping out all the data at once and then processing it. You should be able to dump out a couple million rows in a couple/few seconds. Try to do something like
cursor.execute("select Time_stamp FROM Measurement")
data = cusror.fetchall()
for row in data:
#do some stuff...
Well, since you're saying the whole table has to be read, I guess you can't do much about it. It has more than 1 million records... you're not going to optimize much on the database side.
How much time does it take you to process just one record? Maybe you could try optimizing that part. But even if you got down to 1 millisecond per record, it would still take you about half an hour to process the full table. You're dealing with a lot of data.
Maybe run multiple plotting jobs in parallel? With the same metrics as above, dividing your data in 6 equal-sized jobs would (theoretically) give you the plots in 5 minutes.
Do your plots have to be fine-grained? You could look for ways to ignore certain values in the data, and generate a complete plot only when the user needs it (wild speculation here, I really have no idea what your plots look like)

Categories