PCA on complex-valued data - python

What's the correct way to undertake PCA on complex-valued data?
I see that this solution exists.
Are there any python packages that have implemented a complex-PCA?
So far I have just broken my data into real and imaginary parts and performed PCA as if they were real. For example:
| sw | fw | mw |
|4+4i |3+2i|1-1i|
would become:
| swreal | swimag | fwreal | fwimag | mwreal | mwimag |
| 4 | 4 | 3 | 2 | 1 | -1 |
My PCA ends up looking like this:
I want to pursue a complex PCA, but I'm not even sure how I would end up representing it? If it were a 2D plot, the only way is similar to above(?), in which case, would it look any different?

Related

How to explore features of strings PYTHON

I have a table with relative abundances of strings collected elsewhere. I also have a list of features that are associated with the strings. What is the best way to explore each string for each feature and sum the relative abundancies.
Example Input Table:
+────────────+────────────+
| String | Abundance |
+────────────+────────────+
| abcdef | 12 |
| cdefgh | 15 |
| fghijk | 36 |
| jklmnoabc | 37 |
+────────────+────────────+
Example String Features:
cdef, abc, jk
Example Output
+──────────+────────────────+
| Feature | Abundance (%) |
+──────────+────────────────+
| cdef | 27 |
| abc | 59 |
| jk | 73 |
+──────────+────────────────+
Any help would be greatly appreciated!
The answer is to go through the list of string feature for each string you have and use the inoperator of Python.
This will check if your feature has an occurence in the string you apply it to.
You then want to accumulate abundance and associate it to your feature.
Did you try to do a loop with a regex? Because, you must go through your two lists (inputs and features). I don't think there is a particular algorithm to highly accelerate the process.
Here is what I'm thinking about
import re
for feature in features:
p = re.compile(f"{feature.string}")
feature.abundance = 0
for input in inputs:
m = p.match(input.string)
if m: # if not None
feature.abundance += input.abundance
With that, you will have all your stuff in your features list.

Do I need to split the data for isolation forest?

I have a database that consists of 10049972 rows x 19 columns. I was using Isolation Forest to detect outliers, then created an extra column that has outliers set as -1, I dropped all rows containing outliers as -1 then removed the column.
My question is: Do I need to do train, test and validate for isolation forest to work? Also can someone please confirm if my code is valid?
Here is my code.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.ensemble import IsolationForest
df = pd.read_csv('D:\\Project\\database\\4-Final\\Final After.csv',low_memory=True)
iForest = IsolationForest(n_estimators=100, contamination=0.1 , random_state=42, max_samples=200)
iForest.fit(df.values.reshape(-1,1))
pred = iForest.predict(df.values.reshape(-1,1))
pred=df['anomaly']
df=df.drop(df['anomaly'==-1],inplace=True)
df.to_csv('D:\\Project\\database\\4-Final\\IF TEST.csv', index=False)
Thank you.
My question is do I need to do test train and validate for isolation forest to work?
You want to detect outliers in just this batch file, right? In this case, your solution may be ok, but in most cases, you must split.
But please, try to understand when would you need to do the split.
To explain this, let's enter into a real case scenario.
Let's suppose you are trying to predict the anomalous behaviour of different engines. You create a model using the data available in your database until "today", and start predicting incoming data. It may be possible that the predicted data is not equal to the data used to train, right? Then, how can you simulate this situation when you are configuring your model? Using train-test-validate and evaluating with the right metrics.
Edit: Let me add an example. I'll try to make it super simple.
If your engine data base data is:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 1 | 0 | 0.25 |
| 2 | 0 | 0.40 |
| 3 | 1 | 0.16 |
| 4 | 1 | 0.30 |
| 5 | 0 | 5.3 | <- anomaly
| 6 | 1 | 14.4 | <- anomaly
| 7 | 0 | 16.30 | <- anomaly
+----+-------------+--------------+
And use it all to train the model, the model will use the three anomalous values to train, right? The algorithm will create the forest using these 3 anomalous values, so it can be easier for the model to predict them.
Now, what would happen with this production data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 8 | 1 | 3.25 | <- anomaly
| 9 | 1 | 4.40 | <- anomaly
| 10 | 0 | 2.16 |
+----+-------------+--------------+
You pass it to your model, and it says the points are not anomalous, but normal data, because it thinks your "threshold" is for values bigger than 5.
This "threshold" is product of the algorithm hyperparameters, maybe with other configuration, the model could have predicted the values as anomalous, but you are not testing the model generalization.
So how can you improve this configuration? Splitting the data that you have available at that moment. Instead of training with all the database data, you could have trained with only a part of it and use the other part to test, for example use this part as train data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 1 | 0 | 0.25 |
| 2 | 0 | 0.40 |
| 3 | 1 | 0.16 |
| 4 | 1 | 0.30 |
| 7 | 0 | 16.30 | <- anomaly
+----+-------------+--------------+
And this as test data:
+----+-------------+--------------+
| id | engine_type | engine_value |
+----+-------------+--------------+
| 5 | 0 | 5.3 | <- anomaly
| 6 | 1 | 14.4 | <- anomaly
+----+-------------+--------------+
And set a combination of hyperparameters that makes this algorithm predict the test data correctly. Does it ensure that in the future the predictions will be perfect? No, it does not, but it is not the same as just fitting the data without evaluating how well the model is generalizing.
Also can someone please confirm if my code is valid?
Yes, but let me add a recommendation, changing this:
iForest.fit(df.values.reshape(-1,1))
pred = iForest.predict(df.values.reshape(-1,1))
pred=df['anomaly']
To this:
df['anomaly'] = iForest.fit_predict(df.values.reshape(-1,1))
Also, if you are using the new pandas version, use:
df['anomaly'] = iForest.fit_predict(df.to_numpy().reshape(-1,1))

Is there any way to specifically optimize a single output from a neural network in tensorflow?

For example, if I had a neural network that was playing draughts/checkers and attempted to make an invalid move, is there a way to specifically optimize that particular output?
---------------------------------------
8 | | bM | | bM | | bM | | bM |
---------------------------------------
7 | bM | | bM | | bM | | bM | |
---------------------------------------
6 | | bM | | bM | | bM | | bM |
---------------------------------------
5 | | | | | | | | |
---------------------------------------
4 | | | | | | | | |
---------------------------------------
3 | wM | | wM | | wM | | wM | |
---------------------------------------
2 | | wM | | wM | | wM | | wM |
---------------------------------------
1 | wM | | wM | | wM | | wM | |
---------------------------------------
A B C D E F G H
If the board were to look like this, and there was an output neuron for every possible move in the realms of a draught piece (up to a movement of 2 in any direction) so 64 * 8 output neurons, if the highest probability output was neuron 8 (or any other invalid output) which would be something like B1C2 (B1 being starting position and C2 being ending position).
Is there a way, if the output of the neural network is already a probability distribution, to update the network so that this particular output is 0 and all the other outputs are updated and normalized?
I've tried looking at examples of neural nets that train on the mnist data set and adamoptimizer but couldn't find anything that only changes one particular output rather than changing the whole output layer.
Thanks for any help!
For this specific example, you're better off restructuring your network to only include moves that could potentially be valid. B1C2 will never be a valid move, so don't let that be a part of your network.
For moves that could potentially be valid but aren't actually valid, such as B2C3 (not valid for the first turn but valid after moving the piece currently on C3), you can write a custom activation function, but it will probably be easier to just adjust the output.
You can write a function to set each invalid move to zero, and then you will divide all the other answers by (1 - sum of invalid move predictions). Note that this assumes that you are already using softmax as your last activation function.
Edit based on follow up question below:
You can write one function that takes the board state and predictions as input and returns the predictions with invalid moves set to zero and the rest of the predictions normalized.
If instead of modifying the end result you rather have the network learn which moves are invalid, that can be handled by your loss function. For instance, if you are doing deep Q learning then you would add a heavy penalty to the score for invalid moves.

Simple moving average for random related time values

I'm beginner programmer looking for help with Simple Moving Average SMA. I'm working with column files, where first one is related to the time and second is value. The time intervals are random and also the value. Usually the files are not big, but the process is collecting data for long time. At the end files look similar to this:
+-----------+-------+
| Time | Value |
+-----------+-------+
| 10 | 3 |
| 1345 | 50 |
| 1390 | 4 |
| 2902 | 10 |
| 34057 | 13 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
After whole process number of rows is around 60k-100k.
Then i'm trying to "smooth" data with some time window. For this purpose I'm using SMA. [AWK_method]
awk 'BEGIN{size=$timewindow} {mod=NR%size; if(NR<=size){count++}else{sum-=array[mod]};sum+=$1;array[mod]=$1;print sum/count}' file.dat
To achive proper working of SMA with predefined $timewindow i create linear increment filled with zeros. Next, I run a script using diffrent $timewindow and I observe the results.
+-----------+-------+
| Time | Value |
+-----------+-------+
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| (...) | |
| 10 | 3 |
| 11 | 0 |
| 12 | 0 |
| (...) | |
| 1343 | 0 |
| (...) | |
| 898975456 | 10 |
+-----------+-------+
For small data it was relatively comfortable, but now it is quite time-devouring, and created files starting to be too big. I'm also familiar with Gnuplot but SMA there is hell...
So here are my questions:
Is it possible to change the awk solution to bypass filling data with zeros?
Do you recomend any other solution using bash?
I also have considered to learn python because after 6 months of learning bash, I have got to know its limitation. Will I able to solve this in python without creating big data?
I'll be glad with any form of help or advices.
Best regards!
[AWK_method] http://www.commandlinefu.com/commands/view/2319/awk-perform-a-rolling-average-on-a-column-of-data
You included a python tag, check out traces:
http://traces.readthedocs.io/en/latest/
Here are some other insights:
Moving average for time series with not-equal intervls
http://www.eckner.com/research.html
https://stats.stackexchange.com/questions/28528/moving-average-of-irregular-time-series-data-using-r
https://en.wikipedia.org/wiki/Unevenly_spaced_time_series
key phrase in bold for more research:
In statistics, signal processing, and econometrics, an unevenly (or unequally or irregularly) spaced time series is a sequence of observation time and value pairs (tn, Xn) with strictly increasing observation times. As opposed to equally spaced time series, the spacing of observation times is not constant.
awk '{Q=$2-last;if(Q>0){while(Q>1){print "| "++i" | 0 |";Q--};print;last=$2;next};last=$2;print}' Input_file

SciPy Optimization algorithm

I need to solve an optimization task with Python.
The task is following:
Fabric produces desks, chairs, bureau and cupboards. For producing this stuff two types of boards could be used. Fabric has 1500m. of first type and 1000m. of second. Fabric has 800 Employees. What should produce fabric and how much to receive a maximum profit?
The input values are following:
| | Products |
| | Desk | Chair | Bureau | Cupboard |
|--------------|------|-------|--------|----------|
| Board 1 type | 5 | 1 | 9 | 12 |
| Board 2 type | 2 | 3 | 4 | 1 |
| Employees | 3 | 2 | 5 | 10 |
| Profit | 12 | 5 | 15 | 10 |
Unfortunately I don't have an experience in solving optimization tasks so I don't even know where to start. What I did:
I found sciPy optimization package which suppose to solve such type of problems.
I have some vision about input and output for my function. The input should amount of each type of product and the output supposed to be the profit. But the choice of resources(boards, employees) might also be different. And this affects algorithm implementation.
Could you please give me at least any direction where to go? Thank you!
EDIT:
Basically #Balzola is right. It's a simplex algorithm. The task might be solved by using SciPy.optimize.linprog solution which uses simplex under the hood.
Typical https://en.wikipedia.org/wiki/Simplex_algorithm
Looks like scipy can do it:
https://docs.scipy.org/doc/scipy/reference/tutorial/optimize.html#nelder-mead-simplex-algorithm-method-nelder-mead

Categories