Given:
Ideal graph - Depicts the expected reading my machine should have.
Actual graph - Depicts the actual reading my machine had at that instance.
X-axis: Force(N) from the machine
Y-axis: Time(s)
Both graphs were created using pyplot library in python.
What I need to do:
I need to compare the graph in its three phases: initialization (machine starts applying force), constant phase (constant force), end phase (machine stops applying force) and give the analysis of how close the phases in the actual read were to the ideal case (in terms of percentage). The analysis would allow me to conclude how the machine performed in those three phases for the actual read taken. I would need to do this for each reading taken every 50s.
Hurdle:
Now both the graphs were not created using the same number of data points. Ideal graph was created with 100 set of points and Actual graph was created using 30,000+ points. So I would not be able to compare the graphs using data points.
Idea:
Would it be wise to save the graph of the actual read as a png and compare it with the image of the ideal case graph?
Please give me some idea or solution to tackle this problem.
It's a bit late but I'm going to answer anyway:
I don't think resorting to a comparison of images is wise in this case, no.
What you probably want is to interpolate additional points between the 100 points on the 'Ideal graph' to match the 30,000+ points in the 'Actual graph'.
The example on 1-D Interpolation in the scipy.interpolate docs seems to be exactly what you need.
If you need further assistance (such as working code), you will have to provide a Minimal, Complete, and Verifiable Example for us to work with.
Related
I am working on an injection molding machine to analyze process data for some parameters to determine what parameters are related or if they are, which ones are important to determine a change in the condition of the machine.
I have plotted the correlation matrix heatmap for the parameters, from which I can see some positive and negative correlation between different parameters. I have attached a picture Heat Map for selected parameters. The problem I am facing now is some parameters may be or may not be related theoretically and also some of the related data (from the figure) have completely different units.
I want to do further analysis of this data. Please suggest a path about what should I be doing now. Should I proceed towards PCA or regression analysis or something else?
PS: I wanted to share my data regarding the same but I don't know how or where to upload it.
Thank you in advance
Picture is worth a thousand words, so:
My input is the matrix on the left, and what I need to find is the sets of nodes that are maximum one step away from each other (not diagonally). Node that is more than one up/down/left/right step away would be in a separate set.
So, my plan was running a BFS from every node I find, then returning the set it traversed through, and removing it from the original set. Iterate this process until I'm done. But then I've had the wild idea of looking for a graph analysis tools - and I've found NetworkX. Is there an easy way (algorithm?) to achieve this without manually writing BFS, and traverse the whole matrix?
Thanks
What you are trying to do is searching for "connected components" and
NetworX has itself a method for doing exactly that as can be seen in the first example on this documentation page as others has already pointed out on the comments.
Reading your question it seems that your nodes are on a discrete grid and the concept of connected that you describe is the same used on the pixel of an image.
Connected components algorithms are available for graphs and for images also.
If performances are important in your case I would suggest you to go for the image version of connected components.
This comes by the fact that images (grids of pixels) are a specific class of graphs so the connected components algorithms dealing with grids of nodes
are built knowing the topology of the graph itself (i.e. graph is planar, the max vertex degree is four). A general algorithm for graphs has o be able to work on general graphs
(i.e they may be not planar, with multiple edges between some nodes) so it has to spend more work because it can't assume much about the properties of the input graph.
Since connected components can be found on graphs in linear time I am not telling the image version would be orders of magnitude faster. There will only be a constant factor between the two.
For this reason you should also take into account which is the data structure that holds your input data and how much time will be spent in creating the input structures which are required by each version of the algorithm.
My idea is that instead of using DFT/FFT to calculate guitar chords or notes, I can attempt a different approach. How I plan to do it is so (in Python)
Record myself playing a large range of notes and chords and store them as a large dataset.
After this, I would create another recording of me doing a single note or chord which needs to be recognised.
In a similar manner of which I compare two datasets using spearman's rank correlation coefficient, I could compare the recording to each file in the dataset and see which note or chord is most similar.
For my situation, I aim for this calculation to occur as I play, so no preprocessing would be involved. To do this I would need to calibrate the background noise volume, so I could distinct each note/chord from each other.
Diagram to explain the concept:
To help explain my idea, I have created a simple image which has a dataset of two notes.
Imgur link of the diagram
My question is how viable this would be, and if it is feasible, how would I conduct it in Python?
Thanks,
Aj.
I have a text file with about 8.5 million data points in the form:
Company 87178481
Company 893489
Company 2345788
[...]
I want to use Python to create a connection graph to see what the network between companies looks like. From the above sample, two companies would share an edge if the value in the second column is the same (clarification from/for Hooked).
I've been using the NetworkX package and have been able to generate a network for a few thousand points, but it's not making it through the full 8.5 million-node text file. I ran it and left for about 15 hours, and when I came back, the cursor in the shell was still blinking, but there was no output graph.
Is it safe to assume that it was still running? Is there a better/faster/easier approach to graph millions of points?
If you have 1000K points of data, you'll need some way of looking at the broad picture. Depending on what you are looking for exactly, if you can assign a "distance" between companies (say number of connections apart) you can visualize relationships (or clustering) via a Dendrogram.
Scipy does clustering:
http://docs.scipy.org/doc/scipy/reference/cluster.hierarchy.html#module-scipy.cluster.hierarchy
and has a function to turn them into dendrograms for visualization:
http://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.dendrogram.html#scipy.cluster.hierarchy.dendrogram
An example for a shortest path distance function via networkx:
http://networkx.lanl.gov/reference/generated/networkx.algorithms.shortest_paths.generic.shortest_path.html#networkx.algorithms.shortest_paths.generic.shortest_path
Ultimately you'll have to decide how you want to weight the distance between two companies (vertices) in your graph.
You have too many datapoints and if you did visualize the network it won't make any sense. You need to have ways to 1)reduce the number of companies by removing those that are less important/less connected 2)summarize the graph somehow and then visualize.
to reduce the size of data it might be better to create the network independently (using your own code to create an edgelist of companies). This way you can reduce the size of your graph (by removing singletons for example, which may be many).
For summarization I recommend running a clustering or a community detection algorithm. This can be done very fast even for very large networks. Use the "fastgreedy" method in the igraph package: http://igraph.sourceforge.net/doc/R/fastgreedy.community.html
(there is a faster algorithm available online as well, this is by Blondel et al: http://perso.uclouvain.be/vincent.blondel/publications/08BG.pdf I know their code is available online somewhere)
I have some points that I need to classify. Given the collection of these points, I need to say which other (known) distribution they match best. For example, given the points in the top left distribution, my algorithm would have to say whether they are a better match to the 2nd, 3rd, or 4th distribution. (Here the bottom-left would be correct due to the similar orientations)
I have some background in Machine Learning, but I am no expert. I was thinking of using Gaussian Mixture Models, or perhaps Hidden Markov Models (as I have previously classified signatures with these- similar problem).
I would appreciate any help as to which approach to use for this problem. As background information, I am working with OpenCV and Python, so I would most likely not have to implement the chosen algorithm from scratch, I just want a pointer to know which algorithms would be applicable to this problem.
Disclaimer: I originally wanted to post this on the Mathematics section of StackExchange, but I lacked the necessary reputation to post images. I felt that my point could not be made clear without showing some images, so I posted it here instead. I believe that it is still relevant to Computer Vision and Machine Learning, as it will eventually be used for object identification.
EDIT:
I read and considered some of the answers given below, and would now like to add some new information. My main reason for not wanting to model these distributions as a single Gaussian is that eventually I will also have to be able to discriminate between distributions. That is, there might be two different and separate distributions representing two different objects, and then my algorithm should be aware that only one of the two distributions represents the object that we are interested in.
I think this depends on where exactly the data comes from and what sort of assumptions you would like to make as to its distribution. The points above can easily be drawn even from a single Gaussian distribution, in which case the estimation of parameters for each one and then the selection of the closest match are pretty simple.
Alternatively you could go for the discriminative option, i.e. calculate whatever statistics you think may be helpful in determining the class a set of points belongs to and perform classification using SVM or something similar. This can be viewed as embedding these samples (sets of 2d points) in a higher-dimensional space to get a single vector.
Also, if the data is actually as simple as in this example, you could just do the principle component analysis and match by the first eigenvector.
You should just fit the distributions to the data, determine the chi^2 deviation for each one, look at F-Test. See for instance these notes on model fitting etc
You might want to consider also non-parametric techniques (e.g. multivariate kernel density estimation on each of your new data set) in order to compare the statistics or distances of the estimated distributions. In Python stats.kde is an implementation in SciPy.Stats.