Organizing records to classes - python

I'm planning to develop a genetic algorithm for a series of acceleration records in a search to find optimum match with a target.
At this point my data is array-like with a unique ID column, X,Y,Z component info in the second, time in the third etc...
That being said each record has several "attributes". Do you think it would be beneficial to create a (records) class considering the fact I will want to to do a semi-complicated process with it as a next step?
Thanks

I would say yes. Basically I want to:
Take the unique set of data
Filter it so that just a subset is considered (filter parameters can be time of recording for example)
Use a genetic algorithm the filtered data to match on average a target.
Step 3 is irrelevant to the post, I just wanted to give the big picture in order to make my question more clear.

Related

Within-Subject Permutation Testing Python

I'm trying to do permutation testing on a within-subject design, and with stats not being my strong point, after lots of desperate googling I'm still confused.
I have 36 subjects data, and each subject has their data processed by 6 different methods. And I have a metric (say SNR) for how well each method performs (essentially a 36x6 matrix).
The data violates the conditions for parametric testing (not normal, and not homogeneous variance between groups), and rather than using non-parametric testing, we want to use permutation testing.
I want to see if my approach makes sense...
Initially:
Perform an rmANOVA on the data, save the F-value as F-actual.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (any tips on how to perform this are appreciated).
After each shuffle (permutation), recompute the F-value and save to an array of possible F-values.
Check how often F-actual is more extreme than the values in the array of possible F-values.
Post-Hoc Testing:
Perform pairwise t-tests on the data, save the associated T-statistic as T-actual for each pairing.
Shuffle the data between the columns (methods) randomly, but with the constraint that each value must stay in the row associated with it's original subject (the same shuffling as above).
After each shuffle (permutation), recompute the T-stat and save to an array of possible T-values for each pairing.
After n-permutations, check how often the actual T-stat for each pairing is more extreme than those possible T-values for each pairing.
I've currently been working in python with pingouin, but I appreciate this may be easier to do in R so I am open to migrating if that is the case - but any advice on whether this approach even makes sense, and how to perform it if so is greatly appreciated!
Also just to note - the method needs to be capable of dealing with NaN/None values for certain methods and subjects (so for example subject 1 for method 1 may be blank, but there are relevant values for all other methods).
Thank you.

Difference between k-means clustering and creating groups by brute force approach

I have a task to find similar parts based on numeric dimensions--diameters, thickness--and categorical dimensions--material, heat treatment, etc. I have a list of 1 million parts. My approach as a programmer is to put all parts on a list, pop off the first part and use it as a new "cluster" to compare the rest of the parts on the list based on the dimensions. As a part on the list matches the categorical dimensions and numerical dimensions--within 5 percent--I will add that part to the cluster and remove from the initial list. Once all parts in the list are compared with the initial cluster part's dimensions, I will pop the next part off the list and start again, populating clusters until no parts remain on the original list. This is a programmatic approach. I am not sure if this is most efficient way of categorizing parts into "clusters" or if k-means clustering would be a better approach.
Define "better".
What you do seems to be related to "leader" clustering. But that is a very primitive form of clustering that will usually not yield competitive results. But with 1 million points, your choices are limited, and kmeans does not handle categorical data well.
But until you decide what is 'better', there probably is nothing 'wrong' with your greedy approach.
An obvious optimization would be to first split all the data based on the categorical attributes (as you expect them to match exactly). That requires just one pass over the data set and a hash table. If your remaining parts are small enough, you could try kmeans (but how would you choose k), or DBSCAN (probably using the same threshold you already have) on each part.

Calculating similarity measure between millions of documents

I have millions of documents(close to 100 million), each document has fields such as skills, hobbies, certification and education. I want to find similarity between each document along with a score.
Below is an example of data.
skills hobbies certification education
Java fishing PMP MS
Python reading novel SCM BS
C# video game PMP B.Tech.
C++ fishing PMP MS
so what i want is similarity between first row and all other rows, similarity between second row and all other rows and so on. So, every document should be compared against every other document. to get the similarity scores.
Purpose is that i query my database to get people based on skills. In addition to that, i now want people who even though do not have the skills, but are somewhat matching with the people with the specific skills. For example, if i wanted to get data for people who have JAVA skills, first row will appear and again, last row will appear as it is same with first row based on similarity score.
Challenge: My primary challenge is to compute some similarity score for each document against every other document as you can see from below pseudo code. How can i do this faster? Is there any different way to do this with this pseudo code or is there any other computational(hardware/algorithm) approach to do this faster?
document = all_document_in_db
For i in document:
for j in document:
if i != j :
compute_similarity(i,j)
One way to speed up would be to ensure you don't calculate similarity both ways. your current pseudocode will compare i to j and j to i. instead of iterating j over the whole document, iterate over document[i+1:], i.e. only entries after i. This will reduce your calls to compute_similarity by half.
The most suitable data structure for this kind of comparison would be an adjacency matrix. This will be an n * n matrix (n is the number of members in your data set), where matrix[i][j] is the similarity between members i and j. You can populate this matrix fully while still only half-iterating over j, by just simultaneously assigning matrix[i][j] and matrix[j][i] with one call to compute_similarity.
Beyond this, I can't think of any way to speed up this process; you will need to make at least n * (n - 1) / 2 calls to compute_similarity. Think of it like a handshake problem; if every member must be compared to ('shake hands with') every other member at least once, then the lower bound is n * (n - 1) / 2. But I welcome other input!
I think what you want is some sort of clustering algorithm. You think of each row of your data as giving a point in a multi-dimensional space. You then want to look for other 'points' that are nearby. Not all the dimensions of your data will produce good clusters so you want to analyze your data for which dimensions will be significant for generation of clusters and reduce the complexity of looking for similar records by mapping to a lower dimension of the data. scikit-learn has some good routines for dimensional analysis and clustering as well as some of the best documentation for helping you to decide which routines to apply to your data. For actually doing the analysis I think you might do well to purchase cloud time with AWS or Google AppEngine. I believe both can give you access to Hadoop clusters with Anaconda (which includes scikit-learn) available on the nodes. Detailed instructions on either of these topics (clustering, cloud computing) are beyond a simple answer. When you get stuck post another question.
With 100 mln document, you need 500,000 bln comparisons. No, you cannot do this in Python.
The most feasible solution (aside from using a supercomputer) is to calculate the similarity scores in C/C++.
Read the whole database and enumerate each skill, hobby, certification, and education. This operation takes a linear time, assuming that your index look-ups are "smart" and take constant time.
Create a C/C++ struct with four numeric fields: skill, hobby, certification, and education.
Run a nested loop that subtracts each struct from all other structs fieldwise and uses bit-level arithmetic to assess the similarity.
Save the results into a file and make them available to the Python program, if necessary.
Actually, I believe you need to compute a matrix representation of the documents and only call the compute_similarity once. This will invoke a vectorized implementation of the algo on all pairs of rows of features in the X matrix (the first parameter assuming sci-kit learn). You'll be surprised by the performance. If the attempt to calculate this in one call exceeds your RAM you can try to chunk.

Test differently binned data sets

I am trying to test how a periodic data set behaves with respect to the same data set folded with the period (that is, the average profile). More specifically, I want to test if the single profiles are consistent with the average one.
I am reading about a number of test available in Python, especially about the Kolmogorov-Smirnov statistic on 2 samples and the chi square test.
However, my data are real data, and binned of course.
Therefore, as it is frequent, my data have gaps in between. This means that very often the number of bins in the single profiles is less than the bins of the "model" (the folded/average profile).
This means that I can't use those tests straightforward (because the two arrays have different number of elements), but I probably need to:
1) do some transformation, or any other operation, that allows me to compare the distributions;
2) Also, converting the average profile into a continuous model would be a nice solution;
3) Proceed with different statistical instruments which I am not aware of.
But I don't know how to move on in either case, so I would need help in finding a way for (1) or (2) (perhaps both!), or a hint about the third case.
EDIT: the data are a light curve, that is photon counts versus time.
The data are from a periodic astronomical source, that is they repeat their pattern (profile) every given period. I can fold the data with the period and obtain an average profile, and I want to use this averaged profile as a model to test each single profile against the averaged one, that is my model.
Thanks!

group detection in large data sets python

I am a newbie in python and have been trying my hands on different problems which introduce me to different modules and functionalities (I find it as a good way of learning).
I have googled around a lot but haven't found anything close to a solution to the problem.
I have a large data set of facebook posts from various groups on facebooks that use it as a medium to mass send the knowledge.
I want to make groups out of these posts which are content-wise same.
For example, one of the posts is "xyz.com is selling free domains. Go register at xyz.com"
and another is "Everyone needs to register again at xyz.com. Due to server failure, all data has been lost."
These are similar as they both ask to go the group's website and register.
P.S: Just a clarification, if any one of the links would have been abc.com, they wouldn't have been similar.
Priority is to the source and then to the action (action being registering here).
Is there a simple way to do it in python? (a module maybe?)
I know it requires some sort of clustering algorithm ( correct me if I am wrong), my question is can python make this job easier for me somehow? some module or anything?
Any help is much appreciated!
Assuming you have a function called geturls that takes a string and returns a list of urls contained within, I would do it like this:
from collections import defaultdict
groups = defaultdict(list):
for post in facebook_posts:
for url in geturls(post):
groups[url].append(post)
That greatly depends on your definition of being "content-wise same". A straight forward approach is to use a so-called Term Frequency - Inverse Document Frequency (TFIDF) model.
Simply put, make a long list of all words in all your posts, filter out stop-words (articles, determiners etc.) and for each document (=post) count how often each term occurs, and multiplying that by the importance of the team (which is the inverse document frequency, calculated by the log of the ratio of documents in which this term occurs). This way, words which are very rare will be more important than common words.
You end up with a huge table in which every document (still, we're talking about group posts here) is represented by a (very sparse) vector of terms. Now you have a metric for comparing documents. As your documents are very short, only a few terms will be significantly high, so similar documents might be the ones where the same term achieved the highest score (ie. the highest component of the document vectors is the same), or maybe the euclidean distance between the three highest values is below some parameter. That sounds very complicated, but (of course) there's a module for that.

Categories