I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?
Calculate the correlation and sort it. After sorting take top 10-15 categories/features.
Related
I have a dataset having 25 columns and 1000+ rows. This dataset contains dummy information of interns. We want to make squads of these interns. Suppose we want to make each squad of 10 members.
Based on the similarities of the intern we want to make squads and assign squad number to them. The factors will the columns we have in dataset which are Timezone, Language they speak, in which team they want to work etc.
These are the columns:
["Name","Squad_Num","Prefered_Lang","Interested_Grp","Age","City","Country","Region","Timezone",
"Occupation","Degree","Prev_Took_Courses","Intern_Experience","Product_Management","Digital_Marketing",
"Market_Research","Digital_Illustration","Product_Design","Prodcut_Developement","Growth_Marketing",
"Leading_Groups","Internship_News","Cohort_Product_Marketing","Cohort_Product_Design",
"Cohort_Product_Development","Cohort_Product_Growth","Hours_Per_Week"]
enter image description here
Here are a bunch of clustering algos for you to play around with.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Since this is unsupervised learning, you kind of have to fiddle around with different algos, and see which one performs to your liking, but there is no accuracy, precision, R^2, etc., to let you know how well the machine is performing.
I spent a lot of time trying to find good answers for my question. When running machine learning binary classification with multiple (say two) text inputs, how do I impute the missing values in the two input features?
I'm just using a simple example to clarify my question. Suppose I'm trying to classify each news article whether it falls into a 'politics' section or not (binary classification). Two input features are 'article contents' and 'title of an article' which consist of texts. In order to address the task in general, I need to pre-process those two inputs and vectorize each input (using countvectorizer or tf-idf and etc). Then you would concatenate two vectorized matrices into one and then choose whatever binary classification method for the later analysis.
Here my question is, how to impute any missing values especially when inputs are texts? I know that there are several ways to do this such as impute the mean value and so on. But this becomes a simple task when an input is numeric such as person's age or income.
To summarize my questions again here:
Is there any way to impute the missing values when inputs are texts?
One way is to drop a row where it has missing values but I want to keep it alive because one input value might be missing but is not for the other input (just like the Article_ID B & C cases in the below example).
Here is a very simplified example of a data set for clarification. Note that this is a fake data set I came up with just to provide an example.
Article_ID Politics(class) Contents(input 1) Title(input 2)
A Yes The justices heard... Supreme Court Seems...
B Yes N/A U.S. to Begin Offering...
C No The announcement comes as... N/A
D Yes The two countries said... Despite Tensions, U.S. ...
E No Movie streaming service is... Two more seasons renewed...
If you are concatenating two inputs, then some choices that you can use in place of the missing texts:
zero vector or
constant * array(size(vector)) where constant can be (1e-4, 1 / N), N = vocab length or vector length and the array is a vector of ones
Because whatever vector you use for unknown texts, it is the responsibility of the model to identify its importance in the final prediction.
I am trying to solve a machine learning task but have encountered some problems. Any tips would be greatly appreciated. One of my questions is, how do you create a correlation matrix for 2 dataframes (data for 2 labels) of different sizes, to see if you can combine them into one.
Here is the whole text of the task
This dataset is composed of 1100 samples with 30 features each. The first column is the sample id. The second column in the dataset represents the label. There are 4 possible values for the labels. The remaining columns are numeric features.
Notice that the classes are unbalanced: some labels are more frequent than others. You need to decide whether to take this into account, and if so how.
Compare the performance of a Support-Vector Machine (implemented by sklearn.svm.LinearSVC) with that of a RandomForest (implemented by sklearn.ensemble.ExtraTreesClassifier). Try to optimize both algorithms' parameters and determine which one is best for this dataset. At the end of the analysis, you should have chosen an algorithm and its optimal set of parameters.
I have tried to make a correlation matrix for rows with the labels with lower cardinality but I am not convinced it is reliable
I tried to make two new dataframes from the rows that have labels 1 and 2. There are 100-150 entries for each of those 2 labels, compared to 400 for labels 0 and 3. I wanted to check if there is high correlation bewteen data labeled 1 and 2 to see if i could combine them but dont know if this is the right approach.I tried to make the dataframes the same size by appending zeros to the smaller one and then did a correlation matrix for both datasets together. is this a correct approach
your question and approach is not clear. can you modify the question with problem statement and few data sets that you have been given.
If you wanted to visualize your data set please plot them into 2,3 or 4 dimensions.
Here are many plotting tools like 3D scatter plot, pair plot, histogram and may more. use them to better understand your data sets.
I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.
org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)
I have 2 dataframes.
Each dataframe contains 64 columns with each column containing 256 values.
I need to compare these 2 dataframes for statistical significance.
I know only basics of statistics.
What I Have done is calculate p-value for all columns for each dataframe.
Then I compare p-value of each column of 1 st dataframe to the p value of each column to the 2nd dataframe.
EX: p value of 1 st column of 1st dataframe to p value of 1st column of 2nd dataframe.
Then I tell which columns are significantly different among 2 dataframes.
Is there any better way to do this.
I use python.
To be honest, the way you do it is not the way its meant to be. Let my highlight some points that you should always keep in mind when conducting such analyses:
1.) Hypothesis first
I strongly suggest to avoid testing everything against everything. This kind of exploratory data analysis will likely produce some significant results but it is also likely that you end up in a multiple comparisons problem.
In simple terms: You have so many tests that the chance of seeing something significant which in fact is not is greatly increased (see also Type I and Type II errors).
2.) The p-value isn't all the magic
Saying that you calculated the p-value for all columns doesn't tell which test you used. The p-value is just a "tool" from mathematical statistics that is used by a lot of tests (e.g. correlation, t-test, ANOVA, regression etc.). Having a significant p-value indicates that the difference/relationship you observed is statistically relevant (i.e. a systematic and not a random effect).
3.) Consider sample and effect size
Depending on which test you are using, the p-value is sensitive to the sample size you have. The greater your sample size, the more likely it is to find a significant effect. For instance, if you compare two groups with 1 million observations each, the smallest differences (which might also be random artifacts) can be significant. It is therefore important to also take a look at the effect size that tells you how large the observed really is (e.g. r for correlations, Cohen's d for t-tests, partial eta for ANOVAs etc.).
SUMMARY
So, if you want to get some real help here, I suggest to post some code and specify more concretely what (1) your research question is, (2) which tests you used, and (3) how your code and your output looks like.