Group similar data and assign number to each group Python pandas

Group similar data and assign number to each group Python pandas - python

I have a dataset having 25 columns and 1000+ rows. This dataset contains dummy information of interns. We want to make squads of these interns. Suppose we want to make each squad of 10 members.
Based on the similarities of the intern we want to make squads and assign squad number to them. The factors will the columns we have in dataset which are Timezone, Language they speak, in which team they want to work etc.
These are the columns:
["Name","Squad_Num","Prefered_Lang","Interested_Grp","Age","City","Country","Region","Timezone",
"Occupation","Degree","Prev_Took_Courses","Intern_Experience","Product_Management","Digital_Marketing",
"Market_Research","Digital_Illustration","Product_Design","Prodcut_Developement","Growth_Marketing",
"Leading_Groups","Internship_News","Cohort_Product_Marketing","Cohort_Product_Design",
"Cohort_Product_Development","Cohort_Product_Growth","Hours_Per_Week"]
enter image description here

Here are a bunch of clustering algos for you to play around with.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Since this is unsupervised learning, you kind of have to fiddle around with different algos, and see which one performs to your liking, but there is no accuracy, precision, R^2, etc., to let you know how well the machine is performing.

Related

Encoding the categorical columna make data too sparse

I have got a couple of questions.
if data set has some categorical
Varibles and it needs one hot encoding then sometimes as there are so many categories within each columns, one hot encoding increases the total number of columns to 200 or more. How do i deal with this?
what is a good number for total columns that go into model training , like are 50 columns or 20 columns or is it totally dependent on the dataset.

Generally it's good to have a lot of examples. This amount of column is not to much.
I recommend instead to focus on:
Deletion missing or not relevant data which may influence to wrong output after all transformations.
Provide significant number of records which magnify your learning effect.

Machine learning application to column mapping

I have a dataframe with a large number of rows (several hundred thousand) and several columns that show industry classification for a company, while the eighth column is the output and shows the company type, e.g. Corporate or Bank or Asset Manager or Government etc.
Unfortunately industry classification is not consistent 100% of the time and is not finite, i.e. there are too many permutations of the industry classification columns to be mapped once manually. If I mapped say 1k rows with correct Output columns, how can I employ machine learning with python to predict the Output column based on my trained sample data? Please see the image attached which will make it clearer.
Part of the dataset

You are trying to predict to company type based in a couple of columns? That is not possible, there are a lot of companies working on that. The best you can do is to collect a lot of data from different sources match them, and then you can try with sklearn probably a decision tree classifier to start.

Tensorflow sports predictions

I am trying to work on an AI that composes possible winning bets. However I don't know how I should approach this way of AI. I've made simple AI's that can detect the differenace between humans and animals for example, but this one is a lot more complex.
Which AI model should I use? I don't think linear regression and K-Nearest Neighbours will work in this situation. I'm trying to experiment with neural networks but I don't have any experience with them.
To make things a little bit more clear.
I have a MongoDB with fixtures, leagues, countries and predictions
A fixture contains two teams, a league id, and some more values
A league containes a country and some more values
A country is just an ID with for example a flag in SVG format
A prediction is a collection of different markets* with their probability
market = a way to place a bet for example: home wins, away wins, both teams score
I also have a collection that contains information about which league's predictions are most accurate.
How would I go about creating this AI. All data is in a very decent form I just don't know how to begin. For example what AI model to use, and what inputs to use. Also, how would I go about saving the AI model and training it with new data that enters the MongoDB? (I have multiple cron jobs inserting data in the MongoDB)
Note:
The AI should compose bets containing X number of fixtures

Because there is no "Right" way to do this, I'll tell you the most generic way.
The first thing you want to figure out is the target of the model:
The label/target that you want to classify is the market.I can suggest for the simplicity you can use -1 for home , 0 for tie and 1 for away.
data cleaning : remove outliers , complete/interpolate missing values etc.
feature extraction:
convert categorical values using one-hot encoding.
standardise between 0~1 the values of numeric features.
Remove all of the non relevant values: that has very low
entropy over the whole dataset or very high entropy within each
of the labels.
try to extract logical features from the raw data that might help the classifier distinguish between the classes.
select features using (for example) mutual information gain.
try using simple model as Naive Base, if you have more time you can use SVM model.and remember - no free lunch theory and also less is more -always prefer simple features and models.

Determine which factors are significant, using machine learning

I am very inexperienced when it comes to machine learning, but I would like to learn and in order to improve my skills I am currently trying to apply the things I have learned on one of my own research data sets.
I have a dataset with 77 rows and 308 columns. Every row correspondents to a sample. 305 out of the 308 columns give information about concentrations, one column tells whether the column belongs to group A,B,C or D, one column tells whether it is an X or Y sample and one column tells you eventually whether the output is successful or not. I would like to determine which concentrations significantly impact the output, taking into account the variation between the groups and sample types. I have tried multiple things (feature selection, classification, etc.) but so far I do not get the desired output
My question is therefore whether people have suggestions/tips/ideas about how I could tackle this problem, taking into account that the dataset is relatively small and that only 15 out the 77 samples have 'not successful' as output?

Calculate the correlation and sort it. After sorting take top 10-15 categories/features.

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.

org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.