consider two columns when analysis - python

I have a dataframe with rating, price, and currency. three columns for example:
df = pd.DataFrame({
'ratingvalue': ['5.0', '4.5', '2.0'],
'pricerange': ['10000000', '899', '200'],
'pricecurrency': ['45', '15', '20']
})
#the number of pricecurrency represent the currency like EUR, USD,
I'm working on the prediction model that could predict the rating, and we all know that when consider prices, we have to take the currency into account.
How can I take two columns as an independent variable when creating the classifier

I feel like it lacks a little bit of detail, because a simple classifier like LinearRegression can of course take into account multiple features. So it's just a matter of passing both features into the model.
In case your question was about how to make those features more useful, i'd suggest the following:
If you have limited amount of currencies, just hardcode the conversion rate and convert all of them to a single currency.
If the amount is big or you are not sure which one is which, you could just use them as features, but then I'd suggest encoding them as one-hot features, since the ordering might influence the predictions. For example use OneHotEncoder to generate the features
I again don't know many details of the task, but if it's not simple, you'd need more features then that to get a good prediction, so maybe consider adding some other columns as well.

Related

Group similar data and assign number to each group Python pandas

I have a dataset having 25 columns and 1000+ rows. This dataset contains dummy information of interns. We want to make squads of these interns. Suppose we want to make each squad of 10 members.
Based on the similarities of the intern we want to make squads and assign squad number to them. The factors will the columns we have in dataset which are Timezone, Language they speak, in which team they want to work etc.
These are the columns:
["Name","Squad_Num","Prefered_Lang","Interested_Grp","Age","City","Country","Region","Timezone",
"Occupation","Degree","Prev_Took_Courses","Intern_Experience","Product_Management","Digital_Marketing",
"Market_Research","Digital_Illustration","Product_Design","Prodcut_Developement","Growth_Marketing",
"Leading_Groups","Internship_News","Cohort_Product_Marketing","Cohort_Product_Design",
"Cohort_Product_Development","Cohort_Product_Growth","Hours_Per_Week"]
enter image description here
Here are a bunch of clustering algos for you to play around with.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb
Since this is unsupervised learning, you kind of have to fiddle around with different algos, and see which one performs to your liking, but there is no accuracy, precision, R^2, etc., to let you know how well the machine is performing.

Exclude values existing in a list that contains words like

I have a list of merchant category:
[
'General Contractors–Residential and Commercial',
'Air Conditioning, Heating and Plumbing Contractors',
'Electrical Contractors',
....,
'Insulation, Masonry, Plastering, Stonework and Tile Setting Contractors'
]
I want to exclude merchants from my dataframe if df['merchant_category'].str.contains() any of such merchant categories.
However, I cannot guarantee that the value in my dataframe has the long name as in the list of merchant category. It could be that my dataframe value is just air conditioning.
As such, df = df[~df['merchant_category'].isin(list_of_merchant_category)] will not work.
If you can collect a long list of positive examples (categories you definitely want to keep), & negative examples (categories you definitely want to exclude), you could try to train a text classifier on that data.
It would then be able to look at new texts and make a reasonable guess as to whether you want them included or excluded, based on their similarity to your examples.
So, as you're working in Python, I suggest you look for online tutorials and examples of "binary text classification" using Scikit-Learn.
While there's a bewildering variety of possible approaches to both representing/vectorizing your text, and then learning to make classifications from those vectors, you may have success with some very simple ones commonly used in intro examples. For example, you could represent your textual categories with bag-of-words and/or character-n-gram (word-fragments) representations. Then try NaiveBayes or SVC classifiers (and others if you need to experiment for possibly-bettr results).
Some of these will even report a sort of 'confidence' in their predictions - so you could potentially accept the strong predictions, but highlight the weak predictions for human review. When a human then looks at, an definitively rules on, a new 'category' string – because it was highlighted an iffy prediction, or noticed as an error, you can then improve the overall system by:
adding that to the known set that are automatically included/excluded based on an exact literal comparison
re-training the system, so that it has a better chance at getting other new similar strings correct
(I know this is a very high-level answer, but once you've worked though some attempts based on other intro tutorials, and hit issues with your data, you'll be able to ask more specific questions here on SO to get over any specific issues.)

How to handle columns like 'country' and 'age groups' while making a prediction model in python?

I am much new to machine learning and while I was working on this specific data-frame, I found it difficult to handle important columns like age groups and country.
Here is a link to the data-set I am using:
https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016https://www.kaggle.com/russellyates88/suicide-rates-overview-1985-to-2016
In the more precise prediction of the data, the columns 'country' and 'age group' are pretty much important. But I am constantly getting the errors like:
{
could not convert string to float: '15-24 years'
}
And similar for the country column.
What could I do to make them suitable for the model?
These are "categorical" attributes of your machine learning model. Typically categorical attributes are assigned an integer value so that the ML model can handle them. This is a major topic in machine learning, so all I can do is suggest you read up on categorical data. Perhaps this link or one similar will give you a start.
The data you are talking about is categorical.
Basically the data that you have in your dataset is mostly ordinal(numeric) or categorical.
I would recommend you handle this by converting the categorical variables to dummy codes.
For example assume you have a dataframe like the one below
Id, Country
1, US
2, UK
3, Germany
Converting this to dummy code will give you
Id, US, UK, Germany
1, 1,0,0
2, 0,1,0
3, 0,0,1
There are multiple packages which convert categorical data to dummy codes. I think pandas has a function as well.
And then the above dataframe can be used to train your model

Handling Categorical Data with Many Values in sklearn

I am trying to predict customer retention with a variety of features.
One of these is org_id which represents the organization the customer belongs to. It is currently a float column with numbers ranging from 0.0 to 416.0 and 417 unique values.
I am wondering what the best way of preprocessing this column is before feeding it to a scikit-learn RandomForestClassifier. Generally, I would one-hot-encode categorical features, but there are so many values here so it would radically increase my data dimensionality. I have 12,000 rows of data, so I might be OK though, and only about 10 other features.
The alternatives are to leave the column with float values, convert the float values to int values, or convert the floats to pandas' categorical objects.
Any tips are much appreciated.
org_id does not seem to be a feature that brings any info for the classification, you should drop this value and not pass it into the classifier.
In a classifier you only want to pass features that are discriminative for the task that you are trying to perform: here the elements that can impact the retention or churn. The ID of a company does not bring any valuable information in this context therefore it should not be used.
Edit following OP's comment:
Before going further let's state something: with respect to the number of samples (12000) and the relative simplicity of the model, one can make multiple attempts to try different configurations of features easily.
So, As a baseline, I would do as I said before, drop this feature all together. Here is your baseline score i.e., a score you can compare your other combinations of features against.
I think it cost nothing to try to hot-encode org_id, whichever result you observe is going to add up to your experience and knowledge of how the Random Forest behaves in such cases. As you only have 10 more features, the Boolean features is_org_id_1, is_org_id_2, ... will be highly preponderant and the classification results may be highly influenced by these features.
Then I would try to reduce the number of Boolean features by finding new features that can "describe" these 400+ organizations. For instance, if they are only US organizations, their state which is ~50 features, or their number of users (which would be a single numerical feature), their years of existence (another single numerical feature). Let's note that these are only examples to illustrate the process of creating new features, only someone knowing the full problematic can design these features in a smart way.
Also, I would find interesting that, once you solve your problem, you come back here and write another answer to your question as I believe, many people run into such problems when working with real data :)

One hot encoding - data stored in a 1d array

I have a 1-dimensional array which I am using to store a categorical feature of my dataset like so: (where each data instance belongs to many categories and the categories are separated by commas)
Administration Oral ,Aged ,Area Under Curve ,Cholinergic Antagonists/adverse effects/*pharmacokinetics/therapeutic use ,Circadian Rhythm/physiology ,Cross-Over Studies ,Delayed-Action Preparations ,Dose-Response Relationship Drug ,Drug Administration Schedule ,Female ,Humans ,Mandelic Acids/adverse effects/blood/*pharmacokinetics/therapeutic use ,Metabolic Clearance Rate ,Middle Aged ,Urinary Incontinence/drug therapy ,Xerostomia/chemically induced ,
Adult ,Anti-Ulcer Agents/metabolism ,Antihypertensive Agents/metabolism ,Benzhydryl Compounds/administration & dosage/blood/*pharmacology ,Caffeine/*metabolism ,Central Nervous System Stimulants/metabolism ,Cresols/administration & dosage/blood/*pharmacology ,Cross-Over Studies ,Cytochromes/*pharmacology ,Debrisoquin/*metabolism ,Drug Interactions ,Humans ,Male ,Muscarinic Antagonists/pharmacology ,Omeprazole/*metabolism ,*Phenylpropanolamine ,Polymorphism Genetic ,Tolterodine Tartrate ,Urinary Bladder Diseases/drug therapy ,
...
...
Each element of the array represents the categories a data instance belongs to. I need to use one-hot encoding, so I can use these as a feature to train my algorithm. I understand this can be achieved using scrikit-learn, however I am unsure how to implement it. (There are ~150 possible categories and around 1,000 data instances.)
I recommend you use the get_dummies method in pandas for this. The interface is a little nicer especially if you're already using pandas to store your data. The sklearn implementations are a bit more involved. If you do decide to go the sklearn route you will either need to use OneHotEncoder or LabelBinarizer. Both would require you to first convert your categories to integer values which you can accomplish with LabelEncoder.

Categories