I have a 1-dimensional array which I am using to store a categorical feature of my dataset like so: (where each data instance belongs to many categories and the categories are separated by commas)
Administration Oral ,Aged ,Area Under Curve ,Cholinergic Antagonists/adverse effects/*pharmacokinetics/therapeutic use ,Circadian Rhythm/physiology ,Cross-Over Studies ,Delayed-Action Preparations ,Dose-Response Relationship Drug ,Drug Administration Schedule ,Female ,Humans ,Mandelic Acids/adverse effects/blood/*pharmacokinetics/therapeutic use ,Metabolic Clearance Rate ,Middle Aged ,Urinary Incontinence/drug therapy ,Xerostomia/chemically induced ,
Adult ,Anti-Ulcer Agents/metabolism ,Antihypertensive Agents/metabolism ,Benzhydryl Compounds/administration & dosage/blood/*pharmacology ,Caffeine/*metabolism ,Central Nervous System Stimulants/metabolism ,Cresols/administration & dosage/blood/*pharmacology ,Cross-Over Studies ,Cytochromes/*pharmacology ,Debrisoquin/*metabolism ,Drug Interactions ,Humans ,Male ,Muscarinic Antagonists/pharmacology ,Omeprazole/*metabolism ,*Phenylpropanolamine ,Polymorphism Genetic ,Tolterodine Tartrate ,Urinary Bladder Diseases/drug therapy ,
...
...
Each element of the array represents the categories a data instance belongs to. I need to use one-hot encoding, so I can use these as a feature to train my algorithm. I understand this can be achieved using scrikit-learn, however I am unsure how to implement it. (There are ~150 possible categories and around 1,000 data instances.)
I recommend you use the get_dummies method in pandas for this. The interface is a little nicer especially if you're already using pandas to store your data. The sklearn implementations are a bit more involved. If you do decide to go the sklearn route you will either need to use OneHotEncoder or LabelBinarizer. Both would require you to first convert your categories to integer values which you can accomplish with LabelEncoder.
Related
I am trying to work on an AI that composes possible winning bets. However I don't know how I should approach this way of AI. I've made simple AI's that can detect the differenace between humans and animals for example, but this one is a lot more complex.
Which AI model should I use? I don't think linear regression and K-Nearest Neighbours will work in this situation. I'm trying to experiment with neural networks but I don't have any experience with them.
To make things a little bit more clear.
I have a MongoDB with fixtures, leagues, countries and predictions
A fixture contains two teams, a league id, and some more values
A league containes a country and some more values
A country is just an ID with for example a flag in SVG format
A prediction is a collection of different markets* with their probability
market = a way to place a bet for example: home wins, away wins, both teams score
I also have a collection that contains information about which league's predictions are most accurate.
How would I go about creating this AI. All data is in a very decent form I just don't know how to begin. For example what AI model to use, and what inputs to use. Also, how would I go about saving the AI model and training it with new data that enters the MongoDB? (I have multiple cron jobs inserting data in the MongoDB)
Note:
The AI should compose bets containing X number of fixtures
Because there is no "Right" way to do this, I'll tell you the most generic way.
The first thing you want to figure out is the target of the model:
The label/target that you want to classify is the market.I can suggest for the simplicity you can use -1 for home , 0 for tie and 1 for away.
data cleaning : remove outliers , complete/interpolate missing values etc.
feature extraction:
convert categorical values using one-hot encoding.
standardise between 0~1 the values of numeric features.
Remove all of the non relevant values: that has very low
entropy over the whole dataset or very high entropy within each
of the labels.
try to extract logical features from the raw data that might help the classifier distinguish between the classes.
select features using (for example) mutual information gain.
try using simple model as Naive Base, if you have more time you can use SVM model.and remember - no free lunch theory and also less is more -always prefer simple features and models.
I have two columns of having high cardinal categorical values, one column(area_id) has 21878 unique values and other has(page_entry) 800 unique values. I am building a predictive ML model to predict the hits on a webpage.
column information:
area_id: all the locations that were visited during the session. (has location code number of different areas of a webpage)
page_entry: describes the landing page of the session.
how to change these two columns into numerical apart from one_hot encoding?
thank you.
One approach could be to group your categorical levels into smaller buckets using business rules. In your case for the feature area_id you could simply group them based on their geographical location, say all area_ids from a single district (or for that matter any other level of aggregation) will be replaced by a single id. Similarly, for page_entry you could group similar pages based on some attributes like nature of the web page like sports, travel, etc. In this way you could significantly reduce the number dimensions of your variables.
I'm working with data which consists of numerical and categorical features, where each input consists of a variable-sized group of the features.
For example: predict the price of a house by using features about each room in the house, and each house could have a different amount of rooms. The features could be size in meters, type (e.g living room/bathroom/bedroom), color, floor...
Some of the categorical features have high cardinality, and I may be using many features.
I'd want to use the features from n rooms to predict the price for each house.
How would I structure my inputs/nn model to receive variable-sized groups of inputs?
I thought of using one-hot encoding, but then I'd end up with large input vectors and I'd lose the connections between the features for each room.
I also thought of using embeddings, but I'm not sure what the best way is to combine the features/samples to properly input all the data without losing any info about which features come from which samples etc.
As the article, linked below, suggests... you've got one of three routes to choose from.
Ordinal Encoding which I am thinking is not the right use case for your example
One Hot Encoding which you've ruled out efficiently.
Difference Encoding, which is I think a little bit suited as there are master bedrooms, minor ones, guest ones and children ones. So, try that angle.
Link to the beautiful article
Happy coding :)
I am working with a medical data set that contains many variables with discrete outputs. For example: type of anesthesia, infection site, Diabetes y/n. And to deal with this I have just been converting them into multiple columns with ones and zeros and then removing one to make sure there is not a direct correlation between them but I was wondering if there was a more efficient way of doing this
It depends on the purpose of the transformation. Converting categories to numerical labels may not make sense if the ordinal representation does not correspond to the logic of the categories. In this case, the "one-hot" encoding approach you have adopted is the best way to go, if (as I surmise from your post) the intention is to use the generated variables as the input to some sort of regression model. You can achieve what you are looking to do using pandas.get_dummies.
Scikit-learn has fairly user-friendly python modules for machine learning.
I am trying to train an SVM tagger for Natural Language Processing (NLP) where my labels and input data are words and annotation. E.g. Part-Of-Speech tagging, rather than using double/integer data as input tuples [[1,2], [2,0]], my tuples will look like this [['word','NOUN'], ['young', 'adjective']]
Can anyone give an example of how i can use the SVM with string tuples? the tutorial/documentation given here are for integer/double inputs. http://scikit-learn.org/stable/modules/svm.html
Most machine learning algorithm process input samples that are vector of floats such that a small (often euclidean) distance between a pair of samples means that the 2 samples are similar in a way that is relevant for the problem at hand.
It is the responsibility of the machine learning practitioner to find a good set of float features to encode. This encoding is domain specific hence there is not general way to build that representation out of the raw data that would work across all application domains (various NLP tasks, computer vision, transaction log analysis...). This part of the machine learning modeling work is called feature extraction. When it involves a lot of manual work, this is often referred to as feature engineering.
Now for your specific problem, POS tags of a window of words around a word of interest in a sentence (e.g. for sequence tagging such as named entity detection) can be encoded appropriately by using the DictVectorizer feature extraction helper class of scikit-learn.
This is not so much a scikit or python question, but more of a general issue with SVMs.
Data instances in SVMs must be be represented as vectors of scalars of sorts, typically, real numbers. Categorical Attributes must therefore first be mapped to some numeric values before they can be included in SVMs.
Some categorical attributes lend themselves more naturally/logically to be mapped onto some scale (some loose "metric"). For example a (1, 2, 3, 5) mapping for a Priority field with values of ('no rush', 'standard delivery', 'Urgent' and 'Most Urgent') may make sense. Another example may be with colors which can be mapped to 3 dimensions one each for their Red, Green, Blue components etc.
Other attributes don't have a semantic that allows any even approximate logical mapping onto a scale; the various values for these attributes must then be assigned an arbitrary numeric value on one (or possibly several) dimension(s) of the SVM. Understandingly if an SVM has many of these arbitrary "non metric" dimensions, it can be less efficient at properly classifying items, because the distance computations and clustering logic implicit to the working of the SVMs are less semantically related.
This observation doesn't mean that SVMs cannot be used at all when the items include non numeric or non "metric" dimensions, but it is certainly a reminder that feature selection and feature mapping are very sensitive parameters of classifiers in general and SVM in particular.
In the particular case of POS-tagging... I'm afraid I'm stumped at the moment, on which attributes of the labelled corpus to use and on how to map these to numeric values. I know that SVMTool can produce very efficient POS-taggers, using SVMs, and also several scholarly papers describe taggers also based on SVMs. However I'm more familiar with the other approaches to tagging (e.g. with HMMs or Maximum Entropy.)