Text Input and Numerical Output issue for Exam Schedule Predictor - python

I'm working on a model that would predict an exam schedule for a given course and term. My input would be the term and the course name, and the output would be the date. I'm currently done with the data cleaning and preprocessing step, however, I can't wrap my head around a way to make a model whose input is two strings and the output is two numbers (the day and month of exam). One approach that I thought of would be encoding my course names, and writing the term as a binary list. I.E input: encoded(course), [0,0,1] output: day, month. and then feeding to a regression model.
I hope someone who's more experienced could tell me a better approach.

Before I start answering your question:
/rant
I know this sounds dumb and doesn't really help your question, but why are you using Neural Networks for this?!
To me, this seems like the classical case of "everybody uses ML/AI in their area, so now I have to, too!" (which is completely not true) /rant over
For string-like inputs, there exist several methods to encode these; choosing the right one might depend on your specific task. As you have a very "simple" (and predictable) input - i.e., you know in advance that there might not be any new/unseen course titles during testing/inference, or you do not need contextual/semantic information, you can resort to something like scikit-learn's LabelEncoder, which will turn it into different classes.
Alternatively, you could also throw a more heavy-weight encoding structure at the problem, that embeds the values in a matrix. Most DL frameworks offer some form of internal function for this, which basically requires you to pass an unique index for your input data, and actively learns some k-dimensional embedding vector for this. Intuitively, these embeddings correspond to a semantic or topical direction. If you have for example 3-dimensional embeddings, the first one could represent "social sciences course", the other one "technical courses", and the third for "seminar".
Of course, this is just a simplification of it, but helps imagining how it works.
For the output, predicting a specific date is actually a really good question. As I have personally never predicted dates myself, I can only recommend tips by other users. A nice answer on dates (as input) is given here.
If you can sacrifice a little bit of accuracy in the result, predicting the calendar week in which the exam is happening might be a good idea. Otherwise, you could simply treat it as two regressed values, but you might end up with invalid combinations (i.e. "negative days/months", or something like "31st February".
Depending on how much training data of high quality you have, results might vary quite heavily. Lastly, I would again recommend you to overthink whether you actually need a neural network for this task, or whether there are simpler metrics to do this.

Create dummy variables or use RandomForest. They accept text input and numerical output.

Related

How to determine most impactful input variables in a dataset?

I have a neural network program that is designed to take in input variables and output variables, and use forecasted data to predict what the output variables should be based on the forecasted data. After running this program, I will have an output of an output vector. Lets say for example, my input matrix is 100 rows and 10 columns and my output matrix is a vector with 100 values. How do I determine which of my 10 variables (columns) had the most impact on my output?
I've done a correlation analysis between each of my variables (columns) and my output and created a list of the highest correlation between each variable and output, but I'm wondering if there is a better way to go about this.
If what you want to know is model selection, and it's not as simple as studiying the correlation of your features to your target. For an in-depth, well explained look at model selection, I'd recommend you read chapter 7 of The Elements Statistical Learning. If what you're looking for is how to explain your network, then you're in for a treat as well and I'd recommend reading this article for starters, though I won't go into the matter myself.
Naive approaches to model selection:
There a number of ways to do this.
The naïve way is to estimate all possible models, so every combination of features. Since you have 10 features, it's computationally unfeasible.
Another way is to take a variable you think is a good predictor and train to model only on that variable. Compute the error on the training data. Take another variable at random, retrain the model and recompute the error on the training data. If it drops the error, keep the variable. Otherwise discard it. Keep going for all features.
A third approach is the opposite. Start with training the model on all features and sequentially drop variables (a less naïve approach would be to drop variables you intuitively think have little explanatory power), compute the error on training data and compare to know if you keep the feature or not.
There are million ways of going about this. I've exposed three of the simplest, but again, you can go really deeply into this subject and find all kinds of different information (which is why I highly recommend you read that chapter :) ).

custom binary algorithm and neural network

I would like to understand more the machine learning technics, I have read and watch a bunch of things on Python, sklearn and supervised feed forward net but I am still struggling to see how I can apply all this to my project and where to start with. Maybe it is a little bit too ambitious yet.
I have the following algorithm which generates nice patterns as binary format inputs on csv file. The outputs and the goal is to predict the next row.
The simplify logic of this algorithm is the prediction of the next line (top line being the most recent one) would be 0,0,1,1,1,0 and then the next after that would become either 0,0,0,1,1,0 or come back to its previous step 0,1,1,1,0. However you can see the model is slightly more complex and noisy this is why I would like to introduce some machine learnings here. I am aware to have a reliable prediction I will need to introduce other relevant inputs afterwards.
Would someone please help me to get started and stand on my feet here?
I don't like throwing this here and not being able to provide a single piece of code but I am slightly confused to where to start.
Should I pass as input each (line-1) as vectors and then the associated output would be the top line? Should I build the array manually with all my dataset?
I guess I have to use the sigmoid function and python seems the most common way to answer this but for the synapses (or weights), I understand I need also to provide a constant, should this be 1?
Finally assuming you want this to run continuously what would be required?
Please would you share with me readings or simplification tasks that could help me to increase my knowledge with all this.
Many thanks.

Using logistic regression for a multiple touch response model (python/pandas)?

I have a bunch of contact data listing what members were contacted by what offer, which summarizes something like this:
To make sense of it (and to make it more scalable) I was considering creating dummy variables for each offer and then using a logistic model to see how different offers impact performance:
Before I embark too far on this journey I wanted to get some input if this is a sensible way to approach this (I have started playing around but and got a model output, but haven't dug into it yet). Someone suggested I use linear regression instead, but I'm not really sure about the approach for that in this case.
What I'm hoping to get are coefficients that are interpretable - so I can see that Mailing the 50% off offer in the 3d mailing is not as impactful as the $25 giftcard etc, and then do this at scale (lots of mailings with lots of different offers) to draw some conclusions about the impact of timing of different offers.
My concern is that I will end up with a fairly sparse matrix where only some combinations of the many possible are respresented, and what problems may arise from this. I've taken some online courses in ML but am new to it, and this is one of my first chances to work directly with it so I'm hoping I could create something useful out of this. I have access to lots and lots of data, it's just a matter of getting something basic out that can show some value. Maybe there's already some work on this or even some kind of library I can use?
Thanks for any help!
If your target variable is binary (1 or 0) as in the second chart, then a classification model is appropriate. Logistic Regression is a good first option, you could also a tree-based model like a decision tree classifier or a random forest.
Creating dummy variables is a good move; you could also convert the discounts to numerical values if you want to keep them in a single column, however this may not work so well for a linear model like logistic regression as the correlation will probably not be linear.
If you wanted to model the first chart directly you could use a linear regressions for predicting the conversion rate, I'm not sure about the difference is in doing this, it's actually something I've been wondering about for a while, you've motivated me to post a question on stats.stackexchange.com

Text classification in Python based on large dict of string:string

I have a dataset that would be equivalent to a dict of 5 millions key-values, both strings.
Each key is unique but there are only a couple hundreds of different values.
Keys are not natural words but technical references. The values are "families", grouping similar technical references. Similar is meant in the sense of "having similar regex", "including similar characters", or some sort of pattern.
Example of key-values:
ADSF33344 : G1112
AWDX45603 : G1112
D99991111222 : X3334
E98881188393 : X3334
A30-00005-01 : B0007
B45-00234-07A : B0007
F50-01120-06 : B0007
The final goal is to feed an algorithm with a list of new references (never seen before) and the algorithm would return a suggested family for each reference, ideally together with a percentage of confidence, based on what it learned from the dataset.
The suggested family can only come from the existing families found in the dataset. No need to "invent" new family name.
I'm not familiar with machine learning so I don't really know where to start. I saw some solutions through Sklearn or TextBlob and I understand that I'm looking for a classifier algorithm but every tutorial is oriented toward analysis of large texts.
Somehow, I don't find how to handle my problem, although it seems to be a "simpler" problem than analysing newspaper articles in natural language...
Could you indicate me sources or tutorials that could help me?
Make a training dataset, and train a classifier. Most classifiers work on the values of a set of features that you define yourself. (The kind of features depends on the classifier; in some cases they are numeric quantities, in other cases true/false, in others they can take several discrete values.) You provide the features and the classifier decides how important each feature is, and how to interpret their combinations.
By way of a tutorial you can look at chapter 6 of the NLTK book. The example task, the classification of names into male and female, is structurally very close to yours: Based on the form of short strings (names), classify them into categories (genders).
You will translate each part number into a dictionary of features. Since you don't show us the real data, nobody give you concrete suggestions, but you should definitely make general-purpose features as in the book, and in addition you should make a feature out of every clue, strong or weak, that you are aware of. If supplier IDS differ in length, make a length feature. If the presence (or number or position) of hyphens is a clue, make that into a feature. If some suppliers' parts use a lot of zeros, ditto. Then make additional features for anything else, e.g. "first three letters" that might be useful. Once you have a working system, experiment with different feature sets and different classifier engines and algorithms, until you get acceptable performance.
To get good results with new data, don't forget to split up your training data into training, testing and evaluation subsets. You could use all this with any classifier, but the NLTK's Naive Bayes classifier is pretty quick to train so you could start with that. (Note that the features can be discrete values, e.g. first_letter can be the actual letter; you don't need to stick to boolean features.)

Machine Learning in Python - Get the best possible feature-combination for a label

My Question is as follows:
I know a little bit about ML in Python (using NLTK), and it works ok so far. I can get predictions given certain features. But I want to know, is there a way, to display the best features to achieve a label? I mean the direct opposite of what I've been doing so far (put in all circumstances, and get a label for that)
I try to make my question clear via an example:
Let's say I have a database with Soccer games.
The Labels are e.g. 'Win', 'Loss', 'Draw'.
The Features are e.g. 'Windspeed', 'Rain or not', 'Daytime', 'Fouls committed' etc.
Now I want to know: Under which circumstances will a Team achieve a Win, Loss or Draw? Basically I want to get back something like this:
Best conditions for Win: Windspeed=0, No Rain, Afternoon, Fouls=0 etc
Best conditions for Loss: ...
Is there a way to achieve this?
My paint skills aren't the best!
All I know is theory, so well you'll have to look for the code..
If you have only 1 case(The best for "x" situations) the diagram becomes something like (It won't be 2-D, but something like this):
Green (Win), Orange(Draw), Red(Lose)
Now if you want to predict whether the team wins, loses or draws, you have (at least) 2 models to classify:
Linear Regression, the separator is the Perpendicular bisector of the line joining the 2 points:
K-nearest-neighbours: it is done just by calculating the distance from all the points, and classifying the point as the same as the closest..
So, for example, if you have a new data, and have to classify it, here's how:
We have a new point, with certain attributes..
We classify it by seeing/calculating which side of the line the point comes in (or seeing how far it is from our benchmark situations...
Note: You will have to give some weightage to each factor, for more accuracy..
You could compute the representativeness of each feature to separate the classes via feature weighting. The most common method for feature selection (and therefore feature weighting) in Text Classification is chi^2. This measure will tell you which features are better. Based on this information you can analyse the specific values that are best for every case. I hope this helps.
Regards,
Not sure if you have to do this in python, but if not, I would suggest Weka. If you're unfamiliar with it, here's a link to a set of tutorials: https://www.youtube.com/watch?v=gd5HwYYOz2U
Basically, you'd just need to write a program to extract your features and labels and then output a .arff file. Once you've generated a .arff file, you can feed this to Weka and run myriad different classifiers on it to figure out what model best fits your data. If necessary, you can then program this model to operate on your data. Weka has plenty of ways to analyze your results and to graphically display said results. It's truly amazing.

Categories