Preprocessing machine learning data [closed]

Preprocessing machine learning data [closed] - python

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This may be a stupid question, but I am new to ML and can't seem to find a clear answer.
I have implemented a ML algorithm on a Python web app.
Right now I am storing the data that the algorithm uses in an offline CSV file, and every time the algorithm is run, it analyzes all of the data (one new piece of data gets added each time the algorithm is used).
Apologies if I am being too vague, but I am wondering how one should generally go about implementing the data and algorithm properly so that:
The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).

The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
You can store in whatever format you like.
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).
This depends very much on what algorithm you use. Some algorithms can easily be implemented to learn in an incremental manner. For example, Linear/Logistic Regression implemented with Stochastic Gradient Descent could easily just run a quick update on every new instance as it gets added. For other algorithms, full re-trains are the only option (though you could of course elect not to always do them over and over again for every new instance; you could, for example, simply re-train once per day at a set point in time).

Related

Scikit learn - ignore metadata features when training [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I want to map failed examples back to identifying metadata like name, id, etc so I can look more closely at them. The easiest way I can think to do this would be to leave the id field in the feature set when I call the fit function. However, I don't want the model to train on these metadata fields. Is there anyway to fit a model while ignoring some features? Or is there some better way to map failed examples back to their identifying metadata?

First of all, you should be looking at the "failed examples" in your test, not in your training dataset. I'm going to assume that is what you want to do - but it works the same way for training data also. The question becomes, how to set up the data set so that you can trace back individual data points that the model doesn't perform well on.
I'm also going to assume that your data is in a dataframe. Let's say you have the columns [feature1, feature2, id]. Then whatever shuffling and splitting into train/test/validation data you do, you do on the full data frame - features and metadata get moved together.
Finally, you pass df[[feature1, feature2]] to your model. Now your feature data and your full data are indexed in the exactly same way. After identifying the data point that it does not work well on, you can get its id and other metadata by looking at the original dataframe at the same index.

How to train a ML Model for detecting duplicates without having data sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am currently working on my master thesis. My topic is the development of a duplicate checking system in ABAP for a customer in the SAP environment. Up to now, the whole thing works in such a way that if the checking system does not know exactly whether an incoming data record is a duplicate or not, a clerk intervenes and makes the final decision. The clerk is to be "replaced by a machine learning model", which is to be located on a Python server, so that the clerk intervenes less in the process and becomes more and more of an "AI trainer". The communication between the duplicate checking system and the ML model is done by a REST API. The ML-model should decide whether incoming data records are duplicates or not.
My first problem is that I don't have any training data to create an ML model. The second problem is that I still do not know exactly what my training data might look like from the structure. It is possible to get client data records from the client, but for various reasons this is quite unlikely. Is there a way to generate "synthetic data" to create an ML model? What could it look like for my application? Which tools could I use to make my work a little easier?
Many greetings

You can't.
When you don't have any real-world data and how humans classified it, then you can not train a ML system to classify real-world data.
What you can do is train a system with data you generated and classified in a way you believe to be similar to what the program might encounter in the wild. This would allow you to try your methodology and see if it works at all. But if you want to create an ML model which is actually useful in the real world, you need to repeat the training process with real-world data.

There is a lot of variability with what your data COULD be. You need to narrow down with what it will look like.
You can always create random data:
>>> N = 5
>>> import random
>>> import numpy as np
>>> np.random.random((N,N))
Create yourself a whole bunch of random arrays, but then copy one, and then you can test whether you can catch whether it is duplicate or not?

How can I increase email open rates with machine learning? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
For example, I have email sending logs like this:
day_of_week| time | cta | is_opened |is_clicked
1 |10:00AM|CLICK HERE|True |False
7 |07:30PM|BUY NOW |False |False
...
I want to write a program to see "best performing day and time to send emails".
This example is only for sending day/time. I want I can add extra parameters (like CTA, sender name etc.) when I need it.
Is the machine learning best way to do it? (I have no experience in ML) I'm experienced with Python and I think I can use the TensorFlow to do it.
ps: These are marketing emails that we send our members, not spam or malware.

There are two views for your case :
given day, time, etc. predict will be opened/clicked or not.
given cta, etc. predict the best day-time to send the email
For the first case, you can use Neural Net, or any classifier to predict it will be opened/clicked or not
For the second case, I assume this is your case, You may look at Multivariate Regression, because two variables you need to predict (day_of_week, time) may not be handled separately (e.g. by creating two models then predict day_of_week and time separately ). You need to predict two variables simultaneously. But you need to clean your data first, so it only contain opened/clicked email.
And of course, you can implement it using Tensorflow.

Text analysis with a mix of text & categorical columns in R [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataset of IT operations tickets with fields like Ticket No, Description, Category,SubCategory,Priority etc.
What I need to do is to use available data(except ticket no) to predict the ticket priority. Sample data shown below.
Number Priority Created_on Description Category Sub Category
719515 MEDIUM 05-01-2016 MedWay 3rd Lucene.... Server Change
720317 MEDIUM 07-01-2016 DI - Medway 13146409 Application Incident
720447 MEDIUM 08-01-2016 DI QLD Chermside.... Application Medway
Please guide me on this.

Answering without more is a bit tough, and this is more of a context questions than a code question. But here is the logic I would use to start to evaluate this problem Keep in mind it might involve writing a few separate scripts each performing part of the task.
Try breaking the problem up into smaller pieces.You cannot do an analysis without all the data so start by creating the data.
You have the category and sub category already make a list of all the unique factors in each list and create a set of weights for each based on your system and business needs. As you make subcategory weights, keep in mind how they will interact with categories (+/- as well as magnitude).
Write a script to read the description, count all the non-trivial words. Create some kind of classifications for words to help you build lists that will inform the model with categories and sub categories.
Is the value an error message, or machine name, or some other code or type of problem you can extract using key words?
How are all the word groupings meaningful?
How would the contribute to making a decision?
Think about the categories when you decide these things.
Then with all of the parts, decide on a model, build, test and refine. I know there is no code in this but the problem solving part of Data Science happens outside of code most of the time.
You need to come up with the code yourself. If you get stuck post an edit and we can help.

Loading Big Data on Python [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I'm fairly new to data-science and barely started using python roughly about two months ago. I've been trying to do a Kaggle competition for fun (catsVsDogs) to try to learn things along the way. However I'm stopped at the very first step. The problem is that there is a training set, which contains about 25000 .jpg images of cats and dogs and the total directory is approximately 800 MB in size. Whenever I try to load the directory into python and save all the images in a matrix (say we have 100 of (300,200) size images, I would like to save them in a matrix of 100*(60000) size) I get either a memory error or the system just stops processing. I'm using canopy on a mac.
I've been trying to read a lot on the internet and find out how people deal with these big images, but it has been a week and I still couldn't find any good source. I would highly appreciate it if somebody helped me out or would just send me a link that describes the situations.
here's the link for Kaggle Competition (you can see there is no prizes involved and it's just for the sake of learning):
https://www.kaggle.com/c/dogs-vs-cats/data
The question is how do I manage to load this big dataset into python using canopy and start training a Neural Network. Or generally how do I deal with big datasets on a single computer without memory error.

I would recommend making an index of items that you wish to read (directory listing). Next read just the first item, train using just that item, remove that item from memory, move on to the next item, and repeat. You shouldn't need to have more then a few in memory at any given time.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.