Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I want to map failed examples back to identifying metadata like name, id, etc so I can look more closely at them. The easiest way I can think to do this would be to leave the id field in the feature set when I call the fit function. However, I don't want the model to train on these metadata fields. Is there anyway to fit a model while ignoring some features? Or is there some better way to map failed examples back to their identifying metadata?
First of all, you should be looking at the "failed examples" in your test, not in your training dataset. I'm going to assume that is what you want to do - but it works the same way for training data also. The question becomes, how to set up the data set so that you can trace back individual data points that the model doesn't perform well on.
I'm also going to assume that your data is in a dataframe. Let's say you have the columns [feature1, feature2, id]. Then whatever shuffling and splitting into train/test/validation data you do, you do on the full data frame - features and metadata get moved together.
Finally, you pass df[[feature1, feature2]] to your model. Now your feature data and your full data are indexed in the exactly same way. After identifying the data point that it does not work well on, you can get its id and other metadata by looking at the original dataframe at the same index.
Related
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 months ago.
Improve this question
I have a dataset composed of several large csv files. Their total size is larger than the RAM of the machine on which the training is executed.
I need to train an ML model from Scikit-Learn or TF or pyTorch (Think SVR, not deep learning). I need to use the whole dataset which is impossible to load at once. Any recommendation on how to overcome this, please?
I have been in this situation before and my suggestion would be take a step back and look at the problem again.
Does your model absolutely need all of the data at once? Or can it be done in batches? It's also possible that the model you are using can be done in batches, but the library you are using does not support such a case. In that situation, either try to find a library that does support batches or if such a library does not exist (unlikely), "reinvent the wheel" yourself, i.e., create the model from scratch and allow batches. However, as your question mentioned, you need to use a model from Scikit-Learn, TensorFlow, or PyTorch. So if you truly want to stick with your mentioned libraries, there are techniques such as those that Alexey Larionov and I'mahdi mentioned in comments to your question in relation to PyTorch and TensorFlow.
Is all of your data actually relevant? Once I found that a whole subset of my data was useless to the problem I was trying to solve; another time I found that it was only marginally helpful. Dimensionality reduction, numerosity reduction, and statistical modeling may be your friends here. Here is a link to a wikipedia page about data reduction:
https://en.wikipedia.org/wiki/Data_reduction
Not only will data reduction reduce the amount of memory you need, it will also improve your model. Bad data in means bad data out.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 1 year ago.
Improve this question
I have a few lists of movement tracking data, which looks something like this
I want to create a list of outputs where I mark these large spikes, essentially telling that there is a movement at that point.
I applied a rolling standard deviation on the data with a window size of two and got this result
Now I can see the spikes which mark the point of interest, but I am not sure how to do it in code. A statistical tool to measure these spikes, which can be used to flag these spikes.
There are several approaches that you can use for an anomaly detection task.
The choice depends on your data.
If you want to use a statistical approach, you can use some measures like z-score or IQR.
Here you can find a tutorial for these measures.
Here instead, you can find another tutorial for a statistical approach which uses mean and variance.
Last but not least, I suggest you also to check how to use a control chart, because in some cases it's enough.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
This may be a stupid question, but I am new to ML and can't seem to find a clear answer.
I have implemented a ML algorithm on a Python web app.
Right now I am storing the data that the algorithm uses in an offline CSV file, and every time the algorithm is run, it analyzes all of the data (one new piece of data gets added each time the algorithm is used).
Apologies if I am being too vague, but I am wondering how one should generally go about implementing the data and algorithm properly so that:
The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).
The data isn't stored in a CSV (Do I simply store it in a database like I would with any other type of data?)
You can store in whatever format you like.
Some form of preprocessing is used so that the ML algorithm doesn't have to analyze the same data repeatedly each time it is used (or does it have to given that one new piece of data is added every time the algorithm is used?).
This depends very much on what algorithm you use. Some algorithms can easily be implemented to learn in an incremental manner. For example, Linear/Logistic Regression implemented with Stochastic Gradient Descent could easily just run a quick update on every new instance as it gets added. For other algorithms, full re-trains are the only option (though you could of course elect not to always do them over and over again for every new instance; you could, for example, simply re-train once per day at a set point in time).
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have a dataset of IT operations tickets with fields like Ticket No, Description, Category,SubCategory,Priority etc.
What I need to do is to use available data(except ticket no) to predict the ticket priority. Sample data shown below.
Number Priority Created_on Description Category Sub Category
719515 MEDIUM 05-01-2016 MedWay 3rd Lucene.... Server Change
720317 MEDIUM 07-01-2016 DI - Medway 13146409 Application Incident
720447 MEDIUM 08-01-2016 DI QLD Chermside.... Application Medway
Please guide me on this.
Answering without more is a bit tough, and this is more of a context questions than a code question. But here is the logic I would use to start to evaluate this problem Keep in mind it might involve writing a few separate scripts each performing part of the task.
Try breaking the problem up into smaller pieces.You cannot do an analysis without all the data so start by creating the data.
You have the category and sub category already make a list of all the unique factors in each list and create a set of weights for each based on your system and business needs. As you make subcategory weights, keep in mind how they will interact with categories (+/- as well as magnitude).
Write a script to read the description, count all the non-trivial words. Create some kind of classifications for words to help you build lists that will inform the model with categories and sub categories.
Is the value an error message, or machine name, or some other code or type of problem you can extract using key words?
How are all the word groupings meaningful?
How would the contribute to making a decision?
Think about the categories when you decide these things.
Then with all of the parts, decide on a model, build, test and refine. I know there is no code in this but the problem solving part of Data Science happens outside of code most of the time.
You need to come up with the code yourself. If you get stuck post an edit and we can help.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Simply put, what are the preferred practices for writing larger python applications that use pandas dataframes as its primary method for data representation?
I often find myself struggling to maintain inconsistencies in dataframes, sometimes invariants leak through in data, datatypes are not what you expect etc.
I'm wondering just what are the best practices for writing larger, stable applications in pandas? I want to take advantage of array-representation in data for speed, but I also want to make sure that there's a way to further define the "bounds" of dataframe, what it should have in it, in a clean way.
Assertions on receiving a dataframe from a caller.
Forcing a dataframe parameter to have specific dtypes.
Defining a dataframe "type" based upon the columns it has.
Opportunities for OOP, at the dataframe level
Also, sorry for the vague nature of this. I'm starting on a project, and I want to ask this question before I get too far off course. I've been burned in the past with regards to not enforcing enough of a structure when it comes to dataframes.