Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I got started with naive numerical prediction. Here is the training data
https://gist.github.com/karimkhanp/75d6d5f9c4fbaaaaffe8258073d00a75
Test data
https://gist.github.com/karimkhanp/0f93ecf5fe8ec5fccc8a7f360a6c3950
I wrote basic scikit learn code to train and test.
import pandas as pd
import pylab as pl
from sklearn import datasets
from sklearn import metrics, linear_model
from sklearn.linear_model import LogisticRegression, LinearRegression, BayesianRidge, OrthogonalMatchingPursuitCV, SGDRegressor
from datetime import datetime, date, timedelta
class NumericPrediction(object):
def __init__(self):
pass
def dataPrediction(self):
Train = pd.read_csv("data_scientist_assignment.tsv", sep='\t', parse_dates=['date'])
Train_visualize = Train
Train['timestamp'] = Train.date.values.astype(pd.np.int64)
Train_visualize['date'] = Train['timestamp']
print Train.describe()
x1=["timestamp", "hr_of_day"]
test=pd.read_csv("test.tsv", sep='\t', parse_dates=['date'])
test['timestamp'] = test.date.values.astype(pd.np.int64)
model = LinearRegression()
model.fit(Train[x1], Train["vals"])
# print(model)
# print model.score(Train[x1], Train["vals"])
print model.predict(test[x1])
Train.hist()
pl.show()
if __name__ == '__main__':
NumericPrediction().dataPrediction()
But accuracy is very low here. Because approach is very naive. Any better suggestion to improve accuracy ( In terms of algorithm, example, reference, library)?
For starter, your 'test' set doesn't look right. Please check it.
Secondly, your model is doomed to fail. Plot your data - what do you see? Clearly we have a seasonality here, while linear regression assumes that observations are independent. It's important to observe that you are dealing here with time series.
R language is excellent when it comes to time series, with advanced packages for time series forecasting like bsts. Still, Python here will be just as good. Pandas module is going to serve you well. Mind that you might not necessarily have to use machine learning here. Check ARMA and ARIMA. Bayesian structural time series are also excellent.
Here is a very good article that guides you through basics of dealing with time series data.
Related
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am working on a text classification problem. I have huge amount of data and when I am trying to fit data into the machine learning model it is causing a memory error. Is there any way through which I can fit data in parts to avoid memory error.
Additional information
I am using linearSVC model.
I have training data of 1.1 million rows.
I have vectorized text data using tfidf.
The shape of vectorized data (1121063, 4235687) which has to be
fitted into the model.
Or is there any other way out of this problem.
Unfortunately, I don't have any reproducible code for the same.
Thanks in advance.
The simple answer is not to use what I assume is the scikit-learn implementation of linearSVC and instead use some algorithm/implementation that allows training in batches. Most common of which are neural networks, but several other algorithms exists. In scikit-learn look for classifiers with the partial_fit method which will allow you to fit your classifier in batches. See e.g. this list
You could also try what's suggested from sklearn.svm import SVC (the second part, the first is using LinearSVC, which you did):
For large datasets consider using :class:'sklearn.svm.LinearSVC' or
:class:'sklearn.linear_model.SGDClassifier' instead, possibily after a :class:'sklearn.kernel_approximation.Nystroem' transformer.
If you check SGDClassifier() you can set the parameter "warm_start=True" so when you iterate trough your dataset it won't lose it's state.:
clf = SGDClassifier(warm_start=True)
for i in 'loop your data':
clf.fit(data[i])
Additionally you could reduce the dimension of your dataset by removing some words from your TFIDF model. Check the "max_df" and "min_df" parameters, they'll remove words with frequency higher than or lower than, can be a % or an unit.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
Probably I'm missing something obvious--when I detrend my timeseries target data my model preforms way better. That's great. However, I'm trying to forecast an entire cycle and the trend ~is~ important. Is there a way to reconstitute the trend with these better scores or am I shooting in the foot by removing the trend in the first place?
mean absolute error with trend intact are on order of 0.001-0.003, with trend removed the scores are around 0.0001
Please provide more information.
What kind of model do you use?
Can you give an example of the time series e.g. pd.Series(data=[100,110,120,130,140])?
Have you checked for overfitting, meaning your model performs good on your current dataset but once new data comes in it performs really poor.
Does your time series really have a trend, or does it more or less move sideways (plot-wise speaking)?
Also you can combine different models, for example a linear model model might be a good choice for simulating the trend. Once you implemented the linear trend model you can add another model which tries to predict where the linear trend model is wrong. So esentially you could add a random forest algorithm which predicts the residuals of the linear model.
After you got both models you can simly sum up the prediction of both models. The linear one for the general trend and the random forest which tries to predict seasonality.
You can also look into models which recognize seasonality by nature, such as ARIMA models for example.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have extracted unstructured textual data from approximately 3000 documents and I am attempting to use this data to classify this document.
However, even after removing stopwords & punctuation and lemmatizing the data, the count vectorization produces more than 64000 features.
A lot of these features contain unnecessary tokens like random numbers and text in different languages.
The libraries I have used are:
tokenization: Punkt (NLTK)
pos tagging: Penn Treebank (NLTK)
lemmatization: WordNet(NLTK)
vectorization: CountVectorizer (sk-learn)
Can anyone suggest how I can reduce the number of features for training my classifier?
You have two choices here, that can be complementary :
Change your tokenization with stronger rules using regex to remove numbers or other tokens you are not interested in.
Use feature selection to keep a subset of your features that are relevant for the classification. Here is a demo sample of code to keep 50% of the features in data:
from sklearn.datasets import load_iris
from sklearn.feature_selection import SelectPercentile
from sklearn.feature_selection import chi2
import numpy
iris = load_iris()
X, y = iris.data, iris.target
selector = SelectPercentile(score_func=chi2, percentile=50)
X_reduced = selector.fit_transform(X, y)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
How do I apply scikit-learn to a numpy array with 4 columns each representing a different attribute?
Basically, I'm wanting to teach it how to recognize a healthy patient from these 4 characteristics and then see if it can identify an abnormal one.
Thanks in advance!
A pipeline usually has the following steps:
Define a classifier/ regressor
from sklearn import svm
clf = svm.SVC(gamma=0.001, C=100.)
Fit the data
clf.fit(X_train,y_train)
Here X_train will your four column features and y_train will be the labels whether the patient is healthy.
Predict on new data
y_pred = clf.prdict(X_test)
This tutorial is great starting point for you to get some basic idea about the pipeline.
Look into the pandas package which allows you to import CSV files into a dataframe. pandas is supported by scikit-learn.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I'm wondering if it is possible to include scikit-learn outlier detections like isolation forests in scikit-learn's pipelines?
So the problem here is that we want to fit such an object only on the training data and do nothing on the test data. Particularly, one might want to use cross-validation here.
How could a solution look like?
Build a class that inherits from TransformerMixin (and BaseEstimator for ParameterTuning).
Now define a fit_transform function that stores the state if the function has been called yet or not. If it hasn't been called yet, the function fits and predicts the outlier function on the data. If the function has been called before, the outlier detection already has been called on the training data, thus we assume that we now find the test data which we simply return.
Does such an approach have a chance to work or am I missing something here?
Your problem is basically the outlier detection problem.
Hopefully scikit-learn provides some functions to predict whether a sample in your train set is an outlier or not.
How does it work ? If you look at the documentation, it basically says:
One common way of performing outlier detection is to assume that the regular data come from a known distribution (e.g. data are Gaussian distributed). From this assumption, we generally try to define the “shape” of the data, and can define outlying observations as observations which stand far enough from the fit shape.
sklearn provides some functions that allow you to estimate the shape of your data. Take a look at : elliptic envelope and isolation forests.
As far as I am concerned, I prefer to use the IsolationForest algorithm that returns the anomaly score of each sample in your train set. Then you can take them off your training set.