python - tsne reduce memory consumption - python

I have a very large csv file that I am trying to do a tf-idf analysis with. I am using the tfidvectorizer for sklearn and then I want to reduce the dimensionality using tsne so I can plot the results. However, whenever I use tsne, I get a memory error. Is there a way I can prevent this from happening? I read you can use svdTruncate before implementing tsne. How would I go about doing this (or are there any other suggestions)? Here is my code:
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2,max_features=n_features,stop_words='english', lowercase = False)
tfidf = tfidf_vectorizer.fit_transform(df['paper_text'])
nmf = NMF(n_components=n_topics, random_state=0,alpha=.1, l1_ratio=.5).fit(tfidf)
nmf_embedding = nmf.transform(tfidf)
nmf_embedding = (nmf_embedding - nmf_embedding.mean(axis=0))/nmf_embedding.std(axis=0)
tsne = TSNE(random_state=3211)
tsne_embedding = tsne.fit_transform(nmf_embedding)
tsne_embedding = pd.DataFrame(tsne_embedding,columns=['x','y'])
tsne_embedding['hue'] = nmf_embedding.argmax(axis=1)
Thanks in advance!

Related

Partial fit or incremental learning for autoregressive model

I have two time series representing two independent periods of data observation. I would like to fit an autoregressive model to this data. In other words, I would like to perform two partial fits, or two sessions of incremental learning.
This is a simplified description of a not-unusual scenario which could also apply to batch fitting on large datasets.
How do I do this (in statsmodels or otherwise)? Bonus points if the solution can generalise to other time-series models like ARIMA.
In pseudocode, something like:
import statsmodels.api as sm
from statsmodels.tsa.ar_model import AutoReg
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
res = AutoReg(data_1, lags=12).fit()
res.aic
# This is more like what I would like to do
model = AutoReg(lags=12)
model.partial_fit(data_1)
model.partial_fit(data_2)
model.results.aic
Statsmodels does not directly have this functionality. As Kevin S mentioned though, pmdarima does have a wrapper that provides this functionality. Specifically the update method. Per their documentation: "Update the model fit with additional observed endog/exog values.".
See example below around your particular code:
from pmdarima.arima import ARIMA
import statsmodels.api as sm
data = sm.datasets.sunspots.load_pandas().data['SUNACTIVITY']
data_1 = data[:len(data)//3]
data_2 = data[len(data)-len(data)//3:]
# This is the standard single fit usage
model = ARIMA(order=(12,0,0))
model.fit(data_1)
# update the model parameters with the new parameters
model.update(data_2)
I don't know how to achieve that in autoreg, but I think it can be achieved somehow, but need to manually evaluate results or somehow add the data.
But in ARIMA and SARIMAX, it's already implemented and it's simple.
For incremental learning, there are three functions related and it's documented here. First is apply which use fitted parameters on new unrelated data. Then there are extend and append. Append can be refit. I don't know exact difference though.
Here is my example that is different but similar...
from statsmodels.tsa.api import ARIMA
data = np.array(range(200))
order = (4, 2, 1)
model = ARIMA(data, order=order)
fitted_model = model.fit()
prediction = fitted_model.forecast(7)
new_data = np.array(range(600, 800))
fitted_model = fitted_model.apply(new_data)
new_prediction = fitted_model.forecast(7)
print(prediction) # [200. 201. 202. 203. 204. 205. 206.]
print(new_prediction) # [800. 801. 802. 803. 804. 805. 806.]
This replace all the data, so it can be used on unrelated data (unknown index). I profiled it and apply is very fast in comparison to fit.

I'm unable to transform my DataFrame into a Variable to store data-values/features (Linear Discriminant Analysis)

I'm using LDA to reduce two tables I've created, holds and latency, down from 9 and 18 features respectively (along with a target each). I planned on using LDA for this and am currently trying to parse in the features into a variable. However that doesn't seem to be working. I receive a KeyError(1) whenever I do this. My data is perfectly fine and here is the code. If anyone could tell me what's wrong with it, I'd be very grateful. Here is a tail of both my DataFrames:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
lda = LDA(n_components=2)
X = holds[[0,1,2,3,4,5,6,7,8]].values
Y = holds[9].values
X2 = latency[[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17]].values
Y2 = latency[9].values
This error has nothing to do with the LDA or scikit-learn in general.
The error is coming from the way you try to index the pandas dataframe that you have.
Use this:
X = holds.iloc[: , [0,1,2,3,4,5,6,7,8]].values
Y = holds.iloc[:, 9].values
Similarly, for X2 and Y2.

How to input twitter data (csv/txt) into DBSCAN python?

Could someone guide me how could i cluster twitter data using DBSCAN in python? I am totally new to DBSCAN. Also, how to determine the eps value and the iloc or loc value.
import numpy as np
import pandas as pd
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
def clusterEvaluate(cluster):
count_cluster = np.bincount(cluster)
count_cluster = np.argmax(count_cluster)
same_clusters = np.count_nonzero(cluster == count_cluster)/np.size(cluster)
return same_clusters
dataset = np.loadtxt('tweetdata.csv') # not sure if this works
X = StandardScaler().fit_transform(dataset)
y_valid = dataset.iloc[:6].values()
dbscan = DBSCAN(eps= 0.5,min_samples=5,metric='euclidean')
y = dbscan.fit_predict(X)
cluster_labels = np.unique(y)
same_clusters = []
i = 0
for index in cluster_labels:
cluster = y_valid[y == index]
same_clusters.insert((i, clusterEvaluate(cluster)))
You need to choose and appropriate data representation and distance function for this. Furthermore, scalability will kill you.
I do not think it will work well. I have it seen anything that gives insightful results beyond counting frequent words in a unnecessary complex fashion. Twitter data is a bitch. The messages are just too short. All the good approaches like LDA need much longer documents.

How can I read in a sparse matrix saved in a CSV file? (Python 3.6.4)

Using countvectorizer, I extracted feature vectors from thousands of emails and saved it in a CSV file
dictionary = open (r'''C:\Users\User\Desktop\csmp3\stemmedDictionary.txt''',"r")
dic = list(set(dictionary.read().splitlines()))
cv = CountVectorizer(vocabulary = dic, binary = True)
#~PRESENCE FEATURE VECTOR~#
#TRAIN
pdt = open (r'''C:\Users\User\Desktop\csmp3\presence-dataset-training-stemmed.csv''',"w")
matWriter = csv.writer(pdt,delimiter = ',')
for i in range (1,2): #45252
processed_email = open(r'''C:\Users\User\Desktop\csmp3\processed\processed'''+str(i)+'''.txt''',"r")
presence_array = cv.transform(processed_email)
matWriter.writerow(presence_array)
processed_email.close()
pdt.close()
This is part of a Spam Filtering using Naive Bayes project and our data set is rather large. I'm hoping to use this sparse matrix for Bernoulli Naive Bayes' partial fit method. I just can't quite figure out how to load the sparse matrix from the file. I've already tried numpy.loadtxt but it gives me "ValueError: could not convert string to float: "
Any help would be appreciated! Thank you!

ML - Getting feature names after feature selection - SelectPercentile, python

I have been struggling with this one for a while.
My goal is to take a text feature that I have, and find the best 5-10 words in it to help me classify. Hence, I am running a TfIdfVectorizer, and choosing ~90 best for now. however, after I downsize the feature amount, I am unable to see which features were actually chosen.
here is what I have:
import pandas
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_selection import SelectPercentile, f_classif
train=pandas.read_csv("train.tsv", sep='\t')
labels_train = train["label"]
documents = []
for i, row in train.iterrows():
documents.append((row['boilerplate'][1:-1].lower()))
vectorizer = TfidfVectorizer(sublinear_tf=True, stop_words="english")
features_train_transformed = vectorizer.fit_transform(documents)
selector = SelectPercentile(f_classif, percentile=0.1)
selector.fit(features_train_transformed, labels_train)
features_train_transformed = selector.transform(features_train_transformed).toarray()
The result is that features_train_transformed contains a matrix of all the tfidf scores per word per document of the selected words, however I have no idea which words were chosen, and methods like "get_feature_names()" are unavailable for the class SelectPercentile.
This is neccesary because i need to add these features to a bunch of numeric features and only then make my training and predictions.
selector.get_support() to get you a boolean array of columns that were within the percentile range you specified
train.columns.values should get you the complete list of column names for the original dataframe
filtering the latter with the former should give you the names of columns that make up your chosen percentile range.
the code below (cut-pasted from working code) is similar enough to yours, that it's hopefully helpful
import numpy as np
selection = SelectPercentile(f_regression, percentile=2)
train_minus_target = train.drop("y", axis=1)
x_features = selection.fit_transform(train_minus_target, y_train)
columns = np.asarray(train_minus_target.columns.values)
support = np.asarray(selection.get_support())
columns_with_support = columns[support]
Reference:
about get_support

Categories