I'm looking for way to improve the quality of my eigenvectors produced by sklearn TruncatedSVD. The documentation at scikit-learn.org suggests that the n_oversamples parameter is a good place to start. I have a sparse 2200 square matrix as input (provided as three separate files consisting of row indexes, column indexes, and data value.) Here's my code:
from array import array
import sys
import numpy as np
import struct
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import csr_matrix
path="c:\\users\\lenwh\\documents\\wikipedia\\weights\\"
file=sys.argv[1]
dims=int(sys.argv[2]) #I use 300
with open(path+ file + ".rows","rb") as f:
rows=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".cols","rb") as f:
cols=np.fromfile(f,dtype=np.int32)
with open(path+ file + ".data","rb") as f:
data = np.fromfile(f, dtype=np.float32)
rowCount=len(np.unique(rows))
csr=csr_matrix((data, (rows, cols)), shape=(rowCount, rowCount))
vectorsfile=path+"eigens.vec"
transfile=path+ file + ".eig"
oversamples=10;
pca=TruncatedSVD(n_components=dims, n_oversamples=oversamples)
pca.fit(csr)
np.savetxt(transfile,pca.transform(csr),fmt='%16f')
The problem is that whether I have oversamples set to 10, 100, or 1000, the results are not discernably different, meaning the explained variance is the same for all, as is the performance of the results in my application. As a minimum, I expected that the quality of the explained variance would change. I would appreciate any explanation of where my expectations are misguided, and whether there are any other settings -- or alternatives to TruncatedSVD -- that I could looked to other than the n_components setting.
Related
I would like some assist with a problem I have. I have a big csv file (6239292, 5) and want to perform an unsupervised machine learning technique (kmodes). My code is this:
import numpy as np
import pandas as pd
print("initialising")
syms = np.genfromtxt('foo.csv', delimiter = ';', dtype=str, skip_header=1, invalid_raise=False)[:, 0:]
print(syms.shape)
X = np.genfromtxt('foo.csv',dtype=object, delimiter=';', invalid_raise=False, skip_header=1)[:, 1:]
X[1:, 0] = X[1:, 0].astype(float)
from kmodes.kprototypes import KPrototypes
print("Imported successfully")
kproto = KPrototypes(n_clusters=6, init='random', n_init=2, verbose=2)
clusters = kproto.fit_predict(X, categorical=[2,1,3,])
Due to the size of the file, it's taking forever. Is there any technique I could use to reduce the time? Thank you in advance!
You can select the first n rows like:
read_csv(..., nrows=999999)
or skip some rows and then select the next n rows:
read_csv(..., skiprows=1000000, nrows=999999)
There shouldn't be a problem with your results due to the Central Limit Theorem
The Central Limit Theorem (CLT) is a statistical theory states that
given a sufficiently large sample size from a population with a finite
level of variance, the mean of all samples from the same population
will be approximately equal to the mean of the population.
I have a dataset each of datum has sparse labels.
So, below is how data looks like.
[["Snow","Winter","Freezing","Fun","Beanie","Footwear","Headgear","Fur","Playing in the snow","Photography"],["Tree","Sky","Daytime","Urban area","Branch","Metropolitan area","Winter","Town","City","Street light"],...]
The total numbers of labels are around 50, and the numbers of data are 200K. And I want to cluster this data, but I'm having trouble dealing with that.
I want to cluster that data with four clustering algorithms(AgglomerativeClustering, SpectralClustering, MiniBatchKMeans, KMeans), but none of these worked because of memory issues.
Below is my code.
from scipy.sparse import csr_matrix
from sklearn.cluster import KMeans
from sklearn.cluster import MiniBatchKMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.cluster import SpectralClustering
import json
NUM_OF_CLUSTERS = 10
with open('./data/sample.json') as json_file:
json_data = json.load(json_file)
indptr = [0]
indices = []
data = []
vocabulary = {}
for d in json_data:
for term in d:
index = vocabulary.setdefault(term, len(vocabulary))
indices.append(index)
data.append(1)
indptr.append(len(indices))
X = csr_matrix((data, indices, indptr), dtype=int).toarray()
# None of these algorithms work properly. I think it's because of memory issues.
# miniBatchKMeans = MiniBatchKMeans(n_clusters=NUM_OF_CLUSTERS, n_init=5, random_state=0).fit(X)
# agglomerative = AgglomerativeClustering(n_clusters=NUM_OF_CLUSTERS).fit(X)
# spectral = SpectralClustering(n_clusters=NUM_OF_CLUSTERS, assign_labels="discretize", random_state=0).fit(X)
#
# print(miniBatchKMeans.labels_)
# print(agglomerative.labels_)
# print(spectral.labels_)
with open('data.json', 'w') as outfile:
json.dump(miniBatchKMeans.labels_.tolist(), outfile)
Are there any solutions or other recommendations for my problem?
What is the size of X?
With toarray() you are converting the data into a sense format. That significantly increases the memory requirements.
With 200k instances you cannot use spectral clustering not affiniy propagation, because these need O(n²) memory. So either you choose other algorithms or subsample your data. Obviously there is also no use in doing both kmeans and minibatch kmeans (which is an approximation to kmeans). Use only one.
To efficiently work with sparse data, you may need to implement the algorithms yourself. Kmeans is designed for dense data, so it makes sense to time the implementation for dense data by default. In fact, using the mean on sparse data is rather questionable. So I'd not expect the results to be very good on your data with kmeans either.
I am working with TFIDF sparse matrices for document classification and want to retain only the top n (say 50) terms for each document (ranked by TFIDF score). See EDIT below.
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
n = 50
df = pd.read_pickle('my_df.pickle')
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t
Out[15]:
<21175x201380 sparse matrix of type '<class 'numpy.float64'>'
with 6055621 stored elements in Compressed Sparse Row format>
I have tried following the example in this post, although my aim is not to display the features, but just to select the top n for each document before training. But I get a memory error as my data is too large to be converted into a dense matrix.
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
Traceback (most recent call last):
File "<ipython-input-16-e0a74c393ca5>", line 1, in <module>
df_t_sorted = np.argsort(df_t.toarray()).flatten()[::1][n]
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\compressed.py", line 943, in toarray
out = self._process_toarray_args(order, out)
File "C:\Users\Me\AppData\Local\Continuum\anaconda3\lib\site-packages\scipy\sparse\base.py", line 1130, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
Is there any way to do what I want without working with a dense representation (i.e. without the toarray() call) and without reducing the feature space too much more than I already have (with min_df)?
Note: the max_features parameter is not what I want as it only considers "the top max_features ordered by term frequency across the corpus" (docs here) and what I want is a document-level ranking.
EDIT: I wonder if the best way to address this problem is to set the values of all features except the n-best to zero. I say this because the vocabulary has already been calculated, so feature indices must remain the same, as I will want to use them for other purposes (e.g. to visualise the actual words that correspond to the n-best features).
A colleague wrote some code to retrieve the indices of the n highest-ranked features:
n = 2
tops = np.zeros((df_t.shape[0], n), dtype=int) # store the top indices in a new array
for ind in range(df_t.shape[0]):
tops[ind,] = np.argsort(-df_t[ind].toarray())[0, 0:n] # for each row (i.e. document) sort the (inversed, as argsort is ascending) list and slice top n
But from there, I would need to either:
retrieve the list of remaining (i.e. lowest-ranked) indices and modify the values "in place", or
loop through the original matrix (df_t) and set all values to 0 except for the n best indices in tops.
There is a post here explaining how to work with a csr_matrix, but I'm not sure how to put this into practice to get what I want.
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(tokenizer=word_tokenize,ngram_range=(1,2), binary=True, max_features=50)
TFIDF=vect.fit_transform(df['processed_cv_data'])
The max_features parameter passed in the TfidfVectorizer will pick out the top 50 features ordered by their term frequency but not by their Tf-idf score.
You can view the features by using:
print(vect.get_feature_names())
As you mention, the max_features parameter of the TfidfVectorizer is one way of selecting features.
If you are looking for an alternative way which takes the relationship to the target variable into account, you can use sklearn's SelectKBest. By setting k=50, this will filter your data for the best features. The metric to use for selection can be specified as the parameter score_func.
Example:
from sklearn.feature_selection import SelectKBest
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(df['text'])
df_t_reduced = SelectKBest(k=50).fit_transform(df_t, df['target'])
You can also chain it in a pipeline:
pipeline = Pipeline([("vectorizer", TfidfVectorizer()),
("feature_reduction", SelectKBest(k=50)),
("classifier", classifier)])
You could break your numpy array in multiple one to free the memory. Then just concat them
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train').data
tfidfvectorizer = TfidfVectorizer(analyzer='word', stop_words='english',
token_pattern='[A-Za-z][\w\-]*', max_df=0.25)
df_t = tfidfvectorizer.fit_transform(data)
n = 10
df_t = tfidfvectorizer.fit_transform(data)
df_top = [np.argsort(df_t[i: i+500, :].toarray(), axis=1)[:, :n]
for i in range(0, df_t.shape[0], 500)]
np.concatenate(df_top, axis=0).shape
>> (11314, 10)
I have data produced from Comsol which I would like to use as a look up table in a Python / Scipy program I am building. The output from comsol looks like B(ri,thick,L) and will contain approximately 20,000 entries. An example of the output is shown below for a reduced 3x3x3 version.
While I have found many good solutions for 3D interpolation using e.g. regulargridinterpolator (first link below), I am still looking for a solution using the lookup table style. The second link below seems close, however I am unsure how the method interpolates over all three dimensions.
I am having a hard time believing that a lookup table requires such an elaborate implementation, so any suggestions are most appreciated!
COMSOL data example
interpolate 3D volume with numpy and or scipy
Interpolating data from a look up table
I was able to figure this out and wanted to pass on my solution to the next person. I found that merely averaging the two closest points found via a cKDtree yielded errors as large as 10%.
Instead, I used the cKDtree to find the appropriate entry in the scattered look up table / data file and assign it to the correct entry of a 3D numpy array (You can save this numpy array to file if you like). Then I use rectangulargridinterpolator on this array. Errors were on the order of 0.5 percent which was an order of magnitude better than the cKDtree.
import numpy as np
from scipy.spatial import cKDTree
from scipy.interpolate import RegularGridInterpolator
l_data = np.linspace(.125,0.5,16)# np.linspace(0.01,0.1,10) #Range for "short L"
ri_data = np.linspace(0.005,0.075,29)
thick_data = np.linspace(0.0025,0.1225,25)
#xyz data with known bounds above
F = np.zeros((np.size(l_data),np.size(ri_data),np.size(thick_data)))
LUT = np.genfromtxt('a_data_file.csv', delimiter = ',')
F_val = LUT[:, 3]
tree_small_l = cKDTree(LUT[:, :3]) #xyz coords
for ri_iter in np.arange(np.size(ri_data)):
for thick_iter in np.arange(np.size(thick_data)):
for l_iter in np.arange(np.size(l_data)):
dist,ind = tree_small_l.query(((l_data[l_iter],ri_data[ri_iter],thick_data[thick_iter])))
F[l_iter,ri_iter,thick_iter] = F_val[ind].T
interp_F_func = RegularGridInterpolator((l_data, ri_data, thick_data), F)
I have the following code
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
df = df = pd.DataFrame(np.random.randn(1000, 25), index=dates, columns=list('ABCDEFGHIJKLMOPQRSTUVWXYZ'))
def reduce(dim):
svd = sklearn.decomposition.TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
fitted = reduce(5)
how do i get the column names from fitted?
In continuation of Mikhail post.
Assume that you already have feature_names from vectorizer.get_feature_names() and after that you have called svd.fit(X)
Now you can also extract sorted best feature names using the following code:
best_fearures = [feature_names[i] for i in svd.components_[0].argsort()[::-1]]
The above code, try to return the arguement of descending sort of svd.components_[0] and find the relative index from feature_names (all of the features) and construct the best_features array.
Then you can see for example the 10 best features:
In[21]: best_features[:10]
Out[21]:
['manag',
'develop',
'busi',
'solut',
'initi',
'enterprise',
'project',
'program',
'process',
'plan']
fitted column names would be SVD dimensions.
Each dimension is a linear combination of input features. To understand what a particular dimension mean take a look at svd.components_ array - it contains a matrix of coefficients input features are multiplied by.
Your original example, slightly changed:
import pandas as pd
import numpy as np
from sklearn.decomposition import TruncatedSVD
feature_names = list('ABCDEF')
df = pd.DataFrame(
np.random.randn(1000, len(feature_names)),
columns=feature_names
)
def reduce(dim):
svd = TruncatedSVD(n_components=dim, n_iter=7, random_state=42)
return svd.fit(df)
svd = reduce(3)
Then you can do something like that to get a more readable SVD dimension name - let's compute it for 0th dimension:
" ".join([
"%+0.3f*%s" % (coef, feat)
for coef, feat in zip(svd.components_[0], feature_names)
])
It shows +0.170*A -0.564*B -0.118*C +0.367*D +0.528*E +0.475*F - this is a "feature name" you can use for a 0th SVD dimension in this case (of course, coefficients depend on data, so feature name also depends on data).
If you have many input dimensions you may trade some "precision" with inspectability, e.g. sort coefficients and use only a few top of them. A more elaborate example can be found in https://github.com/TeamHG-Memex/eli5/pull/208 (disclaimer: I'm one of eli5 maintainers; pull request is not by me).