Memory Error when trying to create numpy matrix - python

text = codecs.open("lith.txt", encoding= 'utf-8')
text = text.read().lower().replace('"','').replace('?','').replace(',','').replace('!','').replace('.','')
text = text.split()
words = sorted(list(set(text)))
Unigram = np.zeros([len(words)])
ind = range(len(words))
Lexicon = dict(zip(words,ind))
Bigram = np.zeros([len(words),len(words)])
I keep running into major issues with the last line of this portion of the program. The text file is maybe about 7,000,000 words long. Currently, the number of words/length is about 200,000. When I cut the text file to a point where the length of words become 40,000 or so, the program works. Is there anyway to get around this memory limitation? Thanks for any help. The results I get in later parts of the program really seem to suffer if I just keep cutting out portions of the text until the memory errors goes away.
for n in range(len(text)-1):
Unigram[Lexicon[text[n]]] = Unigram[Lexicon[text[n]]] + 1
Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] = Bigram[Lexicon[text[n]]][Lexicon[text[n+1]]] + 1
Unigram_sorted = np.argsort(Unigram)
Unigram_sorted = Unigram_sorted[::-1]
Unigram_sorted = Unigram_sorted[0:4999]

I assume that the line that raises the exception in:
Bigram = np.zeros([len(words),len(words)])
If len(words) is 200,000, then the size of the matrix is 200,000^2 integers. Assuming int64, this requires 320gb of memory.
Assuming most entries will remain zeros, sparse matrices could be helpful. For example, scipy's sparse matrices. In the case of counting joint pairs, this snippet could be of help:
from scipy.sparse.dok import dok_matrix
Bigrams = dok_matrix((len(words), len(words)))
# Bigrams[i, j] += 1
Regarding the code itself, the first part may have a relatively similar implementation at scikit-learn text vectorizers.

Related

Improving the performance of a sliding-window fragment function in Python 3

I have a script in Python 3.6.8 which reads through a very large text file, where each line is an ASCII string drawn from the alphabet {a,b,c,d,e,f}.
For each line, I have a function which fragments the string using a sliding window of size k, and then increments a fragment counter dictionary fragment_dict by 1 for each fragment seen.
The same fragment_dict is used for the entire file, and it is initialized for all possible 5^k fragments mapping to zero.
I also ignore any fragment which has the character c in it. Note that c is uncommon, and most lines will not contain it at all.
def fragment_string(mystr, fragment_dict, k):
for i in range(len(mystr) - k + 1):
fragment = mystr[i:i+k]
if 'c' in fragment:
continue
fragment_dict[fragment] += 1
Because my file is so large, I would like to optimize the performance of the above function as much as possible. Could anyone provide any potential optimizations to make this function faster?
I'm worried I may be rate limited by the speed of Python loops, in which case I would need to consider dropping down into C/Cython.
Numpy may help in speeding up your code:
x = np.array([ord(c) - ord('a') for c in mystr])
filter = np.geomspace(1, 5**(k-1), k, dtype=int)
fragment_dict = collections.Counter(np.convolve(x, filter,mode='valid'))
The idea is, represent each k length segment is a k-digit 5-ary number. Then, converting a list of 0-5 integers equivalent to the string to its 5-ary representation is equivalent to applying a convolution with [1,5,25,125,...] as filter.

ValueError: cannot reshape array of size 3800 into shape (1,200)

I am trying to apply word embedding on tweets. I was trying to create a vector for each tweet by taking the average of the vectors of the words present in the tweet as follow:
def word_vector(tokens, size):
vec = np.zeros(size).reshape((1, size))
count = 0.
for word in tokens:
try:
vec += model_w2v[word].reshape((1, size))
count += 1.
except KeyError: # handling the case where the token is not in vocabulary
continue
if count != 0:
vec /= count
return vec
Next, when I try to Prepare word2vec feature set as follow:
wordvec_arrays = np.zeros((len(tokenized_tweet), 200))
#the length of the vector is 200
for i in range(len(tokenized_tweet)):
wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
wordvec_df = pd.DataFrame(wordvec_arrays)
wordvec_df.shape
I get the following error inside the loop:
ValueError Traceback (most recent call last)
<ipython-input-32-72aee891e885> in <module>
4 # wordvec_arrays.reshape(1,200)
5 for i in range(len(tokenized_tweet)):
----> 6 wordvec_arrays[i,:] = word_vector(tokenized_tweet[i], 200)
7
8 wordvec_df = pd.DataFrame(wordvec_arrays)
<ipython-input-31-9e6501810162> in word_vector(tokens, size)
4 for word in tokens:
5 try:
----> 6 vec += model_w2v.wv.__getitem__(word).reshape((1, size))
7 count += 1.
8 except KeyError: # handling the case where the token is not in vocabulary
ValueError: cannot reshape array of size 3800 into shape (1,200)
I checked all the available posts in stackOverflow but non of them really helped me.
I tried reshaping the array and it still give me the same error.
My model is:
tokenized_tweet = df['tweet'].apply(lambda x: x.split()) # tokenizing
model_w2v = gensim.models.Word2Vec(
tokenized_tweet,
size=200, # desired no. of features/independent variables
window=5, # context window size
min_count=2,
sg = 1, # 1 for skip-gram model
hs = 0,
negative = 10, # for negative sampling
workers= 2, # no.of cores
seed = 34)
model_w2v.train(tokenized_tweet, total_examples= len(df['tweet']), epochs=20)
any suggestions please?
It looks like the intent of your word_vector() method is to take a list of words, and then with respect to a given Word2Vec model, return the average of all those words' vectors (when present).
To do that, you shouldn't need to do any explicit re-shaping of vectors – or even specification of size, because that's forced by what the model already provides. You could use utility methods from numpy to simplify the code a lot. For example, the gensim n_similarity() method, as part of its comparision of two lists-of-words, already does an averaging much like what you're trying, and you can look at its source as a model:
https://github.com/RaRe-Technologies/gensim/blob/f97d0e793faa57877a2bbedc15c287835463eaa9/gensim/models/keyedvectors.py#L996
So, while I haven't tested this code, I think your word_vector() method could be essentially replaced with:
import numpy as np
def average_words_vectors(tokens, wv_model):
vectors = [wv_model[word] for word in tokens
if word in wv_model] # avoiding KeyError
return np.array(vectors).mean(axis=0)
(It's sometimes the case that it makes sense to work with vectors that have been normalized to unit-length - as the linked gensim code via applying gensim.matutils.unitvec() to the average. I haven't done this here, as your method hadn't taken that step – but it is something to consider.)
Separate observations about your Word2Vec training code:
typically words with just 1, 2, or a few occurrences don't get good vectors (due to limited number & variety of examples), but do interfere with the improvement of other more-common-word vectors. That's why the default is min_count=5. So just be aware: your surviving vectors may get better if you use a default (or even larger) value here, discarding more of the rarer words.
the dimensions of a "dense embedding" like word2vec-vectors aren't really "independent variables" (or standalone individually-interpretable "features") as implied by your code-comment, even though they may seem that way as separate values/slots in the data. For example, you can't pick one dimension out and conclude, "that's the foo-ness of this sample" (like 'coldness' or 'hardness' or 'positiveness' etc). Rather, any of those human-describable meanings tend to be other directions in the combined-space, not perfectly aligned with any of the individual dimensions. You can sort-of tease those out by comparing vectors, and downstream ML algorithms can make use of those complicated/entangled multi-dimensional interactions. But if you think of each dimensions as its own "feature" – in any way other than yes, it's technically a single number associated with the item – you may be prone to misinterpreting the vector-space.

Going out of memory for python dictionary when the numbers are integer

I have a python code that suppose to read large files into a dictionary in memory and do some operations. What puzzles me is that in only one case it goes out of memory: when the values in the file are integer...
The structure of my file is like this:
string value_1 .... value_n
The files I have varies in size from 2G to 40G. I have 50G memory that I try to read the file in. When I have something like this:
string 0.001334 0.001473 -0.001277 -0.001093 0.000456 0.001007 0.000314 ... with the n=100 and number of rows equal to 10M, I'll be able to read it into memory relatively fast. The file size is about 10G. However, when I have string 4 -2 3 1 1 1 ... with the same dimension (n=100) and the same number of rows, I'm not able to read it to the memory.
for line in f:
tokens = line.strip().split()
if len(tokens) <= 5: #ignore w2v first line
continue
word = tokens[0]
number_of_columns = len(tokens)-1
features = {}
for dim, val in enumerate(tokens[1:]):
val = float(val)
features[dim] = val
matrix[word] = features
This will result Killed in the second case while will work in the first case.
I know this does not answer the question specifically, but probably offers a better solution to the problem looking to be resolved:
May i suggest you use Pandas for this kind of work?
It seems a lot more appropriate for what you're trying to do. http://pandas.pydata.org/index.html
import pandas as pd
pd.read_csv('file.txt', sep=' ', skiprows=1)
then do all your manipulations
Pandas is a package designed specifically to handle large datasets and process them. it has tons of useful features you probably will end up needing if you're dealing with big data.

Stanford Glove : Dimension anomaly in glove.twitter.27B.200d

I downloaded Glove-twitter pretrained vectors from http://nlp.stanford.edu/data/glove.twitter.27B.zip
When I load the vectors (using glove.twitter.27B.200d.txt) in memory I find 900 words, whose vectors are of 199 dimensions, while for rest of all the words, whose vectors are of 200 dimensions. As per my understanding - every vector in this file is supposed to be of exactly 200 dimensions. No ?
I am using the following python code to arrive at my conclusion
import pickle
import numpy as np
glove_model_path = './glove.twitter.27B.200d.txt'
f = open(glove_model_path,'r')
model = {}
counter = 0
vary_length = 0
anamolies = []
for line in f:
counter += 1
items = line.replace('\r','').replace('\n','').split(' ')
word = items[0]
vect = np.array([float(i) for i in items[1:] if len(i) > 1])
if (len(vect) != 200):
vary_length += 1
anamolies.append(word)
f.close()
print vary_length
Output is : 900
Correct, every vector should be 200 elements.
To elaborate, I suspect the problem is in your code, specifically:
items = line.replace('\r','').replace('\n','').split(' ')
Why don't you print those any of those 900 lines and see what they look like. Depending on how tokenization was done, you may have situations where \r or \n is being treated as a word and so you are eliminating some elements. I find it odd though, that the whitespace wouldn't have been merged together by default.
Also, you may want to check if the API to read in those vectors rather than rolling your own. Your code is making some formatting assumptions that may be incorrect.

how can I complete the text classification task using less memory

(1)My goal:
I am trying to use SVM to classify 10000 documents(each with 400 words) into 10 classes(evenly distributed). The features explored in my work include word n-gram (n=1~4),character n-gram(n=1~6).
(2)My approach: I am representing each document using vectors of frequency values for each feature in the document. And using TF-IDF to formalize the vectors. parts of my code are below:
def commonVec(dicts,count1,count2):
''' put features with frequency between count1 and count2 into a common vector used for SVM training'''
global_vector = []
master = {}
for i, d in enumerate(dicts):
for k in d:
master.setdefault(k, []).append(i)
for key in master:
if (len(master[key])>=count1 and len(master[key])<=count2):
global_vector.append(key)
global_vector1 = sorted(global_vector)
return global_vector1
def featureComb(mix,count1,count2,res1):
'''combine word n-gram and character n-gram into a vector'''
if mix[0]:
common_vector1 = []
for i in mix[0]:
dicts1 = []
for res in res1: #I stored all documents into database. res1 is the document result set and res is each document.
dicts1.append(ngram.characterNgrams(res[1], i)) # characterNgrams()will return a dictionary with feature name as the key, frequency as the value.
common_vector1.extend(commonVec(dicts1, count1, count2))
else:
common_vector1 = []
if mix[1]:
common_vector2 = []
for j in mix[1]:
dicts2 = []
for res in res1:
dicts2.append(ngram.wordNgrams(res[1], j))
common_vector2.extend(commonVec(dicts2, count1, count2))
else:
common_vector2 = []
return common_vector1+common_vector2
def svmCombineVector(mix,global_combine,label,X,y,res1):
'''Construct X vector that can be used to train SVM'''
lstm = []
for res in res1:
y.append(label[res[0]]) # insert class label into y
dici1 = {}
dici2 = {}
freq_term_vector = []
for i in mix[0]:
dici1.update(ngram.characterNgrams(res[1], i))
freq_term_vector.extend(dici1[gram] if gram in dici1 else 0 for gram in global_combine)
for j in mix[1]:
dici2.update(ngram.wordNgrams(res[1], j))
freq_term_vector.extend(dici2[gram] if gram in dici2 else 0 for gram in global_combine)
lstm.append(freq_term_vector)
freq_term_matrix = np.matrix(lstm)
transformer = TfidfTransformer(norm="l2")
tfidf = transformer.fit_transform(freq_term_matrix)
X.extend(tfidf.toarray())
X = []
y = []
character = [1,2,3,4,5,6]
word = [1,2,3,4]
mix = [character,word]
global_vector_combine = featureComb(mix, 2, 5000, res1)
print len(global_vector_combine) # 542401
svmCombineVector(mix,global_vector_combine,label,X,y,res1)
clf1 = svm.LinearSVC()
clf1.fit(X, y)
(3)My problem: However, when I ran the source code, a memory error occurred.
Traceback (most recent call last):
File "svm.py", line 110, in <module>
functions.svmCombineVector(mix,global_vector_combine,label,X,y,res1)
File "/home/work/functions.py", line 201, in svmCombineVector
X.extend(tfidf.toarray())
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/compressed.py", line 901, in toarray
return self.tocoo(copy=False).toarray(order=order, out=out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/coo.py", line 269, in toarray
B = self._process_toarray_args(order, out)
File "/home/anaconda/lib/python2.7/site-packages/scipy/sparse/base.py", line 789, in _process_toarray
_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
I really have a hard time with it and need help from stackoverflow.
Could anyone explain some details and give me some idea how to solve it?
could anyone check my source code and show me some other methods to make use of memory more effectively?
The main problem you're facing is that you're using far too many features. It's actually quite extraordinary that you've managed to generate 542401 features from documents that contain just 400 words! I've seen SVM classifiers separate spam from non-spam with high accuracy using just 150 features -- word counts of selected words that say a lot about whether the document is spam. These use stemming and other normalization tricks to make the features more effective.
You need to spend some time thinning out your features. Think about which features are most likely to contain information useful for this task. Experiment with different features. As long as you keep throwing everything but the kitchen sink in, you'll get memory errors. Right now you're trying to pass 10000 data points with 542401 dimensions each to your SVM. That's 542401 * 10000 * 4 = 21 gigabytes (conservatively) of data. My computer only has 4 gigabytes of RAM. You've got to pare this way down.1
A first step towards doing so would be to think about how big your total vocabulary size is. Each document has only 400 words, but let's say those 400 words are taken from a vocabulary of 5000 words. That means there will be 5000 ** 4 = 6.25 * 10 ** 14 possible 4-grams. That's half a quadrillion possible 4-grams. Of course not all those 4-grams will appear in your documents, but this goes a long way towards explaining why you're running out of memory. Do you really need these 4-grams? Could you get away with 2-grams only? There are a measly 5000 ** 2 = 25 million possible 2-grams. That will fit much more easily in memory, even if all possible 2-grams appear (unlikely).
Also keep in mind that even if the SVM could handle quadrillions of datapoints, it would probably give bad results, because when you give any learning algorithm too many features, it will tend to overfit, picking up on irrelevant patterns and overgeneralizing from them. There are ways of dealing with this, but it's best not to deal with it at all if you can help it.
I will also mention that these are not "newbie" problems. These are problems that machine learning specialists with PhDs have to deal with. They come up with lots of clever solutions, but we're not so clever that way, so we have to be clever a different way.
Although I can't offer you specific suggestions for cleverness without knowing more, I would say that, first, stemming is a good idea in at least some cases. Stemming simply removes grammatical inflection, so that different forms of the same word ("swim" and "swimming") are treated as identical. This will probably reduce your vocabulary size significantly, at least if you're dealing with English text. A common choice is the porter stemmer, which is included in nltk, as well as in a number of other packages. Also, if you aren't already, you should probably strip punctuation and reduce all words to lower-case. From there, it really depends. Stylometry (identifying authors) sometimes requires only particles ("a", "an", "the"), conjunctions ("and", "but") and other very common words; spam, on the other hand, has its own oddball vocabularies of interest. At this level, it is very difficult to say in advance what will work; you'll almost certainly need to try different approaches to see which is most effective. As always, testing is crucial!
1. Well, possibly you have a huge amount of RAM at your disposal. For example, I have access to a machine with 48G of RAM at my current workplace. But I doubt it could handle this either, because the SVM will have its own internal representation of the data, which means there will be at least one copy at some point; if a second copy is needed at any point -- kaboom.

Categories