Problem with Python/NLTK Stop Words and File Write

Problem with Python/NLTK Stop Words and File Write - python

I am trying to write to file a list of stop words from NLTK.
So, I wrote this script:
import nltk
from nltk.corpus import stopwords
from string import punctuation
file_name = 'OUTPUT.CSV'
file = open(file_name, 'w+')
_stopwords = set(stopwords.words('english')+list(punctuation))
i = 0
file.write(f'\n\nSTOP WORDS:+++\n\n')
for w in _stopwords:
i=i+1
out1 = f'{i:3}. {w}\n'
out2 = f'{w}\n'
out3 = f'{i:3}. {w}'
file.write(out2)
print(out3)
file.close()
The original program used file.write(w), but since I encountered problems, I started trying things.
So, I tried using file.write(out1). That works, but the order of the stop words appear to be random.
What's interesting is that if I use file.write(out2), I only write a random number of stop words that appear to show up in random order, always short of 211. I experience the same problem both in Visual Studio 2017 and Jupyter Notebook.
For example, the last run wrote 175 words ending with:
its
wouldn
shan
Using file.write(out1) I get all 211 words and the column ends like this:
209. more
210. have
211. ,
Has anyone run into a similar problem. Any idea of what may be going on?
I'm new to Python/NLTK so I decided to ask.

The reason you are getting a random order of stop words is due to use of set.
_stopwords = set(stopwords.words('english')+list(punctuation))
A set is an unordered collection with no duplicate elements. Read more here.
Unlike arrays, where the elements are stored as ordered list, the order of elements in a set is undefined (moreover, the set elements are usually not stored in order of appearance in the set; this allows checking if an element belongs to a set faster than just going through all the elements of the set).
You can use this simple example to check this:
test = set('abcd')
for i in test:
print(i)
It outputs different order (e.g. I tried on two different systems, this is what I got):
On Ist system
a
d
b
c
and,
on the second system
d
c
a
b
There are other alternatives for ordered sets. Check here.
Besides, I've checked that all three out1, out2, and out3 gives 211 stop words.

Related

I can't delete cases from .sav files using spss with python

I have some .sav files that I want to check for bad data. What I mean by bad data is irrelevant to the problem. I have written a script in python using the spss module to check the cases and then delete them if they are bad. I do that within a datastep by defining a dataset object and then getting its case list. I then use
del datasetObj.cases[k]
to delete the problematic cases within the datastep.
Here is my problem:
Say I have a data set foo.sav and it is the active data set in spss, then I can run something like:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[k]
spss.EndDataStep()
END PROGRAM.
from within the spss client and it will delete the case k from the data set foo.sav. But, if I run something like the following using the directory of foo.sav as the working directory:
import os, spss
pathname = os.curdir()
foopathname = os.path.join(pathname, 'foo.sav')
spss.Submit("""
GET FILE='%(foopathname)s'.
DATASET NAME file1.
DATASET ACTIVATE file1.
""" %locals())
spss.StartDataStep()
datasetObj = spss.Dataset()
caselist = datasetObj.cases
del caselist[3]
spss.EndDataStep()
from command line, then it doesn't delete the case k. Similar code which gets values will work fine. E.g.,
print caselist[3]
will print case k (when it is in the data step). I can even change the values for the various entries of a case. But it will not delete cases. Any ideas?
I am new to python and spss, so there may be something that I am not seeing which is obvious to others; hence why I am asking the question.

Your first piece of code did not work for me. I adjusted it as follows to get it working:
BEGIN PROGRAM PYTHON.
import spss
spss.StartDataStep()
datasetObj = spss.Dataset()
del datasetObj.cases[k]
spss.EndDataStep()
END PROGRAM.
Notice that, in your code, caselist is just a list, containing values taken from the datasetObj in SPSS. The attribute .cases belongs to datasetObj.
With spss.Submit, you can also delete cases (or actually, not select them) using the SPSS command SELECT IF. For example, if your file has a variable (column) named age, with values ranging from 0 to 100, you can delete all cases with an age lower than (in SPSS: lt or <) 25 using:
BEGIN PROGRAM PYTHON.
import spss
spss.Submit("""
SELECT IF age lt 25.
""")
END PROGRAM.
Don't forget to add some code to save the edited file.

caselist is not actually a regular list containing the dataset values. Although its interface is the list interface, it actually works directly with the dataset, so it does not contain a list of values. It just accesses operations on the SPSS side to retrieve, change, or delete values. The most important difference is that since Statistics is not keeping the data in memory, the size of the caselist is not limited by memory.
However, if you are trying to iterate over the cases with a loop using
range(spss.GetCaseCount())
and deleting some, the loop will eventually fail, because the actual case count reflects the deletions, but the loop limit doesn't reflect that. And datasetObj.cases[k] might not be the case you expect if an earlier case has been deleted. So you need to keep track of the deletions and adjust the limit or the k value appropriately.
HTH

Normalize all words in a document

I need to normalize all words in a huge corpora. Any ideas how to optimize this code? That's too slow...
texts = [ [ list(morph.normalize(word.upper()))[0] for word in document.split() ]
for document in documents ]
documents is a list of strings, where each string is single book's text.
morph.normalize works only for upper register, so I apply .upper() to all words. Moreover, it returns a set with one element, which is normalized word (string)

The first and obvious thing I'd do would be to cache the normalized words in a local dict, as to avoid calling morph.normalize() more than once for a given word.
A second optimization is to alias methods to local variables - this avoids going thru the whole attribute lookup + function descriptor invocation + method object instanciation on each turn of the loop.
Then since it's a "huge" corpus you probably want to avoid creating a full list of lists at once, which might eat all your ram, make your computer start to swap (which is garanteed to make it snail slow) and finally crash with a memory error. I don't know what your supposed to do with this list of lists nor how huge each document is but as an example I iter on a per-document result and write it to stdout - what should really be done depends on the context and concrete use case.
NB : untested code, obviously, but at least this should get you started
def iterdocs(documents, morph):
# keep trac of already normalized words
# beware this dict might get too big if you
# have lot of different words. Depending on
# your corpus, you may want to either use a LRU
# cache instead and/or use a per-document cache
# and/or any other appropriate caching strategy...
cache = {}
# aliasing methods as local variables
# is faster for tight loops
normalize = morph.normalize
def norm(word):
upw = word.upper()
if upw in cache:
return cache[upw]
nw = cache[upw] = normalize(upw).pop()
return nw
for doc in documents:
words = [norm(word) for word in document.split() if word]
yield words
for text in iterdocs(docs, morph):
# if you need all the texts for further use
# at least write them to disk or other persistence
# mean and re-read them when needed.
# Here I just write them to sys.stdout as an example
print(text)
Also, I don't know where you get your documents from but if they are text files, you may want to avoid loading them all in memory. Just read them one by one, and if they are themselves huge don't even read a whole file at once (you can iterate over a file line by line - the most obvious choice for text).
Finally, once you made sure your code don't eat up to much memory for a single document, the next obvious optimisation is parallelisation - run a process per available core and split the corpus between processes (each writing it's results to a given place). Then you just have to sum up the results if you need them all at once...
Oh and yes : if that's still not enough you may want to distribute the work with some map reduce framework - your problem looks like a perfect fit for map reduce.

Effective 1-5 grams extraction with python

I have a huge files of 3,000,000 lines and each line have 20-40 words. I have to extract 1 to 5 ngrams from the corpus. My input files are tokenized plain text, e.g.:
This is a foo bar sentence .
There is a comma , in this sentence .
Such is an example text .
Currently, I am doing it as below but this don't seem to be a efficient way to extract the 1-5grams:
#!/usr/bin/env python -*- coding: utf-8 -*-
import io, os
from collections import Counter
import sys; reload(sys); sys.setdefaultencoding('utf-8')
with io.open('train-1.tok.en', 'r', encoding='utf8') as srcfin, \
io.open('train-1.tok.jp', 'r', encoding='utf8') as trgfin:
# Extract words from file.
src_words = ['<s>'] + srcfin.read().replace('\n', ' </s> <s> ').split()
del src_words[-1] # Removes the final '<s>'
trg_words = ['<s>'] + trgfin.read().replace('\n', ' </s> <s> ').split()
del trg_words[-1] # Removes the final '<s>'
# Unigrams count.
src_unigrams = Counter(src_words)
trg_unigrams = Counter(trg_words)
# Sum of unigram counts.
src_sum_unigrams = sum(src_unigrams.values())
trg_sum_unigrams = sum(trg_unigrams.values())
# Bigrams count.
src_bigrams = Counter(zip(src_words,src_words[1:]))
trg_bigrams = Counter(zip(trg_words,trg_words[1:]))
# Sum of bigram counts.
src_sum_bigrams = sum(src_bigrams.values())
trg_sum_bigrams = sum(trg_bigrams.values())
# Trigrams count.
src_trigrams = Counter(zip(src_words,src_words[1:], src_words[2:]))
trg_trigrams = Counter(zip(trg_words,trg_words[1:], trg_words[2:]))
# Sum of trigram counts.
src_sum_trigrams = sum(src_bigrams.values())
trg_sum_trigrams = sum(trg_bigrams.values())
Is there any other way to do this more efficiently?
How to optimally extract different N ngrams simultaneously?
From Fast/Optimize N-gram implementations in python, essentially this:
zip(*[words[i:] for i in range(n)])
when hard-coded is this for bigrams, n=2:
zip(src_words,src_words[1:])
and is this for trigrams, n=3:
zip(src_words,src_words[1:],src_words[2:])

If you are interested only in the most common (frequent) n-grams (which is your case I suppose), you can reuse the central idea of the Apriori algorithm. Given s_min, a minimal support which can be thought as the number of lines that a given n-gram is contained in, it efficiently searches for all such n-grams.
The idea is as follows: write a query function which takes an n-gram and tests how many times it is contained in the corpus. After you have such a function prepared (may be optimized as discussed later), scan the whole corpus and get all the 1-grams, i.e. bare tokens, and select those which are contained at least s_min times. This gives you subset F1 of frequent 1-grams. Then test all the possible 2-grams by combining all the 1-grams from F1. Again, select those which hold the s_min criterion and you'll get F2. By combining all the 2-grams from F2 and selecting the frequent 3-grams, you'll get F3. Repeat for as long as Fn is non-empty.
Many optimizations can be done here. When combining n-grams from Fn, you can exploit the fact that n-grams x and y may only be combined to form (n+1)-gram iff x[1:] == y[:-1] (may be checked in constant time for any n if proper hashing is used). Moreover, if you have enough RAM (for your corpus, many GBs), you can extremely speed up the query function. For each 1-gram, store a hash-set of line indices containing the given 1-gram. When combining two n-grams into an (n+1)-gram, use intersection of the two corresponding sets, obtaining a set of lines where the (n+1)-gram may be contained.
The time complexity grows as s_min decreases. The beauty is that infrequent (and hence uninteresting) n-grams are completely filtered as the algorithm runs, saving computational time for the frequent ones only.

I am giving you a bunch of pointers regarding the general problems you are trying to solve.. One or more of these should be useful for you and help you figure this out.
For what you are doing (I am guessing some sort of machine translation experiment) you don't really need to load the two files srcfin and trgfin into memory at the same time (at least not for the code sample you have provided).. Processing them separately will be less expensive in terms of the amount of stuff you need to hold in memory at a given time.
You are reading a ton of data into memory, processing it (which takes even more memory), and then holding the results in some in-memory data-structures. Instead of doing that, you should strive to be lazier. Learn about python generators and write a generator which streams out all the ngrams from a given text without needing to hold the entire text in memory at any given point in time. The itertools python package will probably come in handy while writing this.
Beyond a point, it will no longer be feasible for you to hold all this data in memory. You should consider looking at map-reduce to help you break this down. Check out the mrjob python package which lets you write map reduce jobs in python. In the mapper step, you will break text down into its ngrams, and in the reducer stage you will count the number of times you see each ngram to get its overall count. mrjob's can also be run locally which obviously won't give you any parallelization benefits, but will be nice cause mrjob will still do a lot of heavy lifting for you.
If you are compelled to hold all the counts in memory at the same time (for a massive amount of text), then either implement some pruning strategy to prune out very rare ngrams, or consider using some persistent file-based lookup table such sqlite to hold all the data for you.

Assuming you don't want to count ngrams between lines, and assuming naive tokenization:
def ngrams(n, f):
deque = collections.deque(maxlen=n)
for line in f:
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1]) # pre-seed so 5-gram counter doesn't count incomplete 5-grams
for word in words[n-1:]:
deque.append(word)
yield tuple(str(w) for w in deque) # n-gram tokenization
counters = [collections.Counter(ngrams(i, open('somefile.txt'))) for i in range(5)]
edit: added beginning/end line tokens
The resultant data object is I believe about as sparse as possible. 3m lines with 40 words is ~120m tokens. With ~1m words in English (though less commonly used), you'll probably get a rather long tail. If you can imagine your data to be exchangeable / iid, then you can add some pruning in the middle:
def ngrams(n, f, prune_after=10000):
counter = collections.Counter()
deque = collections.deque(maxlen=n)
for i, line in enumerate(f):
deque.clear()
words = ["<s>"] + line.split() + ["</s>"]
deque.extend(words[:n-1])
for word in words[n-1:]:
deque.append(word)
ngram = tuple(str(w) for w in deque)
if i < prune_after or ngram in counter:
counter[ngram] += 1
return counter
Relaxing the exchangeability assumption would require something like Tregoreg's answer for efficient pruning, but in most cases exchangeability should hold.
As far as raw speed, I think zip (like the original code) vs deque is the fundamental question. zip removes the innermost loop, so it is likely already very fast. deque requires the innermost loop but also consumes the data iteratively, so its working memory footprint should be much smaller. Which is better will likely depend on your machine, but I'd imagine for large machines/small data that zip would be faster. Once you start running out of memory (especially if you start talking about pruning), however, deque gets a few more advantages.

Trying to fill an array with data opened from files

the following is code I have written that tries to open individual files, which are long strips of data and read them into an array. Essentially I have files that run over 15 times (24 hours to 360 hours), and each file has an iteration of 50, hence the two loops. I then try to open the files into an array. When I try to print a specific element in the array, I get the error "'file' object has no attribute 'getitem'". Any ideas what the problem is? Thanks.
#!/usr/bin/python
############################################
#
import csv
import sys
import numpy as np
import scipy as sp
#
#############################################
level = input("Enter a level: ");
LEVEL = str(level);
MODEL = raw_input("Enter a model: ");
NX = 360;
NY = 181;
date = 201409060000;
DATE = str(date);
#############################################
FileList = [];
data = [];
for j in range(1,51,1):
J = str(j);
for i in range(24,384,24):
I = str(i);
fileName = '/Users/alexg/ECMWF_DATA/DAT_FILES/'+MODEL+'_'+LEVEL+'_v_'+J+'_FT0'+I+'_'+DATE+'.dat';
FileList.append(fileName);
fo = open(fileName,"rb");
data.append(fo);
fo.close();
print data[1][1];
print FileList;
EDITED TO ADD:
Below, find the CORRECT array that the python script should be producing (sorry it wont let me post this inline yet):
http://i.stack.imgur.com/ItSxd.png
The problem I now run into, is that the first three values in the first row of the output matrix are:
-7.090874
-7.004936
-6.920952
These values are actually the first three values of the 11th row in the array below, which is the how it should look (performed in MATLAB). The next three values the python script outputs (as what it believes to be the second row) are:
-5.255577
-5.159874
-5.064171
These values should be found in the 22nd row. In other words, python is placing the 11th row of values in the first position, the 22nd in the second and so on. I don't have a clue as to why, or where in the code I'm specifying it do this.

You're appending the file objects themselves to data, not their contents:
fo = open(fileName,"rb");
data.append(fo);
So, when you try to print data[1][1], data[1] is a file object (a closed file object, to boot, but it would be just as broken if still open), so data[1][1] tries to treat that file object as if it were a sequence, and file objects aren't sequences.
It's not clear what format your data are in, or how you want to split it up.
If "long strips of data" just means "a bunch of lines", then you probably wanted this:
data.append(list(fo))
A file object is an iterable of lines, it's just not a sequence. You can copy any iterable into a sequence with the list function. So now, data[1][1] will be the second line in the second file.
(The difference between "iterable" and "sequence" probably isn't obvious to a newcomer to Python. The tutorial section on Iterators explains it briefly, the Glossary gives some more information, and the ABCs in the collections module define exactly what you can do with each kind of thing. But briefly: An iterable is anything you can loop over. Some iterables are sequences, like list, which means they're indexable collections that you can access like spam[0]. Others are not, like file, which just reads one line at a time into memory as you loop over it.)
If, on the other hand, you actually imported csv for a reason, you more likely wanted something like this:
reader = csv.reader(fo)
data.append(list(reader))
Now, data[1][1] will be a list of the columns from the second row of the second file.
Or maybe you just wanted to treat it as a sequence of characters:
data.append(fo.read())
Now, data[1][1] will be the second character of the second file.
There are plenty of other things you could just as easily mean, and easy ways to write each one of them… but until you know which one you want, you can't write it.

Set() on a very long list and creating an even longer matrix

I'm trying to get the set() of all words in a very long database of books (around 60,000 books) and to store in a matrix the 'vocabularies' of each book (the paths of books are in "files"):
for f in files:
book = open(f, 'r')
vocabulary = []
for lines in book.readlines():
words = string.split(lines)
vocabulary += set(words)
matrix.extend([vocabulary])
V += set(vocabulary)
OK, I solved the (memory) problem by creating a file to store everything, but now I get another memory error when trying to create a matrix with:
entries = numpy.zeros((len(V),a))
I tried to solve this by:
entries = numpy.memmap('matrice.mymemmap', shape=(len(V),a))
but the terminal says:
File "/usr/lib/python2.7/dist-packages/numpy/core/memmap.py", line 193, in new
fid = open(filename, (mode == 'c' and 'r' or mode)+'b')
IOError: [Errno 2] No such file or directory: 'matrice.mymemmap'
Can you help me solve this too?

V = set()
for f in files:
with open(f, 'r') as book:
for lines in book.readlines():
words = lines.split(" ")
V.update(words)
Here you first create an empty set. Then for each file you iterate through the lines in the file and split each line by the spaces. This gives you a list of words on the line. Then you update the set by the list of words, i.e. only unique words remain in the set.
So, you will end up with V which contains all the words in your library.
Of course, you might want to clean some upper/lower cases and punctuation in words before updating the set and remove empty words (""). That should happen before the V.update() statement. Otherwise you end up with both, e.g., It and it, fortunately, and fortunately, etc.
Please note the with statement with file operation. This ensures that whatever happens, the file will be closed before you leave the with-block.
If you want to do this book-by-book, then:
vocabularies = []
for f in files:
V = set()
with open(f, 'r') as book:
for lines in book.readlines():
words = lines.split(" ")
V.update(words)
vocabularies.append(V)
Also, instead of for lines in book.readlines(): you may use just for lines in book:.

I don't think your code does what you think it does:
for f in files:
book = open(f, 'r')
vocabulary = []
You've created an empty list called vocabulary
for lines in book.readlines():
words = string.split(lines)
vocabulary += set(words)
For each line in the file, you're creating a set of the words in that line. But then you add it to vocabulary, which is a list. This just puts the elements on the end of the list. If a word appears on multiple lines, it will appear in vocabulary once for every line. This could make vocabulary very large.
matrix.extend([vocabulary])
From this, I would assume that matrix is also a list. This will give you one entry in matrix for each book, and that entry will be a huge list as described above.
V += set(vocabulary)
Is V a list or a set? This copies vocabulary, which is already a set, into another set. Then it takes all the elements of that copied set and adds them to V.
First of all, I think you probably intend for vocabulary to be a set. To create an empty set, use vocabulary = set(). To add one item to a set, use vocabulary.add(word) and to add a collection use vocabulary.update(words). It looks like you mean to do the same with V. That should reduce your memory requirements a lot. That alone might be enough to fix your problem.
If that's not enough to make it work, consider whether you need all of matrix in memory at once. You could write it to a file instead of accumulating it in memory.
You'll probably accumulate lots of extra words due to punctuation and capitalization. Your sets would be smaller if you didn't count 'clearly', 'Clearly', 'Clearly.', 'clearly.', 'clearly,'... as being distinct.
As others have noted, you should use a with statement to make sure your file is closed. However, I doubt this is causing your problem. While it's not guaranteed by all Pythons, in this case the file is probably getting closed automatically quite promptly.

In python, you can't add values to a list using +=. instead, use
vocabulary.append(set(words))
EDIT: was wrong.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.