NGram based Language detection William B. Cavnar and John M. Trenkle - python

I am trying to implement the NGram based Langauage detection paper by William B. Cavnar and John M. Trenkle using https://github.com/z0mbiehunt3r/ngrambased-textcategorizer/blob/master/ngramfreq.py
import operator
import string
import glob
import os.path
from nltk.util import ngrams
#file which contains the language to be detected
filename = raw_input("Enter the file name: ")
fp = open(filename)
text = str(fp.read())
fp.close()
#tokenize the text
rawtext = text.translate(None, string.punctuation)
words = [w.lower() for w in rawtext.split(" ")]
#generate ngrams for the text
gen_ngrams=[]
for word in words:
for i in range(1,6):
temp = ngrams(word, i, pad_left = True, pad_right = True, left_pad_symbol = ' ', right_pad_symbol =' ')
#join the characters of individual ngrams
for t in temp:
ngram = ' '.join(t)
gen_ngrams.append(ngram)
#calculate ngram frequencies of the text
ngram_stats = {}
for n in gen_ngrams:
if not ngram_stats.has_key(n):
ngram_stats.update({n:1})
else:
ng_count = ngram_stats[n]
ngram_stats.update({n:ng_count+1})
#now sort them, add an iterator to dict and reverse sort based on second column(count of ngrams)
ngrams_txt_sorted = sorted(ngram_stats.iteritems(), key=operator.itemgetter(1), reverse = True)[0:300]
#Load ngram language statistics
lang_stats={}
for filepath in glob.glob('./langdata/*.dat'):
filename = os.path.basename(filepath)
lang = os.path.splitext(filename)[0]
ngram_stats = open(filepath,"r").readlines()
ngram_stats = [x.rstrip() for x in ngram_stats]
lang_stats.update({lang:ngram_stats})
#compare ngram frequency statistics by doing a rank order lookup
lang_ratios = {}
txt_ng = [ng[0] for ng in ngrams_txt_sorted]
print txt_ng
max_out_of_place = len(txt_ng)
for lang, ngram_stat in lang_stats.iteritems():
lang_ng = [ng[0] for ng in lang_stats]
doc_dist = 0
for n in txt_ng:
try:
txt_ng_index = txt_ng.index(n)
lang_ng_index = lang_ng.index(n)
except ValueError:
lang_ng_index = max_out_of_place
doc_dist += abs(lang_ng_index - txt_ng_index)
lang_ratios.update({lang:doc_dist})
for i in lang_ratios.iteritems():
print i
predicted_lang = min(lang_ratios, key=lang_ratios.get)
print "The language is",predicted_lang
It outputs 'English' every time I execute it. The computed distances are always the same for all the languages. I am not able to figure out the logical error in the above code. Kindly help me.

Comparing to the Cavnar & Trenkle code, it looks like
ngram = ' '.join(t)
should be
ngram = ''join(t)
(without the space)
I bet this is what's throwing off your stats.

Related

How can I iterate over a file looking for keywords defined within a list

I have a defined list of keywords and a text file. I would like to search the text file and count how many times each of the keywords within my list appear. Example:
kw = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
with open("file.txt", "r") as f:
data_file = f.read()
d = dict()
for line in data_file:
line = line.strip()
line = line.lower()
words = line.split(" ")
for word in words:
if word in d:
d[word] = d[word] + 1
else:
d[word] = 1
for key in list(d.keys()):
print(key, ":", d[key])
Now lets say we run the code, it should search the file.txt and loop through the list. If it finds a keyword in the list, then print that word and how many times it was found. If no word is found then it doesn't report.
Example output:
Keywords Found:
max speed: 3
travel: 7
distance: 3
Can't quite get this to work like I want. Any feedback would be great! Thank you in advanced!
There are several algorithms which you can use. There are special algorithms for finding a specific words in texts. The easiest one will be the naive algorithm, here is code that I wrote:
def naive_string_matching(text, pattern):
txt_len, pat_len = len(text), len(pattern)
result = []
for s in range(txt_len - pat_len + 1):
if pattern == text[s:s+pat_len]:
result.append(s)
return result
This naive algorithm takes as an input a text and one word as a pattern to search for. The complexity of this algorithm is O((n − m + 1)m) where m is the length of pattern and n is the length of a text.
The next algorithm which you can use and has the better complexity than the naive algorithm is Finite automation algorithm. Here you can read more about it if you are interested in it. Here is also my implementation of this algorithm:
def transition_table(pattern):
alphabet = set(pattern)
ptt_len = len(pattern)
result = []
for q in range(ptt_len+1):
result.append({})
for l in alphabet:
k = min(len(pattern), q+1)
while True:
if k == 0 or pattern[:k] == (pattern[:q] + l)[-k:]:
break
k -= 1
result[q][l] = k
return result
def fa_string_matching(text, pattern):
q = 0
delta = transition_table(pattern)
txt_len = len(text)
result = []
for s in range(txt_len):
if text[s] in delta[q]:
q = delta[q][text[s]]
if q == len(delta) - 1:
result.append(s+1-q)
else:
q = 0
return result
The complexity of this algorithm is O(n) but pre-processing time (transition_table function) takes O(m) where again n is the length of a text and m the length of a pattern.
And the last one algorithm I can propose to you is the KMP (Knuth–Morris–Pratt) algorithm which is the fastest of all 3 of them. Again my implementation of it:
def prefix_function(pattern):
pat_len = len(pattern)
pi = [0]
k = 0
for q in range(1, pat_len):
while k > 0 and pattern[k] != pattern[q]:
k = pi[k-1]
if pattern[k] == pattern[q]:
k += 1
pi.append(k)
return pi
def kmp_string_matching(text, pattern):
txt_len, pat_len = len(text), len(pattern)
pi = prefix_function(pattern)
q = 0
result = []
for i in range(txt_len):
while q > 0 and pattern[q] != text[i]:
q = pi[q-1]
if pattern[q] == text[i]:
q += 1
if q == pat_len:
result.append(i - q + 1)
q = pi[q-1]
return result
As an input it takes full text and a pattern that you are looking for. The complexity of KMP algorithm is similar to the Finite automation algorithm and it is O(n) but the pre-processing time is faster (prefix_function).
If you are interested in such topics like pattern matching or finding the occurrences of a pattern in a text, I highly recommend you to become acquainted with all of them.
To open a file you can simply run:
with open(file_name) as file:
text = file.read()
result = naive_string_matching(text, pattern)
where file_name is the name of your file, pattern is the phrase that you want to search for in the text. To search for patterns in an array you can try:
example_patterns = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
with open(file_name) as file:
text = file.read()
for pattern in example_patterns:
result = kmp_string_matching(text, pattern)
import re
keywords = ['max speed', 'time', 'distance', 'travel', 'down', 'up']
keywords = [x.replace(' ', r'\s') for x in keywords] # replaces spaces with whitespace indicator
with open('file.txt', 'r') as file:
data = file.read()
keywords_found = {}
for key in keywords:
found = re.findall(key, data, re.I) # re.I means it'll ignore case.
if found:
keywords_found[key] = len(found)
print(keywords_found)

python "indexerror: list index out of range" when iterating through files in a directory

I have written a program to iterate through files in a directory, get the words that come 1 index before and 2 indexes after a specific word from a list (a sort of highly specific concordance), then compares those words to a dictionary and counts the number of matches. It worked well on a small set of test files, but when applied to a directory of about 800 files, only about half process before I get an "indexerror: list index out of range".
Program
import sys
import os
import io
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
files_path = sys.argv[1]
textfile_dictionary = sys.argv[2]
rmsa_words = ["face", "countenance", "manner", "look", "expression", "appearance"]
for filename in os.listdir(files_path):
if filename.endswith(".txt"):
file = open(os.path.join(files_path, filename), "rt")
text = file.read()
words = word_tokenize(text)
words = [word.lower() for word in words if word.isalpha()]
stops = stopwords.words("english")
tokens = [word for word in words if word not in stops]
conc_words = []
i=[]
for word in rmsa_words:
i += [x for x, token in enumerate(tokens) if token == word]
for number in i:
if number > 0 and number <= (len(tokens) - 2):
conc_words = conc_words + [tokens[number-1], tokens[number+1], tokens[number+2]]
ps = PorterStemmer()
conc_stems = []
for word in conc_words:
conc_stems.append(ps.stem(word))
file = io.open(textfile_dictionary, mode="r", encoding="utf8")
dictionaryread = file.read()
dictionary = dictionaryread.split()
dictionary_stems = []
for word in dictionary:
dictionary_stems.append(ps.stem(word))
rmsa_count = 0
for element in dictionary_stems:
for w in conc_stems:
if w == element:
rmsa_count = rmsa_count + 1
print(filename, len(tokens), rmsa_count)
The problem is something to do with the indexing in this section:
for word in rmsa_words:
i += [x for x, token in enumerate(tokens) if token == word]
for number in i:
if number > 0 and number <= (len(tokens) - 2):
conc_words = conc_words + [tokens[number-1], tokens[number+1], tokens[number+2]]
I included the "if number > 0 and number <= (len(tokens) - 2)" to try an stop the loop searching for an index beyond the bounds of the list, but I'm still getting the index error. I have read several index error posts on SO but they are all too different to my code to help.
Any thoughts on why this is working, particularly why it seems to be working for some files and not others, would be much appreciated. I am new to coding.

natural language corpus string to int

take a sample of sentences from each of the corpus1, corpus2 and corpus3 corpora and displays the average length (as measured in terms of the number of characters in the sentence).
so I've 3 corpus and sample_raw_sents is a defined function to return random sentences:
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
for sentence in tcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
print(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
print(len(sentence))
so using this code all lengths are printed, though how do I sum() these lengths?
Use zip, it will allow you to draw a sentence from each corpus all at once.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
zipped = zip(tcr.sample_raw_sents(sample_size),
rcr.sample_raw_sents(sample_size),
mcr.sample_raw_sents(sample_size))
for s1, s2, s3 in zipped:
summed = len(s1) + len(s2) + len(s3)
average = summed/3
print(summed, average)
You could store all lengths of sentences in list and then sum them up.
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
lengths = []
for sentence in tcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in rcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
for sentence in mcr.sample_raw_sents(sample_size):
lengths.append(len(sentence))
print(sum(lengths) / len(lengths))
tcr = corpus1()
rcr = corpus2()
mcr = corpus3()
sample_size=50
s = 0
for sentence in tcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in rcr.sample_raw_sents(sample_size):
s = s + len(sentence)
for sentence in mcr.sample_raw_sents(sample_size):
s = s + len(sentence)
average = s/150
print('average: {}'.format(average))

Counting phrase frequency in Python 3.3.2

I have been examining different sources on the web and have tried various methods but could only find how to count the frequency of unique words but not unique phrases. The code I have so far is as follows:
import collections
import re
wanted = set(['inflation', 'gold', 'bank'])
cnt = collections.Counter()
words = re.findall('\w+', open('02.2003.BenBernanke.txt').read().lower())
for word in words:
if word in wanted:
cnt [word] += 1
print (cnt)
If possible, I would also like to count the number of times the phrases 'central bank' and 'high inflation' is used in this text. I appreciate any suggestion or guidance you can give.
First of all, this is how I would generate the cnt that you do (to reduce memory overhead)
def findWords(filepath):
with open(filepath) as infile:
for line in infile:
words = re.findall('\w+', line.lower())
yield from words
cnt = collections.Counter(findWords('02.2003.BenBernanke.txt'))
Now, on to your question about phrases:
from itertools import tee
phrases = {'central bank', 'high inflation'}
fw1, fw2 = tee(findWords('02.2003.BenBernanke.txt'))
next(fw2)
for w1,w2 in zip(fw1, fw2)):
phrase = ' '.join([w1, w2])
if phrase in phrases:
cnt[phrase] += 1
Hope this helps
To count literal occurrences of couple of phrases in a small file:
with open("input_text.txt") as file:
text = file.read()
n = text.count("high inflation rate")
There is nltk.collocations module that provides tools to identify words that often appear consecutively:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.collocations import BigramCollocationFinder, TrigramCollocationFinder
# run nltk.download() if there are files missing
words = [word.casefold() for sentence in sent_tokenize(text)
for word in word_tokenize(sentence)]
words_fd = nltk.FreqDist(words)
bigram_fd = nltk.FreqDist(nltk.bigrams(words))
finder = BigramCollocationFinder(word_fd, bigram_fd)
bigram_measures = nltk.collocations.BigramAssocMeasures()
print(finder.nbest(bigram_measures.pmi, 5))
print(finder.score_ngrams(bigram_measures.raw_freq))
# finder can be constructed from words directly
finder = TrigramCollocationFinder.from_words(words)
# filter words
finder.apply_word_filter(lambda w: w not in wanted)
# top n results
trigram_measures = nltk.collocations.TrigramAssocMeasures()
print(sorted(finder.nbest(trigram_measures.raw_freq, 2)))
Assuming the file is not huge - this is the easiest way
for w1, w2 in zip(words, words[1:]):
phrase = w1 + " " + w2
if phrase in wanted:
cnt[phrase] += 1
print(cnt)

counting a set dictionary of words in a specific html tag

I am trying to parse a document and if there is a name associated with a specific docno, count the total number of names. After the for loop ends for that docno, I want to store names[docno]= word count. Therefore, if namedict={'henry':'','joe':'') and henry is in docno=doc 1 -4 times and joe 6 that the dictionary would store it as ('doc 1': 10). So far, all I can figure out is counting the total number of names in the entire text file.
from xml.dom.minidom import *
import re
from string import punctuation
from operator import itemgetter
def parseTREC1 (atext):
fc = open(atext,'r').read()
fc = '<DOCS>\n' + fc + '\n</DOCS>'
dom = parseString(fc)
w_re = re.compile('[a-z]+',re.IGNORECASE)
doc_nodes = dom.getElementsByTagName('DOC')
namelist={'Matt':'', 'Earl':'', 'James':''}
default=0
indexdict={}
N=10
names={}
words={}
for doc_node in doc_nodes:
docno = doc_node.getElementsByTagName('DOCNO')[0].firstChild.data
cnt = 1
for p_node in doc_node.getElementsByTagName('P'):
p = p_node.firstChild.data
words = w_re.findall(p)
words_gen=(word.strip(punctuation).lower() for line in words
for word in line.split())
for aword in words:
if aword in namelist:
names[aword]=names.get(aword, 0) + 1
print names
# top_words=sorted(names.iteritems(), key=lambda(word, count): (-count, word))[:N]
# for word, frequency in top_words:
# print "%s: %d" % (word, frequency)
#print words + top_words
#print docno + "\t" + str(numbers)
parseTREC1('LA010189.txt')
I've cleaned up your code a bit to make it easier to follow. Here are a few comments and suggestions:
To answer the key question: you should be storing the count in names[docno] = names.get(docno, 0) + 1.
Use a defaultdict(int) instead of names.get(aword, 0) + 1 to accumlate the count.
Use set() for the namelist.
Adding the re.MULTILINE option to your regular expression should remove the need for line.split().
You didn't use your words_gen, was that an oversight?
I used this doc to test with, based on your code:
<DOC>
<DOCNO>1</DOCNO>
<P>groucho
harpo
zeppo</P>
<P>larry
moe
curly</P>
</DOC>
<DOC>
<DOCNO>2</DOCNO>
<P>zoe
inara
kaylie</P>
<P>mal
wash
jayne</P>
</DOC>
Here is a cleaned-up version of the code to count names in each paragraph:
import re
from collections import defaultdict
from string import punctuation
from xml.dom.minidom import *
RE_WORDS = re.compile('[a-z]+', re.IGNORECASE | re.M)
def parse(path, names):
data = '<DOCS>' + open(path, 'rb').read() + '</DOCS>'
tree = parseString(data)
hits = defaultdict(int)
for doc in tree.getElementsByTagName('DOC'):
doc_no = 'doc ' + doc.getElementsByTagName('DOCNO')[0].firstChild.data
for node in doc.getElementsByTagName('P'):
text = node.firstChild.data
words = (w.strip(punctuation).lower()
for w in RE_WORDS.findall(text))
hits[doc_no] += len(names.intersection(words))
for item in hits.iteritems():
print item
names = set(['zoe', 'wash', 'groucho', 'moe', 'curly'])
parse('doc.xml', names)
Output:
(u'doc 2', 2)
(u'doc 1', 3)

Categories