Comparing two strings with low/no consistency - python

I have two strings
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'
I need to check if they match and print out the result accordingly. As b is the string I want checked in a, the result should print out 'Match' or 'Mis-match' accordingly. The code shoule not depend on the '_' in a, as they can be '-' or spaces as well.
I have tried using fuzzywuzzy library and the fuzzy.token_set_ratio to calculate the ratio.
From observation, I chose a value of 95 to be convincing.
I want to know if there is another way to check this without using fuzzywuzzy, probably difflib.
I tried using difflib and SequenceManager, but all I get is a word wise comparison and am unable to combine the result exactly.
I have tried the following code.
from fuzzywuzzy import fuzzy
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
ratio = fuzzy.token_set_ratio(a.lower(), b.lower())
if ratio >= 95:
print('Match')
else:
print('Mis-Match')
output
'Mis-Match'
This gives a score of 64 while all of controlling, pos0 and pos1 are in a and in b and should give a match instead.
I tried this as this doesn't depend on the '_' or '-' or spaces.

You can use gensim library implement MatchSemantic and write code like this as a function:
Initialization
if run the code for the first time a process-bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code
Code
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
doc_similarity_scores = index[query_tf]
# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes:
if documents[idx] != '':
if doc_similarity_scores[idx] > 0.0: print('Match')
else: print('Mis-Match')
Usage
for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents
Test:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple in my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)
so we know that the first item I have an apple in my basket has a semantical relation with Fruit and Vegetables so it prints Match and for the second item no relation will be found so it prints Mis-Match
output:
Match
Mis-Match

It looks like you are trying to use tools like fuzzywuzzy that were not really designed for that.
One possible approach to this problem could be to find how many tokens from the second text are present in the first text.
This can be normalized by the total number of tokens in the second text.
Then you can threshold to whatever value you deem fit.
One possible way of implementing this is the following:
Tokenize (i.e. convert to a list of tokens) the input texts a and b.
Collect each token list into a corresponding counter (i.e. some data structure for counting the which tokens are present).
Compute the intersection a_i_b of the tokens for a and b.
Compute some metric based on the the total occurrences of a_i_b (weight_a_i_b) and the total occurrences of b (weight_b). This final metric is a proxy of the "amount" of b contained into a. This could be a ratio or a difference and should use the fact that weight_a_i_b <= weight_b by construction.
The difference weight_b - weight_a_i_b results in a number between 0 and the number of tokens in b, which is also a direct measure of how many tokens from b are not found in a, hence 0 indicates perfect matching.
The ratio weight_a_i_b / weight_b results in a number between 0 and 1, with 1 meaning perfect matching and 0 meaning no matching.
The difference metric is probably more suited for small number of tokens and easier to interpret and threshold in a meaningful way (e.g. accepting a value below 2 means that at most one token from b is not present in a).
On the other hand the ratio is more standard and it is probably more suited for larger tokens lists sizes.
All this would translate into this code, leveraging collections.Counter() for the dealing with counting the tokens:
import collections
def contains_tokens(
text_a,
text_b,
tokenize_kws=None,
metric=lambda a, b, a_i_b: b - a_i_b):
"""Compute the ratio of `b` contained in `a`."""
tokenize_kws = dict(tokenize_kws) if tokenize_kws is not None else {}
counter_a = collections.Counter(tokenize(text_a, **tokenize_kws))
counter_b = collections.Counter(tokenize(text_b, **tokenize_kws))
counter_a_i_b = counter_a & counter_b
weight_a = counter_total(counter_a)
weight_b = counter_total(counter_b)
weight_a_i_b = counter_total(counter_a_i_b)
return metric(weight_a, weight_b, weight_a_i_b)
The first step, i.e. tokenization, is achieved with the following function.
This is a bit primitive, but gets the job done for the your input.
What it does essentially is to replace a number of special characters (ignores) into blanks, and then splits the string along the blanks, optionally excluding the tokens in a blacklist (excludes).
def tokenize(
text,
case_sensitive=False,
ignores=('_', '-', ':', ',', '.', '?', '!'),
excludes=('the', 'from', 'to')):
"""Tokenize a text, ignoring some characters and excluding some tokens."""
if not case_sensitive:
text = text.lower()
for ignore in ignores:
text = text.replace(ignore, ' ')
for token in text.split():
if token not in excludes:
yield token
To count the total number of values in a counter the following function is used. However, for Python 3.10 and later, there is a build-in method Counter.total() which does the exact same.
def counter_total(counter):
"""Count the total number of values."""
return sum(counter.values())
For the given input this becomes:
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
and
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos2'
# one token from `b` (`pos2`) not in `a`
print(contains_tokens(a, b))
# 1
Note that distance-based functions (like fuzz.token_set_ratio() or fuzz.partial_ratio()) cannot be used in this context because they will be sensitive to how much "noise" is present in the first text, e.g. if b = 'a b c', those tokens are contained equally in a = 'a c' as well as a = 'a b c d e f g h i', and any distance cannot account for that, most notably because distance functions are symmetric (i.e. f(a, b) = f(b, a)) while the function you are looking for is not (i.e. f(a, b) != f(b, a)).

Related

Clustering script fails with German, but works like expected with English

I have a script to cluster keywords, utilizing pandas and polyfuzz. With English, it works like expected. Trying to use the script with keywords in German, the script recognizes multiple keywords wrongly.
What means "wrongly recognized": clustering recognizes the first and second word in the keyword. And as you can see on the screenshot, columns G and H (First Word and Second Word) contain other words, then corresponding keywords in column B (Keyword):
The script fails not always with German - multiple keywords are clustered correctly. But the part of wrongly recognized keywords is very high, up to 20%.
Could somebody explain to me why the script failed with German keywords and, in the best case, improve the script enabling it to work with German?
Here is the part of the script, which does clustering:
# find keywords from one column in another in any order and count the frequency
df_matched['Cluster Name'] = df_matched['Cluster Name'].str.strip()
df_matched['Keyword'] = df_matched['Keyword'].str.strip()
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[0]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[1]
df_matched['Total Keywords'] = df_matched['First Word'].str.count(' ') + 1
def ismatch(s):
A = set(s["First Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found'] = df_matched.apply(ismatch, axis=1)
df_matched = df_matched. fillna('')
def ismatch(s):
A = set(s["Second Word"].split())
B = set(s['Keyword'].split())
return A.intersection(B) == A
df_matched['Found 2'] = df_matched.apply(ismatch, axis=1)
# todo - document this algo. Essentially if it matches on the second word only, it renames the cluster to the second word
# clean up code nd variable names
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == True), "Cluster Name"] = df_matched["Second Word"]
df_matched.loc[(df_matched["Found"] == False) & (df_matched["Found 2"] == False), "Cluster Name"] = "zzz_no_cluster_available"
# count cluster_size
df_matched['Cluster Size'] = df_matched['Cluster Name'].map(df_matched.groupby('Cluster Name')['Cluster Name'].count())
df_matched.loc[df_matched["Cluster Size"] == 1, "Cluster Name"] = "zzz_no_cluster_available"
df_matched = df_matched.sort_values(by="Cluster Name", ascending=True)
Here are two datasets:
Working dataset in English: http://dl.dropboxusercontent.com/s/zrobh2x4bs3ztlf/working-dataset-english.txt
Badly working dataset in German: http://dl.dropboxusercontent.com/s/i1p3j3zi1t0cev3/badly-working-dataset-german.txt
And here, the working Colab with the whole script.
I opened the full code to understand where df_matched came from.
I'm not 100% sure of what you are trying to do, but I think that the problem comes from before the snippet you shared here.
It comes from the way that df_matched is created. It uses fuzzy matching to create clusters. So the words of "Cluster Name" are not all guaranteed to be present in "Keyword".
If you run the code for the English data, and check the words in position -1 and -2 (last two words of the Cluster Name) instead of 0 and 1...
df_matched['First Word'] = df_matched['Cluster Name'].str.split(" ").str[-1]
df_matched['Second Word'] = df_matched['Cluster Name'].str.split(" ").str[-2]
...then calculate how many of them are not found...
print((~df_matched["Found"]).sum())
print((~df_matched["Found 2"]).sum())
# 140
# 10
...you can see that for 104 out of 158 rows, the last word is not part of the keywords.
(I don't know if you care about the first two words more than the last two... but this looks worse than the 20% you noticed in the German data.)
For the German one the problem is more visible because this language uses a lot of compound words and many frequent suffixes (e.g., "ung")... So they will fuzzy-match a lot.
Example of df_matched for German: the "From" words are not present in "To"... but there are large overlaps.
This is df_matched for English: some words of "From" are not even close to the words in "To"... and similarity score can be worse than in the German dataset.
Possible improvements
I think that the part where you could improve the clustering is this (from the colab notebook)
df_1_list = df_1.Keyword.tolist()  # create list from df
model = PolyFuzz("TF-IDF")
cluster_tags = df_1_list[::]
cluster_tags = set(cluster_tags)
cluster_tags = list(cluster_tags)
print("Cleaning up the cluster tags.. Please be patient!")
substrings = {w1 for w1 in tqdm(cluster_tags) for w2 in cluster_tags if w1 in w2 and w1 != w2}
longest_word = set(cluster_tags) - substrings
longest_word = list(longest_word)
shortest_word_list = list(set(cluster_tags) - set(longest_word))
try:
    model.match(df_1_list, shortest_word_list)
except ValueError:
    print("Empty Dataframe, Can't Match - Check the URL Filter!")
    sys.exit()
model.group(link_min_similarity=sim_match_percent)
df_matched = model.get_matches()
Here you compute the similarity between df_1_list and shortest_word_list.
shortest_word_list is created by looking for substrings, which might lead to weird clusters is German because of compound words.
You could try and normalize the text with (language-specific) ​stemming or lemmatization before / instead of checking for substrings and creating clusters. This should help and transform each word in their "root form" and retain their meaning.
Yoy can use the spaCy library, which provide language-specific
pretrained models for stemming, embedding and other language operations.
You can select the correct model for each language and use the lemmatization function to replace each word of df_1_list with their "base form" before trying to cluster.
Lemmatization example
import spacy
nlp = spacy.load("en_core_web_sm") # load English or German model
lemmatizer = nlp.get_pipe("lemmatizer")
print(lemmatizer.mode) # 'rule'
doc = nlp("I was reading the paper.")
print([token.lemma_ for token in doc])
# ['I', 'be', 'read', 'the', 'paper', '.']
Link to spaCy German model: https://spacy.io/models/de

Calculate TF-IDF using sklearn for variable-n-grams in python

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?
I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.

python word grouping based on words before and after

I am trying create groups of words. First I am counting all words. Then I establish the top 10 words by word count. Then I want to create 10 groups of words based on those top 10. Each group consist of all the words that are before and after the top word.
I have survey results stored in a python pandas dataframe structured like this
Question_ID | Customer_ID | Answer
1 234 Data is very important to use because ...
2 234 We value data since we need it ...
I also saved the answers column as a string.
I am using the following code to find 3 words before and after a word ( I actually had to create a string out of the answers column)
answers_str = df.Answer.apply(str)
for value in answers_str:
non_data = re.split('data|Data', value)
terms_list = [term for term in non_data if len(term) > 0] # skip empty terms
substrs = [term.split()[0:3] for term in terms_list] # slice and grab first three terms
result = [' '.join(term) for term in substrs] # combine the terms back into substrings
print result
I have been manually creating groups of words - but is there a way of doing it in python?
So based on the example shown above the group with word counts would look like this:
group "data":
data : 2
important: 1
value: 1
need:1
then when it goes through the whole file, there would be another group:
group "analytics:
analyze: 5
report: 7
list: 10
visualize: 16
The idea would be to get rid of "we", "to","is" as well - but I can do it manually, if that's not possible.
Then to establish the 10 most used words (by word count) and then create 10 groups with words that are in front and behind those main top 10 words.
We can use regex for this. We'll be using this regular expression
((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})
which you can test for yourself here, to extract the three words before and after each occurence of data
First, let's remove all the words we don't like from the strings.
import re
# If you're processing a lot of sentences, it's probably wise to preprocess
#the pattern, assuming that bad_words is the same for all sentences
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
The we want to get the words that surround data in each line
data_pat = r'((?:\b\w+?\b\s*){0,3})[dD]ata((?:\s*\b\w+?\b){0,3})'
res = re.findall(pat, s, flags=re.IGNORECASE)
gives us a list of tuples of strings. We want to get a list of those strings after they are split.
from itertools import chain
list_of_words = list(chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res))))))
That's not pretty, but it works. Basically, we pull the tuples out of the list, pull the strings out of each tuples, then split each string then pull all the strings out of the lists they end up in into one big list.
Let's put this all together with your pandas code. pandas isn't my strongest area, so please don't assume that I haven't made some elementary mistake if you see something weird looking.
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
c = Counter()
data_pat = r'((?:\b\w+?\b\s*){0,3})data((?:\s*\b\w+?\b){0,3})'
for sentence in sentence_list:
res = re.findall(data_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
The nice thing about the regex we're using is that all of the complicated parts don't care about what word we're using. With a slight change, we can make a format string
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
such that
base_pat.format('data') == data_pat
So with some list of words we want to collect information about key_words
import re
from itertools import chain
from collections import Counter
def remove_words(sentence, bad_words):
pat = r'(?:{})'.format(r'|'.join(bad_words))
return re.sub(pat, '', sentence, flags=re.IGNORECASE)
bad_words = ['we', 'is', 'to']
sentence_list = df.Answer.apply(lambda x: remove_words(str(x), bad_words))
key_words = ['data', 'analytics']
d = {}
base_pat = r'((?:\b\w+?\b\s*){{0,3}}){}((?:\s*\b\w+?\b){{0,3}})'
for keyword in key_words:
key_pat = base_pat.format(keyword)
c = Counter()
for sentence in sentence_list:
res = re.findall(key_pat, sentence, flags=re.IGNORECASE)
words = chain.from_iterable(map(str.split, chain.from_iterable(map(chain, chain(res)))))
c.update(words)
d[keyword] = c
Now we have a dictionary d that maps keywords, like data and analytics to Counters that map words that are not on our blacklist to their counts in the vicinity of the associated keyword. Something like
d= {'data' : Counter({ 'important' : 2,
'very' : 3}),
'analytics' : Counter({ 'boring' : 5,
'sleep' : 3})
}
As to how we get the top 10 words, that's basically the thing Counter is best at.
key_words, _ = zip(*Counter(w for sentence in sentence_list for w in sentence.split()).most_common(10))

Generating random titles and descriptions in Python

Are there any python libraries which can generate random titles and random descriptions.
Random title: A grammatically correct(but random) English sentence with less than 5 words.
Random description: A grammatically correct(but random) English sentence with less than 20 words.
I am testing a product which has title and description field. I want to create multiple objects with random title and random descriptions instead of "Title 1" "Description 1".
For a fairly simple solution, just find matches for a regex like [A-Z][a-z'\-]+[, ]([a-zA-Z'\-]+[;,]? ){15,25}[a-zA-Z'\-]+[.?!] (Match a capitalized word followed by 15-25 words (potentially with commas or semicolons following them) then followed by a final word and an ending punctuation mark) in some large block of text. To get shorter, title-like phrases, you could just match any sequence of about 5 words (probably without punctuation between them):
([a-zA-Z'\-]+ ){4,6}
From Generating pseudo random text with Markov chains using Python:
You can use Markov chains to achieve this. To do that, you'll need to do the following steps (from the page I linked):
Have a text which will serve as the corpus from which we choose the
next transitions.
Start with two consecutive words from the text.
The last two words constitute the present state.
Generating next
word is the markov transition. To generate the next word, look in
the corpus, and find which words are present after the given two
words. Choose one of them randomly.
Repeat 2, until text of required
size is generated.
The code they supply to accomplish this:
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
With this code, you then do something like the following example, replacing their jeeves.txt with some seed text of your choice (longer is better).
In [1]: file_ = open('/home/shabda/jeeves.txt')
In [2]: import markovgen
In [3]: markov = markovgen.Markov(file_)
In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'
After In[1] through In[3], you'd just need to call markov.generate_markov_text() with the proper arguments to generate sequences of 5 and 20 words as you needed them.

Accumulated Frequencies, Ngrams

quick question here: if you run the code below you get a list of frequencies of bigrams per list from the corpus.
I would like to be able to display and keep track of a total running tally. IE instead of what you see displayed when you run it as 1 or maybe 2 for the frequency because the index is so small, it counts through the whole corpus and displays frequencies.
I then basically need to generate text from the frequencies that models the original corpus.
#---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project
#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown
#---------------------------------------------------------
#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'
#---------------------------------------------------------
#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
#Simply check for an instance of a period, and if so, replace with '</s>'
if corpus_list[-1] == '.':
corpus_list[-1] = '</s>'
#Stripe is a modifier that allows us to remove all special characters, IE '\n'
corpus_list[-1].strip()
#Else add to the end of the list item
else:
corpus_list.append('</s>')
return ['<s>'] + corpus_list
#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
print "The Corpus Annotated with <s> and </s> looks like : "
print "Displaying [",user,"] rows of the corpus : ", '\n'
for corpus_list in news[:user]:
print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
print "Fine I Won't Show You Any... ",'\n'
#---------------------------------------------------------
print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0
#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
passer = news[count]
def ngrams(passer, n = 2, padding = True):
#Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
#'None' the first item in each list so that calculations of frequencies can be made
pad = [] if not padding else [None]*(n-1)
grams = pad + passer + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
#In this case, arguments are first: n-gram type (bi, tri, quad)
#Followed by in our case the addition of 'padding'
#Padding is used in every case here because we need it for calculations
#This function structure allows us to pull in corpus parts without the added annotations if need be
for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
print '\n%d - grams || padding = %d' % (size, padding)
print list(ngrams(passer, size, padding))
# show frequency
counts = defaultdict(int)
for n_gram in ngrams(passer, 2, False):
counts[n_gram] += 1
print ("======================================================================================")
print '\nFrequencies Of Bigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
print '\nFrequencies Of Trigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
count = count + 1
#---------------------------------------------------------
I'm not sure I understand the question. nltk has a function generate. The book from which nltk comes from is available online.
http://nltk.org/book/ch01.html
Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
The problem is that you define the dict counts anew for each sentence, so the ngram counts get reset to zero. Define it above the while loop and the counts will accumulate over the entire Brown corpus.
Bonus advice: You should also move the definition of ngram outside the loop-- it's nonsensical to define the same function over and over and over. (But it does no harm, except to performance). Better yet, you should use the nltk's ngram function and read about FreqDist, which is like a dict counter on steroids. It will come in handy when you tackle the statistical text generation.

Categories