Calculate TF-IDF using sklearn for variable-n-grams in python - python

Problem:
using scikit-learn to find the number of hits of variable n-grams of a particular vocabulary.
Explanation.
I got examples from here.
Imagine I have a corpus and I want to find how many hits (counting) has a vocabulary like the following one:
myvocabulary = [(window=4, words=['tin', 'tan']),
(window=3, words=['electrical', 'car'])
(window=3, words=['elephant','banana'])
What I call here window is the length of the span of words in which the words can appear. as follows:
'tin tan' is hit (within 4 words)
'tin dog tan' is hit (within 4 words)
'tin dog cat tan is hit (within 4 words)
'tin car sun eclipse tan' is NOT hit. tin and tan appear more than 4 words away from each other.
I just want to count how many times (window=4, words=['tin', 'tan']) appears in a text and the same for all the other ones and then add the result to a pandas in order to calculate a tf-idf algorithm.
I could only find something like this:
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(vocabulary = myvocabulary, stop_words = 'english')
tfs = tfidf.fit_transform(corpus.values())
where vocabulary is a simple list of strings, being single words or several words.
besides from scikitlearn:
class sklearn.feature_extraction.text.CountVectorizer
ngram_range : tuple (min_n, max_n)
The lower and upper boundary of the range of n-values for different n-grams to be extracted. All values of n such that min_n <= n <= max_n will be used.
does not help neither.
Any ideas?

I am not sure if this can be done using CountVectorizer or TfidfVectorizer. I have written my own function for doing this as follows:
import pandas as pd
import numpy as np
import string
def contained_within_window(token, word1, word2, threshold):
word1 = word1.lower()
word2 = word2.lower()
token = token.translate(str.maketrans('', '', string.punctuation)).lower()
if (word1 in token) and word2 in (token):
word_list = token.split(" ")
word1_index = [i for i, x in enumerate(word_list) if x == word1]
word2_index = [i for i, x in enumerate(word_list) if x == word2]
count = 0
for i in word1_index:
for j in word2_index:
if np.abs(i-j) <= threshold:
count=count+1
return count
return 0
SAMPLE:
corpus = [
'This is the first document. And this is what I want',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
'I like coding in sklearn',
'This is a very good question'
]
df = pd.DataFrame(corpus, columns=["Test"])
your df will look like this:
Test
0 This is the first document. And this is what I...
1 This document is the second document.
2 And this is the third one.
3 Is this the first document?
4 I like coding in sklearn
5 This is a very good question
Now you can apply contained_within_window as follows:
sum(df.Test.apply(lambda x: contained_within_window(x,word1="this", word2="document",threshold=2)))
And you get:
2
You can just run a for loop for checking different instances.
And you this to construct your pandas df and apply TfIdf on it, which is straight forward.

Related

Comparing two strings with low/no consistency

I have two strings
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'
I need to check if they match and print out the result accordingly. As b is the string I want checked in a, the result should print out 'Match' or 'Mis-match' accordingly. The code shoule not depend on the '_' in a, as they can be '-' or spaces as well.
I have tried using fuzzywuzzy library and the fuzzy.token_set_ratio to calculate the ratio.
From observation, I chose a value of 95 to be convincing.
I want to know if there is another way to check this without using fuzzywuzzy, probably difflib.
I tried using difflib and SequenceManager, but all I get is a word wise comparison and am unable to combine the result exactly.
I have tried the following code.
from fuzzywuzzy import fuzzy
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
ratio = fuzzy.token_set_ratio(a.lower(), b.lower())
if ratio >= 95:
print('Match')
else:
print('Mis-Match')
output
'Mis-Match'
This gives a score of 64 while all of controlling, pos0 and pos1 are in a and in b and should give a match instead.
I tried this as this doesn't depend on the '_' or '-' or spaces.
You can use gensim library implement MatchSemantic and write code like this as a function:
Initialization
if run the code for the first time a process-bar will go from 0% to 100% for downloading glove-wiki-gigaword-50 of the gensim and after that everything will be set and you can simply run the code
Code
from re import sub
import numpy as np
from gensim.utils import simple_preprocess
import gensim.downloader as api
from gensim.corpora import Dictionary
from gensim.models import TfidfModel
from gensim.similarities import SparseTermSimilarityMatrix, WordEmbeddingSimilarityIndex, SoftCosineSimilarity
def MatchSemantic(query_string, documents):
stopwords = ['the', 'and', 'are', 'a']
if len(documents) == 1: documents.append('')
def preprocess(doc):
# Tokenize, clean up input document string
doc = sub(r'<img[^<>]+(>|$)', " image_token ", doc)
doc = sub(r'<[^<>]+(>|$)', " ", doc)
doc = sub(r'\[img_assist[^]]*?\]', " ", doc)
doc = sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[$-_#.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+', " url_token ", doc)
return [token for token in simple_preprocess(doc, min_len=0, max_len=float("inf")) if token not in stopwords]
# Preprocess the documents, including the query string
corpus = [preprocess(document) for document in documents]
query = preprocess(query_string)
# Load the model: this is a big file, can take a while to download and open
glove = api.load("glove-wiki-gigaword-50")
similarity_index = WordEmbeddingSimilarityIndex(glove)
# Build the term dictionary, TF-idf model
dictionary = Dictionary(corpus + [query])
tfidf = TfidfModel(dictionary=dictionary)
# Create the term similarity matrix.
similarity_matrix = SparseTermSimilarityMatrix(similarity_index, dictionary, tfidf)
query_tf = tfidf[dictionary.doc2bow(query)]
index = SoftCosineSimilarity(
tfidf[[dictionary.doc2bow(document) for document in corpus]],
similarity_matrix)
doc_similarity_scores = index[query_tf]
# Output the sorted similarity scores and documents
sorted_indexes = np.argsort(doc_similarity_scores)[::-1]
for idx in sorted_indexes:
if documents[idx] != '':
if doc_similarity_scores[idx] > 0.0: print('Match')
else: print('Mis-Match')
Usage
for example, we want to see if Fruit and Vegetables matches any of the sentences or items inside documents
Test:
query_string = 'Fruit and Vegetables'
documents = ['I have an apple in my basket', 'I have a car in my house']
MatchSemantic(query_string, documents)
so we know that the first item I have an apple in my basket has a semantical relation with Fruit and Vegetables so it prints Match and for the second item no relation will be found so it prints Mis-Match
output:
Match
Mis-Match
It looks like you are trying to use tools like fuzzywuzzy that were not really designed for that.
One possible approach to this problem could be to find how many tokens from the second text are present in the first text.
This can be normalized by the total number of tokens in the second text.
Then you can threshold to whatever value you deem fit.
One possible way of implementing this is the following:
Tokenize (i.e. convert to a list of tokens) the input texts a and b.
Collect each token list into a corresponding counter (i.e. some data structure for counting the which tokens are present).
Compute the intersection a_i_b of the tokens for a and b.
Compute some metric based on the the total occurrences of a_i_b (weight_a_i_b) and the total occurrences of b (weight_b). This final metric is a proxy of the "amount" of b contained into a. This could be a ratio or a difference and should use the fact that weight_a_i_b <= weight_b by construction.
The difference weight_b - weight_a_i_b results in a number between 0 and the number of tokens in b, which is also a direct measure of how many tokens from b are not found in a, hence 0 indicates perfect matching.
The ratio weight_a_i_b / weight_b results in a number between 0 and 1, with 1 meaning perfect matching and 0 meaning no matching.
The difference metric is probably more suited for small number of tokens and easier to interpret and threshold in a meaningful way (e.g. accepting a value below 2 means that at most one token from b is not present in a).
On the other hand the ratio is more standard and it is probably more suited for larger tokens lists sizes.
All this would translate into this code, leveraging collections.Counter() for the dealing with counting the tokens:
import collections
def contains_tokens(
text_a,
text_b,
tokenize_kws=None,
metric=lambda a, b, a_i_b: b - a_i_b):
"""Compute the ratio of `b` contained in `a`."""
tokenize_kws = dict(tokenize_kws) if tokenize_kws is not None else {}
counter_a = collections.Counter(tokenize(text_a, **tokenize_kws))
counter_b = collections.Counter(tokenize(text_b, **tokenize_kws))
counter_a_i_b = counter_a & counter_b
weight_a = counter_total(counter_a)
weight_b = counter_total(counter_b)
weight_a_i_b = counter_total(counter_a_i_b)
return metric(weight_a, weight_b, weight_a_i_b)
The first step, i.e. tokenization, is achieved with the following function.
This is a bit primitive, but gets the job done for the your input.
What it does essentially is to replace a number of special characters (ignores) into blanks, and then splits the string along the blanks, optionally excluding the tokens in a blacklist (excludes).
def tokenize(
text,
case_sensitive=False,
ignores=('_', '-', ':', ',', '.', '?', '!'),
excludes=('the', 'from', 'to')):
"""Tokenize a text, ignoring some characters and excluding some tokens."""
if not case_sensitive:
text = text.lower()
for ignore in ignores:
text = text.replace(ignore, ' ')
for token in text.split():
if token not in excludes:
yield token
To count the total number of values in a counter the following function is used. However, for Python 3.10 and later, there is a build-in method Counter.total() which does the exact same.
def counter_total(counter):
"""Count the total number of values."""
return sum(counter.values())
For the given input this becomes:
a = 'Test - 4567: Controlling_robotic_hand_with_Arduino_uno'
b = 'Controlling robotic hand'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
and
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos1'
# all tokens from `b` are in `a`
print(contains_tokens(a, b))
# 0
a = 'Test - 4567: robotic_hand_with_Arduino_uno_controlling_pos0_pos1'
b = 'Controlling from pos0 to pos2'
# one token from `b` (`pos2`) not in `a`
print(contains_tokens(a, b))
# 1
Note that distance-based functions (like fuzz.token_set_ratio() or fuzz.partial_ratio()) cannot be used in this context because they will be sensitive to how much "noise" is present in the first text, e.g. if b = 'a b c', those tokens are contained equally in a = 'a c' as well as a = 'a b c d e f g h i', and any distance cannot account for that, most notably because distance functions are symmetric (i.e. f(a, b) = f(b, a)) while the function you are looking for is not (i.e. f(a, b) != f(b, a)).

Sentence comparison: how to highlight differences

I have the following sequences of strings within a column in pandas:
SEQ
An empty world
So the word is
So word is
No word is
I can check the similarity using fuzzywuzzy or cosine distance.
However I would like to know how to get information about the word which changes position from amore to another.
For example:
Similarity between the first row and the second one is 0. But here is similarity between row 2 and 3.
They present almost the same words and the same position. I would like to visualize this change (missing word) if possible. Similarly to the 3rd row and the 4th.
How can I see the changes between two rows/texts?
Assuming you're using jupyter / ipython and you are just interested in comparisons between a row and that preceding it I would do something like this.
The general concept is:
find shared tokens between the two strings (by splitting on ' ' and finding the intersection of two sets).
apply some html formatting to the tokens shared between the two strings.
apply this to all rows.
output the resulting dataframe as html and render it in ipython.
import pandas as pd
data = ['An empty world',
'So the word is',
'So word is',
'No word is']
df = pd.DataFrame(data, columns=['phrase'])
bold = lambda x: f'<b>{x}</b>'
def highlight_shared(string1, string2, format_func):
shared_toks = set(string1.split(' ')) & set(string2.split(' '))
return ' '.join([format_func(tok) if tok in shared_toks else tok for tok in string1.split(' ') ])
highlight_shared('the cat sat on the mat', 'the cat is fat', bold)
df['previous_phrase'] = df.phrase.shift(1, fill_value='')
df['tokens_shared_with_previous'] = df.apply(lambda x: highlight_shared(x.phrase, x.previous_phrase, bold), axis=1)
from IPython.core.display import HTML
HTML(df.loc[:, ['phrase', 'tokens_shared_with_previous']].to_html(escape=False))

Word count frequency: removing stopwords

I have the following list of words frequency generated by the code below.
Frequency
the 3
15 5
18 1
a 1
2020 4
... ...
house 1
apartment 1
hotel 5
pool 1
swimming 1
The code is
from sklearn.feature_extraction.text import CountVectorizer
word_vectorizer = CountVectorizer(ngram_range=(1,1), analyzer='word')
sparse_matrix = word_vectorizer.fit_transform(df['Sentences'])
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq, index=word_vectorizer.get_feature_names(), columns=['Frequency'])
w_df
I would like to remove the stopwords from the the list of words above (not in the column of my dataframe, but just in the output, creating a new variable in case it would be needed).
I have tried with w_df =[w for w in w_df if not w in stop_words] but it gave me ['Frequency'] as output.
I think this happens because it is not a list.
Could you please tell me how to remove stopwords (numbers included) from there?
Thanks
CountVectorizer has a parameter that does that for you. You can feed it a custom list of stopwords, or set it to english, a built-in stop word list. Here's an example:
s = pd.Series('Just a random sentence with more than one stopword')
word_vectorizer = CountVectorizer(ngram_range=(1,1),
analyzer='word',
stop_words='english')
sparse_matrix = word_vectorizer.fit_transform(s)
w_freq = sum(sparse_matrix).toarray()[0]
w_df=pd.DataFrame(w_freq,
index=word_vectorizer.get_feature_names(),
columns=['Frequency'])
print(w_df)
Frequency
just 1
random 1
sentence 1
stopword 1
Just to add, your approach wasn't all that wrong. You needed just a minor change.
w_df = [w for w in w_df.index if not w in stop_words]
Your problem was simply that, in the list comprehension, you iterated over the dataframe itself rather than the tokens which are in its index. This would also return the desired result.

TFIDF separate for each label

Using TFIDFvectorizor(SKlearn), how to obtain word ranking based on tfidf score for each label separately. I want the word frequency for each label (positive and negative).
relevant code:
vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,stop_words='english',use_idf=True, ngram_range =(1,1))
features_train = vectorizer.fit_transform(features_train).todense()
features_test = vectorizer.transform(features_test).todense()
for i in range(len(features_test)):
first_document_vector=features_test[i]
df_t = pd.DataFrame(first_document_vector.T, index=feature_names, columns=["tfidf"])
df_t.sort_values(by=["tfidf"],ascending=False).head(50)
This will give you positive, neutral, and negative sentiment analysis for each row of comments in a field of a dataframe. There is a lot of preprocessing code, to get things cleaned up, filter out stop-words, do some basic charting, etc.
import pickle
import pandas as pd
import numpy as np
import pandas as pd
import re
import nltk
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
global str
df = pd.read_csv('C:\\your_path\\test_dataset.csv')
print(df.shape)
# let's experiment with some sentiment analysis concepts
# first we need to clean up the stuff in the independent field of the DF we are workign with
df['body'] = df[['body']].astype(str)
df['review_text'] = df[['review_text']].astype(str)
df['body'] = df['body'].str.replace('\d+', '')
df['review_text'] = df['review_text'].str.replace('\d+', '')
# get rid of special characters
df['body'] = df['body'].str.replace(r'[^\w\s]+', '')
df['review_text'] = df['review_text'].str.replace(r'[^\w\s]+', '')
# get rid fo double spaces
df['body'] = df['body'].str.replace(r'\^[a-zA-Z]\s+', '')
df['review_text'] = df['review_text'].str.replace(r'\^[a-zA-Z]\s+', '')
# convert all case to lower
df['body'] = df['body'].str.lower()
df['review_text'] = df['review_text'].str.lower()
# It looks like the language in body and review_text is very similar (2 fields in dataframe). let's check how closely they match...
# seems like the tone is similar, but the text is not matching at a high rate...less than 20% match rate
import difflib
body_list = df['body'].tolist()
review_text_list = df['review_text'].tolist()
body = body_list
reviews = review_text_list
s = difflib.SequenceMatcher(None, body, reviews).ratio()
print ("ratio:", s, "\n")
# filter out stop words
# these are the most common words such as: “the“, “a“, and “is“.
from nltk.corpus import stopwords
english_stopwords = stopwords.words('english')
print(len(english_stopwords))
text = str(body_list)
# split into words
from nltk.tokenize import word_tokenize
tokens = word_tokenize(text)
# convert to lower case
tokens = [w.lower() for w in tokens]
# remove punctuation from each word
import string
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in tokens]
# remove remaining tokens that are not alphabetic
words = [word for word in stripped if word.isalpha()]
# filter out stop words
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
print(words[:100])
# plot most frequently occurring words in a bar chart
# remove unwanted characters, numbers and symbols
df['review_text'] = df['review_text'].str.replace("[^a-zA-Z#]", " ")
#Let’s try to remove the stopwords and short words (<2 letters) from the reviews.
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
# function to remove stopwords
def remove_stopwords(rev):
rev_new = " ".join([i for i in rev if i not in stop_words])
return rev_new
# remove short words (length < 3)
df['review_text'] = df['review_text'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))
# remove stopwords from the text
reviews = [remove_stopwords(r.split()) for r in df['review_text']]
# make entire text lowercase
reviews = [r.lower() for r in reviews]
#Let’s again plot the most frequent words and see if the more significant words have come out.
freq_words(reviews, 35)
###############################################################################
###############################################################################
# Tf-idf is a very common technique for determining roughly what each document in a set of
# documents is “about”. It cleverly accomplishes this by looking at two simple metrics: tf
# (term frequency) and idf (inverse document frequency). Term frequency is the proportion
# of occurrences of a specific term to total number of terms in a document. Inverse document
# frequency is the inverse of the proportion of documents that contain that word/phrase.
# Simple, right!? The general idea is that if a specific phrase appears a lot of times in a
# given document, but it doesn’t appear in many other documents, then we have a good idea
# that the phrase is important in distinguishing that document from all the others.
# Starting with the CountVectorizer/TfidfTransformer approach...
# convert fields in datframe to list
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
cvec = CountVectorizer(stop_words='english', min_df=1, max_df=.5, ngram_range=(1,2))
cvec
# Calculate all the n-grams found in all documents
from itertools import islice
cvec.fit(body_list)
list(islice(cvec.vocabulary_.items(), 20))
len(cvec.vocabulary_)
# Let’s take a moment to describe these parameters as they are the primary levers for adjusting what
# feature set we end up with. First is “min_df” or mimimum document frequency. This sets the minimum
# number of documents that any term is contained in. This can either be an integer which sets the
# number specifically, or a decimal between 0 and 1 which is interpreted as a percentage of all documents.
# Next is “max_df” which similarly controls the maximum number of documents any term can be found in.
# If 90% of documents contain the word “spork” then it’s so common that it’s not very useful.
# Initialize the vectorizer with new settings and check the new vocabulary length
cvec = CountVectorizer(stop_words='english', min_df=.0025, max_df=.5, ngram_range=(1,2))
cvec.fit(body_list)
len(cvec.vocabulary_)
# Our next move is to transform the document into a “bag of words” representation which essentially is
# just a separate column for each term containing the count within each document. After that, we’ll
# take a look at the sparsity of this representation which lets us know how many nonzero values there
# are in the dataset. The more sparse the data is the more challenging it will be to model
cvec_counts = cvec.transform(body_list)
print('sparse matrix shape:', cvec_counts.shape)
print('nonzero count:', cvec_counts.nnz)
print('sparsity: %.2f%%' % (100.0 * cvec_counts.nnz / (cvec_counts.shape[0] * cvec_counts.shape[1])))
# get counts of frequently occurring terms; top 20
occ = np.asarray(cvec_counts.sum(axis=0)).ravel().tolist()
counts_df = pd.DataFrame({'term': cvec.get_feature_names(), 'occurrences': occ})
counts_df.sort_values(by='occurrences', ascending=False).head(20)
# Now that we’ve got term counts for each document we can use the TfidfTransformer to calculate the
# weights for each term in each document
transformer = TfidfTransformer()
transformed_weights = transformer.fit_transform(cvec_counts)
transformed_weights
# we can take a look at the top 20 terms by average tf-idf weight.
weights = np.asarray(transformed_weights.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'term': cvec.get_feature_names(), 'weight': weights})
weights_df.sort_values(by='weight', ascending=False).head(20)
# FINALLY!!!!
# Here we are doing some sentiment analysis, and distilling the 'review_text' field into positive, neutral, or negative,
# based on the tone of the text in each record. Also, we are filtering out the records that have <.2 negative score;
# keeping only those that have >.2 negative score. This is interesting, but this can contain some non-intitive results.
# For instance, one record in 'review_text' literally says 'no issues'. This is probably positive, but the algo sees the
# word 'no' and interprets the comment as negative. I would argue that it's positive. We'll circle back and resolve
# this potential issue a little later.
import nltk
nltk.download('vader_lexicon')
nltk.download('punkt')
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sid = SentimentIntensityAnalyzer()
df['sentiment'] = df['review_text'].apply(lambda x: sid.polarity_scores(x))
def convert(x):
if x < 0:
return "negative"
elif x > .2:
return "positive"
else:
return "neutral"
df['result'] = df['sentiment'].apply(lambda x:convert(x['compound']))
# df.groupby(['brand','result']).size()
# df.groupby(['brand','result']).count()
x = df.groupby(['review_text','brand'])['result'].value_counts(normalize=True)
x = df.groupby(['brand'])['result'].value_counts(normalize=True)
y = x.loc[(x.index.get_level_values(1) == 'negative')]
print(y[y>0.2])
Result:
brand result
ABH negative 0.500000
Alexander McQueen negative 0.500000
Anastasia negative 0.498008
BURBERRY negative 0.248092
Beats negative 0.272947
Bowers & Wilkins negative 0.500000
Breitling Official negative 0.666667
Capri Blue negative 0.333333
FERRARI negative 1.000000
Fendi negative 0.283582
GIORGIO ARMANI negative 1.000000
Jan Marini Skin Research negative 0.250000
Jaybird negative 0.235294
LANC�ME negative 0.500000
Longchamp negative 0.271605
Longchamps negative 0.500000
M.A.C negative 0.203390
Meaningful Beauty negative 0.222222
Polk Audio negative 0.256410
Pumas negative 0.222222
Ralph Lauren Polo negative 0.500000
Roberto Cavalli negative 0.250000
Samsung negative 0.332298
T3 Micro negative 0.224138
Too Faced negative 0.216216
VALENTINO by Mario Valentino negative 0.333333
YSL negative 0.250000
Feel free to skip things you find to be irrelevant, but as-is, the code does a fairly comprehensive NLP analysis.
Also, take a look at these two links.
https://www.analyticsvidhya.com/blog/2018/02/the-different-methods-deal-text-data-predictive-python/
https://towardsdatascience.com/fine-grained-sentiment-analysis-in-python-part-1-2697bb111ed4

How can I use nltk to get the chance of the next word being something?

Problem
I have a problem where I have one word and certain restrictions on what the second might be (for example "I _o__"). What I want is a list of words like "rode", "love", and "most" and telling me how common each one is following "I".
I want to be able to get a list of two-tuples (nextword, probability) where nextword is a word that satisfies a regex and probability is the chance that nextword follows after the first word, given by (number of times it is seen after the first word in a corpus of text)/(number of times the first word appears).
Like this:
[(nextword, follow_probability("I", nextword) for nextword in findwords('.o..')]
My approach to this is to first generate a list of possible words that satisfy the regex, and then look up the probability of each. The first part is easy, but I don't know how to do the second part. Ideally I would be able to have a function taking an argument for each word and returning the probability the second follows the first.
What I Have Tried
Using the markovify library to generate a chain and the sentences with a certain starting word and a state size of 1
Using nltk's BigramCollocationFinder
Try something like this:
from collections import Counter, deque
from nltk.tokenize import regexp_tokenize
import pandas as pd
def grouper(iterable, length=2):
i = iter(iterable)
q = deque(map(next, [i] * length))
while True:
yield tuple(q)
try:
q.append(next(i))
q.popleft()
except StopIteration:
break
def tokenize(text):
return [word.lower() for word in regexp_tokenize(text, r'\w+')]
def follow_probability(word1, word2, vec):
subvec = vec.loc[word1]
try:
ct = subvec.loc[word2]
except:
ct = 0
return float(ct) / (subvec.sum() or 1)
text = 'This is some training text this this'
tokens = tokenize(text)
markov = list(grouper(tokens))
vec = pd.Series(Counter(markov))
follow_probability('this', 'is', vec)
Output:
0.5

Categories