I am trying to give a score to a randomly generated word based on how close it is "myname" (I only generated a word with three letters to make sure it works first). Since I have only generating words with three letters, if the random word is "myn" then the score is 3.. if the random word is "mym" the score is 2, if the random word is "mim" the score is one.. etc etc. Essentially the "closeness" of the random word is based on whether the random word generated is in the first three letters of "myname" Here is the code I have thus far.
def generate(x):
word = "myname"
alpha_1 = alpha[random.randint(0,21)]
vowel_1 = vowels[random.randint(0,4)]
alpha_2 = alpha[random.randint(0,21)]
two = {}
for x in range(1,7315):
two.update({alpha_1 + vowel_1 + alpha_2:0})
return two
generate(x)
You are looking for fuzzy matching. Try using fuzzywuzzy (pip install fuzzywuzzy). You can use the doc here to know how it works,but, basically :
Import the module from fuzzywuzzy import fuzz, process
Then, the ratio of the two strings : fuzz.ratio(myname, "The name to compare with")
You can also see if the name is contained in the string using partial ratio :
fuzz.partial_ratio("abcdef", "bcd") will return 100 (%) as bcd is in abcdef
fuzz.partial_ratio("aaaa", "abcd") will return 25 (%) as the string is not really contained in the other one : only 1 letter is common.
Related
I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1
Are there any python libraries which can generate random titles and random descriptions.
Random title: A grammatically correct(but random) English sentence with less than 5 words.
Random description: A grammatically correct(but random) English sentence with less than 20 words.
I am testing a product which has title and description field. I want to create multiple objects with random title and random descriptions instead of "Title 1" "Description 1".
For a fairly simple solution, just find matches for a regex like [A-Z][a-z'\-]+[, ]([a-zA-Z'\-]+[;,]? ){15,25}[a-zA-Z'\-]+[.?!] (Match a capitalized word followed by 15-25 words (potentially with commas or semicolons following them) then followed by a final word and an ending punctuation mark) in some large block of text. To get shorter, title-like phrases, you could just match any sequence of about 5 words (probably without punctuation between them):
([a-zA-Z'\-]+ ){4,6}
From Generating pseudo random text with Markov chains using Python:
You can use Markov chains to achieve this. To do that, you'll need to do the following steps (from the page I linked):
Have a text which will serve as the corpus from which we choose the
next transitions.
Start with two consecutive words from the text.
The last two words constitute the present state.
Generating next
word is the markov transition. To generate the next word, look in
the corpus, and find which words are present after the given two
words. Choose one of them randomly.
Repeat 2, until text of required
size is generated.
The code they supply to accomplish this:
import random
class Markov(object):
def __init__(self, open_file):
self.cache = {}
self.open_file = open_file
self.words = self.file_to_words()
self.word_size = len(self.words)
self.database()
def file_to_words(self):
self.open_file.seek(0)
data = self.open_file.read()
words = data.split()
return words
def triples(self):
""" Generates triples from the given data string. So if our string were
"What a lovely day", we'd generate (What, a, lovely) and then
(a, lovely, day).
"""
if len(self.words) < 3:
return
for i in range(len(self.words) - 2):
yield (self.words[i], self.words[i+1], self.words[i+2])
def database(self):
for w1, w2, w3 in self.triples():
key = (w1, w2)
if key in self.cache:
self.cache[key].append(w3)
else:
self.cache[key] = [w3]
def generate_markov_text(self, size=25):
seed = random.randint(0, self.word_size-3)
seed_word, next_word = self.words[seed], self.words[seed+1]
w1, w2 = seed_word, next_word
gen_words = []
for i in xrange(size):
gen_words.append(w1)
w1, w2 = w2, random.choice(self.cache[(w1, w2)])
gen_words.append(w2)
return ' '.join(gen_words)
With this code, you then do something like the following example, replacing their jeeves.txt with some seed text of your choice (longer is better).
In [1]: file_ = open('/home/shabda/jeeves.txt')
In [2]: import markovgen
In [3]: markov = markovgen.Markov(file_)
In [4]: markov.generate_markov_text()
Out[4]: 'Can you put a few years of your twin-brother Alfred,
who was apt to rally round a bit. I should strongly advocate
the blue with milk'
After In[1] through In[3], you'd just need to call markov.generate_markov_text() with the proper arguments to generate sequences of 5 and 20 words as you needed them.
I have extracted the list of sentences from a document. I am pre-processing this list of sentences to make it more sensible. I am faced with the following problem
I have sentences such as "more recen t ly the develop ment, wh ich is a po ten t "
I would like to correct such sentences using a look up dictionary? to remove the unwanted spaces.
The final output should be "more recently the development, which is a potent "
I would assume that this is a straight forward task in preprocessing text? I need help with some pointers to look for such approaches. Thanks.
Take a look at word or text segmentation. The problem is to find the most probable split of a string into a group of words. Example:
thequickbrownfoxjumpsoverthelazydog
The most probable segmentation should be of course:
the quick brown fox jumps over the lazy dog
Here's an article including prototypical source code for the problem using Google Ngram corpus:
http://jeremykun.com/2012/01/15/word-segmentation/
The key for this algorithm to work is access to knowledge about the world, in this case word frequencies in some language. I implemented a version of the algorithm described in the article here:
https://gist.github.com/miku/7279824
Example usage:
$ python segmentation.py t hequi ckbrownfoxjum ped
thequickbrownfoxjumped
['the', 'quick', 'brown', 'fox', 'jumped']
Using data, even this can be reordered:
$ python segmentation.py lmaoro fll olwt f pwned
lmaorofllolwtfpwned
['lmao', 'rofl', 'lol', 'wtf', 'pwned']
Note that the algorithm is quite slow - it's prototypical.
Another approach using NLTK:
http://web.archive.org/web/20160123234612/http://www.winwaed.com:80/blog/2012/03/13/segmenting-words-and-sentences/
As for your problem, you could just concatenate all string parts you have to get a single string and the run a segmentation algorithm on it.
Your goal is to improve text, not necessarily to make it perfect; so the approach you outline makes sense in my opinion. I would keep it simple and use a "greedy" approach: Start with the first fragment and stick pieces to it as long as the result is in the dictionary; if the result is not, spit out what you have so far and start over with the next fragment. Yes, occasionally you'll make a mistake with cases like the me thod, so if you'll be using this a lot, you could look for something more sophisticated. However, it's probably good enough.
Mainly what you require is a large dictionary. If you'll be using it a lot, I would encode it as a "prefix tree" (a.k.a. trie), so that you can quickly find out if a fragment is the start of a real word. The nltk provides a Trie implementation.
Since this kind of spurious word breaks are inconsistent, I would also extend my dictionary with words already processed in the current document; you may have seen the complete word earlier, but now it's broken up.
--Solution 1:
Lets think of these chunks in your sentence as beads on an abacus, with each bead consisting of a partial string, the beads can be moved left or right to generate the permutations. The position of each fragment is fixed between two adjacent fragments.
In current case, the beads would be :
(more)(recen)(t)(ly)(the)(develop)(ment,)(wh)(ich)(is)(a)(po)(ten)(t)
This solves 2 subproblems:
a) Bead is a single unit,so We do not care about permutations within the bead i.e. permutations of "more" are not possible.
b) The order of the beads is constant, only the spacing between them changes. i.e. "more" will always be before "recen" and so on.
Now, generate all the permutations of these beads , which will give output like :
morerecentlythedevelopment,which is a potent
morerecentlythedevelopment,which is a poten t
morerecentlythedevelop ment, wh ich is a po tent
morerecentlythedevelop ment, wh ich is a po ten t
morerecentlythe development,whichisapotent
Then score these permutations based on how many words from your relevant dictionary they contain, most correct results can be easily filtered out.
more recently the development, which is a potent will score higher than morerecentlythedevelop ment, wh ich is a po ten t
Code which does the permutation part of the beads:
import re
def gen_abacus_perms(frags):
if len(frags) == 0:
return []
if len(frags) == 1:
return [frags[0]]
prefix_1 = "{0}{1}".format(frags[0],frags[1])
prefix_2 = "{0} {1}".format(frags[0],frags[1])
if len(frags) == 2:
nres = [prefix_1,prefix_2]
return nres
rem_perms = gen_abacus_perms(frags[2:])
res = ["{0}{1}".format(prefix_1, x ) for x in rem_perms] + ["{0} {1}".format(prefix_1, x ) for x in rem_perms] + \
["{0}{1}".format(prefix_2, x ) for x in rem_perms] + ["{0} {1}".format(prefix_2 , x ) for x in rem_perms]
return res
broken = "more recen t ly the develop ment, wh ich is a po ten t"
frags = re.split("\s+",broken)
perms = gen_abacus_perms(frags)
print("\n".join(perms))
demo:http://ideone.com/pt4PSt
--Solution#2:
I would suggest an alternate approach which makes use of text analysis intelligence already developed by folks working on similar problems and having worked on big corpus of data which depends on dictionary and grammar .e.g. search engines.
I am not well aware of such public/paid apis, so my example is based on google results.
Lets try to use google :
You can keep putting your invalid terms to Google, for multiple passes, and keep evaluating the results for some score based on your lookup dictionary.
here are two relevant outputs by using 2 passes of your text :
This outout is used for a second pass :
Which gives you the conversion as ""more recently the development, which is a potent".
To verify the conversion, you will have to use some similarity algorithm and scoring to filter out invalid / not so good results.
One raw technique could be using a comparison of normalized strings using difflib.
>>> import difflib
>>> import re
>>> input = "more recen t ly the develop ment, wh ich is a po ten t "
>>> output = "more recently the development, which is a potent "
>>> input_norm = re.sub(r'\W+', '', input).lower()
>>> output_norm = re.sub(r'\W+', '', output).lower()
>>> input_norm
'morerecentlythedevelopmentwhichisapotent'
>>> output_norm
'morerecentlythedevelopmentwhichisapotent'
>>> difflib.SequenceMatcher(None,input_norm,output_norm).ratio()
1.0
I would recommend stripping away the spaces and looking for dictionary words to break it down into. There are a few things you can do to make it more accurate. To make it get the first word in text with no spaces, try taking the entire string, and going through dictionary words from a file (you can download several such files from http://wordlist.sourceforge.net/), the longest ones first, than taking off letters from the end of the string you want to segment. If you want it to work on a big string, you can make it automatically take off letters from the back so that the string you are looking for the first word in is only as long as the longest dictionary word. This should result in you finding the longest words, and making it less likely to do something like classify "asynchronous" as "a synchronous". Here is an example that uses raw input to take in the text to correct and a dictionary file called dictionary.txt:
dict = open("dictionary.txt",'r') #loads a file with a list of words to break string up into
words = raw_input("enter text to correct spaces on: ")
words = words.strip() #strips away spaces
spaced = [] #this is the list of newly broken up words
parsing = True #this represents when the while loop can end
while parsing:
if len(words) == 0: #checks if all of the text has been broken into words, if it has been it will end the while loop
parsing = False
iterating = True
for iteration in range(45): #goes through each of the possible word lengths, starting from the biggest
if iterating == False:
break
word = words[:45-iteration] #each iteration, the word has one letter removed from the back, starting with the longest possible number of letters, 45
for line in dict:
line = line[:-1] #this deletes the last character of the dictionary word, which will be a newline. delete this line of code if it is not a newline, or change it to [1:] if the newline character is at the beginning
if line == word: #this finds if this is the word we are looking for
spaced.append(word)
words = words[-(len(word)):] #takes away the word from the text list
iterating = False
break
print ' '.join(spaced) #prints the output
If you want it to be even more accurate, you could try using a natural language parsing program, there are several available for python free online.
Here's something really basic:
chunks = []
for chunk in my_str.split():
chunks.append(chunk)
joined = ''.join(chunks)
if is_word(joined):
print joined,
del chunks[:]
# deal with left overs
if chunks:
print ''.join(chunks)
I assume you have a set of valid words somewhere that can be used to implement is_word. You also have to make sure it deals with punctuation. Here's one way to do that:
def is_word(wd):
if not wd:
return False
# Strip of trailing punctuation. There might be stuff in front
# that you want to strip too, such as open parentheses; this is
# just to give the idea, not a complete solution.
if wd[-1] in ',.!?;:':
wd = wd[:-1]
return wd in valid_words
You can iterate through a dictionary of words to find the best fit. Adding the words together when a match is not found.
def iterate(word,dictionary):
for word in dictionary:
if words in possibleWord:
finished_sentence.append(words)
added = True
else:
added = False
return [added,finished_sentence]
sentence = "more recen t ly the develop ment, wh ich is a po ten t "
finished_sentence = ""
sentence = sentence.split()
for word in sentence:
added,new_word = interate(word,dictionary)
while True:
if added == False:
word += possible[sentence.find(possibleWord)]
iterate(word,dictionary)
else:
break
finished_sentence.append(word)
This should work. For the variable dictionary, download a txt file of every single english word, then open it in your program.
my index.py file be like
from wordsegment import load, segment
load()
print(segment('morerecentlythedevelopmentwhichisapotent'))
my index.php file be like
<html>
<head>
<title>py script</title>
</head>
<body>
<h1>Hey There!Python Working Successfully In A PHP Page.</h1>
<?php
$python = `python index.py`;
echo $python;
?>
</body>
</html>
Hope this will work
quick question here: if you run the code below you get a list of frequencies of bigrams per list from the corpus.
I would like to be able to display and keep track of a total running tally. IE instead of what you see displayed when you run it as 1 or maybe 2 for the frequency because the index is so small, it counts through the whole corpus and displays frequencies.
I then basically need to generate text from the frequencies that models the original corpus.
#---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project
#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown
#---------------------------------------------------------
#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'
#---------------------------------------------------------
#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
#Simply check for an instance of a period, and if so, replace with '</s>'
if corpus_list[-1] == '.':
corpus_list[-1] = '</s>'
#Stripe is a modifier that allows us to remove all special characters, IE '\n'
corpus_list[-1].strip()
#Else add to the end of the list item
else:
corpus_list.append('</s>')
return ['<s>'] + corpus_list
#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
print "The Corpus Annotated with <s> and </s> looks like : "
print "Displaying [",user,"] rows of the corpus : ", '\n'
for corpus_list in news[:user]:
print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
print "Fine I Won't Show You Any... ",'\n'
#---------------------------------------------------------
print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0
#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
passer = news[count]
def ngrams(passer, n = 2, padding = True):
#Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
#'None' the first item in each list so that calculations of frequencies can be made
pad = [] if not padding else [None]*(n-1)
grams = pad + passer + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
#In this case, arguments are first: n-gram type (bi, tri, quad)
#Followed by in our case the addition of 'padding'
#Padding is used in every case here because we need it for calculations
#This function structure allows us to pull in corpus parts without the added annotations if need be
for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
print '\n%d - grams || padding = %d' % (size, padding)
print list(ngrams(passer, size, padding))
# show frequency
counts = defaultdict(int)
for n_gram in ngrams(passer, 2, False):
counts[n_gram] += 1
print ("======================================================================================")
print '\nFrequencies Of Bigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
print '\nFrequencies Of Trigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
count = count + 1
#---------------------------------------------------------
I'm not sure I understand the question. nltk has a function generate. The book from which nltk comes from is available online.
http://nltk.org/book/ch01.html
Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she
The problem is that you define the dict counts anew for each sentence, so the ngram counts get reset to zero. Define it above the while loop and the counts will accumulate over the entire Brown corpus.
Bonus advice: You should also move the definition of ngram outside the loop-- it's nonsensical to define the same function over and over and over. (But it does no harm, except to performance). Better yet, you should use the nltk's ngram function and read about FreqDist, which is like a dict counter on steroids. It will come in handy when you tackle the statistical text generation.
I'm working on a random text generator -without using Markov chains- and currently it works without too many problems -actually generates a good amount of random sentences by my criteria but I want to make it even more accurate to prevent as many sentence repeats as possible-. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
And here is my code:
import nltk
from nltk.corpus import gutenberg
from random import choice
import smtplib #will be for send e-mail option later
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
longestLen2 = 0
longestStr2 = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
while triggerSentence:#run the loop so long as there is a trigger sentence
sets = []
sets2 = []
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
#code to get the sentences containing the longest word, then selecting
#random one of these sentences that are longer than 40 characters
for sentence in listOfSents:
if sentence.count(longestString):
sents= " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence #the first sentence that comes up after I enter input-
split_str = triggerSentence.split()
for apiece in triggerSentence: #find the longest word in this new sentence
if len(apiece) > longestLen2:
longestStr2 = piece
longestLen2 = len(apiece)
if longestStr2 == longestString:
second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence
#print second_longest #now get second longest word if first is same
#as longest word in previous sentence
for sentence in listOfSents:
if sentence.count(second_longest):
sents = " ".join(sentence)
if len(sents) > 40:
sets2.append(" ".join(sentence))
triggerSentence = choice(sets2)
else:
for sentence in listOfSents:
if sentence.count(longestStr2):
sents = " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence
According to my code, once I enter a trigger sentence, I should get another one that contains the longest word of the trigger sentence I entered. Then this new sentence becomes the trigger sentence and it's longest word is picked. This is where the problem sometimes occurs. I observed that despite the code lines I placed - starting from line 47 to the end- , the algorithm still can pick the same longest word in the sentences that come along, not looking for the second longest word.
For example:
Trigger string = "Scotland is a nice place."
Sentence 1 = -This is a random sentence with the word Scotland in it-
Now, this is where the problem can occur in my code at times -doesn't matter whether it comes up in sentence 2 or 942 or zillion or whatever, but I give it in sent.2 for example's sake-
Sentence 2 = Another sentence that has the word Scotland in it but not the second longest word in sentence 1. According to my code, this sentence should have been some sentence that contained the second longest word in sentence 1, not Scotland !
How can I solve this ? I'm trying to optimize the code as much as possible and any help is welcome.
There is nothing random about your algorithm at all. It should always be deterministic.
I'm not quite sure what you want to do here. If it is to generate random words, just use a dictionary and the random module. If you want to grab random sentences from the Gutenberg project, use the random module to pick a work and then a sentence out of that work.