creating sentance from word list x amount of times

creating sentance from word list x amount of times - python

So I have been learning python and I am working on a little project. I have run into a problem though.
What I am trying to do is get the program to pick x amount of words from a text file and then repeating that task x amount of times.
So lets say for example I wanted to have 5 words in the sentence and do this 3 times. The result would be the following:
word1 word2 word3 word4 word5
word1 word2 word3 word4 word5
word1 word2 word3 word4 word5
This is what I have so far:
import random
word_file = "words.txt" #find word file
Words = open(word_file).read().splitlines() #retrieving and sorting word file
sent = random.randrange(0,1100) #amount of words to choose from in list
print(Words[sent]) #print the words
That will generate one word from the list of 1100 words. So then I tried to repeat this task x amount of times but instead it just repeated the same randomly chosen word x amount of times.
Here is that code:
import random
word_file = "words.txt" #find word file
Words = open(word_file).read().splitlines() #retreiving and sorting word file
sent = random.randrange(0,1100) #amount of words to choose from in list
for x in range(0, 3): #reapeat 3 times
print(Words[sent]) #print the words
So I am running in to two problems really. The first is that it is repeating the same word that was chosen first and second it will do it in each individual line instead of x amount then next line.
Would anyone be able to point me in the right direction to sorting this out?

Let me explain your code a little bit:
sent = random.randrange(0,1100) # <= returns a random number in range 0 => 1100 , this will not be changed.
for x in range(0, 3):
print(Words[sent]) # <= This line will print the word at the position sent, 3 times with the same Words and sent so it will be repeated the same word 3 times.
To fix this, you need to random a number each time you want a new word to be outputted.
for x in range(0, 3):
sent = random.randrange(0,1100)
print(Words[sent])

You just need to recompute a new random number each time.
for x in range(0, 3):
sent = random.randrange(0,1100)
print(Words[sent])
Though what might be easier for your case is to use the built in random.choices() function:
print(random.choices(Words, k=3))
Will print a list of 3 random words from your Words list.
If you aren't using Python 3.6, then you can just call random.choice(Words) over and over again.

you could abstract it to a function
def my_function(x,y):
#your code here
#you script goes here
my_function(x,y)
you are only generating the random number once, you could need to generate another random number for it to be different (thats where the function could help a lot). Make sure your function definition is before you call it.

Related

how to define a function that counts how many times the words from 'the LM word' list (all words in total) appear in the text?

fhand= open (r"Apple - 2019.txt")
lines = fhand.readlines()
for line in lines:
print(line)
LMword= [apple, company, numbers]
how to go further to answer the question: how to define a function that counts how many times the words from 'the LM word' list (all words in total) appear in the 'apple' text?

This will give you the total count (for all words):
c=sum([apple.count(i) for i in LMword])

Count number of texts in which a word occurs

I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1

How to split string such that it contains less than n characters with a twist

I have a long string which I want to save to a file. Words are separated by spaces. It is given that the number of words in long string is divisible by 3.
Basically I'm looking for a way to split string into chunks. Each chunk is less than n characters and the number of words in a chunk is also divisible by 3.
e.g.
>>> longstring = "This is a very long string and the sum of words is divisible by three"
>>> len(longstring.split())
>>> 15
say max line length is n=30:
>>>split_string(longstring, 30)
['This is a very long string', 'and the sum of words is', 'divisible by three']
In summary, the rules are:
No line longer than n characters.
A twist is that each new line must contain multiple of 3 words.
So far I tried using textwrap but I don't know how to implement 2.
import textwrap
textwrap.fill(long_line, width=69)

If you are certain that the total number of words in a string will always be divisible by 3, you can do something like this:
import sys
#long string; 84 words; divisible by 3
longString = "The charges are still sealed under orders from a federal judge. Plans were prepared Friday for anyone charged to be into custody as soon as Monday, the sources said. It is unclear what the charges are. A spokesman for the special counsel's office declined to comment. The White House also had no comment, a senior administration official said Saturday morning. A spokesman for the special counsel's office declined to comment. The White House also had no comment, a senior administration official said Saturday morning."
#convert string to list
listOfWords = longString.split()
#list to contain lines
lines = []
#make sure number of words is divisible by 3
if len(listOfWords) % 3 != 0:
#exit
print "words number is not divisible by 3"
sys.exit()
#keep going until list is empty
while listOfWords:
i = 0
line = ""
#loop for every line
while True:
#puts the next 3 words into a string
temp = " ".join(listOfWords[i:i+3])
#check new length of line after adding the new 3 words, if it is still less than 70, add the words, otherwise break out of the loop
if len(line) + len(temp) > 70:
break
line += "{} ".format(temp)
i+=3
#remove finished words from the list completely
listOfWords = listOfWords[i:]
#adds line into result list
lines.append(line.strip())
#to make sure this works
for line in lines:
print len(str(line))
print "Number of words: {}".format(len(line.split()))
print "number of chars: {}".format(len(line))
print line
print "----------------------------------------"

Accumulated Frequencies, Ngrams

quick question here: if you run the code below you get a list of frequencies of bigrams per list from the corpus.
I would like to be able to display and keep track of a total running tally. IE instead of what you see displayed when you run it as 1 or maybe 2 for the frequency because the index is so small, it counts through the whole corpus and displays frequencies.
I then basically need to generate text from the frequencies that models the original corpus.
#---------------------------------------------------------
#!/usr/bin/env python
#Ngram Project
#Import all of the libraries we will need for the program to function
import nltk
import nltk.collocations
from collections import defaultdict
import nltk.corpus as corpus
from nltk.corpus import brown
#---------------------------------------------------------
#create our list with the Brown corpus inside variable called "news"
news = corpus.brown.sents(categories = 'editorial')
#This will display the type of variable Python recognizes this as
print "News Is Of The Variable Type : ",type(news),'\n'
#---------------------------------------------------------
#This function will take in the corpus one line at a time
#After searching through and adding a <s> to the beggning of each list item, it also annotates periods out for </s>'
def alter_list(corpus_list):
#Simply check for an instance of a period, and if so, replace with '</s>'
if corpus_list[-1] == '.':
corpus_list[-1] = '</s>'
#Stripe is a modifier that allows us to remove all special characters, IE '\n'
corpus_list[-1].strip()
#Else add to the end of the list item
else:
corpus_list.append('</s>')
return ['<s>'] + corpus_list
#Displays the length of the list 'news'
print "The Length of News is : ",len(news),'\n'
#Allows the user to choose how much of the annotated corpus they would like to see
print "How many lines of the <s> // </s> annotated corpus would you like to see? ", '\n'
user = input()
#Takes user input to determine how many lines to display if any
if(user >= 1):
print "The Corpus Annotated with <s> and </s> looks like : "
print "Displaying [",user,"] rows of the corpus : ", '\n'
for corpus_list in news[:user]:
print(alter_list(corpus_list),'\n')
#Non positive number catch
else:
print "Fine I Won't Show You Any... ",'\n'
#---------------------------------------------------------
print '\n'
#Again allows the user to choose the number of lists from Brown corpus to be displayed in
# Unigram, bigram, trigram and quadgram format
user2 = input("How many list sequences would you like to see broken into bigrams, trigrams, and quadgrams? ")
count = 0
#Function 'ngrams' is run in a loop so that each entry in the list can be gone through and turned into information
#Displayed to the user
while(count < user2):
passer = news[count]
def ngrams(passer, n = 2, padding = True):
#Padding refers to the same idea demonstrated above, that is bump the first word to the second, making
#'None' the first item in each list so that calculations of frequencies can be made
pad = [] if not padding else [None]*(n-1)
grams = pad + passer + pad
return (tuple(grams[i:i+n]) for i in range(0, len(grams) - (n - 1)))
#In this case, arguments are first: n-gram type (bi, tri, quad)
#Followed by in our case the addition of 'padding'
#Padding is used in every case here because we need it for calculations
#This function structure allows us to pull in corpus parts without the added annotations if need be
for size, padding in ((1,1), (2,1), (3, 1), (4, 1)):
print '\n%d - grams || padding = %d' % (size, padding)
print list(ngrams(passer, size, padding))
# show frequency
counts = defaultdict(int)
for n_gram in ngrams(passer, 2, False):
counts[n_gram] += 1
print ("======================================================================================")
print '\nFrequencies Of Bigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
print '\nFrequencies Of Trigrams:'
for c, n_gram in sorted(((c, n_gram) for n_gram, c in counts.iteritems()), reverse = True):
print c, n_gram
count = count + 1
#---------------------------------------------------------

I'm not sure I understand the question. nltk has a function generate. The book from which nltk comes from is available online.
http://nltk.org/book/ch01.html
Now, just for fun, let's try generating some random text in the various styles we have just seen. To do this, we type the name of the text followed by the term generate. (We need to include the parentheses, but there's nothing that goes between them.)
>>> text3.generate()
In the beginning of his brother is a hairy man , whose top may reach
unto heaven ; and ye shall sow the land of Egypt there was no bread in
all that he was taken out of the month , upon the earth . So shall thy
wages be ? And they made their father ; and Isaac was old , and kissed
him : and Laban with his cattle in the midst of the hands of Esau thy
first born , and Phichol the chief butler unto his son Isaac , she

The problem is that you define the dict counts anew for each sentence, so the ngram counts get reset to zero. Define it above the while loop and the counts will accumulate over the entire Brown corpus.
Bonus advice: You should also move the definition of ngram outside the loop-- it's nonsensical to define the same function over and over and over. (But it does no harm, except to performance). Better yet, you should use the nltk's ngram function and read about FreqDist, which is like a dict counter on steroids. It will come in handy when you tackle the statistical text generation.

How can I randomize this text generator even further?

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems -actually generates a good amount of random sentences by my criteria but I want to make it even more accurate to prevent as many sentence repeats as possible-. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
And here is my code:
import nltk
from nltk.corpus import gutenberg
from random import choice
import smtplib #will be for send e-mail option later
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
longestLen2 = 0
longestStr2 = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
while triggerSentence:#run the loop so long as there is a trigger sentence
sets = []
sets2 = []
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
#code to get the sentences containing the longest word, then selecting
#random one of these sentences that are longer than 40 characters
for sentence in listOfSents:
if sentence.count(longestString):
sents= " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence #the first sentence that comes up after I enter input-
split_str = triggerSentence.split()
for apiece in triggerSentence: #find the longest word in this new sentence
if len(apiece) > longestLen2:
longestStr2 = piece
longestLen2 = len(apiece)
if longestStr2 == longestString:
second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence
#print second_longest #now get second longest word if first is same
#as longest word in previous sentence
for sentence in listOfSents:
if sentence.count(second_longest):
sents = " ".join(sentence)
if len(sents) > 40:
sets2.append(" ".join(sentence))
triggerSentence = choice(sets2)
else:
for sentence in listOfSents:
if sentence.count(longestStr2):
sents = " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence
According to my code, once I enter a trigger sentence, I should get another one that contains the longest word of the trigger sentence I entered. Then this new sentence becomes the trigger sentence and it's longest word is picked. This is where the problem sometimes occurs. I observed that despite the code lines I placed - starting from line 47 to the end- , the algorithm still can pick the same longest word in the sentences that come along, not looking for the second longest word.
For example:
Trigger string = "Scotland is a nice place."
Sentence 1 = -This is a random sentence with the word Scotland in it-
Now, this is where the problem can occur in my code at times -doesn't matter whether it comes up in sentence 2 or 942 or zillion or whatever, but I give it in sent.2 for example's sake-
Sentence 2 = Another sentence that has the word Scotland in it but not the second longest word in sentence 1. According to my code, this sentence should have been some sentence that contained the second longest word in sentence 1, not Scotland !
How can I solve this ? I'm trying to optimize the code as much as possible and any help is welcome.

There is nothing random about your algorithm at all. It should always be deterministic.
I'm not quite sure what you want to do here. If it is to generate random words, just use a dictionary and the random module. If you want to grab random sentences from the Gutenberg project, use the random module to pick a work and then a sentence out of that work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

creating sentance from word list x amount of times - python

Related

how to define a function that counts how many times the words from 'the LM word' list (all words in total) appear in the text?

Count number of texts in which a word occurs

How to split string such that it contains less than n characters with a twist

Accumulated Frequencies, Ngrams

How can I randomize this text generator even further?

Categories

Resources