For my assessment in my computing class I have completed the first two tasks but need help understanding what the third one is asking me. it states "Develop a program that builds upon the technique from Task 2 to compress a text file with several sentences, including punctuation. The program should be able to compress a file into a list of words and list of positions to recreate the original file. It should also be able to take a compressed file and recreate the full text, including punctuation and capitalisation, of the original file".
some of this i understand but i don't really understand what it actually wants me to do. Also as it says i have to build on the technique from task two so the description and solution for task two is below(solution isn't finished because i don't have access to my finished one)
"Develop a program that identifies individual words in a sentence, stores these in a list and replaces each word in the original sentence with the position of that word in the list.
For example, the sentence ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY
contains the words ASK, NOT, WHAT, YOUR, COUNTRY, CAN, DO, FOR, YOU
The sentence can be recreated from the positions of these words in this list using the sequence
1,2,3,4,5,6,7,8,9,1,3,9,6,7,8,4,5
Save the list of words and the positions of these words in the sentence as separate files or as a single file."
And the code for task 2:
restart = 'y'
while (True):
sentence = input("What is your sentence?: ")
sentence_split = sentence.split()
sentence2 = [0]
print(sentence)
for count, i in enumerate(sentence_split):
if sentence_split.count(i) < 2:
sentence2.append(max(sentence2) + 1)
else:
sentence2.append(sentence_split.index(i) +1)
sentence2.remove(0)
print(sentence2)
restart = input("would you like restart the programme y/n?").lower()
if (restart == "n"):
print ("programme terminated")
break
elif (restart == "y"):
pass
else:
print ("Please enter y or n")
As your solution for the second task shows, you have already compressed on sentence with the technique described in the task.
You should now provide a program, that has two functionalities.
read a file and use your technique to create a list of all containing words and a sequence of of all this words, writing this into a file (or stdout)
read the output created by the first function to produce the file before.
Your program may have this command line interface - maybe this makes the task more clear for you.
python task3.py compress /path/to/inputtext.txt /path/to/outputfile
python task3.py extract /path/to/outputfile /path/to/inputtext.txt
This is a very simple way to compress a text file. On top you need to deal with pythons file api. nice task!
I am doing the same task as you for my GCSE and I was confused as well.
However , task 3 is asking you to alter your code so that when you split your sentence, is should be case sensitive now. eg hello and Hello must be treated as separate entities. so they must have different numbers when regenerating.
Also, your code must be compatible for multiple sentences rather than just one sentence.
Finally, you must split the punctuation marks into separate entities also.
use regex values to strip out punctuation.
remove .lower() to make your sentences case sensitive.
allow the code to take "." mark as an entity.
hope that helped.
Related
I've been stuck on this problem:
I am given a list of text files (e.g. 'ZchEuxJ9VJ.txt', 'QAih70niIq.txt') which each contain a randomly generated paragraph. I am to write a parser that will extract the individual words from each of the files listed in the folder. My end goal is to create a dictionary with keys that represent the individual words found in my files. The values are supposed to be a list of files that contain that word. For example, if I print ('happy:', search['happy']), the files that contain the word happy should be added as values to my dictionary. If the word is a "new" word I would have to set it up for the first time. If the word is not a "new" word I would have to update the list associated with that word. I also have to make sure that I don't add the same filename to the same word twice.
I've already created a new dictionary called search, visited each of the files and opened them for reading, then isolated each word using the split() method. The thing I am struggling with is how to "find" a word in a particular file and mark down which file a word can be found in. I've tried "searching" for a word in a file then adding the file to the dictionary, but that gets me an error message.
Instead of searching for words in files, you should be going about it the other way around. You know you need to index every word in every file eventually, so why not just go through every word in every file, in order? In pseudocode it might look something like this.
for each file:
for each word in file:
if not word in search:
search[word] = []
search[word].append(file)
This is homework, so I'm going to help you with the algorithm, not the code. You seem to have figured out most of the problem. The only thing you need help with is how to actually populate the dictionary.
Open the file (say fname), read its contents
Split the contents to separate each word
Iterate over each word. Say we call it fword.
Is fword in the dictionary?
No? Create the key with an empty list as the value
Yes? Do nothing and move on
Now you know that fword is a key in the search dictionary, and its value is a list. (Say we call this list fwlist)
You also know that fword was found in the file fname
Check if fname is already in fwlist
No? Add fname to fwlist.
Yes? Don't add it again. Do nothing.
Now, there are optimizations you can make, such as using a set instead of a list. This way you don't need to check if fname already exists in fwlist, because sets automatically discard duplicates, but this should be enough for what you need.
Remember: before you start writing the program, it's helpful to sit down, think about the problem you're trying to solve, and plan out how you're going to attack the problem. Drawing a flowchart helps immensely when you're a novice programmer because it helps you organize your thoughts and figure out how your program is supposed to flow.
Debugging is also a crucial skill -- your code is useless if you can't fix errors. How to debug small programs.
|
What is a debugger and how can it help me diagnose problems?
The text file which is a "txt" file. Also, I have separate files for different length phrases (spaces count towards the phrase length) I saw phrases because it can be multiple words, but in the example below I use three letter words all of which are one word. Also, imagine each phrase is on a new line. Each phrase is separated by a comma. Imagine you have a file like this:
app,
bar,
car,
eel,
get,
pod,
What I want is to be able to add one or more phrases that will be assumed to only contain lowercase alphabetical letters and/or spaces.
For example, let us say I want to add the phrases in this order:
(cat, bat, car, hat, mom, rat)
basically, I want to add these phrases to the file without deleting
the file and making sure no phrases repeat in the file as well as making sure they are alphabetically sorted. Spaces are assumed to be after the letter z in terms of alphabetically sorting them. So after inputting these phrases, the file should look like this:
'
app,
bar,
bat,
car,
eel,
get,
hat,
mom,
pod,
rat
'
And each file will be assumed to become at least a gigabyte of data. What is the fastest/least memory consuming/etc. So copying the file in order to accomplish this is a no go.
I haven't tried anything that 100% works. I know what to do, I just don't know how to do it. Here are the main points that I need to accomplish.
1) Make sure the phrase(s) are created (using input() function)
2) Open the file of organized words (using "with open(filename)" statements)
3) Put each phrase into the "correct" spot in the file. By "correct" I mean that is alphabetical and is not a repeat.
4) Make sure the file doesn't get deleted.
Here is what I have currently (changed it a bit and it is doing MORE of what I want, but not everything):
phrase_to_add = input('Please enter the phrase: ').lower()
with open('/Users/ian/Documents/three_character_phrases.txt') as file:
unique_phrases = list(file.read().split())
unique_phrases.append(phrase_to_add)
unique_phrases.sort()
list_of_phrases = set()
for phrase in unique_phrases:
list_of_phrases.add(phrase)
with open('/Users/ian/Documents/three_character_phrases.txt', 'w') as fin:
for phrase in list_of_phrases:
fin.write(phrase + '\n')
So I started with BOTH files being empty and I added the word 'cow' by putting it into the input and this what the file looked like:
three_character_phrases.txt:
cow
then I inputted the word "bat" and I got this:
bat
cow
then I added the word "bawk" (I know it isn't a 3 letter word but I'll take care of making sure the right words go into the right files)
I got this:
bawk
bat
cow
It looks like you're getting wrapped up in the implementation instead of trying to understand the concept, so let me invite you to take a step back with me.
You have a data structure that resembles a list (since order is relevant) but allows no duplicates.
['act', 'bar', 'dog']
You want to add an entry to that list
['act', 'bar', 'cat', 'dog']
and serialize the whole thing to file afterwards so you can use the same data between multiple sessions.
First up is to establish your serialization method. You've chosen a plain text file, line delimited. There's nothing wrong with that, but if you were looking for alternatives then a csv, a json, or indeed serializing directly to database might be good too. Let's proceed forward under the assumption that you won't change serialization schemas, though.
It's easy to read from file
from pathlib import Path
FILEPATH = Path("/Users/ian/Documents/three_character_phrases.txt")
def read_phrases():
with FILEPATH.open(mode='r') as f:
return [line.strip() for line in f]
and it's easy to write to it, too.
# Assume FILEPATH is defined here, and in all future snippets as well.
def write_phrases(phrases):
with FILEPATH.open(mode='w') as f:
f.writelines(f'{phrase}\n' for phrase in phrases)
# this is equivalent to:
# text = '\n'.join(phrases)
# f.write(text + '\n')
You've even figured out how to have the user enter a new value (though your algorithm could use work to make the worst case better. Since you're always inserting into a sorted list, the bisect stdlib module can help your performance here for large lists. I'll leave that for a different question though).
Since you've successfully done all the single steps, the only thing holding you back is to put them all together.
phrases = read_phrases()
phrase_to_add = input('Please enter the phrase: ').lower()
if phrase_to_add not in phrases:
phrases.append(phrase_to_add)
phrases.sort() # this is, again, not optimal. Look at bisect!
write_phrases(phrases)
for word in wordStr:
word = word.strip()
print word
When the above code analyzes a .txt with thousands of words, why does it only return the last word in the .txt file? What do I need to do to get it to return all the words in the text file?
Because you're overwriting word in the loop, not really a good idea. You can try something like:
wordlist = ""
for word in wordStr:
wordlist = "%s %s"%(wordlist,word.strip())
print wordlist[1:]
This is fairly primitive Python and I'm sure there's a more Pythonic way to do it with list comprehensions and all that new-fangled stuff :-) but I usually prefer readability where possible.
What this does is to maintain a list of the words in a separate string and then add each stripped word to the end of that list. The [1:] at the end is simply to get rid of the initial space that was added when the first word was tacked on to the end of the empty word list.
It will suffer eventually as the word count becomes substantial since tacking things on to the end of a string is less optimal than other data structures. However, even up to 10,000 words (with the print removed), it's still well under a second of execution time.
At 50,000 words it becomes noticeable, taking 3 seconds on my box. If you're going to be processing that sort of quantity, you would probably opt for a real list-based solution like (equivalent to above but with a different underlying data structure):
wordlist = []
for word in wordStr:
wordlist.append (word.strip())
print wordlist
That takes about 0.22 seconds (without the print) to do my entire dictionary file, some 110,000 words.
To print all the words in wordStr (assuming that wordStr is some kind of iterable that returning strings), you can simply write
for word in wordStr:
word = word.strip()
print word # Notice that the only difference is the indentation on this line
Python cares about indentation, so in your code the print statement is outside the loop and is only executed once. In the modified version, the print statement is inside the loop and is executed once per word.
that is because the variable word is the current word, when finish the file, is the last one:
def stuff():
words = []
for word in wordStr:
words.append(word.strip())
print words
return words
List comprehensions should make your code snazzier and more pythonic.
wordStr = 'here are some words'
print [word.strip() for word in wordStr.split()]
returns
['here', 'are', 'some', 'words']
If you don't know why the code in your example is "returning" only the last word (actually it's not returning anything, it's printing a single word), then I'm afraid nothing here is going to help you very much.
I know that sounds hostile, but I don't wish to be. It seems from your question that you are throwing bits of Python together with little real understanding of the fundamental basics of programming in Python.
Now don't get me wrong, trying stuff out is a great early learning activity, and having a task to motivate your learning is also a great way to do it. So I don't want to tell you to stop! But whether you're trying to learn to program or just have a task that needs you to write a program, you aren't going to get very far without developing an understanding of the fundamental underlying issues that make your example not do what you want it to do.
We can tell you here that this:
for word in wordStr:
word = word.strip()
print word
is a program that roughly means "for every word in wordStr, bind word to the result of word.strip(); then after all that, print the contents of word", whereas what you wanted is likely:
for word in wordStr:
word = word.strip()
print word
which is a program that roughly means "for every word in wordStr, bind word to the result of word.strip() and then print word". And that solves your immediate problem. But you're going to run into many more problems of a very similar nature, and without an understanding of the very basics you won't be able to see that they're all "of a kind", and you'll just end up asking more questions here.
What you need is to gain a basic understanding of how variables, statements, loops, etc work in Python. You will probably eventually gain that if you just keep trying to apply code and ask questions here. But Stack Overflow is not the most efficient resource for gaining that understanding; a much better bet would be to find yourself a good book or tutorial (there's an official one at http://docs.python.org/tutorial/).
Here endeth the soap box.
I have to write a program in python where the user is given a menu with four different "word games". There is a file called dictionary.txt and one of the games requires the user to input a) the number of letters in a word and b) a letter to exclude from the words being searched in the dictionary (dictionary.txt has the whole dictionary). Then the program prints the words that follow the user's requirements. My question is how on earth do I open the file and search for words with a certain length in that file. I only have a basic code which only asks the user for inputs. I'm am very new at this please help :(
this is what I have up to the first option. The others are fine and I know how to break the loop but this specific one is really giving me trouble. I have tried everything and I just keep getting errors. Honestly, I only took this class because someone said it would be fun. It is, but recently I've really been falling behind and I have no idea what to do now. This is an intro level course so please be nice I've never done this before until now :(
print
print "Choose Which Game You Want to Play"
print "a) Find words with only one vowel and excluding a specific letter."
print "b) Find words containing all but one of a set of letters."
print "c) Find words containing a specific character string."
print "d) Find words containing state abbreviations."
print "e) Find US state capitals that start with months."
print "q) Quit."
print
choice = raw_input("Enter a choice: ")
choice = choice.lower()
print choice
while choice != "q":
if choice == "a":
#wordlen = word length user is looking for.s
wordlen = raw_input("Please enter the word length you are looking for: ")
wordlen = int(wordlen)
print wordlen
#letterex = letter user wishes to exclude.
letterex = raw_input("Please enter the letter you'd like to exclude: ")
letterex = letterex.lower()
print letterex
Here's what you'd want to do, algorithmically:
Open up your file
Read it line by line, and on each line (assuming each line has one and only one word), check if that word is a) of appropriate length and b) does not contain the excluded character
What sort of control flow would this suggest you use? Think about it.
I'm not sure if you're confused about how to approach this from a problem-solving standpoint or a Python standpoint, but if you're not sure how to do this specifically in Python, here are some helpful links:
The Input and Output section of the official Python tutorial
The len() function, which can be used to get the length of a string, list, set, etc.
To open the file, use open(). You should also read the Python tutorial sec. 7, file input/output.
Open a file and get each line
Assuming your dictionary.txt has each word on a separate line:
opened_file = open('dictionary.txt')
for line in opened_file:
print(line) # Put your code here to run it for each word in the dictionary
Word length:
You can check the length of a string using its str.len() method. See the Python documentation on string methods.
"Bacon, eggs and spam".len() # returns '20' for 20 characters long
Check if a letter is in a word:
Use str.find(), again from the Python sring methods.
Further comments after seeing your code sample:
If you want to print a multi-line prompt, use the heredoc syntax (triple quotes) instead of repeated print() statements.
What happens if, when asked "how many letters long", your user enters bacon sandwich instead of a number? (Your assignment may not specify that you should gracefully handle incorrect user input, but it never hurts to think about it.)
My question is how on earth do I open the file
Use the with statement
with open('dictionary.txt','r') as f:
for line in f:
print line
and search for words with a certain length in that file.
First, decide what is the length of the word you want to search.
Then, read each line of the file that has the words.
Check each word for its length.
If it matches the length you are looking for, add it to a list.
Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?
There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right
I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.
It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.