The text file which is a "txt" file. Also, I have separate files for different length phrases (spaces count towards the phrase length) I saw phrases because it can be multiple words, but in the example below I use three letter words all of which are one word. Also, imagine each phrase is on a new line. Each phrase is separated by a comma. Imagine you have a file like this:
app,
bar,
car,
eel,
get,
pod,
What I want is to be able to add one or more phrases that will be assumed to only contain lowercase alphabetical letters and/or spaces.
For example, let us say I want to add the phrases in this order:
(cat, bat, car, hat, mom, rat)
basically, I want to add these phrases to the file without deleting
the file and making sure no phrases repeat in the file as well as making sure they are alphabetically sorted. Spaces are assumed to be after the letter z in terms of alphabetically sorting them. So after inputting these phrases, the file should look like this:
'
app,
bar,
bat,
car,
eel,
get,
hat,
mom,
pod,
rat
'
And each file will be assumed to become at least a gigabyte of data. What is the fastest/least memory consuming/etc. So copying the file in order to accomplish this is a no go.
I haven't tried anything that 100% works. I know what to do, I just don't know how to do it. Here are the main points that I need to accomplish.
1) Make sure the phrase(s) are created (using input() function)
2) Open the file of organized words (using "with open(filename)" statements)
3) Put each phrase into the "correct" spot in the file. By "correct" I mean that is alphabetical and is not a repeat.
4) Make sure the file doesn't get deleted.
Here is what I have currently (changed it a bit and it is doing MORE of what I want, but not everything):
phrase_to_add = input('Please enter the phrase: ').lower()
with open('/Users/ian/Documents/three_character_phrases.txt') as file:
unique_phrases = list(file.read().split())
unique_phrases.append(phrase_to_add)
unique_phrases.sort()
list_of_phrases = set()
for phrase in unique_phrases:
list_of_phrases.add(phrase)
with open('/Users/ian/Documents/three_character_phrases.txt', 'w') as fin:
for phrase in list_of_phrases:
fin.write(phrase + '\n')
So I started with BOTH files being empty and I added the word 'cow' by putting it into the input and this what the file looked like:
three_character_phrases.txt:
cow
then I inputted the word "bat" and I got this:
bat
cow
then I added the word "bawk" (I know it isn't a 3 letter word but I'll take care of making sure the right words go into the right files)
I got this:
bawk
bat
cow
It looks like you're getting wrapped up in the implementation instead of trying to understand the concept, so let me invite you to take a step back with me.
You have a data structure that resembles a list (since order is relevant) but allows no duplicates.
['act', 'bar', 'dog']
You want to add an entry to that list
['act', 'bar', 'cat', 'dog']
and serialize the whole thing to file afterwards so you can use the same data between multiple sessions.
First up is to establish your serialization method. You've chosen a plain text file, line delimited. There's nothing wrong with that, but if you were looking for alternatives then a csv, a json, or indeed serializing directly to database might be good too. Let's proceed forward under the assumption that you won't change serialization schemas, though.
It's easy to read from file
from pathlib import Path
FILEPATH = Path("/Users/ian/Documents/three_character_phrases.txt")
def read_phrases():
with FILEPATH.open(mode='r') as f:
return [line.strip() for line in f]
and it's easy to write to it, too.
# Assume FILEPATH is defined here, and in all future snippets as well.
def write_phrases(phrases):
with FILEPATH.open(mode='w') as f:
f.writelines(f'{phrase}\n' for phrase in phrases)
# this is equivalent to:
# text = '\n'.join(phrases)
# f.write(text + '\n')
You've even figured out how to have the user enter a new value (though your algorithm could use work to make the worst case better. Since you're always inserting into a sorted list, the bisect stdlib module can help your performance here for large lists. I'll leave that for a different question though).
Since you've successfully done all the single steps, the only thing holding you back is to put them all together.
phrases = read_phrases()
phrase_to_add = input('Please enter the phrase: ').lower()
if phrase_to_add not in phrases:
phrases.append(phrase_to_add)
phrases.sort() # this is, again, not optimal. Look at bisect!
write_phrases(phrases)
Related
I've been stuck on this problem:
I am given a list of text files (e.g. 'ZchEuxJ9VJ.txt', 'QAih70niIq.txt') which each contain a randomly generated paragraph. I am to write a parser that will extract the individual words from each of the files listed in the folder. My end goal is to create a dictionary with keys that represent the individual words found in my files. The values are supposed to be a list of files that contain that word. For example, if I print ('happy:', search['happy']), the files that contain the word happy should be added as values to my dictionary. If the word is a "new" word I would have to set it up for the first time. If the word is not a "new" word I would have to update the list associated with that word. I also have to make sure that I don't add the same filename to the same word twice.
I've already created a new dictionary called search, visited each of the files and opened them for reading, then isolated each word using the split() method. The thing I am struggling with is how to "find" a word in a particular file and mark down which file a word can be found in. I've tried "searching" for a word in a file then adding the file to the dictionary, but that gets me an error message.
Instead of searching for words in files, you should be going about it the other way around. You know you need to index every word in every file eventually, so why not just go through every word in every file, in order? In pseudocode it might look something like this.
for each file:
for each word in file:
if not word in search:
search[word] = []
search[word].append(file)
This is homework, so I'm going to help you with the algorithm, not the code. You seem to have figured out most of the problem. The only thing you need help with is how to actually populate the dictionary.
Open the file (say fname), read its contents
Split the contents to separate each word
Iterate over each word. Say we call it fword.
Is fword in the dictionary?
No? Create the key with an empty list as the value
Yes? Do nothing and move on
Now you know that fword is a key in the search dictionary, and its value is a list. (Say we call this list fwlist)
You also know that fword was found in the file fname
Check if fname is already in fwlist
No? Add fname to fwlist.
Yes? Don't add it again. Do nothing.
Now, there are optimizations you can make, such as using a set instead of a list. This way you don't need to check if fname already exists in fwlist, because sets automatically discard duplicates, but this should be enough for what you need.
Remember: before you start writing the program, it's helpful to sit down, think about the problem you're trying to solve, and plan out how you're going to attack the problem. Drawing a flowchart helps immensely when you're a novice programmer because it helps you organize your thoughts and figure out how your program is supposed to flow.
Debugging is also a crucial skill -- your code is useless if you can't fix errors. How to debug small programs.
|
What is a debugger and how can it help me diagnose problems?
For my assessment in my computing class I have completed the first two tasks but need help understanding what the third one is asking me. it states "Develop a program that builds upon the technique from Task 2 to compress a text file with several sentences, including punctuation. The program should be able to compress a file into a list of words and list of positions to recreate the original file. It should also be able to take a compressed file and recreate the full text, including punctuation and capitalisation, of the original file".
some of this i understand but i don't really understand what it actually wants me to do. Also as it says i have to build on the technique from task two so the description and solution for task two is below(solution isn't finished because i don't have access to my finished one)
"Develop a program that identifies individual words in a sentence, stores these in a list and replaces each word in the original sentence with the position of that word in the list.
For example, the sentence ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY
contains the words ASK, NOT, WHAT, YOUR, COUNTRY, CAN, DO, FOR, YOU
The sentence can be recreated from the positions of these words in this list using the sequence
1,2,3,4,5,6,7,8,9,1,3,9,6,7,8,4,5
Save the list of words and the positions of these words in the sentence as separate files or as a single file."
And the code for task 2:
restart = 'y'
while (True):
sentence = input("What is your sentence?: ")
sentence_split = sentence.split()
sentence2 = [0]
print(sentence)
for count, i in enumerate(sentence_split):
if sentence_split.count(i) < 2:
sentence2.append(max(sentence2) + 1)
else:
sentence2.append(sentence_split.index(i) +1)
sentence2.remove(0)
print(sentence2)
restart = input("would you like restart the programme y/n?").lower()
if (restart == "n"):
print ("programme terminated")
break
elif (restart == "y"):
pass
else:
print ("Please enter y or n")
As your solution for the second task shows, you have already compressed on sentence with the technique described in the task.
You should now provide a program, that has two functionalities.
read a file and use your technique to create a list of all containing words and a sequence of of all this words, writing this into a file (or stdout)
read the output created by the first function to produce the file before.
Your program may have this command line interface - maybe this makes the task more clear for you.
python task3.py compress /path/to/inputtext.txt /path/to/outputfile
python task3.py extract /path/to/outputfile /path/to/inputtext.txt
This is a very simple way to compress a text file. On top you need to deal with pythons file api. nice task!
I am doing the same task as you for my GCSE and I was confused as well.
However , task 3 is asking you to alter your code so that when you split your sentence, is should be case sensitive now. eg hello and Hello must be treated as separate entities. so they must have different numbers when regenerating.
Also, your code must be compatible for multiple sentences rather than just one sentence.
Finally, you must split the punctuation marks into separate entities also.
use regex values to strip out punctuation.
remove .lower() to make your sentences case sensitive.
allow the code to take "." mark as an entity.
hope that helped.
I am reading a file, then save that information with readlines(). I then check to see if any of the data from one of my lists is in readlines. The problem I am facing is removing all the information from readlines that isn't in my list, so readlines only contains the information that is in my list, that is if there are any matches. When I say match, I mean if any of the words are found in any order. Could someone please help point me in the right direction? Thank you. I am using python 2.7 and am reading utf-8 files.
Edit: I am reading files and stores their information to readlines(), I then use my list to check and see if the file contains what I am looking for. If it does, then I want to remove all the data from readlines(), except the match found from my list. I save the matches to a text file. I hope this makes sense. If I am going about this the right way, please let me know.
Edit2: I am reading a file and then using readlines, which stores the data from that file in my readlines() variable. I know it would be helpful to share my code, but I am not allowed to do so.
Edit 3: Pseudo code
alist= ['hamburger','cow','meat']
openit = open.codecs('afile.html','utf-8-sig')
justreadit = openit.readlines()
for alist in justreadit:
print "found matches"
comment: remove any data that is not a list. When I tried putting in the pound sign as a normal comment, it didn't work.
edit4: I am looking for any of the words in the file in alist. No order, I just need to find the word and save it to a text file.
So let me see if I'm understanding this right.
You have a file that looks something like this:
I am a farmer
Sometimes, I farm chickens
I also have a cow
I like to eat hamburger meat
But not lamb
You want to grab the third and fourth lines out of this, because the third line has "cow", and the fourth line has both "hamburger" and "meat". If this is a correct understanding of your problem, here is code that will achieve that (assuming the above text is saved to afile.html in the current working directory).
word_list = ['hamburger', 'cow', 'meat']
with open('afile.html', encoding='utf-8-sig') as f:
lines = f.readlines()
for line in lines:
for word in word_list:
if word in line:
print(line)
break
Result:
I also have a cow
I like to eat hamburger meat
Is this the result you wanted?
Note that there are many ways this could fail. For example, the line I LIKE COW would not be printed, because "COW" is not in the same case as "cow". The line "I like cows" would be printed, because the substring "cow" is found in that line (even though the word "cow" isn't). Because the specification in your question is unclear about these things, I have not tried to guess at which of these behaviors you do or do not want.
I'm pretty new at this, but I think that since file.readlines() returns a list, with each list entry being a line from the target file. To only return matches, I would:
justreadit=openit.readlines()
matchlist=[]
for i in justreadit:
for h in alist:
if h==i:
outputlist.append(i)
return outputlist
I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')
Exercise problem: "given a word list and a text file, spell check the
contents of the text file and print all (unique) words which aren't
found in the word list."
I didn't get solutions to the problem so can somebody tell me how I went and what the correct answer should be?:
As a disclaimer none of this parses in my python console...
My attempt:
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
#I'm aware that something is wrong here since I get an error when I use it.....when I just write blablabla.txt it says that it can't find the thing. Is this function only gonna work if I'm working off the online IVLE program where all those files are automatically linked to the console or how would I do things from python without logging into the online IVLE?
for words in data:
for words not in a
print words
wrong = words not in a
right = words in a
print="wrong spelling:" + "properly splled words:" + right
oh yeh...I'm very sure I've indented everything correctly but I don't know how to format my question here so that it doesn't come out as a block like it has. sorry.
What do you think?
There are many things wrong with this code - I'm going to mark some of them below, but I strongly recommend that you read up on Python control flow constructs, comparison operators, and built-in data types.
a=list[....,.....,....,whatever goes here,...]
data = open(C:\Documents and Settings\bhaa\Desktop\blablabla.txt).read()
# The filename needs to be a string value - put "C:\..." in quotes!
for words in data:
# data is a string - iterating over it will give you one letter
# per iteration, not one word
for words not in a
# aside from syntax (remember the colons!), remember what for means - it
# executes its body once for every item in a collection. "not in a" is not a
# collection of any kind!
print words
wrong = words not in a
# this does not say what you think it says - "not in" is an operator which
# takes an arbitrary value on the left, and some collection on the right,
# and returns a single boolean value
right = words in a
# same as the previous line
print="wrong spelling:" + "properly splled words:" + right
I don't know what you are trying to iterate over, but why don't you just first iterate over your words (which are in the variable a I guess?) and then for every word in a you iterate over the wordlist and check whether or not that word is in the wordslist.
I won't paste code since it seems like homework to me (if so, please add the homework tag).
Btw the first argument to open() should be a string.
It's simple really. Turn both lists into sets then take the difference. Should take like 10 lines of code. You just have to figure out the syntax on your own ;) You aren't going to learn anything by having us write it for you.