Get list of words from text file

Get list of words from text file - python

In my code on line I have no idea why it is wrong I've tried a gazillion different ways but they don't work. I want it to print out:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
romeo.txt is the text document name
this is whats inside:
"But soft what light through yonder window breaks It is the east and
Juliet is the sun Arise fair sun and kill the envious moon Who is
already sick and pale with grief "
Also the output is in alphabetic order.
fname = "romeo.txt"#raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst.append(line)
words = lst.split(line)
# line = line.sort()
print lst

fname = "romeo.txt"
fh = open(fname)
lst = []
for line in fh:
words = lst.split(line) # this comes first
lst.extend(words) # add all the words to the current list
lst = sorted(lst) # sorts lexicographically
print lst
Comments in code. Basically, split up your line and accumulate it in your list. Sorting should be done at the end, once.
A (slightly) more pythonic solution:
import re
lst = sorted(re.split('[\s]+', open("romeo.txt").read(), flags=re.M))
Regex will split your text into a list of words based on the regexp (delimiters as whitespaces). Everything else is basically multiple lines condensed into 1.

Related

How to extract specific words from a string?

I have to extract two things from a string: A list that contains stop-words, and another list that contains the rest of the string.
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = []
normal_words = []
for i in text.split():
for j in stopwords:
if i in j:
contains_stopwords.append(i)
else:
normal_words.append(i)
if text.split() in stopwords:
contains_stopwords.append(text.split())
else:
normal_words.append(text.split())
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)
Output:
contains_stopwords: ['he', 'he', 'the', 'our']
normal_words: ['he', 'is', 'is', 'is', 'the', 'the', 'best', 'best', 'best', 'when', 'when', 'when', 'people', 'people', 'people', 'in', 'in', 'in', 'our', 'our', 'life', 'life', 'life', ['he', 'is', 'the', 'best', 'when', 'people', 'in', 'our', 'life']]
Desired result:
contains_stopwords: ['he', 'the', 'our']
normal_words: ['is', 'best', 'when', 'people', 'in', 'life']

One answer could be:
text = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
contains_stopwords = set() # The set data structure guarantees there won't be any duplicate
normal_words = []
for word in text.split():
if word in stopwords:
contains_stopwords.add(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

you seem to have chosen the most difficult path. The code under should do the trick.
for word in text.split():
if word in stopwords:
contains_stopwords.append(word)
else:
normal_words.append(word)
First, we separate the text into a list of words using split, then we iterate and check if that word is in the list of stopwords (yeah, python allows you to do this). If it is, we just append it to the list of stopwords, if not, we append it to the other list.

Use the list comprehention and eliminate the duplicates by creating a dictionary with keys as list values and converting it again to a list:
itext = 'he is the best when people in our life'
stopwords = ['he', 'the', 'our']
split_words = itext.split(' ')
contains_stopwords = list(dict.fromkeys([word for word in split_words if word in stopwords]))
normal_words = list(dict.fromkeys([word for word in split_words if word not in stopwords]))
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

Some list comprehension could work and then use set() to remove duplicates from the list. I reconverted the set datastructure to a list as per your question, but you can leave it as a set:
text = 'he is the best when people in our life he he he'
stopwords = ['he', 'the', 'our']
list1 = {item for item in text.split(" ") if item in stopwords}
list2 = [item for item in text.split(" ") if item not in list1]
Output:
list1 - ['he', 'the', 'our']
list2 - ['is', 'best', 'when', 'people', 'in', 'life']

text = 'he is the best when people in our life'
# I will suggest make `stopwords` a set
# cuz the membership operator(ie. in) will take O(1)
stopwords = set(['he', 'the', 'our'])
contains_stopwords = []
normal_words = []
for word in text.split():
if word in stopwords: # here checking membership
contains_stopwords.append(word)
else:
normal_words.append(word)
print("contains_stopwords:", contains_stopwords)
print("normal_words:", normal_words)

python: how to sort a string alphabetically and by len

shakespeare = ‘All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts.’
Create a function that returns a string with all the words of the sentence shakespeare ordered alphabetically. Eliminate punctuation marks.
(Tip: the three first words should be ‘ a all all’, this time duplicates are allowed and remember that there are words in mayus)
def sort_string(shakespeare):
return string_sorted

Here you get a one-liner
import re
shakespeare = "All the world is a stage, and all the men and women merely players. They have their exits and their entrances, And one man in his time plays many parts."
print (sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))
Output:
['a', 'all', 'all', 'and', 'and', 'and', 'and', 'entrances', 'exits', 'have', 'his', 'in', 'is', 'man', 'many', 'men', 'merely', 'one', 'parts', 'players', 'plays', 'stage', 'the', 'the', 'their', 'their', 'they', 'time', 'women', 'world']
The corresponding function:
def sort_string(shakespeare)
return sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x)))
In case you want a string to be returned:
def sort_string(shakespeare)
return " ".join(sorted(re.sub(r"[^\w\s]","",shakespeare.lower()).split(), key=lambda x: (x,-len(x))))

Confusion with split function in Python

I am trying to alphabetically sort the words from a file. However, the program sorts the lines, not the words, according to their first words. Here it is.
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.append(words)
lst.sort()
print lst
Here is my input file
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
And this is what I'm hoping to get
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

lst.append(words) append a list at the end of lst, it does not concatenates lst and words. You need to use lst.extend(words) or lst += words.
Also, you should not sort the list at each iteration but only at the end of your loop:
lst = []
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.extend(words)
lst.sort()
print lst
If you don't want repeated word, use a set:
st = set()
for line in fh:
lst2 = line.strip()
words = lst2.split()
st.update(words)
lst = list(st)
lst.sort()
print lst

lst.append(words) is adding the list as a member to the outer list. For instance:
lst = []
lst.append(['another','list'])
lst ## [['another','list']]
So you're getting a nested list. Use .extend(...) instead:
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.extend(words)
lst.sort()
print lst

line.split() returns a list of strings. Now you want to join those words with the list of strings you've already accumulated with the previous lines. When you call lst.append(words) you're just adding the list of words to your list, so you end up with a list of lists. What you probably want is extend() which simply adds all the elements of one list to the other.
So instead of doing lst.append(words), you would want lst.extend(words).

The problem is that words is an array of your words from the split. When you append words to lst, you are making a list of arrays, and sorting it will only sort that list.
You want to do something like:
for x in words:
lst.append(x)
lst.sort()
I believe
Edit: I have implemented your text file, this following code works for me:
inp=open('test.txt','r')
lst=list()
for line in inp:
tokens=line.split('\n')[0].split() #This is to split away new line characters but shouldnt impact
for x in tokens:
lst.append(x)
lst.sort()
lst

How to look into a list and varify the existing of elements inside the list?

I'm new to Python and I'm trying to write a piece of code which has accomplishes task:
I need to open the file romeo.txt and read it line by line.
For each line, split the line into a list of words using the split() function. * * Build a list of words as follows:
For each word on each line check to see if the word is already in the list
If not append it to the list.
When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
This is what I have so far:
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = []
for line in fname:
words = line.rstrip().split()
print words
I know that I need to use another for loop to check for any missing words and finally I need to sort them out by using the sort() function. The Python interpreter is giving me an error saying that I have to use append() to add the missing words if they don't exist.
I have managed to build the following list with my code:
['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'] ← Mismatch
['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
['Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon']
['Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']
but the output should come look like this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks','east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick','soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
How can I adjust my code to produce that output?
Important Note:
To everyone wants to help, Please make sure that you go from my code to finish this tast as it's an assignment and we have to follow the level of the course. Thanks
That is my updates for the code :
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = list()
for line in fname:
words = line.rstrip().split()
for i in words:
newList.append(i)
newList.sort()
print newList
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
But I'm getting duplication! Why is that and how to avoide that?

fname = raw_input("Enter file name: ")
fh = open(frame)
lst = list()
for line in fh
for i in line.split():
lst.append(i)
lst.sort()
print list(set(lst))
The above code worked for me.

fname = input("Enter file name: ") #Ask the user for the filename.
fh = open(fname) #Open the file.
lst = list() #Create a list called lst.
for line in fh: #For each line in the fh?
words = line.split() . #Separate the words in the line.
for word in words : #For each word in words.
if word not in lst : #If word not in the lst.
lst.append(word) #Add the word.
elif word in lst : #Continue looping.
continue
lst.sort()
print(lst) #Print the lst.

I struggled with this question for quite a long time while i was doing a Online Python course in Coursera. But i managed to do it without too many nested loops or for loops. Hope this helps.
file = input('Enter File Name: ')
try:
file = open(file)
except:
print('File Not Found')
quit()
F = file.read()
F = F.rstrip().split()
L = list()
for a in F:
if a in L:
continue
else:
L.append(a)
print(sorted(L))

You want to gather all of the words into a single list. Or, uh, a set, because sets enforce uniqueness and you don't care about order anyways.
fname = raw_input("Enter file name: ")
if len(fname) == 0: fname = 'romeo.txt')
with open(fname, 'r') as f: # Context manager
words = set()
for line in f: words.update(line.rstrip().split())
#Now for the sorting
print sorted(words, key = str.lower)
I'm using key = str.lower because I assume you want to sort by human alphabetical and not by computer alphabetical. If you want computer alphabetical, get rid of that argument.
Now, if you actually want to use a list, although it's O(n) for this application...
words = []
with open(filename, "r") as f:
for word in line.rstrip().split():
if word not in words:
words.append(word)

The 'Pythonic' way is to use a set to make a list of unique words and to interate over the file line-by-line:
with open(fn) as f: # open your file and auto close it
uniq=set() # a set only has one entry of each
for line in f: # file line by line
for word in line.split(): # line word by word
uniq.add(word) # uniqueify by adding to a set
print sorted(uniq) # print that sorted
Which you can make terse Pythonic by having a set comprehension that flattens the list of lists produced by 1) a list of lines 2) the lines from the file:
with open(fn) as f:
uniq={w for line in f for w in line.split()}

8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
a=line.split()
i=0
for z in a:
if z not in lst:
lst.append(z)
else:
continue
lst.sort()
print lst

fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
word=line.rstrip().split()
for i in word:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print lst

I am doing the EXACT SAME Python Online course in Coursera - "Python for everybody" - and It took me 3 days to complete this assignment and come up with the following piece of code. A short recommendation just if you care
1) Try writing the code exclusively without ANY hint or help - try at least 10 hours
2) Leave the questions as "Last Resort"
When you don't give up and write the code independently the reward is IMMENSE.
For the following code I used EXCLUSIVELY the materials covered in week 4 for the course
fname = input("Enter file name: ")
fh = open("romeo.txt")
newlist = list ()
for line in fh:
words = line.split()
for word in words:
if word not in newlist :
newlist.append (word)
elif word in newlist :
continue
newlist.sort ()
print (newlist)

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line=line.split()
for i in line:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print (lst)

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
for i in line.split():
if i not in lst:
lst.append(i)
lst.sort()
print(lst)

Is there a better way to get just 'important words' from a list in python?

I wrote some code to find the most popular words in submission titles on reddit, using the reddit praw api.
import nltk
import praw
picksub = raw_input('\nWhich subreddit do you want to analyze? r/')
many = input('\nHow many of the top words would you like to see? \n\t> ')
print 'Getting the top %d most common words from r/%s:' % (many,picksub)
r = praw.Reddit(user_agent='get the most common words from chosen subreddit')
submissions = r.get_subreddit(picksub).get_top_from_all(limit=200)
hey = []
for x in submissions:
hey.extend(str(x).split(' '))
fdist = nltk.FreqDist(hey) # creates a frequency distribution for words in 'hey'
top_words = fdist.keys()
common_words = ['its','am', 'ago','took', 'got', 'will', 'been', 'get', 'such','your','don\'t', 'if', 'why', 'do', 'does', 'or', 'any', 'but', 'they', 'all', 'now','than','into','can', 'i\'m','not','so','just', 'out','about','have','when', 'would' ,'where', 'what', 'who' 'I\'m','says' 'not', '', 'over', '_', '-','after', 'an','for', 'who', 'by', 'from', 'it', 'how', 'you', 'about' 'for', 'on', 'as', 'be', 'has', 'that', 'was', 'there', 'with','what', 'we', '::', 'to', 'the', 'of', ':', '...', 'a', 'at', 'is', 'my', 'in' , 'i', 'this', 'and', 'are', 'he', 'she', 'is', 'his', 'hers']
already = []
counter = 0
number = 1
print '-----------------------'
for word in top_words:
if word.lower() not in common_words and word.lower() not in already:
print str(number) + ". '" + word + "'"
counter +=1
number +=1
already.append(word.lower())
if counter == many:
break
print '-----------------------\n'
so inputting subreddit 'python' and getting 10 posts returns:
'Python'
'PyPy'
'code'
'use'
'136'
'181'
'd...'
'IPython'
'133'
10. '158'
How can I make this script not return numbers, and error words like 'd...'? The first 4 results are acceptable, but I would like to replace this rest with words that make sense. Making a list common_words is unreasonable, and doesn't filter these errors. I'm relatively new to writing code, and I appreciate the help.

I disagree. Making a list of common words is correct, there is no easier way to filter out the, for, I, am, etc.. However, it is unreasonable to use the common_words list to filter out results that aren't words, because then you'd have to include every possible non-word you don't want. Non-words should be filtered out differently.
Some suggestions:
1) common_words should be a set(), since your list is long this should speed things up. The in operation for sets in O(1), while for lists it is O(n).
2) Getting rid of all number strings is trivial. One way you could do it is:
all([w.isdigit() for w in word])
Where if this returns True, then the word is just a series of numbers.
3) Getting rid of the d... is a little more tricky. It depends on how you define a non-word. This:
tf = [ c.isalpha() for c in word ]
Returns a list of True/False values (where it is False if the char was not a letter). You can then count the values like:
t = tf.count(True)
f = tf.count(False)
You could then define a non-word as one that has more non-letter chars in it than letters, as one that has any non-letter characters at all, etc. For example:
def check_wordiness(word):
# This returns true only if a word is all letters
return all([ c.isalpha() for c in word ])
4) In the for word in top_words: block, are you sure that you have not mixed up counter and number? Also, counter and number are pretty much redundant, you could rewrite the last bit as:
for word in top_words:
# Since you are calling .lower() so much,
# you probably want to define it up here
w = word.lower()
if w not in common_words and w not in already:
# String formatting is preferred over +'s
print "%i. '%s'" % (number, word)
number +=1
# This could go under the if statement. You only want to add
# words that could be added again. Why add words that are being
# filtered out anyways?
already.append(w)
# this wasn't indented correctly before
if number == many:
break
Hope that helps.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get list of words from text file - python

Related

How to extract specific words from a string?

python: how to sort a string alphabetically and by len

Confusion with split function in Python

How to look into a list and varify the existing of elements inside the list?

Is there a better way to get just 'important words' from a list in python?

Categories

Resources