Confusion with split function in Python - python

I am trying to alphabetically sort the words from a file. However, the program sorts the lines, not the words, according to their first words. Here it is.
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.append(words)
lst.sort()
print lst
Here is my input file
But soft what light through yonder window breaks
It is the east and Juliet is the sun
Arise fair sun and kill the envious moon
Who is already sick and pale with grief
And this is what I'm hoping to get
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']

lst.append(words) append a list at the end of lst, it does not concatenates lst and words. You need to use lst.extend(words) or lst += words.
Also, you should not sort the list at each iteration but only at the end of your loop:
lst = []
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.extend(words)
lst.sort()
print lst
If you don't want repeated word, use a set:
st = set()
for line in fh:
lst2 = line.strip()
words = lst2.split()
st.update(words)
lst = list(st)
lst.sort()
print lst

lst.append(words) is adding the list as a member to the outer list. For instance:
lst = []
lst.append(['another','list'])
lst ## [['another','list']]
So you're getting a nested list. Use .extend(...) instead:
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst2 = line.strip()
words = lst2.split()
lst.extend(words)
lst.sort()
print lst

line.split() returns a list of strings. Now you want to join those words with the list of strings you've already accumulated with the previous lines. When you call lst.append(words) you're just adding the list of words to your list, so you end up with a list of lists. What you probably want is extend() which simply adds all the elements of one list to the other.
So instead of doing lst.append(words), you would want lst.extend(words).

The problem is that words is an array of your words from the split. When you append words to lst, you are making a list of arrays, and sorting it will only sort that list.
You want to do something like:
for x in words:
lst.append(x)
lst.sort()
I believe
Edit: I have implemented your text file, this following code works for me:
inp=open('test.txt','r')
lst=list()
for line in inp:
tokens=line.split('\n')[0].split() #This is to split away new line characters but shouldnt impact
for x in tokens:
lst.append(x)
lst.sort()
lst

Related

How to create a list of words as long as the word is not a word within a list of tuples

I have some words:
wordlist = ['change', 'my', 'diaper', 'please']
I also have a list of tuples that I need to check against:
mylist = [('verb', 'change'), ('prep', 'my')]
What I want to do is create a list out of all the words that are not in the list of tuples.
So the result of this example would be ['diaper', 'please']
What I tried seems to create duplicates:
[word for tuple in mylist for word in wordlist if word not in tuple]
How do I generate a list of the words not in the tuple-list, and do it as efficiently as possible?
No use of sets.
Edit: chose answer based on following restriction of set
Here is a oneliner using list comprehension
[word for word in wordlist if word not in [ w[1] for w in mylist ]]
The inner list, [ w[1] for w in mylist ] extracts the second element from the tuple list.
The outer list, [word for word in wordlist if word not in innerlist] extracts the words filtering out the ones in the just extracted list.
P.S. I assumed you wanted to filter only the second element of the tuple list.
Make a set of known words from your tuples list:
myList = [('verb', 'change'), ('prep', 'my')]
known_words = set(tup[1] for tup in myList)
then use it as you did before:
wordlist = ['change', 'my', 'diaper', 'please']
out = [word for word in wordlist if word not in known_words]
print(out)
# ['diaper', 'please']
Checking if an item exists in a set is O(1), while checking in a list or tuple is O(length of the list), so it is really worth using sets in such cases.
Also, if you don't care about the order of the words and want to remove duplicates, you could do:
unique_new_words = set(wordlist) - known_words
print(unique_new_words)
# {'diaper', 'please'}
this is a version where i flatten (using itertools.chain) your tuples into a set and compare against that set (using a set will speed up the lookup for the in operator):
from itertools import chain
wordlist = ['change', 'my', 'diaper', 'please']
mylist = [('verb', 'change'), ('prep', 'my')]
veto = set(chain(*mylist)) # {'prep', 'change', 'verb', 'my'}
print([word for word in wordlist if word not in veto])
# ['diaper', 'please']
I have made a an assumption, that tuple[1] would have only one element, if not that would need a small change.
[word for word in wordlist if word not in [tuple[1] for tuple in mylist]]

Get list of words from text file

In my code on line I have no idea why it is wrong I've tried a gazillion different ways but they don't work. I want it to print out:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
romeo.txt is the text document name
this is whats inside:
"But soft what light through yonder window breaks It is the east and
Juliet is the sun Arise fair sun and kill the envious moon Who is
already sick and pale with grief "
Also the output is in alphabetic order.
fname = "romeo.txt"#raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
lst.append(line)
words = lst.split(line)
# line = line.sort()
print lst
fname = "romeo.txt"
fh = open(fname)
lst = []
for line in fh:
words = lst.split(line) # this comes first
lst.extend(words) # add all the words to the current list
lst = sorted(lst) # sorts lexicographically
print lst
Comments in code. Basically, split up your line and accumulate it in your list. Sorting should be done at the end, once.
A (slightly) more pythonic solution:
import re
lst = sorted(re.split('[\s]+', open("romeo.txt").read(), flags=re.M))
Regex will split your text into a list of words based on the regexp (delimiters as whitespaces). Everything else is basically multiple lines condensed into 1.

How can I find duplicate words in a text file?

file_str = input("Enter poem: ")
my_file = open(file_str, "r")
words = file_str.split(',' or ';')
I have a file on my computer that contains a really long poem, and I want to see if there are any words that are duplicated per line (hence it being split by punctuation).
I have that much, and I don't want to use a module or Counter, I would prefer to use loops. Any ideas?
You can use sets to track seen items and duplicates:
>>> words = 'the fox jumped over the lazy dog and over the bear'.split()
>>> seen = set()
>>> dups = set()
>>> for word in words:
if word in seen:
if word not in dups:
print(word)
dups.add(word)
else:
seen.add(word)
the
over
with open (r"specify the path of the file") as f:
data = f.read()
if(set([i for i in data if f.count(f)>1])):
print "Duplicates found"
else:
print "None"
SOLVED !!!
I can give the explanation with working program
file content of sam.txt
sam.txt
Hello this is star hello the data are Hello so you can move to the
hello
file_content = []
resultant_list = []
repeated_element_list = []
with open(file="sam.txt", mode="r") as file_obj:
file_content = file_obj.readlines()
print("\n debug the file content ",file_content)
for line in file_content:
temp = line.strip('\n').split(" ") # This will strip('\n') and split the line with spaces and stored as list
for _ in temp:
resultant_list.append(_)
print("\n debug resultant_list",resultant_list)
#Now this is the main for loop to check the string with the adjacent string
for ii in range(0, len(resultant_list)):
# is_repeated will check the element count is greater than 1. If so it will proceed with identifying duplicate logic
is_repeated = resultant_list.count(resultant_list[ii])
if is_repeated > 1:
if ii not in repeated_element_list:
for2count = ii + 1
#This for loop for shifting the iterator to the adjacent string
for jj in range(for2count, len(resultant_list)):
if resultant_list[ii] == resultant_list[jj]:
repeated_element_list.append(resultant_list[ii])
print("The repeated strings are {}\n and total counts {}".format(repeated_element_list, len(repeated_element_list)))
Output:
debug the file content ['Hello this is abdul hello\n', 'the data are Hello so you can move to the hello']
debug resultant_list ['Hello', 'this', 'is', 'abdul', 'hello', 'the', 'data', 'are', 'Hello', 'so', 'you', 'can', 'move', 'to', 'the', 'hello']
The repeated strings are ['Hello', 'hello', 'the']
and total counts 3
Thanks
def Counter(text):
d = {}
for word in text.split():
d[word] = d.get(word,0) + 1
return d
there is loops :/
to split on punctionation just us
matches = re.split("[!.?]",my_corpus)
for match in matches:
print Counter(match)
For this kinda file;
A hearth came to us from your hearth
foreign hairs with hearth are same are hairs
This will check whole poem;
lst = []
with open ("coz.txt") as f:
for line in f:
for word in line.split(): #splited by gaps (space)
if word not in lst:
lst.append(word)
else:
print (word)
Output:
>>>
hearth
hearth
are
hairs
>>>
As you see there are two hearth here, because in whole poem there are 3 hearth.
For check line by line;
lst = []
lst2 = []
with open ("coz.txt") as f:
for line in f:
for word in line.split():
lst2.append(word)
for x in lst2:
if x not in lst:
lst.append(x)
lst2.remove(x)
print (set(lst2))
>>>
{'hearth', 'are', 'hairs'}
>>>

How to look into a list and varify the existing of elements inside the list?

I'm new to Python and I'm trying to write a piece of code which has accomplishes task:
I need to open the file romeo.txt and read it line by line.
For each line, split the line into a list of words using the split() function. * * Build a list of words as follows:
For each word on each line check to see if the word is already in the list
If not append it to the list.
When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
This is what I have so far:
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = []
for line in fname:
words = line.rstrip().split()
print words
I know that I need to use another for loop to check for any missing words and finally I need to sort them out by using the sort() function. The Python interpreter is giving me an error saying that I have to use append() to add the missing words if they don't exist.
I have managed to build the following list with my code:
['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'] ← Mismatch
['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
['Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon']
['Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']
but the output should come look like this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks','east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick','soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
How can I adjust my code to produce that output?
Important Note:
To everyone wants to help, Please make sure that you go from my code to finish this tast as it's an assignment and we have to follow the level of the course. Thanks
That is my updates for the code :
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = list()
for line in fname:
words = line.rstrip().split()
for i in words:
newList.append(i)
newList.sort()
print newList
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
But I'm getting duplication! Why is that and how to avoide that?
fname = raw_input("Enter file name: ")
fh = open(frame)
lst = list()
for line in fh
for i in line.split():
lst.append(i)
lst.sort()
print list(set(lst))
The above code worked for me.
fname = input("Enter file name: ") #Ask the user for the filename.
fh = open(fname) #Open the file.
lst = list() #Create a list called lst.
for line in fh: #For each line in the fh?
words = line.split() . #Separate the words in the line.
for word in words : #For each word in words.
if word not in lst : #If word not in the lst.
lst.append(word) #Add the word.
elif word in lst : #Continue looping.
continue
lst.sort()
print(lst) #Print the lst.
I struggled with this question for quite a long time while i was doing a Online Python course in Coursera. But i managed to do it without too many nested loops or for loops. Hope this helps.
file = input('Enter File Name: ')
try:
file = open(file)
except:
print('File Not Found')
quit()
F = file.read()
F = F.rstrip().split()
L = list()
for a in F:
if a in L:
continue
else:
L.append(a)
print(sorted(L))
You want to gather all of the words into a single list. Or, uh, a set, because sets enforce uniqueness and you don't care about order anyways.
fname = raw_input("Enter file name: ")
if len(fname) == 0: fname = 'romeo.txt')
with open(fname, 'r') as f: # Context manager
words = set()
for line in f: words.update(line.rstrip().split())
#Now for the sorting
print sorted(words, key = str.lower)
I'm using key = str.lower because I assume you want to sort by human alphabetical and not by computer alphabetical. If you want computer alphabetical, get rid of that argument.
Now, if you actually want to use a list, although it's O(n) for this application...
words = []
with open(filename, "r") as f:
for word in line.rstrip().split():
if word not in words:
words.append(word)
The 'Pythonic' way is to use a set to make a list of unique words and to interate over the file line-by-line:
with open(fn) as f: # open your file and auto close it
uniq=set() # a set only has one entry of each
for line in f: # file line by line
for word in line.split(): # line word by word
uniq.add(word) # uniqueify by adding to a set
print sorted(uniq) # print that sorted
Which you can make terse Pythonic by having a set comprehension that flattens the list of lists produced by 1) a list of lines 2) the lines from the file:
with open(fn) as f:
uniq={w for line in f for w in line.split()}
8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
a=line.split()
i=0
for z in a:
if z not in lst:
lst.append(z)
else:
continue
lst.sort()
print lst
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
word=line.rstrip().split()
for i in word:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print lst
I am doing the EXACT SAME Python Online course in Coursera - "Python for everybody" - and It took me 3 days to complete this assignment and come up with the following piece of code. A short recommendation just if you care
1) Try writing the code exclusively without ANY hint or help - try at least 10 hours
2) Leave the questions as "Last Resort"
When you don't give up and write the code independently the reward is IMMENSE.
For the following code I used EXCLUSIVELY the materials covered in week 4 for the course
fname = input("Enter file name: ")
fh = open("romeo.txt")
newlist = list ()
for line in fh:
words = line.split()
for word in words:
if word not in newlist :
newlist.append (word)
elif word in newlist :
continue
newlist.sort ()
print (newlist)
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line=line.split()
for i in line:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print (lst)
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
for i in line.split():
if i not in lst:
lst.append(i)
lst.sort()
print(lst)

Split strings in a list of lists

I currently have a list of lists:
[['Hi my name is'],['What are you doing today'],['Would love some help']]
And I would like to split the strings in the lists, while remaining in their current location. For example
[['Hi','my','name','is']...]..
How can I do this?
Also, if I would like to use a specific of the lists after searching for it, say I search for "Doing", and then want to append something to that specific list.. how would I go about doing that?
You can use a list comprehension to create new list of lists with all the sentences split:
[lst[0].split() for lst in list_of_lists]
Now you can loop through this and find the list matching a condition:
for sublist in list_of_lists:
if 'doing' in sublist:
sublist.append('something')
or searching case insensitively, use any() and a generator expression; this will the minimum number of words to find a match:
for sublist in list_of_lists:
if any(w.lower() == 'doing' for w in sublist):
sublist.append('something')
list1 = [['Hi my name is'],['What are you doing today'],['Would love some help']]
use
[i[0].split() for i in list1]
then you will get the output like
[['Hi', 'my', 'name', 'is'], ['What', 'are', 'you', 'doing', 'today'], ['Would', 'love', 'some', 'help']]
l = [['Hi my name is'],['What are you doing today'],['Would love some help']]
for x in l:
l[l.index(x)] = x[0].split(' ')
print l
Or simply:
l = [x[0].split(' ') for x in l]
Output
[['Hi', 'my', 'name', 'is'], ['What', 'are', 'you', 'doing', 'today'], ['Would', 'love', 'some', 'help']]

Categories