I'm trying to convert a set I've defined into a list so I can use it for indexing.
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
The set seems to contain items just fine. However the list is always empty when I monitor its value in the variable explorer (and when I later call the index function).
What am I doing wrong?
EDIT: This is the entire code. I'm trying to find the location of words in 'p' in 'o' and chart the number of its occurrences in a single line. It's a huge list of words so manually entering anything is out of the question.
p = open("p.txt", 'r')
o = open("o.txt", 'r')
t = open("t.txt", 'w')
lines = p.readlines()
vlines = o.readlines()
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
for i in lines:
thisline = i.split();
thisline[:] = [word for word in thisline if not word.isdigit()]
count = len(thisline)
j = []
j.append(count)
for sword in thisline:
num = thisline.count(sword)
#index=0
#for m in vlines:
#if word is not m:
#index+=1
ix = been.index(sword)
j.append(' ' + str(ix) + ':' + str(num))
j.append('\n')
for item in j:
t.write("%s" % item)
Output should be in the format '(total number of items in line) (index):(no. of occurrences)'.
I think I'm pretty close but this part is bugging me.
Your code is working just fine.
>>> p = '''
the 123 dogs
chased 567 cats
through 89 streets'''.splitlines()
>>> seen = set()
>>> for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
>>> been = list(seen)
>>>
>>> seen
set(['streets', 'chased', 'cats', 'through', 'the', 'dogs'])
>>> been
['streets', 'chased', 'cats', 'through', 'the', 'dogs']
Unless there's a reason why you want to read line by line you can simply replace this:
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
with:
been = list(set([w for w in open('p.txt', 'r').read().split() if not w.isdigit()]))
Related
I’m a programming neophyte and would like some assistance in understanding why the following algorithm is behaving in a particular manner.
My objective is for the function to read in a text file containing words (can be capitalized), strip the whitespace, split the items into separate lines, convert all capital first characters to lowercase, remove all single characters (e.g., “a”, “b”, “c”, etc.), and add the resulting words to a list. All words are to be a separate item in the list for further processing.
Input file:
A text file (‘sample.txt’) contains the following data - “a apple b Banana c cherry”
Desired output:
[‘apple’, ‘banana’, ‘cherry’]
In my initial attempt I tried to iterate through the list of words to test if their length was equal to 1. If so, the word was to be removed from the list, with the other words remaining in the list. This resulted in the following, non-desired output: [None, None, None]
filename = ‘sample.txt’
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
word_list = [word_list.remove(word) for word in word_list if len(word) == 1]
print(word_list)
Produced non-desired output = [None, None, None]
My next attempt was to instead iterate through the list for words to test if their length was greater than 1. If so, the word was to be added to the list (leaving the single characters behind). The desired output was achieved using this method.
filename = ‘sample.txt’
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
word_list = [word for word in word_list if len(word) > 1]
print(word_list)
Produced desired Output = [‘apple’, ‘banana’, ‘cherry’]
My questions are:
Why didn’t the initial code produce the desired result when it seemed to be the most logical and most efficient?
What is the best ‘Pythonic’ way to achieve the desired result?
The reason you got the output you got is
You're removing items from the list as you're looping through it
You are trying to use the output of list.remove (which just modifies the list and returns None)
Your last list comprehension (word_list = [word_list.remove(word) for word in word_list if len(word) == 1]) is essentially equivalent to this:
new_word_list = []
for word in word_list:
if len(word) == 1:
new_word_list.append(word_list.remove(word))
word_list = new_word_list
And as you loop through it this happens:
# word_list == ['a', 'apple', 'b', 'banana', 'c', 'cherry']
# new_word_list == []
word = word_list[0] # word == 'a'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'b', 'banana', 'c', 'cherry']
# new_word_list == [None]
word = word_list[1] # word == 'b'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'banana', 'c', 'cherry']
# new_word_list == [None, None]
word = word_list[2] # word == 'c'
new_word_list.append(word_list.remove(word))
# word_list == ['apple', 'banana', 'cherry']
# new_word_list == [None, None, None]
word_list = new_word_list
# word_list == [None, None, None]
The best 'Pythonic' way to do this (in my opinion) would be:
with open('sample.txt') as input_file:
file_content = input_file.read()
word_list = []
for word in file_content.strip().split(' '):
if len(word) == 1:
continue
word_list.append(word.lower())
print(word_list)
In your first approach, you are storing the result of word_list.remove(word) in the list which is None. Bcz list.remove() method return nothing but performing action on a given list.
Your second approach is the pythonic way to achieve your goal.
The second attempt is the most pythonic. The first one can still be achieved with the following:
filename = 'sample.txt'
with open(filename) as input_file:
word_list = input_file.read().strip().split(' ')
word_list = [word.lower() for word in word_list]
for word in word_list:
if len(word) == 1:
word_list.remove(word)
print(word_list)
Why didn’t the initial code produce the desired result when it seemed
to be the most logical and most efficient?
It's advised to never alter a list while iterating over it. This is because it is iterating over a view of the initial list and that view will differ from the original.
What is the best ‘Pythonic’ way to achieve the desired result?
Your second attempt. But I'd use a better naming convention and your comprehensions can be combined as you're only making them lowercase in the first one:
word_list = input_file.read().strip().split(' ')
filtered_word_list = [word.lower() for word in word_list if len(word) > 1]
I am writing a mini program and within my program there is a function which reads in a text file and returns the individual words from the sentence. However I am having trouble seeing the individual words printed even though I return them. I don't really get why unless I have a big problem with my whitespace. Can you please help? For your information I am only a beginner. The program asks the user for an input of a filename the program then reads the file in the function should then turn the fie into a list and find the individual words from the list and stores them in that list
file_input = input("enter a filename to read: ")
#unique_words = []
def file(user):
unique_words = []
csv_file = open(user + ".txt","w")
main_file = csv_file.readlines()
csv_file.close()
for i in main_list:
if i not in unique_words:
unique_words.append(i)
return unique_words
#display the results of the file being read in
print (file(file_input))
Sorry I am using notepad:
check to see if checking works
it seems you only have one word for each line in your file.
def read_file(user):
with open(user + ".txt","r") as f:
data = [ line.strip() for line in f.readlines() ]
return list( set(data) )
--update---
if you have more than one word in each line and separated by space
def read_file(user):
with open(user + ".txt","r") as f:
data = [ item.strip() for line in f.readlines() for item in line.split(' ')]
return list( set(data) )
In fact, I can not reproduce you problem. Given a proper CSV input file 1) such as
a,b,c,d
e,f,g,h
i,j,k,l
your program prints this, which apart from the last '' seems fine:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', '']
However, you can significantly simplify your code.
instead of appending a , to each line, and then joining by "", just join by , (this will also get rid of that last '')
do the strip directly in join, using a generator expression
main_string = ",".join(line.strip() for line in main_file)
instead of join and then split, use a double-for-loop list comprehension:
main_list = [word for line in csv_file for word in line.strip().split(",")]
instead of doing all this by hand, use the csv module:
main_list = [word for row in csv.reader(csv_file) for word in row]
assuming that order is not important, use a set to remove duplicates:
unique_words = set(main_list)
and if order is important, you can (ab)use collections.OrderedDict:
unique_words = list(collections.OrderedDict((x, None) for x in main_list))
use with to open and close the file
Putting it all together:
import csv
def read_file(user):
with open(user + ".txt") as csv_file:
main_list = [word for row in csv.reader(csv_file) for word in row]
unique_words = set(main_list) # or OrderedDict, see above
return unique_words
1) Update: The reason why it does not work on your "Example text..." file shown in your edit is because that is not a CSV file. CSV mean "comma separated values", but the words in that file a separated by spaces, so you will have to split by spaces instead of by commas:
def read_file(user):
with open(user + ".txt") as text_file:
main_list = [word for line in text_file for word in line.strip().split()]
return set(main_list)
If all you want is a list of each word that occurs in the text, you are doing far too much work. You want something like this:
unique_words = []
all_words = []
with open(file_name, 'r') as in_file:
text_lines = in_file.readlines() # Read in all line from the file as a list.
for line in text_lines:
all_words.extend(line.split()) # iterate through the list of lines, extending the list of all words to include the words in this line.
unique_words = list(set(all_words)) # reduce the list of all words to unique words.
You can simplify your code by using a set because it will only contain unique elements.
user_file = raw_input("enter a filename to read: ")
#function to read any file
def read_file(user):
unique_words = set()
csv_file = open(user + ".txt","r")
main_file = csv_file.readlines()
csv_file.close()
for line in main_file:
line = line.split(',')
unique_words.update([x.strip() for x in line])
return list(unique_words)
#display the results of the file being read in
print (read_file(user_file))
The output for a file with the contents:
Hello, world1
Hello, world2
is
['world2', 'world1', 'Hello']
file_str = input("Enter poem: ")
my_file = open(file_str, "r")
words = file_str.split(',' or ';')
I have a file on my computer that contains a really long poem, and I want to see if there are any words that are duplicated per line (hence it being split by punctuation).
I have that much, and I don't want to use a module or Counter, I would prefer to use loops. Any ideas?
You can use sets to track seen items and duplicates:
>>> words = 'the fox jumped over the lazy dog and over the bear'.split()
>>> seen = set()
>>> dups = set()
>>> for word in words:
if word in seen:
if word not in dups:
print(word)
dups.add(word)
else:
seen.add(word)
the
over
with open (r"specify the path of the file") as f:
data = f.read()
if(set([i for i in data if f.count(f)>1])):
print "Duplicates found"
else:
print "None"
SOLVED !!!
I can give the explanation with working program
file content of sam.txt
sam.txt
Hello this is star hello the data are Hello so you can move to the
hello
file_content = []
resultant_list = []
repeated_element_list = []
with open(file="sam.txt", mode="r") as file_obj:
file_content = file_obj.readlines()
print("\n debug the file content ",file_content)
for line in file_content:
temp = line.strip('\n').split(" ") # This will strip('\n') and split the line with spaces and stored as list
for _ in temp:
resultant_list.append(_)
print("\n debug resultant_list",resultant_list)
#Now this is the main for loop to check the string with the adjacent string
for ii in range(0, len(resultant_list)):
# is_repeated will check the element count is greater than 1. If so it will proceed with identifying duplicate logic
is_repeated = resultant_list.count(resultant_list[ii])
if is_repeated > 1:
if ii not in repeated_element_list:
for2count = ii + 1
#This for loop for shifting the iterator to the adjacent string
for jj in range(for2count, len(resultant_list)):
if resultant_list[ii] == resultant_list[jj]:
repeated_element_list.append(resultant_list[ii])
print("The repeated strings are {}\n and total counts {}".format(repeated_element_list, len(repeated_element_list)))
Output:
debug the file content ['Hello this is abdul hello\n', 'the data are Hello so you can move to the hello']
debug resultant_list ['Hello', 'this', 'is', 'abdul', 'hello', 'the', 'data', 'are', 'Hello', 'so', 'you', 'can', 'move', 'to', 'the', 'hello']
The repeated strings are ['Hello', 'hello', 'the']
and total counts 3
Thanks
def Counter(text):
d = {}
for word in text.split():
d[word] = d.get(word,0) + 1
return d
there is loops :/
to split on punctionation just us
matches = re.split("[!.?]",my_corpus)
for match in matches:
print Counter(match)
For this kinda file;
A hearth came to us from your hearth
foreign hairs with hearth are same are hairs
This will check whole poem;
lst = []
with open ("coz.txt") as f:
for line in f:
for word in line.split(): #splited by gaps (space)
if word not in lst:
lst.append(word)
else:
print (word)
Output:
>>>
hearth
hearth
are
hairs
>>>
As you see there are two hearth here, because in whole poem there are 3 hearth.
For check line by line;
lst = []
lst2 = []
with open ("coz.txt") as f:
for line in f:
for word in line.split():
lst2.append(word)
for x in lst2:
if x not in lst:
lst.append(x)
lst2.remove(x)
print (set(lst2))
>>>
{'hearth', 'are', 'hairs'}
>>>
I'm new to Python and I'm trying to write a piece of code which has accomplishes task:
I need to open the file romeo.txt and read it line by line.
For each line, split the line into a list of words using the split() function. * * Build a list of words as follows:
For each word on each line check to see if the word is already in the list
If not append it to the list.
When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
This is what I have so far:
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = []
for line in fname:
words = line.rstrip().split()
print words
I know that I need to use another for loop to check for any missing words and finally I need to sort them out by using the sort() function. The Python interpreter is giving me an error saying that I have to use append() to add the missing words if they don't exist.
I have managed to build the following list with my code:
['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'] ← Mismatch
['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
['Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon']
['Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']
but the output should come look like this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks','east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick','soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
How can I adjust my code to produce that output?
Important Note:
To everyone wants to help, Please make sure that you go from my code to finish this tast as it's an assignment and we have to follow the level of the course. Thanks
That is my updates for the code :
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = list()
for line in fname:
words = line.rstrip().split()
for i in words:
newList.append(i)
newList.sort()
print newList
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
But I'm getting duplication! Why is that and how to avoide that?
fname = raw_input("Enter file name: ")
fh = open(frame)
lst = list()
for line in fh
for i in line.split():
lst.append(i)
lst.sort()
print list(set(lst))
The above code worked for me.
fname = input("Enter file name: ") #Ask the user for the filename.
fh = open(fname) #Open the file.
lst = list() #Create a list called lst.
for line in fh: #For each line in the fh?
words = line.split() . #Separate the words in the line.
for word in words : #For each word in words.
if word not in lst : #If word not in the lst.
lst.append(word) #Add the word.
elif word in lst : #Continue looping.
continue
lst.sort()
print(lst) #Print the lst.
I struggled with this question for quite a long time while i was doing a Online Python course in Coursera. But i managed to do it without too many nested loops or for loops. Hope this helps.
file = input('Enter File Name: ')
try:
file = open(file)
except:
print('File Not Found')
quit()
F = file.read()
F = F.rstrip().split()
L = list()
for a in F:
if a in L:
continue
else:
L.append(a)
print(sorted(L))
You want to gather all of the words into a single list. Or, uh, a set, because sets enforce uniqueness and you don't care about order anyways.
fname = raw_input("Enter file name: ")
if len(fname) == 0: fname = 'romeo.txt')
with open(fname, 'r') as f: # Context manager
words = set()
for line in f: words.update(line.rstrip().split())
#Now for the sorting
print sorted(words, key = str.lower)
I'm using key = str.lower because I assume you want to sort by human alphabetical and not by computer alphabetical. If you want computer alphabetical, get rid of that argument.
Now, if you actually want to use a list, although it's O(n) for this application...
words = []
with open(filename, "r") as f:
for word in line.rstrip().split():
if word not in words:
words.append(word)
The 'Pythonic' way is to use a set to make a list of unique words and to interate over the file line-by-line:
with open(fn) as f: # open your file and auto close it
uniq=set() # a set only has one entry of each
for line in f: # file line by line
for word in line.split(): # line word by word
uniq.add(word) # uniqueify by adding to a set
print sorted(uniq) # print that sorted
Which you can make terse Pythonic by having a set comprehension that flattens the list of lists produced by 1) a list of lines 2) the lines from the file:
with open(fn) as f:
uniq={w for line in f for w in line.split()}
8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
a=line.split()
i=0
for z in a:
if z not in lst:
lst.append(z)
else:
continue
lst.sort()
print lst
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
word=line.rstrip().split()
for i in word:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print lst
I am doing the EXACT SAME Python Online course in Coursera - "Python for everybody" - and It took me 3 days to complete this assignment and come up with the following piece of code. A short recommendation just if you care
1) Try writing the code exclusively without ANY hint or help - try at least 10 hours
2) Leave the questions as "Last Resort"
When you don't give up and write the code independently the reward is IMMENSE.
For the following code I used EXCLUSIVELY the materials covered in week 4 for the course
fname = input("Enter file name: ")
fh = open("romeo.txt")
newlist = list ()
for line in fh:
words = line.split()
for word in words:
if word not in newlist :
newlist.append (word)
elif word in newlist :
continue
newlist.sort ()
print (newlist)
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line=line.split()
for i in line:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print (lst)
fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
for i in line.split():
if i not in lst:
lst.append(i)
lst.sort()
print(lst)
The sample below is to strip punctuations and converting text into lower case from a ranbo.txt file...
Help me to split this with whitespace
infile = open('ranbo.txt', 'r')
lowercased = infile.read().lower()
for c in string.punctuation:
lowercased = lowercased.replace(c,"")
white_space_words = lowercased.split(?????????)
print white_space_words
Now after this split - how can I found how many words are in this list?
count or len function?
white_space_words = lowercased.split()
splits using any length of whitespace characters.
'a b \t cd\n ef'.split()
returns
['a', 'b', 'cd', 'ef']
But you could do it also other way round:
import re
words = re.findall(r'\w+', text)
returns a list of all "words" from text.
Get its length using len():
len(words)
and if you want to join them into a new string with newlines:
text = '\n'.join(words)
As a whole:
with open('ranbo.txt', 'r') as f:
lowercased = f.read().lower()
words = re.findall(r'\w+', lowercased)
number_of_words = len(words)
text = '\n'.join(words)