How can I find duplicate words in a text file?

How can I find duplicate words in a text file? - python

file_str = input("Enter poem: ")
my_file = open(file_str, "r")
words = file_str.split(',' or ';')
I have a file on my computer that contains a really long poem, and I want to see if there are any words that are duplicated per line (hence it being split by punctuation).
I have that much, and I don't want to use a module or Counter, I would prefer to use loops. Any ideas?

You can use sets to track seen items and duplicates:
>>> words = 'the fox jumped over the lazy dog and over the bear'.split()
>>> seen = set()
>>> dups = set()
>>> for word in words:
if word in seen:
if word not in dups:
print(word)
dups.add(word)
else:
seen.add(word)
the
over

with open (r"specify the path of the file") as f:
data = f.read()
if(set([i for i in data if f.count(f)>1])):
print "Duplicates found"
else:
print "None"

SOLVED !!!
I can give the explanation with working program
file content of sam.txt
sam.txt
Hello this is star hello the data are Hello so you can move to the
hello
file_content = []
resultant_list = []
repeated_element_list = []
with open(file="sam.txt", mode="r") as file_obj:
file_content = file_obj.readlines()
print("\n debug the file content ",file_content)
for line in file_content:
temp = line.strip('\n').split(" ") # This will strip('\n') and split the line with spaces and stored as list
for _ in temp:
resultant_list.append(_)
print("\n debug resultant_list",resultant_list)
#Now this is the main for loop to check the string with the adjacent string
for ii in range(0, len(resultant_list)):
# is_repeated will check the element count is greater than 1. If so it will proceed with identifying duplicate logic
is_repeated = resultant_list.count(resultant_list[ii])
if is_repeated > 1:
if ii not in repeated_element_list:
for2count = ii + 1
#This for loop for shifting the iterator to the adjacent string
for jj in range(for2count, len(resultant_list)):
if resultant_list[ii] == resultant_list[jj]:
repeated_element_list.append(resultant_list[ii])
print("The repeated strings are {}\n and total counts {}".format(repeated_element_list, len(repeated_element_list)))
Output:
debug the file content ['Hello this is abdul hello\n', 'the data are Hello so you can move to the hello']
debug resultant_list ['Hello', 'this', 'is', 'abdul', 'hello', 'the', 'data', 'are', 'Hello', 'so', 'you', 'can', 'move', 'to', 'the', 'hello']
The repeated strings are ['Hello', 'hello', 'the']
and total counts 3
Thanks

def Counter(text):
d = {}
for word in text.split():
d[word] = d.get(word,0) + 1
return d
there is loops :/
to split on punctionation just us
matches = re.split("[!.?]",my_corpus)
for match in matches:
print Counter(match)

For this kinda file;
A hearth came to us from your hearth
foreign hairs with hearth are same are hairs
This will check whole poem;
lst = []
with open ("coz.txt") as f:
for line in f:
for word in line.split(): #splited by gaps (space)
if word not in lst:
lst.append(word)
else:
print (word)
Output:
>>>
hearth
hearth
are
hairs
>>>
As you see there are two hearth here, because in whole poem there are 3 hearth.
For check line by line;
lst = []
lst2 = []
with open ("coz.txt") as f:
for line in f:
for word in line.split():
lst2.append(word)
for x in lst2:
if x not in lst:
lst.append(x)
lst2.remove(x)
print (set(lst2))
>>>
{'hearth', 'are', 'hairs'}
>>>

Related

How to check if a txt file exists a keyword in another txt file?

I have a input txtfile like,
The quick brown fox jumps over the lazy dog
The quick brown fox
A beautiful dog
And I have keywords saved as txtfile like,
fox dog ...
I want to check each line of the input file if it has these keywords, I know how to check the keyword one by one,
with open("input.txt") as f:
a_file = f.read().splitlines()
b_file = []
for line in a_file:
if "dog" in line:
b_file.append("dog")
elif "fox" in line:
b_file.append("fox")
else:
b_file.append("Not found")
with open('output.txt', 'w') as f:
f.write('\n'.join(b_file) + '\n')
but how to check them if they are in another file? P.S. I need to check some specific line not all content in a file and for examples, the result should like,
fox dog
fox
dog

Although you changed a few of the requirements, it appears you want this:
to read a list of keywords from a file with these keywords on a single line, separated by space
to find lines of a text document that have any of these keywords on them, and output the line number (index) of the line they appear on and exactly which keywords were on it, for all lines that have them
This script does that:
with open('keywords.txt') as f:
keywords = f.read().split()
with open('document.txt') as f, open('output.txt', 'w') as o:
for n, line in enumerate(f):
if matches := [k for k in keywords if k in line]:
o.write(f'{n+1}: {matches}\n')
With keywords.txt something like:
fox dog
And document.txt something like:
the quick brown fox
jumped over the lazy dog
on a beautiful dog day afternoon, you foxy dog
there is nothing on FOX
and sometimes you're in a foxhole with a dog
It will write output.txt with:
1: ['fox']
2: ['dog']
3: ['fox', 'dog']
5: ['fox', 'dog']
If you don't want partial matches (like foxhole) and if you care about the order in which words were found, and perhaps want to know about duplicates as well, and you want to make sure capitalisation doesn't matter:
with open('keywords.txt') as f:
keywords = [k.lower() for k in f.read().split()]
with open('document.txt') as f, open('output.txt', 'w') as o:
for n, line in enumerate(f):
if matches := [w for w in line.split() if w.lower() in keywords]:
o.write(f'{n+1}: {matches}\n')
And finally, perhaps your document.txt gets a 6th line with punctuation:
I watch "FOX", but although I search doggedly, I can't find a thing, you foxy dog!
Then this script:
import re
import string
with open('keywords.txt') as f:
keywords = [k.lower() for k in f.read().split()]
with open('document.txt') as f, open('output.txt', 'w') as o:
for n, line in enumerate(f):
if matches := [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]:
o.write(f'{n+1}: {matches}\n')
Gets this written to output.txt:
1: ['fox']
2: ['dog']
3: ['dog', 'dog']
4: ['FOX']
5: ['dog']
6: ['FOX', 'dog']

You should load both files. One is for keyword query, another is for the content for searching. Ex I have a file named keywords.txt, and content.txt
Then open it all:
with open("keywords.txt") as f1, open("content.txt") as f2:
keywords = f1.read()
content = f2.read()
# keywords: fox dog
# content: The quick brown fox jumps over the lazy dog\nThe quick brown fox\nA beautiful dog
If you only want to check if the content contains the keyword, then just do this:
keywords = [line.split() for line in keywords.split("\n")]
keywords = sum(keywords, [])
# keywords: ['fox', 'dog']
content = [line.split() for line in content.split("\n")]
content = sum(content, [])
# content: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', 'The', 'quick', 'brown', 'fox', 'A', 'beautiful', 'dog']
# check intersection of 2 sets, if there is some words overlap
# ==> keywords appear in the content
if set(keywords)&set(content):
print(True)
else:
print(False)

For all unfamiliar with Python I want to extend Grismar's manifold answer with two goals:
explain the language constructs used
extract all the matching-variants into functions and an enum
1. language constructs
[expr for var in generator] is a List comprehension for building a list
i, var in enumerate(list) uses enumerate to have index and iterator-variable inside the loop
var := expr is the Walrus operator (assignment expression) introduced in Python 3.8
2. extract matching variants
The Enum (class) defines the 3 proposed matching-modes. We can then use this mode for both:
(a) reading the keywords ready-to-match, using the extracted function keywords_from
(b) find matches of those keywords, using the extracted function match_keywords
from enum import Enum
class KeywordMatch(Enum):
EXACT = 'exact'
LOWER = 'lower'
PARTIAL = 'partial'
# Usage: keywords = keywords_from('keywords.txt', KeywordMatch.LOWER)
def keywords_from(filename, mode):
with open(filename) as f:
if mode == KeywordMatch.LOWER:
keywords = [k.lower() for k in f.read().split()]
else:
keywords = f.read().split()
return keywords
import re
import string
# Usage: if match_keywords(line, KeywordMatch.LOWER):
def match_keywords(line, mode):
if mode == KeywordMatch.LOWER
matches = [w for w in line.split() if w.lower() in keywords]
elif mode == KeywordMatch.PARTIAL:
matches = [w for w in re.sub('['+string.punctuation+']', '', line).split() if w.lower() in keywords]
else:
matches = [k for k in keywords if k in line]
return matches
if __name__ == "__main__":
mode = KeywordMatch.LOWER
keywords = keywords_from('keywords.txt', mode)
with open('document.txt') as f, open('output.txt', 'w') as o:
for n, line in enumerate(f):
matches = match_keywords(line, mode)
# can also test or debug-print matches
if matches:
o.write(f'{n+1}: {matches}\n')
Note:
despite all the modularization, the keywords list is still a global variable (which is not so clean)
removed the Walrus operator and kept matches separate to test or debug them before writing to file
See also:
Real Python: How to Use Generators and yield in Python
Real Python: Python enumerate(): Simplify Looping With Counters
Real Python: Assignment Expressions: The Walrus Operator
":=" syntax and assignment expressions: what and why?

Remove conjunction from file.txt and punctuation from user input

I want to clean a string from user input from punctuation and conjunction. the conjunction is stored in the file.txt (Stop Word.txt)
I already tried this code:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
for char in words:
punc = char.strip(punctuation)
if punc in conjunction:
words.remove(punc)
print(words)
message(input("Pesan: "))
OUTPUT
when i input "Hello, how are you? and where are you?"
i expect the output is [hello,how,are,you,where,are,you]
but the output is [hello,how,are,you?,where,are,you?]
or [hello,how,are,you?,and,where,are,you?]

Use list comprehension to construct words and check if the word is in your conjunction list:
f = open("Stop Word.txt", "r")
def message(userInput):
punctuation = "!##$%^&*()_+<>?:.,;/"
words = userInput.lower().split()
conjunction = f.read().split("\n")
return [char.strip(punctuation) for char in words if char not in conjunction]
print (message("Hello, how are you? and where are you?"))
#['hello', 'how', 'are', 'you', 'where', 'are', 'you']

Why arent the individual words printed?

I am writing a mini program and within my program there is a function which reads in a text file and returns the individual words from the sentence. However I am having trouble seeing the individual words printed even though I return them. I don't really get why unless I have a big problem with my whitespace. Can you please help? For your information I am only a beginner. The program asks the user for an input of a filename the program then reads the file in the function should then turn the fie into a list and find the individual words from the list and stores them in that list
file_input = input("enter a filename to read: ")
#unique_words = []
def file(user):
unique_words = []
csv_file = open(user + ".txt","w")
main_file = csv_file.readlines()
csv_file.close()
for i in main_list:
if i not in unique_words:
unique_words.append(i)
return unique_words
#display the results of the file being read in
print (file(file_input))
Sorry I am using notepad:
check to see if checking works

it seems you only have one word for each line in your file.
def read_file(user):
with open(user + ".txt","r") as f:
data = [ line.strip() for line in f.readlines() ]
return list( set(data) )
--update---
if you have more than one word in each line and separated by space
def read_file(user):
with open(user + ".txt","r") as f:
data = [ item.strip() for line in f.readlines() for item in line.split(' ')]
return list( set(data) )

In fact, I can not reproduce you problem. Given a proper CSV input file 1) such as
a,b,c,d
e,f,g,h
i,j,k,l
your program prints this, which apart from the last '' seems fine:
['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', '']
However, you can significantly simplify your code.
instead of appending a , to each line, and then joining by "", just join by , (this will also get rid of that last '')
do the strip directly in join, using a generator expression
main_string = ",".join(line.strip() for line in main_file)
instead of join and then split, use a double-for-loop list comprehension:
main_list = [word for line in csv_file for word in line.strip().split(",")]
instead of doing all this by hand, use the csv module:
main_list = [word for row in csv.reader(csv_file) for word in row]
assuming that order is not important, use a set to remove duplicates:
unique_words = set(main_list)
and if order is important, you can (ab)use collections.OrderedDict:
unique_words = list(collections.OrderedDict((x, None) for x in main_list))
use with to open and close the file
Putting it all together:
import csv
def read_file(user):
with open(user + ".txt") as csv_file:
main_list = [word for row in csv.reader(csv_file) for word in row]
unique_words = set(main_list) # or OrderedDict, see above
return unique_words
1) Update: The reason why it does not work on your "Example text..." file shown in your edit is because that is not a CSV file. CSV mean "comma separated values", but the words in that file a separated by spaces, so you will have to split by spaces instead of by commas:
def read_file(user):
with open(user + ".txt") as text_file:
main_list = [word for line in text_file for word in line.strip().split()]
return set(main_list)

If all you want is a list of each word that occurs in the text, you are doing far too much work. You want something like this:
unique_words = []
all_words = []
with open(file_name, 'r') as in_file:
text_lines = in_file.readlines() # Read in all line from the file as a list.
for line in text_lines:
all_words.extend(line.split()) # iterate through the list of lines, extending the list of all words to include the words in this line.
unique_words = list(set(all_words)) # reduce the list of all words to unique words.

You can simplify your code by using a set because it will only contain unique elements.
user_file = raw_input("enter a filename to read: ")
#function to read any file
def read_file(user):
unique_words = set()
csv_file = open(user + ".txt","r")
main_file = csv_file.readlines()
csv_file.close()
for line in main_file:
line = line.split(',')
unique_words.update([x.strip() for x in line])
return list(unique_words)
#display the results of the file being read in
print (read_file(user_file))
The output for a file with the contents:
Hello, world1
Hello, world2
is
['world2', 'world1', 'Hello']

How to look into a list and varify the existing of elements inside the list?

I'm new to Python and I'm trying to write a piece of code which has accomplishes task:
I need to open the file romeo.txt and read it line by line.
For each line, split the line into a list of words using the split() function. * * Build a list of words as follows:
For each word on each line check to see if the word is already in the list
If not append it to the list.
When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
This is what I have so far:
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = []
for line in fname:
words = line.rstrip().split()
print words
I know that I need to use another for loop to check for any missing words and finally I need to sort them out by using the sort() function. The Python interpreter is giving me an error saying that I have to use append() to add the missing words if they don't exist.
I have managed to build the following list with my code:
['But', 'soft', 'what', 'light', 'through', 'yonder', 'window', 'breaks'] ← Mismatch
['It', 'is', 'the', 'east', 'and', 'Juliet', 'is', 'the', 'sun']
['Arise', 'fair', 'sun', 'and', 'kill', 'the', 'envious', 'moon']
['Who', 'is', 'already', 'sick', 'and', 'pale', 'with', 'grief']
but the output should come look like this:
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'breaks','east', 'envious', 'fair', 'grief', 'is', 'kill', 'light', 'moon', 'pale', 'sick','soft', 'sun', 'the', 'through', 'what', 'window', 'with', 'yonder']
How can I adjust my code to produce that output?
Important Note:
To everyone wants to help, Please make sure that you go from my code to finish this tast as it's an assignment and we have to follow the level of the course. Thanks
That is my updates for the code :
fname = raw_input("Enter file name: ")
if len(fname) == 0:
fname = open('romeo.txt')
newList = list()
for line in fname:
words = line.rstrip().split()
for i in words:
newList.append(i)
newList.sort()
print newList
['Arise', 'But', 'It', 'Juliet', 'Who', 'already', 'and', 'and', 'and', 'breaks', 'east', 'envious', 'fair', 'grief', 'is', 'is', 'is', 'kill', 'light', 'moon', 'pale', 'sick', 'soft', 'sun', 'sun', 'the', 'the', 'the', 'through', 'what', 'window', 'with', 'yonder']
But I'm getting duplication! Why is that and how to avoide that?

fname = raw_input("Enter file name: ")
fh = open(frame)
lst = list()
for line in fh
for i in line.split():
lst.append(i)
lst.sort()
print list(set(lst))
The above code worked for me.

fname = input("Enter file name: ") #Ask the user for the filename.
fh = open(fname) #Open the file.
lst = list() #Create a list called lst.
for line in fh: #For each line in the fh?
words = line.split() . #Separate the words in the line.
for word in words : #For each word in words.
if word not in lst : #If word not in the lst.
lst.append(word) #Add the word.
elif word in lst : #Continue looping.
continue
lst.sort()
print(lst) #Print the lst.

I struggled with this question for quite a long time while i was doing a Online Python course in Coursera. But i managed to do it without too many nested loops or for loops. Hope this helps.
file = input('Enter File Name: ')
try:
file = open(file)
except:
print('File Not Found')
quit()
F = file.read()
F = F.rstrip().split()
L = list()
for a in F:
if a in L:
continue
else:
L.append(a)
print(sorted(L))

You want to gather all of the words into a single list. Or, uh, a set, because sets enforce uniqueness and you don't care about order anyways.
fname = raw_input("Enter file name: ")
if len(fname) == 0: fname = 'romeo.txt')
with open(fname, 'r') as f: # Context manager
words = set()
for line in f: words.update(line.rstrip().split())
#Now for the sorting
print sorted(words, key = str.lower)
I'm using key = str.lower because I assume you want to sort by human alphabetical and not by computer alphabetical. If you want computer alphabetical, get rid of that argument.
Now, if you actually want to use a list, although it's O(n) for this application...
words = []
with open(filename, "r") as f:
for word in line.rstrip().split():
if word not in words:
words.append(word)

The 'Pythonic' way is to use a set to make a list of unique words and to interate over the file line-by-line:
with open(fn) as f: # open your file and auto close it
uniq=set() # a set only has one entry of each
for line in f: # file line by line
for word in line.split(): # line word by word
uniq.add(word) # uniqueify by adding to a set
print sorted(uniq) # print that sorted
Which you can make terse Pythonic by having a set comprehension that flattens the list of lists produced by 1) a list of lines 2) the lines from the file:
with open(fn) as f:
uniq={w for line in f for w in line.split()}

8.4 Open the file romeo.txt and read it line by line. For each line, split the line into a list of words using the split() method. The program should build a list of words. For each word on each line check to see if the word is already in the list and if not append it to the list. When the program completes, sort and print the resulting words in alphabetical order.
You can download the sample data at http://www.pythonlearn.com/code/romeo.txt
fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
a=line.split()
i=0
for z in a:
if z not in lst:
lst.append(z)
else:
continue
lst.sort()
print lst

fname = raw_input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
word=line.rstrip().split()
for i in word:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print lst

I am doing the EXACT SAME Python Online course in Coursera - "Python for everybody" - and It took me 3 days to complete this assignment and come up with the following piece of code. A short recommendation just if you care
1) Try writing the code exclusively without ANY hint or help - try at least 10 hours
2) Leave the questions as "Last Resort"
When you don't give up and write the code independently the reward is IMMENSE.
For the following code I used EXCLUSIVELY the materials covered in week 4 for the course
fname = input("Enter file name: ")
fh = open("romeo.txt")
newlist = list ()
for line in fh:
words = line.split()
for word in words:
if word not in newlist :
newlist.append (word)
elif word in newlist :
continue
newlist.sort ()
print (newlist)

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
line=line.rstrip()
line=line.split()
for i in line:
if i in lst:
continue
else:
lst.append(i)
lst.sort()
print (lst)

fname = input("Enter file name: ")
fh = open(fname)
lst = list()
for line in fh:
for i in line.split():
if i not in lst:
lst.append(i)
lst.sort()
print(lst)

Empty list when converting from set

I'm trying to convert a set I've defined into a list so I can use it for indexing.
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
The set seems to contain items just fine. However the list is always empty when I monitor its value in the variable explorer (and when I later call the index function).
What am I doing wrong?
EDIT: This is the entire code. I'm trying to find the location of words in 'p' in 'o' and chart the number of its occurrences in a single line. It's a huge list of words so manually entering anything is out of the question.
p = open("p.txt", 'r')
o = open("o.txt", 'r')
t = open("t.txt", 'w')
lines = p.readlines()
vlines = o.readlines()
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
for i in lines:
thisline = i.split();
thisline[:] = [word for word in thisline if not word.isdigit()]
count = len(thisline)
j = []
j.append(count)
for sword in thisline:
num = thisline.count(sword)
#index=0
#for m in vlines:
#if word is not m:
#index+=1
ix = been.index(sword)
j.append(' ' + str(ix) + ':' + str(num))
j.append('\n')
for item in j:
t.write("%s" % item)
Output should be in the format '(total number of items in line) (index):(no. of occurrences)'.
I think I'm pretty close but this part is bugging me.

Your code is working just fine.
>>> p = '''
the 123 dogs
chased 567 cats
through 89 streets'''.splitlines()
>>> seen = set()
>>> for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
>>> been = list(seen)
>>>
>>> seen
set(['streets', 'chased', 'cats', 'through', 'the', 'dogs'])
>>> been
['streets', 'chased', 'cats', 'through', 'the', 'dogs']

Unless there's a reason why you want to read line by line you can simply replace this:
seen = set()
for line in p:
for word in line.split():
if word not in seen and not word.isdigit():
seen.add(word)
been = list(seen)
with:
been = list(set([w for w in open('p.txt', 'r').read().split() if not w.isdigit()]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I find duplicate words in a text file? - python

You can use sets to track seen items and duplicates: >>> words = 'the fox jumped over the lazy dog and over the bear'.split() >>> seen = set() >>> dups = set() >>> for word in words: if word in seen: if word not in dups: print(word) dups.add(word) else: seen.add(word) the over

with open (r"specify the path of the file") as f: data = f.read() if(set([i for i in data if f.count(f)>1])): print "Duplicates found" else: print "None"

def Counter(text): d = {} for word in text.split(): d[word] = d.get(word,0) + 1 return d there is loops :/ to split on punctionation just us matches = re.split("[!.?]",my_corpus) for match in matches: print Counter(match)

Related

How to check if a txt file exists a keyword in another txt file?

Remove conjunction from file.txt and punctuation from user input

Why arent the individual words printed?

How to look into a list and varify the existing of elements inside the list?

Empty list when converting from set

Categories

Resources