Duplicates with in a sentence of a text file in python - python

Hi I want to write a code that reads a text file, and identifies the sentences in the file with words that have duplicates within that sentence. I was thinking of putting each sentence of the file in a dictionary and finding which sentences have duplicates. Since I am new to Python, I need some help in writing the code.
This is what I have so far:
def Sentences():
def Strings():
l = string.split('.')
for x in range(len(l)):
print('Sentence', x + 1, ': ', l[x])
return
text = open('Rand article.txt', 'r')
string = text.read()
Strings()
return
The code above converts files to sentences.

Suppose you have a file where each line is a sentence, e.g. "sentences.txt":
I contain unique words.
This sentence repeats repeats a word.
The strategy could be to split the sentence into its constituent words, then use set to find the unique words in the sentence. If the resulting set is shorter than the list of all words, then you know that the sentence contains at least one duplicated word:
sentences_with_dups = []
with open("sentences.txt") as fh:
for sentence in fh:
words = sentence.split(" ")
if len(set(words)) != len(words):
sentences_with_dups.append(sentence)

Related

Python program which can count and identify the number of acronyms in a text file

I have tried this code from my side, any suggestion and help is appreciated. To be more specific, I want to create a python program which can count and identify the number of acronyms in a text file. And the output of the program should display every acronyms present in the specified text file and how many time each of those acronyms occurred in the file.
*Note- The below code is not giving the desired output. Any type of help and suggestion is appreciated.
Link for the Text File , You guys can have a look- https://drive.google.com/file/d/1zlqsmJKqGIdD7qKicVmF0W6OgF5-g7Qk/view?usp=sharing
This text file contain various acronyms which are used in it. So, I basically want to write a python script to identify those acronyms and count how many times those acronyms occurred. The acronyms are of various type which can be 2 or more letters and it can either be of small or capital letters. For further reference about acronyms please have a look at the text file provided at the google drive.
Any updated code is also appreciated.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
import re
print(re.sub("([a-zA-Z]\.*){2,}s?", "", text))
for line in text: # for every line in file
for word in line.split(' '): # for every word in line
if word.isupper(): # if word is all uppercase letters
acronyms+=1
print("Number of acronyms:", acronyms) #print number of acronyms
In building a small text file and then trying out your code, I came up with a couple of tweaks to your code to simplify it and still acquire the count of words within the text file that are all uppercase letters.
acronyms = 0 # number of acronyms
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
print("Number of words that are all uppercase:", acronyms) #print number of acronyms
First off, just a simple loop is used through the words that are split out from the read text, and then the program just checks that the word is all alpha and that all of the letters in the word are all uppercase.
To test, I built a small text file with some words all in uppercase.
NASA and UCLA have teamed up with the FBI and JPL.
also UNICEF and the WWE have teamed up.
With that, there should be five words that are all uppercase.
And, when run, this was the output on the terminal.
#Una:~/Python_Programs/Acronyms$ python3 Acronym.py
Number of words that are all uppercase: 5
You will note that I am being a bit pedantic here referring to the count of "uppercase" words and not calling them acronyms. I am not sure if you are attempting to actually derive true acronyms, but if you are, this link might help:
Acronyms
Give that a try to see if it meets the spirit of your project.
Answer to the question-
#open file File.txt in read mode with name file
with open('Larex_text_file.txt', "r", errors ='ignore') as file:
text = str(file.read())
for word in text.split(' '): # for every word in line
if word.isupper() and word.isalpha(): # if word is all uppercase letters
acronyms+=1
if len(word) == 1: #ignoring the word found in the file of single character as they are not acronyms
pass
else:
index = len(acronym_word)
acronym_word.insert(index, word) #storing all the acronyms founded in the file to a list
uniqWords = sorted(set(acronym_word)) #remove duplicate words and sort the list of acronyms
for word in uniqWords:
print(word, ":", acronym_word.count(word))
From your comments, it sounds like every acronym appears at least once as an all-uppercase word, then can appear several more times in lowercase.
I suggest making two passes on the text: a first time to collect all uppercase words, and a second pass to search for every occurrence, case-insensitive, of the words you collected on the first pass.
You can use collections.Counter to quickly count words.
You can use ''.join(filter(str.isalpha, word.lower())) to strip a word of its non-alphabetical characters and disregard its case.
In the code snippet below, I used io.StringIO to emulate opening a text file.
from io import StringIO
from collections import Counter
text = '''First we have CR and PU as uppercase words. A word which first
appeared as uppercase can also appear as lowercase.
For instance, cr and pu appear in lowercase, and pu appears again.
And again: here is a new occurrence of pu.
An acronym might or might not have punctuation or numbers in it: CR-1,
C.R., cr.
A word that contains only a singly letter will look like an acronym
if it ever appears as the first word of a sentence.'''
#with open('path/to/file.txt', 'r') as f:
with StringIO(text) as f:
counts = Counter(''.join(filter(str.isalpha, word.lower()))
for line in f for word in line.split())
f.seek(0)
uppercase_words = set(''.join(filter(str.isalpha, word.lower()))
for line in f
for word in line.split() if word.isupper())
acronyms = Counter({w: c for w,c in counts.items() if w in uppercase_words})
print(acronyms)
# Counter({'cr': 5, 'a': 5, 'pu': 4})

remove only the unknown words from a text but leave punctuation and digits

I have a text in French containing words that are separated by space (e.g répu blique*). I want to remove these separated words from the text and append them into a list while keeping punctuation and digits in the text. My code works for appending the words that are separated but it does not work to keep the digits in the text.
import nltk
from nltk.tokenize import word_tokenize
import re
with open ('french_text.txt') as tx:
#opening text containing the separated words
#stores the text with the separated words
text = word_tokenize(tx.read().lower())
with open ('Fr-dictionary.txt') as fr: #opens the dictionary
dic = word_tokenize(fr.read().lower()) #stores the first dictionary
pat=re.compile(r'[.?\-",:]+|\d+')
out_file=open("newtext.txt","w") #defining name of output file
valid_words=[ ] #empty list to append the words checked by the dictionary
invalid_words=[ ] #empty list to append the errors found
for word in text:
reg=pat.findall(word)
if reg is True:
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
a=' '.join(valid_words) #converting list into a string
print(a) #print converted list
print(invalid_words) #print errors found
out_file.write(a) #writing the output to a file
out_file.close()
so, with this code, my list of errors come with the digits.
['ments', 'prési', 'répu', 'blique', 'diri', 'geants', '»', 'grand-est', 'elysée', 'emmanuel', 'macron', 'sncf', 'pepy', 'montparnasse', '1er', '2017.', 'geoffroy', 'hasselt', 'afp', 's', 'empare', 'sncf', 'grand-est', '26', 'elysée', 'emmanuel', 'macron', 'sncf', 'saint-dié', 'epinal', '23', '2018', 'etat', 's', 'vosges', '2018']
I think the problem is with the regular expression. Any suggestions? Thank you!!
The problem is with your if statement where you check reg is True. You should not use the is operator with True to check if the result of pat.findall(word) was positive (i.e. you had a matching word).
You can do this instead:
for word in text:
if pat.match(word):
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
Caveat user: this is actually a complex problem, because it all depends on what we define to be a word:
is l’Académie a single word, how about j’eus ?
is gallo-romanes a single word, or c'est-à-dire?
how about J.-C.?
and xiv(e) (with superscript, as in 14 siecle)?
and then QDN or QQ1 or LOL?
Here's a direct solution, that's summarised as:
break up text into "words" and "non-words" (punctuation, spaces)
validate "words" against a dictionary
# Adjust this to your locale
WORD = re.compile(r'\w+')
text = "foo bar, baz"
while True:
m = WORD.search(text)
if not m:
if text:
print(f"punctuation: {text!r}")
break
start, end = m.span()
punctuation = text[:start]
word = text[start:end]
text = text[end:]
if punctuation:
print(f"punctuation: {punctuation!r}")
print(f"possible word: {word!r}")
possible word: 'foo'
punctuation: ' '
possible word: 'bar'
punctuation: ', '
possible word: 'baz'
I get a feeling that you are trying to deal with intentionally misspelt / broken up words, e.g. if someone is trying to get around forum blacklist rules or speech analysis.
Then, a better approach would be:
identify what might be a "word" or "non-word" using a dictionary
then break up the text
If the original text was made to evade computers but be readable by humans, your best bet would be ML/AI, most likely a neural network, like RNN's used to identify objects in images.

How to find length of words in a file in Python

I am fairly new to files in python and want to find the words in a file that have say 8 letters in them, which prints them, and keeps a numerical total of how many there actually are. Can you look through files like if it were a very large string or is there a specific way that it has to be done?
You could use Python's Counter for doing this:
from collections import Counter
import re
with open('input.txt') as f_input:
text = f_input.read().lower()
words = re.findall(r'\b(\w+)\b', text)
word_counts = Counter(w for w in words if len(w) == 8)
for word, count in word_counts.items():
print(word, count)
This works as follows:
It reads in a file called input.txt, as one very long string.
It then converts it all to lowercase to make sure the same words with different case are counted as the same word.
It uses a regular expression to split all of the text into a list of words.
It uses a list comprehension to store any word that has a length of 8 characters into a Counter.
It displays all of the matching entries along with the counts.
Try this code, where "eight_l_words" is an array of all the eight letter words and where "number_of_8lwords" is the number of eight letter words:
# defines text to be used
your_file = open("file_location","r+")
text = your_file.read
# divides the text into lines and defines some arrays
lines = text.split("\n")
words = []
eight_l_words = []
# iterating through "lines" adding each separate word to the "words" array
for each in lines:
words += each.split(" ")
# checking to see if each word in the "words" array is 8 chars long, and if so
# appending that words to the "eight_l_word" array
for each in words:
if len(each) == 8:
eight_l_word.append(each)
# finding the number of eight letter words
number_of_8lwords = len(eight_l_words)
# displaying results
print(eight_l_words)
print("There are "+str(number_of_8lwords)+" eight letter words")
Running the code with
text = "boomhead shot\nshamwow slapchop"
Yields the results:
['boomhead', 'slapchop']
There are 2 eight letter words
There's a useful post from 2 years ago called "How to split a text file to its words in python?"
How to split a text file to its words in python?
It describes splitting the line by whitespace. If you got punctuation such as commas and fullstops in there then you'll have to be a bit more sophisticated. There's help here: "Python - Split Strings with Multiple Delimiters" Split Strings with Multiple Delimiters?
You can use the function len() to get the length of each individual word.

Automatically separating words into letters?

So I have this code:
import sys ## The 'sys' module lets us read command line arguments
words1 = open(sys.argv[2],'r') ##sys.argv[2] is your dictionary text file
words = str((words1.read()))
def main():
# Get the dictionary to search
if (len(sys.argv) != 3) :
print("Proper format: python filename.py scrambledword filename.txt")
exit(1) ## the non-zero return code indicates an error
scrambled = sys.argv[1]
print(sys.argv[1])
unscrambled = sorted(scrambled)
print(unscrambled)
for line in words:
print(line)
When I print words, it prints the words in the dictionary, one word at a time, which is great. But as soon as I try and do anything with those words like in my last two lines, it automatically separates the words into letters, and prints one letter per line of each word. Is there anyway to keep the words together? My end goal is to do ordered=sorted(line), and then an if (ordered==unscrambled) have it print the original word from the dictionary?
Your words is an instance of str. You should use split to iterate over words:
for word in words.split():
print(word)
A for-loop takes one element at a time from the "sequence" you pass it. You have read the contents of your file into a single string, so python treats it as a sequence of letters. What you need is to convert it into a list yourself: Split it into a list of strings that are as large as you like:
lines = words.splitlines() # Makes a list of lines
for line in lines:
....
Or
wordlist = words.split() # Makes a list of "words", by splitting at whitespace
for word in wordlist:
....

Optimize a find and match code in Python

I have a code which takes as input two files:
(1) a dictionary/lexicon
(2) a text file (one sentence per line)
The first part of my code reads the dictionary in tuples so outputs something like:
('mthy3lkw', 'weakBelief', 'U')
('mthy3lkm', 'firmBelief', 'B')
('mthy3lh', 'notBelief', 'A')
The second part of the code is to search each sentence in the text file for the words in position 0 in those tuples and then print out the sentence, the search word and it's type.
So given the sentence mthy3lkw ana mesh 3arif , desired output is:
["mthy3lkw ana mesh 3arif", 'mthy3lkw', 'weakBelief', 'U'] given that the highlighted word is found in the dictionary.
The second part of my code - the matching part - is TOO slow. How do I make it faster?
Here is my code
findings = []
for sentence in data: # I open the sentences file with .readlines()
for word in tuples: # similar to the ones mentioned above
p1 = re.compile('\\b%s\\b'%word[0]) # get the first word in every tuple
if p1.findall(sentence) and word[1] == "firmBelief":
findings.append([sentence, word[0], "firmBelief"])
print findings
Build a dict lookup structure so you can find the correct one from your tuples quickly. Then you can restructure your loops so that instead of going through your whole dictionary for each sentence, trying to match every entry up, you instead go over each word in the sentence and look it up in the dictionary dict:
# Create a lookup structure for words
word_dictionary = dict((entry[0], entry) for entry in tuples)
findings = []
word_re = re.compile(r'\b\S+\b') # only need to create the regexp once
for sentence in data:
for word in word_re.findall(sentence): # Check every word in the sentence
if word in word_dictionary: # A match was found
entry = word_dictionary[word]
findings.append([sentence, word, entry[1], entry[2]])
Convert your list of tuples into a trie, and use that for searching.

Categories