Check if a string contains any items of an set of strings? - python

I have a text file that has a sentence at each line. And I have a word list. I just want to get only the sentences which contain at least one word from the list. Is there a pythonic way to do that?

sentences = [line for line in f if any(word in line for word in word_list)]
Here f would be your file object, for example you could replace it with open('file.txt') if file.txt was the name of your file and it was located in the same directory as the script.

Using set.intersection:
with open('file') as f:
[line for line in f if set(line.lower().split()).itersection(word_set)]
or with filter:
filter(lambda x:word_set.intersection(set(x.lower().split())),f)

this will give you a start:
words = ['a', 'and', 'foo']
infile = open('myfile.txt', 'r')
match_sentences = []
for line in infile.readlines():
# check for words in this line
# if match, append to match_sentences list

Related

How to get the longest word in txt file python

article = open("article.txt", encoding="utf-8")
for i in article:
print(max(i.split(), key=len))
The text is written with line breaks, and it gives me the longest words from each line. How to get the longest word from all of the text?
One approach would be to read the entire text file into a Python string, remove newlines, and then find the largest word:
with open('article.text', 'r') as file:
data = re.sub(r'\r?\n', '', file.read())
longest_word = max(re.findall(r'\w+', data), key=len)
longest = 0
curr_word = ""
with open("article.txt", encoding="utf-8") as f:
for line in f:
for word in line.split(" "): # Use line-by-line generator to avoid loading large file in memory
word = word.strip()
if (wl := len(word)) > longest: # Python 3.9+, otherwise use 2 lines
longest = wl
curr_word = word
print(curr_word)
Instead of iterating through each line, you can get the entire text of the file and then split them using article.readline().split()
article = open("test.txt", encoding="utf-8")
print(max(article.readline().split(), key=len))
article.close()
There are many ways by which you could do that. This would work
with open("article.txt", encoding="utf-8") as article:
txt = [word for item in article.readlines() for word in item.split(" ")]
biggest_word = sorted(txt, key=lambda word: (-len(word), word), )[0]
Note that I am using a with statement to close the connection to the file when the reading is done, that I use readlines to read the entire file, returing a list of lines, and that I unpack the split items twice to get a flat list of items. The last line of code sorts the list and uses -len(word) to inverse the sorting from ascending to descending.
I hope this is what you are looking for :)
If your file is large enough to fit in memory, you can read all line at once.
file = open("article.txt", encoding="utf-8", mode='r')
all_text = file.read()
longest = max(i.split(), key=len)
print(longest)

Completely deleting duplicates words in a text file

I have some words in a text file like:
joynal
abedin
rahim
mohammad
joynal
abedin
mohammad
kudds
I want to delete the duplicate names. It will delete these duplicate entries totally from the text file
The output should be like:
rahim
kuddus
I have tried some coding but it's only giving me the duplicate values as one like 1.joynal and 2.abedin.
Edited: This is the code I tried:
content = open('file.txt' , 'r').readlines()
content_set = set(content)
cleandata = open('data.txt' , 'w')
for line in content_set:
cleandata.write(line)
Use a Counter:
from collections import Counter
with open(fn) as f:
cntr=Counter(w.strip() for w in f)
Then just print the words with a count of 1:
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
rahim
kudds
Or do it the 'old fashion way' with a dict as a counter:
cntr={}
with open(fn) as f:
for line in f:
k=line.strip()
cntr[k]=cntr.get(k, 0)+1
>>> print('\n'.join(w for w,cnt in cntr.items() if cnt==1))
# same
If you want to output to a new file:
with open(new_file, 'w') as f_out:
f_out.write('\n'.join(w for w,cnt in cntr.items() if cnt==1))
you can just create a list which appends if name is not in and remove if name is in and occured a 2nd time.
with open("file1.txt", "r") as f, open("output_file.txt", "w") as g:
output_list = []
for line in f:
word = line.strip()
if not word in output_list:
output_list.append(word)
else:
output_list.remove(word)
g.write("\n".join(output_list))
print(output_list)
['rahim', 'kudds']
#in the text it is for each row one name like this:
rahim
kudds
The solution with counter is still the more elegant way imo
For completeness, if you don't care about order:
with open(fn) as f:
words = set(x.strip() for x in f)
with open(new_fn, "w") as f:
f.write("\n".join(words))
Where fn is the file you want to read from, and new_fn the file you want to write to.
In general for uniqueness think set---remembering that order is not gauranteed.
file = open("yourFile.txt") # open file
text = file.read() # returns content of the file
file.close()
wordList = text.split() # creates list of every word
wordList = list(dict.fromkeys(wordList)) # removes duplicate elements
str = ""
for word in wordList:
str += word
str += " " # creates a string that contains every word
file = open("yourFile.txt", "w")
file.write(str) # writes the new string in the file
file.close()

How to create a list that contains the first word of every line in a file (Python 3)

The file is called "emotion_words" which I want the first word of each line for.
I want to use a nested for loop, but I am not sure how.
Would I do this
emotions=open("emotion_words.txt","r+")
content = emotions.read()
for line in content.split(' ',1):
And add an append function before the second for loop?
with open("emotion_words.txt","r+") as f:
for line in f:
first_word_in_line = line.split(" ")[0]
fileref = open ("emotion_words.txt","r")
line = fileref.readlines()
emotions = []
for words in line:
word = words.split()
emotions.append(word[0])
print (emotions)
If I understand you question correctly, this should work for you:
words = []
emotions = open("emotion_words.txt", "r+")
for l in emotions:
first_word = l.split()[0]
words.append(first_word)
After that you have your words in a 'words' list.

Counting number of words in Python file

I'm trying to count the number of instances several words appear in a file.
Here is my code:
#!/usr/bin/env python
file = open('my_output', 'r')
word1 = 'wordA'
print('wordA', file.read().split().count(word1))
word2 = 'wordB'
print('wordB', file.read().split().count(word2))
word3 = 'wordC'
print('wordC', file.read().split().count(word3))
The issue in the code is that it only counts the number of instances of word1. How can this code be fixed to count word2 and word3?
Thanks!
i think instead of continuously reading and splitting the file , this code would work better if you did : [ this way you could find the term frequency of any number of words you find in the file ]
file=open('my_output' , 'r')
s=file.read()
s=s.split()
w=set(s)
tf={}
for i in s:
tf[i]=s.count(i)
print(tf)
The main problem is that file.read() consumes the file. Thus the second time you search you end up searching an empty file. The simplest solution is to read the file once (if it is not too large) and then just search the previously read text:
#!/usr/bin/env python
with open('my_output', 'r') as file:
text = file.read()
word1 = 'wordA'
print('wordA', text.split().count(word1))
word2 = 'wordB'
print('wordB', text.split().count(word2))
word3 = 'wordC'
print('wordC', text.split().count(word3))
To improve performance it is also possible to split only once:
#!/usr/bin/env python
with open('my_output', 'r') as file:
split_text = file.read().split()
word1 = 'wordA'
print('wordA', split_text.count(word1))
word2 = 'wordB'
print('wordB', split_text.count(word2))
word3 = 'wordC'
print('wordC', split_text.count(word3))
Using with will also ensure that the file closes correctly after being read.
Can you try this:
file = open('my_output', 'r')
splitFile = file.read().split()
lst = ['wordA','wordB','wordC']
for wrd in lst:
print(wrd, splitFile.count(wrd))
Short solution using collections.Counter object:
import collections
with open('my_output', 'r') as f:
wordnames = ('wordA', 'wordB', 'wordC')
counts = (i for i in collections.Counter(f.read().split()).items() if i[0] in wordnames)
for c in counts:
print(c[0], c[1])
For the following sample text line:
'wordA some dfasd asdasdword B wordA sdfsd sdasdasdddasd wordB wordC wordC sdfsdfsdf wordA'
we would obtain the output:
wordB 1
wordC 2
wordA 3
in your code the file is consumed (exhausted) in the first line so the next lines will not return anything to count: the first file.read() reads the whole contents of the file and returns it as a string. the second file.read() has nothing left to read and just returns an empty string '' - as does the third file.read() .
this is a version that should do what you want:
from collections import Counter
counter = Counter()
with open('my_output', 'r') as file:
for line in file:
counter.update(line.split())
print(counter)
you may have to do some preprocessing (in order to get rid of special characters and , and . and what not).
Counter is in the python standard library and is very useful for exactly that kind of thing.
note that this way you iterate once over the file only and you do not have to store the whole file in memory at any time.
if you only want to keep track of certain words you could select only them instead of passing the whole line to a counter:
from collections import Counter
import string
counter = Counter()
words = ('wordA', 'wordB', 'wordC')
chars_to_remove = str.maketrans('', '', string.punctuation)
with open('my_output', 'r') as file:
for line in file:
line = line.translate(chars_to_remove)
w = (word for word in line.split() if word in words)
counter.update(w)
print(counter)
i also included an example of what i meant with preprocessing: punctuation will be removed before counting.
from collections import Counter
#Create a empty word_list which stores each of the words from a line.
word_list=[]
#file_handle to refer to the file object
file_handle=open(r'my_file.txt' , 'r+')
#read all the lines in a file
for line in file_handle.readlines():
#get each line,
#split each line into list of words
#extend those returned words into the word_list
word_list.extend(line.split())
# close the file object
file_handle.close()
#Pass the word_list to Counter() and get the dictionary of the words
dictionary_of_words=Counter(word_list)
print dictionary_of_words

Search for a string from file in python

I am reading a file with a different string on each line. I want to be able to search an input string for a substring that matches an entire line in the file and then save that substring so that it can be printed. This is what I have right now:
wordsDoc = open('Database.doc', 'r', encoding='latin-1')
words = wordsDoc.read().lower()
matching = [string for string in words if string in op_text]
But this matches on each character. How would I do this properly?
This will create a list named "matching" containing all the lines in the file that exactly match the string in op_text, once lowercased.
with open('Database.doc', 'r', encoding='latin-1') as wordsDoc:
matching = [line for line in wordsDoc if op_text == line.lower()]
I assume the idea is that there is some search phrase and if it is contained in any line from the file, you want to filter those lines out.
Try this, which will compare the lower cased version of the line, but will return the original line from the file if it contains the search_key.
with open('somefile.doc') as f:
matching = [line for line in f if search_key in line.lower()]
Couple of comments:
First, using with to open a file is usually better:
with open('Database.doc', 'r', encoding='latin-1') as f:
# closes the file automagically at the end of this block...
Second, there is no need to read in the whole file unless you are doing something with the file as a whole. Since you are searching lines, deal with the lines one by one:
matches=[]
with open('Database.doc', 'r', encoding='latin-1') as f:
for line in f:
if string in line.lower():
matches.append(line)
If you are trying to match the entire line:
matches=[]
with open('Database.doc', 'r', encoding='latin-1') as f:
for line in f:
if string == line.lower():
matches.append(line)
Or, more Pythonically, with a list comprehension:
with open('Database.doc', 'r', encoding='latin-1') as f:
matches=[line for line in f if line.lower()==string]
etc...

Categories