python file handling can't search all words - python

Trying to search no of times word appears in a file using python file handling. For example was trying to search 'believer' in the lyrics of believer song that how many times believer comes. It appears 18 times but my program is giving 12. What are the conditions I am missing.
def no_words_function():
f=open("believer.txt","r")
data = f.read()
cnt=0
ws = input("Enter word to find: ")
word = data.split()
for w in word:
if w in ws:
cnt+=1
f.close()
print(ws,"found",cnt,"times in the file.")
no_words_function()

if you are not considering camel case while searching, assuming the entered word in small case you can add below code:
for w in word:
if ws.lower() in w.lower():
cnt+=1

You are not cleaning the data of the trailing characters which can be ,, ", '.' etc. This means your code will not find "believer," in the text
You are also not doing case comparisons. This means your code will not find "Believer" in the text. Based on your search needs you might want to do that.
For cleaning data:
word = data.split()
word = [w.strip("'\".,") for w in word] # Add other trailing characters you do not want
For case-insensitive search:
word = [w.lower() for w in word]

The reason you only find 12 of the 18 times "believer" occurs is because of your test inside the for loop.
Instead of writing
if w in ws:
cnt+=1
you should reverse the order
if ws in w:
cnt+=1
To understand why, let's look at one of the lines in you test: You break me down, you build me up, believer, believer. If you split this lines you get the following result:
line = "You break me down, you build me up, believer, believer"
line.split()
Out[26]:
['You', 'break', 'me', 'down,',
'you', 'build', 'me', 'up,',
'believer,', 'believer']
As you can see, the ninth element in this list is believer,. If you test 'believer,' in 'believer' the result will be False. However, if you test 'believer' in 'believer,' the result will be True
As others have mentioned, it is also a good idea to convert the search string and your search word to lower case, if you want to ignore case.

Related

How to check generated strings against a text file

I'm trying to have the user input a string of characters with one asterisk. The asterisk indicates a character that can be subbed out for a vowel (a,e,i,o,u) in order to see what substitutions produce valid words.
Essentially, I want to take an input "l*g" and have it return "lag, leg, log, lug" because "lig" is not a valid English word. Below I have invalid words to be represented as "x".
I've gotten it to properly output each possible combination (e.g., including "lig"), but once I try to compare these words with the text file I'm referencing (for the list of valid words), it'll only return 5 lines of x's. I'm guessing it's that I'm improperly importing or reading the file?
Here's the link to the file I'm looking at so you can see the formatting:
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/packages/corpora/words.zip
Using the "en" file ~2.5MB
It's not in a dictionary layout i.e. no corresponding keys/values, just lines (maybe I could use the line number as the index, but I don't know how to do that). What can I change to check the test words to narrow down which are valid words based on the text file?
with open(os.path.expanduser('~/Downloads/words/en')) as f:
words = f.readlines()
inputted_word = input("Enter a word with ' * ' as the missing letter: ")
letters = []
for l in inputted_word:
letters.append(l)
### find the index of the blank
asterisk = inputted_word.index('*') # also used a redundant int(), works fine
### sub in vowels
vowels = ['a','e','i','o','u']
list_of_new_words = []
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters)
list_of_new_words.append(new_word)
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
There are probably more efficient ways to do this, but I'm brand new to this. The last two for loops could probably be combined but debugging it was tougher that way.
print(list_of_new_words)
gives
['lag', 'leg', 'lig', 'log', 'lug']
So far, so good.
But this :
for w in list_of_new_words:
if w in words:
print(new_word)
else:
print('x')
Here you print new_word, which is defined in the previous for loop :
for v in vowels:
letters[asterisk] = v
new_word = ''.join(letters) # <----
list_of_new_words.append(new_word)
So after the loop, new_word still has the last value it was assigned to : "lug" (if the script input was l*g).
You probably meant w instead ?
for w in list_of_new_words:
if w in words:
print(w)
else:
print('x')
But it still prints 5 xs ...
So that means that w in words is always False. How is that ?
Looking at words :
print(words[0:10]) # the first 10 will suffice
['A\n', 'a\n', 'aa\n', 'aal\n', 'aalii\n', 'aam\n', 'Aani\n', 'aardvark\n', 'aardwolf\n', 'Aaron\n']
All the words from the dictionary contain a newline character (\n) at the end. I guess you were not aware that it is what readlines do. So I recommend using :
words = f.read().splitlines()
instead.
With these 2 modifications (w and splitlines) :
Enter a word with ' * ' as the missing letter: l*g
lag
leg
x
log
lug
🎉

Count total number of words in a file?

I want to find the total number of words in a file (text/string). I was able to get an output with my code but I'm not sure if it is correct Here are some sample files for y'all to try and see what you get.
Also note, use of any modules/libraries is not permitted.
sample1: https://www.dropbox.com/s/kqwvudflxnmldqr/sample1.txt?dl=0
sample2 - https://www.dropbox.com/s/7xph5pb9bdf551h/sample2.txt?dl=0
sample3 - https://www.dropbox.com/s/4mdb5hgnxyy5n2p/sample3.txt?dl=0
There are some things you must consider before counting the words.
A sentence is a sequence of words followed by either a full-stop, question mark or exclamation mark, which in turn must be followed either by a quotation mark (so the sentence is the end of a quote or spoken utterance), or white space (space, tab or new-line character).
E.g if a full-stop is not at the end of a sentence, it is to be regarded as white space, so serve to end words.
Like 3.42 would be two words. Or P.yth.on would be 3 words.
Double hypen (--) represents is to be regarded as a space character.
That being said, first of all, I opened and read the file to get all the text. I then replaced all the useless characters with blank space so it is easier to count the words. This includes '--' as well.
Then I split the text into words, created a dictionary to store count of the words. After completing the dictionary, I added all the values to get the total number of words and printed this. See below for code:
def countwords():
filename = input("Name of file? ")
text = open(filename, "r").read()
text = text.lower()
for ch in '!.?"#$%&()*+/:<=>#[\\]^_`{|}~':
text = text.replace(ch, ' ')
text = text.replace('--', ' ')
text = text.rstrip("\n")
words = text.split()
count = {}
for w in words:
count[w] = count.get(w,0) + 1
wordcount = sum(count.values())
print(wordcount)
So for sample1 text file, my word count is 321,
Forsample2: 542
For sample3: 139
I was hoping if I could compare these answers with some python pros here and see if my results are correct and if they are not what I'm doing wrong.
You can try this solution using regex.
#word counter using regex
import re
while True:
string =raw_input("Enter the string: ")
count = len(re.findall("[a-zA-Z_]+", string))
if line == "Done": #command to terminate the loop
break
print (count)
print ("Terminated")

remove only the unknown words from a text but leave punctuation and digits

I have a text in French containing words that are separated by space (e.g répu blique*). I want to remove these separated words from the text and append them into a list while keeping punctuation and digits in the text. My code works for appending the words that are separated but it does not work to keep the digits in the text.
import nltk
from nltk.tokenize import word_tokenize
import re
with open ('french_text.txt') as tx:
#opening text containing the separated words
#stores the text with the separated words
text = word_tokenize(tx.read().lower())
with open ('Fr-dictionary.txt') as fr: #opens the dictionary
dic = word_tokenize(fr.read().lower()) #stores the first dictionary
pat=re.compile(r'[.?\-",:]+|\d+')
out_file=open("newtext.txt","w") #defining name of output file
valid_words=[ ] #empty list to append the words checked by the dictionary
invalid_words=[ ] #empty list to append the errors found
for word in text:
reg=pat.findall(word)
if reg is True:
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
a=' '.join(valid_words) #converting list into a string
print(a) #print converted list
print(invalid_words) #print errors found
out_file.write(a) #writing the output to a file
out_file.close()
so, with this code, my list of errors come with the digits.
['ments', 'prési', 'répu', 'blique', 'diri', 'geants', '»', 'grand-est', 'elysée', 'emmanuel', 'macron', 'sncf', 'pepy', 'montparnasse', '1er', '2017.', 'geoffroy', 'hasselt', 'afp', 's', 'empare', 'sncf', 'grand-est', '26', 'elysée', 'emmanuel', 'macron', 'sncf', 'saint-dié', 'epinal', '23', '2018', 'etat', 's', 'vosges', '2018']
I think the problem is with the regular expression. Any suggestions? Thank you!!
The problem is with your if statement where you check reg is True. You should not use the is operator with True to check if the result of pat.findall(word) was positive (i.e. you had a matching word).
You can do this instead:
for word in text:
if pat.match(word):
valid_words.append(word)
elif word in dic:
valid_words.append(word)#appending to a list the words checked
else:
invalid_words.append(word) #appending the invalid_words
Caveat user: this is actually a complex problem, because it all depends on what we define to be a word:
is l’Académie a single word, how about j’eus ?
is gallo-romanes a single word, or c'est-à-dire?
how about J.-C.?
and xiv(e) (with superscript, as in 14 siecle)?
and then QDN or QQ1 or LOL?
Here's a direct solution, that's summarised as:
break up text into "words" and "non-words" (punctuation, spaces)
validate "words" against a dictionary
# Adjust this to your locale
WORD = re.compile(r'\w+')
text = "foo bar, baz"
while True:
m = WORD.search(text)
if not m:
if text:
print(f"punctuation: {text!r}")
break
start, end = m.span()
punctuation = text[:start]
word = text[start:end]
text = text[end:]
if punctuation:
print(f"punctuation: {punctuation!r}")
print(f"possible word: {word!r}")
possible word: 'foo'
punctuation: ' '
possible word: 'bar'
punctuation: ', '
possible word: 'baz'
I get a feeling that you are trying to deal with intentionally misspelt / broken up words, e.g. if someone is trying to get around forum blacklist rules or speech analysis.
Then, a better approach would be:
identify what might be a "word" or "non-word" using a dictionary
then break up the text
If the original text was made to evade computers but be readable by humans, your best bet would be ML/AI, most likely a neural network, like RNN's used to identify objects in images.

Parsing a huge dictionary file with python. Simple task I cant get my head around

I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:
<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>
and I want to extract every word of it to a list of words without duplicates. Lets start by my code.
f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
#<ar><k> lines
#<kref> lines
#ending to ";" - lines
for word in listofwordsfromaline:
p.write(word,"\n")
f.close()
p.close()
Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.
For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "
# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]
In the last case you can use the split method:
word_lst = line.split(";")
ans = []
for word in word_list:
start = word.find("-")
ans.append(word[start:])
ans = set(ans)
First find what defines a word for you.
Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters).
https://docs.python.org/2/howto/regex.html
If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.
Match groups in Python

Using sent_tokenize in a specific area of a file in Python using NLTK?

I have a file with thousands of sentences, and I want to find the sentence containing a specific character/word.
Originally, I was tokenizing the entire file (using sent_tokenize) and then iterating through the sentences to find the word. However, this is too slow. Since I can quickly find the indices of the words, can I use this to my advantage? Is there a way to just tokenize an area around a word (i.e. figure out which sentence contains a word)?
Thanks.
Edit: I'm in Python and using the NLTK library.
What platform are you using? On unix/linux/macOS/cygwin, you can do the following:
sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'
Which will display just the lines containing your word (and the sed will get a very rough tokenisation into sentences). If you want a solution in a particular language, you should say what you're using!
EDIT for Python:
The following will work---it only calls the tokenization if there's a regexp match on your word (this is a very fast operation). This will mean you only tokenize lines that contain the word you want:
import re
import os.path
myword = 'using'
fname = os.path.abspath('path/to/my/file')
try:
f = open(fname)
matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l))
for match in matching_lines:
#do something with matching lines
sents = sent_tokenize(match)
except IOError:
print "Can't open file "+fname
finally:
f.close()
Here's an idea that might speed up the search. You create an additional list in which you store the running total of the word counts for each sentence in your big text. Using a generator function that I learned from Alex Martelli, try something like:
def running_sum(a):
tot = 0
for item in a:
tot += item
yield tot
from nltk.tokenize import sent_tokenize
sen_list = sent_tokenize(bigtext)
wc = [len(s.split()) for s in sen_list]
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text)
word_index = #some number that you get from word index
for index,w in enumerate(runningwc):
if w > word_index:
sentnumber = index-1 #found the index of the sentence that contains the word
break
print sen_list[sentnumber]
Hope the idea helps.
UPDATE: If sent_tokenize is what is slow, then you can try avoiding it altogether. Use the known index to find the word in your big text.
Now, move forward and backward, character by character, to detect sentence end and sentence starts. Something like a "[.!?] " (a period, exclamation or a question mark, followed by a space) would signify and sentence start and end. You will only be searching in the vicinity of your target word, so it should be much faster than sent_tokenize.

Categories