I'm trying to find the biggest sentence in a text file. I'm using the dot (.) to define the beginning and end of sentences. The text file don't have special punctuation (like ?! etc).
My code currently only return the first letter of my text file. I'm not sure why.
def recherche(source):
"find the biggest sentence"
fs = open(source, "r")
while 1:
txt = fs.readline()
if txt == "":
break
else:
grande_phrase= max(txt, key=len)
print (grande_phrase)
fs.close()
recherche("for92.txt")
Your current code reads each line, and finds the max of that line. Since a string is just a collection of characters, your expression max(txt, key=len) gives you the character in txt that has the maximum length. Since all characters have a length of 1, you just get the first character of the line.
You want to create a list of all sentences, and then use max on that list. There seems to be no guarantee that your input file will have one sentence per line. Since you use a period to define where a sentence ends, you're going to have to split the entire file at . to get your list of sentences. Keep in mind that this is not a foolproof strategy to split any text into sentences, since you risk splitting at other occurrences of ., such as a decimal point or an abbreviation.
def recherche(source):
"find the biggest sentence"
with open(source, "r") as fs:
sentences = fs.read().split(".")
grande_phrase = max(sentences, key=len)
print(grande_phrase)
With an input file that looks like so:
It was the best of times. It was the worst of times. It was the age of wisdom. It was the age of foolishness. It was the epoch of belief. It was the epoch of incredulity. It was the season of light. It was the season of darkness. It was the spring of hope. It was the winter of despair.
we get the output:
It was the epoch of incredulity
Try it online Note: I replaced the file with an io.StringIO to work on tio.run
Related
I have a txt file containing one sentence per line, and there are lines containing numbers attached to letters. For instance:
The boy3 was strolling on the beach while four seagulls appeared flying.
There were 3 women sunbathing as well.
All children were playing happily.
I would like remove lines like the first one (i.e. having numbers stuck to words) but not lines like the second which are properly written.
Has anybody got a slight idea?
You can use a simple regex pattern. We start with [0-9]+. This pattern detects any number 0-9 an indefinite amounts of times. Meaning 6, or 56, or 56790 works. If you want to detect sentences that have numbers attached to a string you could use something like this: ([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z]) This regex string matches a string with a letter before a number or after a number. You can search strings using:
import re
lines = [
'The boy3 was strolling on the beach while 4 seagulls appeared flying.',
'There were 3 women sunbathing as well.',
]
for line in lines:
res = re.search("([a-zA-Z][0-9]+)|([0-9]+[a-zA-Z])", line)
if res is None:
# remove line
However you can add more characters to the allowed letters if your sentences can include special characters and such.
Suppose, your input text is stored in file in.txt, you can use following code:
import re
with open("in.txt", "r") as f:
for line in f:
if not(re.search(r'(?!\d)[\w]\d|\d(?!\d)[\w]', line, flags=re.UNICODE)):
print(line, end="")
The pattern (?!\d)[\w] looks for word characters (\w) excluding digits. The idea is stolen from https://stackoverflow.com/a/12349464/2740367
I have 2 strings loss of gene and aquaporin protein. In a line, I want to find if these two exist in a line of my file, within a proximity of 5 words.
Any ideas? I have searched extensively but cannot find anything.
Also, since these are multi-word strings, I cannot use abs(array.index) for the two (which was possible with single words).
Thanks
You could try the following approach:
First sanitise your text by converting it to lowercase, keeping only the characters and enforcing one space between each word.
Next, search for each of the phrases in the resulting text and keep a note of the starting index and the length of the phrase matched. Sort this index list.
Next make sure that all of the phrases were present in the text by making sure all found indexes are not -1.
If all are found count the number of words between the end of the first phrase, and the start of the last phrase. To do this take a text slice starting from the end of the first phrase to the start of the second phrase, and split it into words.
Script as follows:
import re
text = "The Aquaporin protein, sometimes 'may' exhibit a big LOSS of gene."
text = ' '.join(re.findall(r'\b(\w+)\b', text.lower()))
indexes = sorted((text.find(x), len(x)) for x in ['loss of gene', 'aquaporin protein'])
if all(i[0] != -1 for i in indexes) and len(text[indexes[0][0] + indexes[0][1] : indexes[-1][0]].split()) <= 5:
print "matched"
To extend this to work on a file with a list of phrases, the following approach could be used:
import re
log = 'loss of gene'
phrases = ['aquaporin protein', 'another protein']
with open('input.txt') as f_input:
for number, line in enumerate(f_input, start=1):
# Sanitise the line
text = ' '.join(re.findall(r'\b(\w+)\b', line.lower()))
# Only process lines containing 'loss of gene'
log_index = text.find(log)
if log_index != -1:
for phrase in phrases:
phrase_index = text.find(phrase)
if phrase_index != -1:
if log_index < phrase_index:
start, end = (log_index + len(log), phrase_index)
else:
start, end = (phrase_index + len(phrase), log_index)
if len(text[start:end].split()) <= 5:
print "line {} matched - {}".format(number, phrase)
break
This would give you the following kind of output:
line 1 matched - aquaporin protein
line 5 matched - another protein
Note, this will only spot one phrase pair per line.
I am not completely sure if this is what you want, but I'll give it a shot!
In Python, you can use "in" to check if a string is in another string. I am going to assume you already have a way to store a line from a file:
"loss of gene" in fileLine -> returns boolean (either True or False)
With this you can check if "loss of gene" and "aquaporin protein" are in your line from your file. Once you have confirmed that they are both there you can check their proximity by splitting the line of text into a list as so:
wordsList = fileLine.split()
If in your text file you have the string:
"The aquaporin protein sometimes may exhibit a loss of gene"
After splitting it becomes:
["The","aquaporin","protein","sometimes","may","exhibit","a","loss","of","gene"]
I'm not sure if that is a valid sentence but for the sake of example let's ignore it :P
Once you have the line of text split into a list of words and confirmed the words are in there, you can get their proximity with the index function that comes with lists in python!
wordsList.index("protein") -> returns index 2
After finding what index "protein" is at you can check what index "loss" is at, then subtract them to find out if they are within a 5 word proximity.
You can use the index function to discern if "loss of gene" comes before or after "aquaporin protein". If "loss of gene" comes first, index "gene" and "aquaporin" and subtract those indexes. If "aquaporin protein" comes first, index "protein" and "loss" and subtract those indexes.
You will have to do a bit more to ensure that you subtract indexes correctly if the words come in different orders, but this should cover the meat of the problem. Good luck Chahat!
I just got a giant 1.4m line dictionary for other programming uses, and i'm sad to see notepad++ is not powerful enough to do the parsing job to the problem. The dictionary contains three types of lines:
<ar><k>-aaltoiseen</k>
yks.ill..ks. <kref>-aaltoinen</kref></ar>
yks.nom. -aaltoinen; yks.gen. -aaltoisen; yks.part. -aaltoista; yks.ill. -aaltoiseen; mon.gen. -aaltoisten -aaltoisien; mon.part. -aaltoisia; mon.ill. -aaltoisiinesim. Lyhyt-, pitkäaaltoinen.</ar>
and I want to extract every word of it to a list of words without duplicates. Lets start by my code.
f = open('dic.txt')
p = open('parsed_dic.txt', 'r+')
lines = f.readlines()
for line in lines:
#<ar><k> lines
#<kref> lines
#ending to ";" - lines
for word in listofwordsfromaline:
p.write(word,"\n")
f.close()
p.close()
Im not particulary asking you how to do this whole thing, but anything would be helpful. A link to a tutorial or one type of line parsing method would be highly appreciated.
For the first two cases you can see that any word starts and ends with a specific tag , if we see it closely , then we can say that every word must have a ">-" string preceding it and a "
# First and second cases
start = line.find(">-")+2
end = line.find("</")+1
required_word = line[start:end]
In the last case you can use the split method:
word_lst = line.split(";")
ans = []
for word in word_list:
start = word.find("-")
ans.append(word[start:])
ans = set(ans)
First find what defines a word for you.
Make a regular expression to capture those matches. For example - word break '\b' will match word boundaries (non word characters).
https://docs.python.org/2/howto/regex.html
If the word definition in each type of line is different - then if statements to match the line first, then corresponding regular expression match for the word, and so on.
Match groups in Python
I have a file with thousands of sentences, and I want to find the sentence containing a specific character/word.
Originally, I was tokenizing the entire file (using sent_tokenize) and then iterating through the sentences to find the word. However, this is too slow. Since I can quickly find the indices of the words, can I use this to my advantage? Is there a way to just tokenize an area around a word (i.e. figure out which sentence contains a word)?
Thanks.
Edit: I'm in Python and using the NLTK library.
What platform are you using? On unix/linux/macOS/cygwin, you can do the following:
sed 's/[\.\?\!]/\n/' < myfile | grep 'myword'
Which will display just the lines containing your word (and the sed will get a very rough tokenisation into sentences). If you want a solution in a particular language, you should say what you're using!
EDIT for Python:
The following will work---it only calls the tokenization if there's a regexp match on your word (this is a very fast operation). This will mean you only tokenize lines that contain the word you want:
import re
import os.path
myword = 'using'
fname = os.path.abspath('path/to/my/file')
try:
f = open(fname)
matching_lines = list(l for l in f if re.search(r'\b'+myword+r'\b', l))
for match in matching_lines:
#do something with matching lines
sents = sent_tokenize(match)
except IOError:
print "Can't open file "+fname
finally:
f.close()
Here's an idea that might speed up the search. You create an additional list in which you store the running total of the word counts for each sentence in your big text. Using a generator function that I learned from Alex Martelli, try something like:
def running_sum(a):
tot = 0
for item in a:
tot += item
yield tot
from nltk.tokenize import sent_tokenize
sen_list = sent_tokenize(bigtext)
wc = [len(s.split()) for s in sen_list]
runningwc = list(running_sum(wc)) #list of the word count for each sentence (running total for the whole text)
word_index = #some number that you get from word index
for index,w in enumerate(runningwc):
if w > word_index:
sentnumber = index-1 #found the index of the sentence that contains the word
break
print sen_list[sentnumber]
Hope the idea helps.
UPDATE: If sent_tokenize is what is slow, then you can try avoiding it altogether. Use the known index to find the word in your big text.
Now, move forward and backward, character by character, to detect sentence end and sentence starts. Something like a "[.!?] " (a period, exclamation or a question mark, followed by a space) would signify and sentence start and end. You will only be searching in the vicinity of your target word, so it should be much faster than sent_tokenize.
I have written the following code to tokenize the input paragraph that comes from the file samp.txt. Can anybody help me out to find and print the number of sentences, words and characters in the file? I have used NLTK in python for this.
>>>import nltk.data
>>>import nltk.tokenize
>>>f=open('samp.txt')
>>>raw=f.read()
>>>tokenized_sentences=nltk.sent_tokenize(raw)
>>>for each_sentence in tokenized_sentences:
... words=nltk.tokenize.word_tokenize(each_sentence)
... print each_sentence #prints tokenized sentences from samp.txt
>>>tokenized_words=nltk.word_tokenize(raw)
>>>for each_word in tokenized_words:
... words=nltk.tokenize.word_tokenize(each_word)
... print each_words #prints tokenized words from samp.txt
Try it this way (this program assumes that you are working with one text file in the directory specified by dirpath):
import nltk
folder = nltk.data.find(dirpath)
corpusReader = nltk.corpus.PlaintextCorpusReader(folder, '.*\.txt')
print "The number of sentences =", len(corpusReader.sents())
print "The number of patagraphs =", len(corpusReader.paras())
print "The number of words =", len([word for sentence in corpusReader.sents() for word in sentence])
print "The number of characters =", len([char for sentence in corpusReader.sents() for word in sentence for char in word])
Hope this helps
With nltk, you can also use FreqDist (see O'Reillys Book Ch3.1)
And in your case:
import nltk
raw = open('samp.txt').read()
raw = nltk.Text(nltk.word_tokenize(raw.decode('utf-8')))
fdist = nltk.FreqDist(raw)
print fdist.N()
For what it's worth if someone comes along here. This addresses all that the OP's question asked I think. If one uses the textstat package, counting sentences and characters is very easy. There is a certain importance for punctuation at the end of each sentence.
import textstat
your_text = "This is a sentence! This is sentence two. And this is the final sentence?"
print("Num sentences:", textstat.sentence_count(your_text))
print("Num chars:", textstat.char_count(your_text, ignore_spaces=True))
print("Num words:", len(your_text.split()))
I believe this to be the right solution because it properly counts things like "..." and "??" as a single sentence
len(re.findall(r"[^?!.][?!.]", paragraph))
Characters are easy to count.
Paragraphs are usually easy to count too. Whenever you see two consecutive newlines you probably have a paragraph. You might say that an enumeration or an unordered list is a paragraph, even though their entries can be delimited by two newlines each. A heading or a title too can be followed by two newlines, even-though they're clearly not paragraphs. Also consider the case of a single paragraph in a file, with one or no newlines following.
Sentences are tricky. You might settle for a period, exclamation-mark or question-mark followed by whitespace or end-of-file. It's tricky because sometimes colon marks an end of sentence and sometimes it doesn't. Usually when it does the next none-whitespace character would be capital, in the case of English. But sometimes not; for example if it's a digit. And sometimes an open parenthesis marks end of sentence (but that is arguable, as in this case).
Words too are tricky. Usually words are delimited by whitespace or punctuation marks. Sometimes a dash delimits a word, sometimes not. That is the case with a hyphen, for example.
For words and sentences you will probably need to clearly state your definition of a sentence and a word and program for that.
Not 100% correct but I just gave a try. I have not taken all points by #wilhelmtell in to consideration. I try them once I have time...
if __name__ == "__main__":
f = open("1.txt")
c=w=0
s=1
prevIsSentence = False
for x in f:
x = x.strip()
if x != "":
words = x.split()
w = w+len(words)
c = c + sum([len(word) for word in words])
prevIsSentence = True
else:
if prevIsSentence:
s = s+1
prevIsSentence = False
if not prevIsSentence:
s = s-1
print "%d:%d:%d" % (c,w,s)
Here 1.txt is the file name.
The only way you can solve this is by creating an AI program that uses Natural Language Processing which is not very easy to do.
Input:
"This is a paragraph about the Turing machine. Dr. Allan Turing invented the Turing Machine. It solved a problem that has a .1% change of being solved."
Checkout OpenNLP
https://sourceforge.net/projects/opennlp/
http://opennlp.apache.org/
There's already a program to count words and characters-- wc.