Count number of sentences ending with puncutation mark in a textfile - python

I'm attempting to make a function that counts the number of sentences in a textfile. In this case, a sentence refers to any string ending with either a '.', '?', or a '!'.
I'm new to Python and I'm having trouble figuring out how to do this. I keep getting the error 'UnboundLocalError: local variable 'numberofSentences' referenced before assignment.' Any help would be appreciated!
def countSentences(filename):
endofSentence =['.', '!', '?']
for sentence in filename:
for fullStops in endofSentence:
if numberofSentences.find(fullStops) == true:
numberofSentences = numberofSentences+1
return numberofSentences
print(countSentences('paragraph.txt'))

You need to initialize the variable to something before incrementing it. You also need to open the file itself, read it's text and evaluate it - and not count what letters are in the given filename.
# create file
with open("paragraph.txt","w") as f:
f.write("""
Some text. More text. Even
more text? No, dont need more of that!
Nevermore.""")
def countSentences(filename):
"""Count the number of !.? in a file. Return it."""
numberofSentences = 0 # init variable here before using it
with open(filename) as f: # read file
for line in f: # process line wise , each line char wise
for char in line:
if char in {'.', '!', '?'}:
numberofSentences += 1
return numberofSentences
print(countSentences('paragraph.txt'))
Output:
5
Doku:
reading files

It would work if you actually checked that the fullStop was in the sentence in the first place and you declared numberOfSentences beforehand.
The best method to do this I think would be to instead of using find, which returns a number, not a bool, would be to write
if sentence in endofSentence:
numberOfSentences+=1

Related

Read a text file and return punctation as a string

I want to create a program in Python which reads the text from a text file and returns a string where everything except punctuation (period, comma, colon, semicolon, exclamation point, question mark) has been removed. This is my code:
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
for line in f:
for c in line:
if c == punctations:
c.append(s)
return s
But it only returns '', I have also tried with s = + c instead of s.append(c) since append might not work on strings but the problem still remains. Does anyone want to help me find out why?
How it should work:
If we have a text file named hello.txt with the text "Hello, how are you today?" then punctation('hello.txt') should give us the output string ',?'
You were comparing each character to the whole string when you should have been checking if it belonged in punctuations. Also, append was not the appropriate method here, because you were not returning a list instead you could concatenate the characters into s.
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
text = f.read()
words = text.split()
for line in text:
if line in set(punctations):
s+=line
return s
Another approach you could take to check if it's a symbol is the isalnum() method since it will consider all values that aren't characters or numbers incase you miss any symbols out.
if line!= " " and line!= "\n" and not line.isalnum():
The problem is that c == punctutations will never be True since c is a character and punctutations is a longer string. Another problem is that append doesn't work on strings, you should use + to concat strings instead.
def punctuation(filename):
with open(filename, mode='r') as f:
s = ''
punctations = '''.,;:!?'''
for line in f:
for c in line:
if c in punctations:
s += c
return s
Issues
Some statements seem to have issues:
if c == punctations: # 1
c.append(s) # 2
A single character is never equal to a string of many characters like your punctations (e.g. '.' == '.?' is never true). So we have to use a different boolean comparison-operator: in, because a character can be an element in a collection of characters, a string, list or set.
You spotted already: since c is a character and s a str , not lists we can not use method append. So we have to use s = s + c or shortcut s += c (your solution was almost right)
Extract a testable & reusable function
Why not extract and test the part that fails:
def extract_punctuation(line):
punctuation_chars = set('.,;:!?') # typo in name, unique thus set
symbols = []
for char in line:
if char in punctuation_chars:
symbols.append(char)
return symbols
# test
symbol_list = extract_punctuation('Hello, how are you today?')
print(symbol_list) # [',', '?']
print(''.join(symbol_list)) # ',?'
Solution: use a function on file-read
Then you could reuse that function on any text, or a file like:
def punctuation(filename):
symbols = []
with open(filename, mode='r') as f:
symbols + extract_punctuation(f.read())
return symbols.join()
Explained:
The default result is defined first as empty list [] (returned if file is empty).
The list of extracted is added to symbols using + for each file-read inside with block (here the whole file is read at once).
Returns either empty [].join() giving '' or not, e.g. ,?.
See:
How do I concatenate two lists in Python?
Extend: return a list to play with
For a file with multiple sentences like dialogue.txt:
Hi, how are you?
Well, I am fine!
What about you .. ready to start, huh?
You could get a list (ordered by appearance) like:
[',', '?', ',', '!', '.', '.', ',', '?']
which will result in a string with ordered duplicates:
,?,!..,?
To extend, a list might be a better return type:
Filter unique as set: set( list_punctuation(filename) )
Count frequency using pandas: pd.Series(list_punctuation(filename)).value_counts()
def list_punctuation(filename):
with open(filename, mode='r') as f:
return extract_punctuation(f.read())
lp = list_punctuation('dialogue.txt')
print(lp)
print(''.join(lp))
unique = set(lp)
print(unique)
# pass the list to pandas to easily do statistics
import pandas as pd
frequency = pd.Series(lp).value_counts()
print(frequency)
Prints above list, string. plus following set
{',', '?', '!', '.'}
as well as the ranked frequency for each punctuation symbol:
, 3
? 2
. 2
! 1
Today I learned - by playing with
punctuation & Python's data structures

How to count number of replacements made in string

I am currently working on a beginner problem
(https://www.reddit.com/r/beginnerprojects/comments/1i6sax/challenge_count_and_fix_green_eggs_and_ham/).
The challenge is to read through a file, replacing lower case 'i' with 'I' and writing a new corrected file.
I am at a point where the program reads the input file, replaces the relevant lower case characters, and writes a new corrected file. However, I need to also count the number of corrections.
I have looked through the .replace() documentation and I cannot see that it is possible to find out the number of replacements made. Is it possible to count corrections using the replace method?
def capitalize_i(file):
file = file.replace('i ', 'I ')
file = file.replace('-i-', '-I-')
return file
with open("green_eggs.txt", "r") as f_open:
file_1 = f_open.read()
file_2 = open("result.txt", "w")
file_2.write(capitalize_i(file_1))
You can just use the count function:
i_count = file.count('i ')
file = file.replace('i ', 'I ')
i_count += file.count('-i-')
file = file.replace('-i-', '-I-')
i_count will have the total amount of replacements made. You can also separate them by creating new variables if you want.

Reading characters from file into list

I have a part of a program with the following code
file1 = [line.strip()for line in open(sometext.txt).readlines()]
print ((file1)[0])
and when the code is executed it gives me the whole contents of the txt file which is a very long sentence,
how would I go about reading every letter and placing it in a list to index each character separately? I have used the list() function which seems to put the whole text file into a list and not each character.
You can use file.read() rather than file.readlines():
file1 = [char for char in open(sometext.txt).read()]
You don't really need list-comprehension, however; instead you can do this:
file1 = list(open(sometext.txt).read())
Also, as #furas mentioned in his comment, you don't need a list to have indexing. str also has a method called index, so you could say file1 = open(sometext.txt).read() and still be able to use file1.index(). Note, str also has a find method which will return -1 if the substring is not found, rather than raising a ValueError.
With a read() is enough. Plus. if you want to store the list without \n and white spaces, you can use:
char_list = [ch for ch in open('test.txt').read() if ch != '\n' if ch != ' ']
You can remove the if statements if you want to maintain them.

When counting the occurrence of a string in a file, my code does not count the very first word

Code
def main():
try:
file=input('Enter the name of the file you wish to open: ')
thefile=open(file,'r')
line=thefile.readline()
line=line.replace('.','')
line=line.replace(',','')
thefilelist=line.split()
thefilelistset=set(thefilelist)
d={}
for item in thefilelist:
thefile.seek(0)
wordcount=line.count(' '+item+' ')
d[item]=wordcount
for i in d.items():
print(i)
thefile.close()
except IOError:
print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')
main()
Problem
Basically I am trying to read the file and count the number of times each word in the file appears then print that word with the number of times it appeared next to it.
It all works except that it will not count the first word in the file.
File I am using
my practice file that I am testing this code on contains this text:
This file is for testing. It is going to test how many times the words
in here appear.
output
('for', 1)
('going', 1)
('the', 1)
('testing', 1)
('is', 2)
('file', 1)
('test', 1)
('It', 1)
('This', 0)
('appear', 1)
('to', 1)
('times', 1)
('here', 1)
('how', 1)
('in', 1)
('words', 1)
('many', 1)
note
If you notice it says that 'This' appears 0 times but it does in fact appear in the file.
any ideas?
My guess would be this line:
wordcount=line.count(' '+item+' ')
You are looking for "space" + YourWord + "space", and the first word is not preceded by space.
I would suggest more use of Python utilities. A big flaw is that you only read one line from the file.
Then you create a set of unique words and then start counting them individually which is highly inefficient; the line is traversed many times: once to create the set and then for each unique word.
Python has a built-in "high performance counter" (https://docs.python.org/2/library/collections.html#collections.Counter) which is specifically meant for use cases like this.
The following few lines replace your program; it also uses "re.split()" to split each line by word boundaries (https://docs.python.org/2/library/re.html#regular-expression-syntax).
The idea is to execute this split() function on each of the lines of the file and update the wordcounter with the results from this split. Also re.sub() is used to replace the dots and commas in one go before handing the line to the split function.
import re, collections
with open(raw_input('Enter the name of the file you wish to open: '), 'r') as file:
for d in reduce(lambda acc, line: acc.update(re.split("\W", line)) or acc,
map(lambda line: re.sub("(\.,)", "", line), file),
collections.Counter()).items():
print d
If you want a simple fix it is simple in this line:
wordcount=line.count(' '+item+' ')
There is no space before "This".
I think the are a couple ways to fix it but I recommend using the with block and using .readlines()
I recommend using some more of pythons capabilities. In this case, a couple recommendations. One if the file is more than one line this code won't work. Also if a sentence is words... lastwordofsentence.Firstwordofnextsentence it won't work because they will be next to each other and become one word. Please change your replace to do spaces by that i mean change '' to ' ', as split will replace multiple spaces .
Also, please post whether you are using Python 2.7 or 3.X. It helps with small possible syntax problems.
filename = input('Enter the name of the file you wish to open: ')
# Using a with block like this is cleaner and nicer than try catch
with open(filename, "r") as f:
all_lines = f.readlines()
d={} # Create empty dictionary
# Iterate through all lines in file
for line in all_lines:
# Replace periods and commas with spaces
line=line.replace('.',' ')
line=line.replace(',',' ')
# Get all words on this line
words_in_this_line = line.split() # Split into all words
# Iterate through all words
for word in words_in_this_line:
#Check if word already exists in dictionary
if word in d: # Word exists increment count
d[word] += 1
else: #Word doesn't exist, add it with count 1
d[word] = 1
# Print all words with frequency of occurrence in file
for i in d.items():
print(i)
You check if line contains ' '+item+' ', which means you are searching for a word starting and ending with a space. Because "This" is the first word of the line, it is not surrounded by two spaces.
To fix that, you can use the following code:
wordcount=(' '+line+' ').count(' '+item+' ')
Above code ensures that the first and the last word are counted correctly.
The problem is in this line wordcount=line.count(' '+item+' '). The first word will not have a space in front of it. I have also have removed some other redundant statements from your code:
import string
def main():
try:
#file=input('Enter the name of the file you wish to open: ')
thefile=open('C:/Projects/Python/data.txt','r')
line=thefile.readline()
line = line.translate(string.maketrans("",""), string.punctuation)
thefilelist=line.split()
d={}
for item in thefilelist:
if item not in d:
d[item] = 0
d[item] = d[item]+1
for i in d.items():
print(i)
thefile.close()
except IOError:
print('IOError: Sorry but i had an issue opening the file that you specified to READ from please try again but keep in mind to check your spelling of the file you want to open')
main()
This do not have space in front ' '.
Quick fix:
line= ' ' + thefile.readline()
But there are many problem in Your code.
For example:
What about multi line file?
What about file without . at the end?

complex regex matches in python

I have a txt file that contains the following data:
chrI
ATGCCTTGGGCAACGGT...(multiple lines)
chrII
AGGTTGGCCAAGGTT...(multiple lines)
I want to first find 'chrI' and then iterate through the multiple lines of ATGC until I find the xth char. Then I want to print the xth char until the yth char. I have been using regex but once I have located the line containing chrI, I don't know how to continue iterating to find the xth char.
Here is my code:
for i, line in enumerate(sacc_gff):
for match in re.finditer(chromo_val, line):
print(line)
for match in re.finditer(r"[ATGC]{%d},{%d}\Z" % (int(amino_start), int(amino_end)), line):
print(match.group())
What the variables mean:
chromo_val = chrI
amino_start = (some start point my program found)
amino_end = (some end point my program found)
Note: amino_start and amino_end need to be in variable form.
Please let me know if I could clarify anything for you, Thank you.
It looks like you are working with fasta data, so I will provide an answer with that in mind, but if it isn't you can use the sub_sequence selection part still.
fasta_data = {} # creates an empty dictionary
with open( fasta_file, 'r' ) as fh:
for line in fh:
if line[0] == '>':
seq_id = line.rstrip()[1:] # strip newline character and remove leading '>' character
fasta_data[seq_id] = ''
else:
fasta_data[seq_id] += line.rstrip()
# return substring from chromosome 'chrI' with a first character at amino_start up to but not including amino_end
sequence_string1 = fasta_data['chrI'][amino_start:amino_end]
# return substring from chromosome 'chrII' with a first character at amino_start up to and including amino_end
sequence_string2 = fasta_data['chrII'][amino_start:amino_end+1]
fasta format:
>chr1
ATTTATATATAT
ATGGCGCGATCG
>chr2
AATCGCTGCTGC
Since you are working with fasta files which are formatted like this:
>Chr1
ATCGACTACAAATTT
>Chr2
ACCTGCCGTAAAAATTTCC
and are a bioinformatics major I am guessing you will be manipulating sequences often I recommend install the perl package called FAST. Once this is installed to get the 2-14 character of every sequence you would do this:
fascut 2..14 fasta_file.fa
Here is the recent publication for FAST and github that contains a whole toolbox for manipulating molecule sequence data on the command line.

Categories