Counting word occurrences from specific part of txt files using python3

Counting word occurrences from specific part of txt files using python3 - python

I have a folder with a number of txt files.
I want to count the number of occurrences of a set of words in a certain part of a each txt file and export the results to a new excel file.
Specifically, I want to look for the occurrences of words only in part of text that begins after the word "Company A" and ends in the word "Company B."
For example:
I want to look for the words "Corporation" and "Board" in the bold part of the following text:
...the Board of Company A oversees the management of risks inherent in the operation of the Corporation businesses and the implementation of its strategic plan. The Board reviews the risks associated with the Corporation strategic plan at an annual strategic planning session and periodically throughout the year as part of its consideration of the strategic direction of Company B. In addition, the Board addresses the primary risks associated with...
I have managed to count the occurrences of the set of words but from the whole txt file and not the part from Company A up to Company B.
import os
import sys
import glob
for filename in glob.iglob('file path' + '**/*', recursive=True):
def countWords(filename, list_words):
try:
reading = open(filename, "r+", encoding="utf-8")
check = reading.readlines()
reading.close()
for each in list_words:
lower = each.lower()
count = 0
for string in check:
word_check = string.split()
for word in word_check:
lowerword = word.lower()
line = lowerword.strip("!##$%^&*()_+?><:.,-'\\ ")
if lower == line:
count += 1
print(lower, ":", count)
except FileNotFoundError:
print("This file doesn't exist.")
for zero in list_words:
if zero != "":
print(zero, ":", "0")
else:
pass
print('----')
print(os.path.basename(filename))
countWords(filename, ["Corporation", "Board"])
The final output for the example text should be like this:
txtfile1
Corporation: 2
Board: 1
And the above process should be replicated for all txt files of the folder and exported as an excel file.
Thanks for the consideration and I apologize in advance for the length of the question.

you might try regexp, assuming you want the whole string if you see repetitions of company a before you see company b.
re.findall('company a.*?company b', 'company a did some things in agreement with company b')
That will provide a list of all the text strings starting with company a and ending with company b.

Related

Input Function and returning number of words in dataset

I'm supposed to be writing a function for an input of any word to search in the song title and then return the number of songs that contain the word. If no word found then return a statement saying no words found. My output is running the elif statement and then my if statement. I'll post what my outlook is looking like.
import csv
word_count = 0
with open("billboard_songs.csv") as data:
word = input("Enter any word: ")
for line in data:
line_strip = line.split(",")
if word.casefold() in line_strip[1]:
word_count += 1
print(word_count, "songs were found to contain", word.casefold(), "in this data set")
elif word_count == 1:
print("No songs were found to contain the words: ", word.casefold())
Current output:
No songs were found to contain the words: war
No songs were found to contain the words: war
No songs were found to contain the words: war
No songs were found to contain the words: war
2 songs were found to contain war in this data set
3 songs were found to contain war in this data set
4 songs were found to contain war in this data set
5 songs were found to contain war in this data set
6 songs were found to contain war in this data set
7 songs were found to contain war in this data set
8 songs were found to contain war in this data set

There are so many issues with the code.
You should be using the csv library you've already imported, not splitting on comma ,.
Your if statement really isn't doing what you might expect.
You should do something similar to the following:
import csv # Use it!
Store the word as a variable:
word = input("Enter any word: ").casefold()
Hopefully your CSV has headers in it... use csv.DictReader if it does:
reader = csv.DictReader(open('billboard_songs.csv', 'r'))
Iterate through each song in the CSV... from line_strip[1], it looks as if your song lyrics are in the second field. So loop through those. You should also set up a variable to store the count of songs containing the word at this stage:
word_count = 0
for lyrics in reader['song_lyrics']: # replace 'song_lyrics' with your CSV header for the field with song lyrics
# Check the word is present
Iterate through the full CSV first, before printing output.
if word in lyrics:
word_count += 1
Once that finishes, you can use an if/else statement to print any desired output:
if word_count == 0:
print('No songs were found to contain the words: {}'.format(word))
else:
# at least one set of lyrics had the word!
print('{} song(s) were found to contain {} in this data set'.format(word_count, word))
Or, instead of the for loop and everything else below reader, you could use sum as follows:
word_count = sum([word in lyrics for lyrics in reader['song_lyrics'])
Then you could just use a generic print statement:
print('There were {} songs that contained the word: {}'.format(word_count, word))

Counting the words a character said in a movie script

I already managed to uncover the spoken words with some help.
Now I'm looking for to get the text spoken by a chosen person.
So I can type in MIA and get every single words she is saying in the movie
Like this:
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
So I'm able to count the words afterwards.
This is how the movie script looks like
An awkward beat. They pass a wooden SALOON -- where a WESTERN
is being shot. Extras in COWBOY costumes drink coffee on the
steps.
Revision 25.
MIA (CONT'D)
I love this stuff. Makes coming to work
easier.
SEBASTIAN
I know what you mean. I get breakfast
five miles out of the way just to sit
outside a jazz club.
MIA
Oh yeah?
SEBASTIAN
It was called Van Beek. The swing bands
played there. Count Basie. Chick Webb.
(then,)
It's a samba-tapas place now.
MIA
A what?
SEBASTIAN
Samba-tapas. It's... Exactly. The joke's on
history.

I would ask the user for all the names in the script first. Then ask which name they want the words for. I would search the text word by word till I found the name wanted and copy the following words into a variable until I hit a name that matches someone else in the script. Now people could say the name of another character, but if you assume titles for people speaking are either all caps, or on a single line, the text should be rather easy to filter.
for word in script:
if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
recording = True
elif word in character_names and word.isupper(): # you may want to check that this is on its own line as well.
recording = False
if recording:
spoken_text += word + " "

I will outline how you could generate a dict which can give you the number of words spoken for all speakers and one which approximates your existing implementation.
General Use
If we define a word to be any chunk of characters in a string split along ' ' (space)...
import re
speaker = '' # current speaker
words = 0 # number of words on line
word_count = {} # dict of speakers and the number of words they speak
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
words = len(line.split())
if speaker in word_count:
word_count[speaker] += words
else:
word_count[speaker] = words
Generates a dict with the format {'JOHN DOE':55} if John Doe says 55 words.
Example output:
>>> word_count['MIA']
13
Your Implementation
Here is a version of the above procedure that approximates your implementation.
import re
def wordsspoken(script,name):
word_count = 0
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
if speaker == name:
word_count += len(line.split())
print(word_count)
def main():
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

If you want to compute your tally with only one pass over the script (which I imagine could be pretty long), you could just track which character is speaking; set things up like a little state machine:
import re
from collections import Counter, defaultdict
words_spoken = defaultdict(Counter)
currently_speaking = 'Narrator'
for line in SCRIPT.split('\n'):
name = line.replace('(CONT\'D)', '').strip()
if re.match('^[A-Z]+$', name):
currently_speaking = name
else:
words_spoken[currently_speaking].update(line.split())
You could use a more sophisticated regex to detect when the speaker changes, but this should do the trick.
demo

There are some good ideas above. The following should work just fine in Python 2.x and 3.x:
import codecs
from collections import defaultdict
speaker_words = defaultdict(str)
with codecs.open('script.txt', 'r', 'utf8') as f:
speaker = ''
for line in f.read().split('\n'):
# skip empty lines
if not line.split():
continue
# speakers have their names in all uppercase
first_word = line.split()[0]
if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
# remove the (CONT'D) from a speaker string
speaker = line.split('(')[0].strip()
# check if this is a dialogue line
elif len(line) - len(line.lstrip()) == 6:
speaker_words[speaker] += line.strip() + ' '
# get a Python-version-agnostic input
try:
prompt = raw_input
except:
prompt = input
speaker = prompt('Enter name: ').strip().upper()
print(speaker_words[speaker])
Example Output:
Enter name: sebastian
I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.

Trying to read text file and count words within defined groups

I'm a novice Python user. I'm trying to create a program that reads a text file and searches that text for certain words that are grouped (that I predefine by reading from csv). For example, if I wanted to create my own definition for "positive" containing the words "excited", "happy", and "optimistic", the csv would contain those terms. I know the below is messy - the txt file I am reading from contains 7 occurrences of the three "positive" tester words I read from the csv, yet the results print out to be 25. I think it's returning character count, not word count. Code:
import csv
import string
import re
from collections import Counter
remove = dict.fromkeys(map(ord, '\n' + string.punctuation))
# Read the .txt file to analyze.
with open("test.txt", "r") as f:
textanalysis = f.read()
textresult = textanalysis.lower().translate(remove).split()
# Read the CSV list of terms.
with open("positivetest.csv", "r") as senti_file:
reader = csv.reader(senti_file)
positivelist = list(reader)
# Convert term list into flat chain.
from itertools import chain
newposlist = list(chain.from_iterable(positivelist))
# Convert chain list into string.
posstring = ' '.join(str(e) for e in newposlist)
posstring2 = posstring.split(' ')
posstring3 = ', '.join('"{}"'.format(word) for word in posstring2)
# Count number of words as defined in list category
def positive(str):
counts = dict()
for word in posstring3:
if word in counts:
counts[word] += 1
else:
counts[word] = 1
total = sum (counts.values())
return total
# Print result; will write to CSV eventually
print ("Positive: ", positive(textresult))

I'm a beginner as well but I stumbled upon a process that might help. After you read in the file, split the text at every space, tab, and newline. In your case, I would keep all the words lowercase and include punctuation in your split call. Save this as an array and then parse it with some sort of loop to get the number of instances of each 'positive,' or other, word.
Look at this, specifically the "train" function:
https://github.com/G3Kappa/Adjustable-Markov-Chains/blob/master/markovchain.py
Also, this link, ignore the JSON stuff at the beginning, the article talks about sentiment analysis:
https://dev.to/rodolfoferro/sentiment-analysis-on-trumpss-tweets-using-python-
Same applies with this link:
http://adilmoujahid.com/posts/2014/07/twitter-analytics/
Good luck!

I looked at your code and passed through some of my own as a sample.
I have 2 idea's for you, based on what I think you may want.
First Assumption: You want a basic sentiment count?
Getting to 'textresult' is great. Then you did the same with the 'positive lexicon' - to [positivelist] which I thought would be the perfect action? Then you converted [positivelist] to essentially a big sentence.
Would you not just:
1. Pass a 'stop_words' list through [textresult]
2. merge the two dataframes [textresult (less stopwords) and positivelist] for common words - as in an 'inner join'
3. Then basically do your term frequency
4. It is much easier to aggregate the score then
Second assumption: you are focusing on "excited", "happy", and "optimistic"
and you are trying to isolate text themes into those 3 categories?
1. again stop at [textresult]
2. download the 'nrc' and/or 'syuzhet' emotional valence dictionaries
They breakdown emotive words by 8 emotional groups
So if you only want 3 of the 8 emotive groups (subset)
3. Process it like you did to get [positivelist]
4. do another join
Sorry, this is a bit hashed up, but if I was anywhere near what you were thinking let me know and we can make contact.
Second apology, Im also a novice python user, I am adapting what I use in R to python in the above (its not subtle either :) )

Having trouble with two of my functions for text analysis

I'm having trouble trying to find the amount of unique words in a speech text file (well actually 3 files), I'm just going to give you my full code so there is no misunderstandings.
#This program will serve to analyze text files for the number of words in
#the text file, number of characters, sentances, unique words, and the longest
#word in the text file. This program will also provide the frequency of unique
#words. In particular, the text will be three political speeches which we will
#analyze, building on searching techniques in Python.
def main():
harper = readFile("Harper's Speech.txt")
newWords = cleanUpWords(harper)
print(numCharacters(harper), "Characters.")
print(numSentances(harper), "Sentances.")
print(numWords(newWords), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama1 = readFile("Obama's 2009 Speech.txt")
newWords = cleanUpWords(obama1)
print(numCharacters(obama1), "Characters.")
print(numSentances(obama1), "Sentances.")
print(numWords(obama1), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
obama2 = readFile("Obama's 2008 Speech.txt")
newWords = cleanUpWords(obama2)
print(numCharacters(obama2), "Characters.")
print(numSentances(obama2), "Sentances.")
print(numWords(obama2), "Words.")
print(uniqueWords(newWords), "Unique Words.")
print("The longest word is: ", longestWord(newWords))
def readFile(filename):
'''Function that reads a text file, then prints the name of file without
'.txt'. The fuction returns the read file for main() to call, and print's
the file's name so the user knows which file is read'''
inFile1 = open(filename, "r")
fileContentsList = inFile1.read()
inFile1.close()
print("\n", filename.replace(".txt", "") + ":")
return fileContentsList
def numCharacters(file):
'''Fucntion returns the length of the READ file (not readlines because it
would only read the amount of lines and counting characters would be wrong),
which will be the correct amount of total characters in the text file.'''
return len(file)
def numSentances(file):
'''Function returns the occurances of a period, exclamation point, or
a question mark, thus counting the amount of full sentances in the text file.'''
return file.count(".") + file.count("!") + file.count("?")
def cleanUpWords(file):
words = (file.replace("-", " ").replace(" ", " ").replace("\n", " "))
onlyAlpha = ""
for i in words:
if i.isalpha() or i == " ":
onlyAlpha += i
return onlyAlpha.replace(" ", " ")
def numWords(newWords):
'''Function finds the amount of words in the text file by returning
the length of the cleaned up version of words from cleanUpWords().'''
return len(newWords.split())
def uniqueWords(newWords):
unique = sorted(newWords.split())
unique = set(unique)
return str(len(unique))
def longestWord(file):
max(file.split())
main()
So, my last two functions uniqueWords, and longestWord will not work properly, or at least my output is wrong. for unique words, i'm supposed to get 527, but i'm actually getting 567 for some odd reason. Also, my longest word function is always printing none, no matter what i do. I've tried many ways to get the longest word, the above is just one of those ways, but all return none. Please help me with my two sad functions!

Try to do it this way:
def longestWord(file):
return sorted(file.split(), key = len)[-1]
Or it would be even easier to do in uniqueWords
def uniqueWords(newWords):
unique = set(newWords.split())
return (str(len(unique)),max(unique, key=len))
info = uniqueWords("My name is Harper")
print("Unique words" + info[0])
print("Longest word" + info[1])
and you don't need sorted before set to get all unique words
because set it's an Unordered collections of unique elements
And look at cleanUpWords. Because if you will have string like that Hello I'm Harper. Harper I am.
After cleaning it up you will get 6 unique words, because you will have word Im.

outputting items from a list with a specific format

I have a program that stores Car registration plates along with their speed for a stretch of road. The inputted data is stored in two separate lists. The format for vehicle registration is Two letters, two numbers, 3 Letters (eg/DV61 EUB)
I need to output any cars that do not meet this format (personalised or foreign number plates) but are above the speed limit. I'm finding it difficult to pick items from the list that do not meet the standard vehicle registration format.
I need a section of code that pulls any Car Registrations from the list that are a different format and over the speed limit.
def nonstandard ():
global OverLimit
global speed
count = 0
while count < len(RegPlate):
speed.append(SensorDist/(int(time[count])))
OverLimit = "Registration Plate: " + str(RegPlate[count])+ "\n" + "Speed recorded: "+ str(speed[count])
if RegPlate != []:
with open ("H:\\Programming Practice\\New controlled assessment\\Output.txt", "w") as text_file:
text_file.write(OverLimit)

If you want to just get the items from the list that match the pattern you provided, you can use re and filter :
import re
l = ["DV61 EUB","DV61 EUBN"]
print (list(filter(lambda x: re.match("[A-Z]{2}\d{2}\s+[A-Z]{3}$",x),l)))
['DV61 EUB']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Counting word occurrences from specific part of txt files using python3 - python

Related

Input Function and returning number of words in dataset

Counting the words a character said in a movie script

Trying to read text file and count words within defined groups

Having trouble with two of my functions for text analysis

outputting items from a list with a specific format

Categories

Resources