Counting the words a character said in a movie script - python

I already managed to uncover the spoken words with some help.
Now I'm looking for to get the text spoken by a chosen person.
So I can type in MIA and get every single words she is saying in the movie
Like this:
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
So I'm able to count the words afterwards.
This is how the movie script looks like
An awkward beat. They pass a wooden SALOON -- where a WESTERN
is being shot. Extras in COWBOY costumes drink coffee on the
steps.
Revision 25.
MIA (CONT'D)
I love this stuff. Makes coming to work
easier.
SEBASTIAN
I know what you mean. I get breakfast
five miles out of the way just to sit
outside a jazz club.
MIA
Oh yeah?
SEBASTIAN
It was called Van Beek. The swing bands
played there. Count Basie. Chick Webb.
(then,)
It's a samba-tapas place now.
MIA
A what?
SEBASTIAN
Samba-tapas. It's... Exactly. The joke's on
history.

I would ask the user for all the names in the script first. Then ask which name they want the words for. I would search the text word by word till I found the name wanted and copy the following words into a variable until I hit a name that matches someone else in the script. Now people could say the name of another character, but if you assume titles for people speaking are either all caps, or on a single line, the text should be rather easy to filter.
for word in script:
if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
recording = True
elif word in character_names and word.isupper(): # you may want to check that this is on its own line as well.
recording = False
if recording:
spoken_text += word + " "

I will outline how you could generate a dict which can give you the number of words spoken for all speakers and one which approximates your existing implementation.
General Use
If we define a word to be any chunk of characters in a string split along ' ' (space)...
import re
speaker = '' # current speaker
words = 0 # number of words on line
word_count = {} # dict of speakers and the number of words they speak
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
words = len(line.split())
if speaker in word_count:
word_count[speaker] += words
else:
word_count[speaker] = words
Generates a dict with the format {'JOHN DOE':55} if John Doe says 55 words.
Example output:
>>> word_count['MIA']
13
Your Implementation
Here is a version of the above procedure that approximates your implementation.
import re
def wordsspoken(script,name):
word_count = 0
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
if speaker == name:
word_count += len(line.split())
print(word_count)
def main():
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)

If you want to compute your tally with only one pass over the script (which I imagine could be pretty long), you could just track which character is speaking; set things up like a little state machine:
import re
from collections import Counter, defaultdict
words_spoken = defaultdict(Counter)
currently_speaking = 'Narrator'
for line in SCRIPT.split('\n'):
name = line.replace('(CONT\'D)', '').strip()
if re.match('^[A-Z]+$', name):
currently_speaking = name
else:
words_spoken[currently_speaking].update(line.split())
You could use a more sophisticated regex to detect when the speaker changes, but this should do the trick.
demo

There are some good ideas above. The following should work just fine in Python 2.x and 3.x:
import codecs
from collections import defaultdict
speaker_words = defaultdict(str)
with codecs.open('script.txt', 'r', 'utf8') as f:
speaker = ''
for line in f.read().split('\n'):
# skip empty lines
if not line.split():
continue
# speakers have their names in all uppercase
first_word = line.split()[0]
if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
# remove the (CONT'D) from a speaker string
speaker = line.split('(')[0].strip()
# check if this is a dialogue line
elif len(line) - len(line.lstrip()) == 6:
speaker_words[speaker] += line.strip() + ' '
# get a Python-version-agnostic input
try:
prompt = raw_input
except:
prompt = input
speaker = prompt('Enter name: ').strip().upper()
print(speaker_words[speaker])
Example Output:
Enter name: sebastian
I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.

Related

Input Function and returning number of words in dataset

I'm supposed to be writing a function for an input of any word to search in the song title and then return the number of songs that contain the word. If no word found then return a statement saying no words found. My output is running the elif statement and then my if statement. I'll post what my outlook is looking like.
import csv
word_count = 0
with open("billboard_songs.csv") as data:
word = input("Enter any word: ")
for line in data:
line_strip = line.split(",")
if word.casefold() in line_strip[1]:
word_count += 1
print(word_count, "songs were found to contain", word.casefold(), "in this data set")
elif word_count == 1:
print("No songs were found to contain the words: ", word.casefold())
Current output:
No songs were found to contain the words: war
No songs were found to contain the words: war
No songs were found to contain the words: war
No songs were found to contain the words: war
2 songs were found to contain war in this data set
3 songs were found to contain war in this data set
4 songs were found to contain war in this data set
5 songs were found to contain war in this data set
6 songs were found to contain war in this data set
7 songs were found to contain war in this data set
8 songs were found to contain war in this data set
There are so many issues with the code.
You should be using the csv library you've already imported, not splitting on comma ,.
Your if statement really isn't doing what you might expect.
You should do something similar to the following:
import csv # Use it!
Store the word as a variable:
word = input("Enter any word: ").casefold()
Hopefully your CSV has headers in it... use csv.DictReader if it does:
reader = csv.DictReader(open('billboard_songs.csv', 'r'))
Iterate through each song in the CSV... from line_strip[1], it looks as if your song lyrics are in the second field. So loop through those. You should also set up a variable to store the count of songs containing the word at this stage:
word_count = 0
for lyrics in reader['song_lyrics']: # replace 'song_lyrics' with your CSV header for the field with song lyrics
# Check the word is present
Iterate through the full CSV first, before printing output.
if word in lyrics:
word_count += 1
Once that finishes, you can use an if/else statement to print any desired output:
if word_count == 0:
print('No songs were found to contain the words: {}'.format(word))
else:
# at least one set of lyrics had the word!
print('{} song(s) were found to contain {} in this data set'.format(word_count, word))
Or, instead of the for loop and everything else below reader, you could use sum as follows:
word_count = sum([word in lyrics for lyrics in reader['song_lyrics'])
Then you could just use a generic print statement:
print('There were {} songs that contained the word: {}'.format(word_count, word))

Counting word occurrences from specific part of txt files using python3

I have a folder with a number of txt files.
I want to count the number of occurrences of a set of words in a certain part of a each txt file and export the results to a new excel file.
Specifically, I want to look for the occurrences of words only in part of text that begins after the word "Company A" and ends in the word "Company B."
For example:
I want to look for the words "Corporation" and "Board" in the bold part of the following text:
...the Board of Company A oversees the management of risks inherent in the operation of the Corporation businesses and the implementation of its strategic plan. The Board reviews the risks associated with the Corporation strategic plan at an annual strategic planning session and periodically throughout the year as part of its consideration of the strategic direction of Company B. In addition, the Board addresses the primary risks associated with...
I have managed to count the occurrences of the set of words but from the whole txt file and not the part from Company A up to Company B.
import os
import sys
import glob
for filename in glob.iglob('file path' + '**/*', recursive=True):
def countWords(filename, list_words):
try:
reading = open(filename, "r+", encoding="utf-8")
check = reading.readlines()
reading.close()
for each in list_words:
lower = each.lower()
count = 0
for string in check:
word_check = string.split()
for word in word_check:
lowerword = word.lower()
line = lowerword.strip("!##$%^&*()_+?><:.,-'\\ ")
if lower == line:
count += 1
print(lower, ":", count)
except FileNotFoundError:
print("This file doesn't exist.")
for zero in list_words:
if zero != "":
print(zero, ":", "0")
else:
pass
print('----')
print(os.path.basename(filename))
countWords(filename, ["Corporation", "Board"])
The final output for the example text should be like this:
txtfile1
Corporation: 2
Board: 1
And the above process should be replicated for all txt files of the folder and exported as an excel file.
Thanks for the consideration and I apologize in advance for the length of the question.
you might try regexp, assuming you want the whole string if you see repetitions of company a before you see company b.
re.findall('company a.*?company b', 'company a did some things in agreement with company b')
That will provide a list of all the text strings starting with company a and ending with company b.

Trying to create index, but variable cannot reach defined value [Python]

I am trying to create an index of a .txt file, which has section titles in all caps. My attempt looks like this:
dictionary = {}
line_count = 0
for line in file:
line_count += 1
line = re.sub(r'[^a-zA-Z0-9 -]','',line)
list = []
if line.isupper():
head = line
else:
list = line.split(' ')
for i in list:
if i not in stopwords:
dictionary.setdefault(i, {}).setdefault(head, []).append(line_count)
The head variable, however, cannot find its value, which I am trying to assign to any lines that are all caps. My desired output would be something like:
>>dictionary['cat']
{'THE PARABLE': [3894, 3924, 3933, 3936, 3939], 'SNOW': [4501], 'THE CHASE': [6765, 6767, 6772, 6773, 6785, 6802, 6807, 6820, 6823, 6839]}
Here is a slice of the data:
THE GOLDEN BIRD
A certain king had a beautiful garden, and in the garden stood a tree
which bore golden apples. These apples were always counted, and about
the time when they began to grow ripe it was found that every night one
of them was gone.
THE PARABLE
Influenced by those remarks, the bird next morning refused to bring in
the wood, telling the others that he had been their servant long enough,
and had been a fool into the bargain, and that it was now time to make a
change, and to try some other way of arranging the work. Beg and pray
as the mouse and the sausage might, it was of no use; the bird remained
master of the situation, and the venture had to be made. They therefore
drew lots, and it fell to the sausage to bring in the wood, to the mouse
to cook, and to the bird to fetch the water.
At the heart your problem is that this:
if line.isupper()
this test will fail for a number ('0') or other such things. Instead try:
if line.isupper() == line:
But in general your code could use a little Pythonic love like maybe:
import re
data = {}
head = ''
with open('file1', 'rU') as f:
for line_num, line in enumerate(f):
line = re.sub(r'[^a-zA-Z0-9 -]', '', line)
if line.isupper() == line:
head = line
else:
for word in (w for w in line.split(' ') if w not in stopwords):
data.setdefault(word, {}).setdefault(head, []).append(line_num+1)

Split txt file into multiple new files with regex

I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?

Using re.findall in python outputting one set of parameters rather than a set of parameters for each line

I've used readlines to split all of the sentences in a file up and I want to use re.findall to go through and find the capitals within them. However, the only output I can get is one set of capitals for all the sentences but I want a set of capitals for each sentence in the file.
I'm using a for loop to attempt this at the moment, but I'm not sure whether this is the best course of action with this task.
Input:
Line 01: HE went to the SHOP
Line 02: THE SHOP HE went
This is what I'm getting as an output:
[HE, SHOP, THE]
and I want to get the output:
[HE, SHOP], [THE, SHOP, HE]
Is there a way of doing this? I've put my coding at the minute below. Thanks!
import re, sys
f = open('findallEX.txt', 'r')
lines = f.readlines()
ii=0
for l in lines:
sys.stdout.write('line %s: %s' %(ii, l))
ii = ii + 1
for x in l
re.findall('[A-Z]+', l)
print x
I think the way to do that is as follows:
txt = """HE went to the SHOP
THE SHOP HE went"""
result = []
for s in txt.split('\n'):
result += [re.findall(r'[A-Z]+', s)]
print(result) # prints [['HE', 'SHOP'], ['THE', 'SHOP', 'HE']]
Or using list comprehensions (a bit less readable):
txt = """HE went to the SHOP
THE SHOP HE went"""
print([re.findall(r'[A-Z]+', s) for s in txt.split('\n')])
If your data really is in that form (words fully capitalized), you don't even need regexes. isupper is all you need.
with open('findallEX.txt') as f:
for line in f.readlines():
print [word for word in line.split() if word.isupper()]
Added an example.

Categories