Python 3 Extracting Candidate Words from a Debate File

Python 3 Extracting Candidate Words from a Debate File - python

This is my first post, so I'm sorry if I do anything wrong. That said, I searched for the question and found something similar that was never answered due to the OP not giving sufficient information. This is also homework, so I'm just looking for a hint. I really want to get this on my own.
I need to read in a debate file (.txt), and pull and store all of the lines that one candidate says to put in a word cloud. The file format is supposed to help, but I'm blanking on how to do this. The hint is that each time a new person speaks, their name followed by a colon is the first word in the first line. However, candidates' data can span multiple lines. I am supposed to store each person's lines separately. Here is a sample of the file:
LEHRER: This debate and the next three -- two presidential, one vice
presidential -- are sponsored by the Commission on Presidential
Debates. Tonight's 90 minutes will be about domestic issues and will
follow a format designed by the commission. There will be six roughly
15-minute segments with two-minute answers for the first question,
then open discussion for the remainder of each segment.
Gentlemen, welcome to you both. Let's start the economy, segment one,
and let's begin with jobs. What are the major differences between the
two of you about how you would go about creating new jobs?
LEHRER: You have two minutes. Each of you have two minutes to start. A
coin toss has determined, Mr. President, you go first.
OBAMA: Well, thank you very much, Jim, for this opportunity. I want to
thank Governor Romney and the University of Denver for your
hospitality.
There are a lot of points I want to make tonight, but the most
important one is that 20 years ago I became the luckiest man on Earth
because Michelle Obama agreed to marry me.
This is what I have for a function so far:
def getCandidate(myFile):
file = open(myFile, "r")
obama = []
romney = []
lehrer = []
file = file.readlines()
I'm just not sure how to iterate through the data so that it separates each person's words correctly. I created a dummy file to create the word cloud, and I'm able to do that fine, so all I am wondering is how to extract the information I need.
Thank you! If there is more information I can offer please let me know. This is a beginning Python course.
EDIT: New code added from a response. This works to an extent, but only grabs the first line of each candidate's response, not their entire response. I need to write code that continues to store each line under that candidate until a new name is at the start of a line.
def getCandidate(myFile, candidate):
file = open(myFile, "r")
OBAMA = []
ROMNEY = []
LEHRER = []
file = file.readlines()
for line in file:
if line.startswith("OBAMA:"):
OBAMA.append(line)
if line.startswith("ROMNEY:"):
ROMNEY.append(line)
if line.startswith("LEHRER:"):
LEHRER.append(line)
if candidate == "OBAMA":
return OBAMA
if candidate == "ROMNEY":
return ROMNEY
EDIT: I now have a new question. How can I generalize the file so that I can open any debate file between two people and a moderator? I am having a lot of trouble with this one.
I've been given a hint to look at the beginning of the line and see if the last word of each line to see if it ends in ":", but I'm still not sure how to do this. I tried splitting each line on spaces and then looking at the first item in the line, but that's as far as I've gotten.

The hint is this: after you split your lines, iterate over them and check with the string function startswith for each candidate, then append.
The iteration over a file is very simple:
for row in file:
do_something_with_row
EDIT:
To keep putting the lines until you find a new candidate, you have to keep track with a variable of the last candidate seen and if you don't find any match at the beginning of the line, you stick with the same candidate as before.
if line.startswith('OBAMA'):
last_seen=OBAMA
OBAMA.append(line)
elif blah blah blah
else:
last_seen.append(line)
By the way, I would change the definitio of the function: instead of take the name of the candidate and returning only his lines, it would be better to return a dictionary with the candidate name as keys and their lines as values, so you wouldn't need to parse the file more than once. When you will work with bigger file this could be a lifesaver.

Related

How to separate words to single letters from text file python

How do I separate words from a text file into single letters?
I'm given a text where I have to calculate the frequency of the letters in a text. However, I can't seem to figure out how I separate the words into single letters so I can count the unique elements and from there determine their frequency.
I apologize for not having the text in a text file, but the following text I'm given:
alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it, and what is the use of a book,' thought alice without pictures or conversation?'
so she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy- chain would be worth the trouble of getting up and picking the daisies, when suddenly a white rabbit with pink eyes ran close by her.
there was nothing so very remarkable in that; nor did alice think it so very much out of the way to hear the rabbit say to itself, `oh dear! oh dear! i shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the rabbit actually took a watch out of its waistcoat- pocket, and looked at it, and then hurried on, alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
in another moment down went alice after it, never once considering how in the world she was to get out again.
the rabbit-hole went straight on like a tunnel for some way, and then dipped suddenly down, so suddenly that alice had not a moment to think about stopping herself before she found herself falling down a very deep well.
I'm supposed to separate into getting 26 variables a-z, and then determine their frequency which is given as the following:
I tried making the following code so far:
# Check where the current file you are working in, is saved.
import os
os.getcwd()
#print(os.getcwd())
# 1. Change the current working directory to the place where you have saved the file.
os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles')
os.getcwd()
#print(os.chdir('C:/Users/Annik/Desktop/DTU/02633 Introduction to programming/Datafiles'))
# 2. Listing the content of current working directory type
os.listdir(os.getcwd())
#print(os.listdir(os.getcwd()))
#importing the file
filein = open("small_text.txt", "r") #opens the file for reading
lines = filein.readlines() #reads all lines into an array
smalltxt = "".join(lines) #Joins the lines into one big string.
import numpy as np
def letterFrequency(filename):
#counts the frequency of letters in a text
unique_elems, counts = np.unique(separate_words, return_counts=True)
return unique_elems
I just don't know how to separate the letters in the text, so I can count the unique elements.

You can use collections.Counter to get your frequencies directly from the text.
Then just select the 26 keys you are interested, because it will also include whitespaces and other signs.
from collections import Counter
[...]
with open("small_text.txt", "r") as file:
text = file.read()
keys = "abcdefghijklmnopqrstuvwxyz"
c = Counter(text.lower())
# initialize occurrence with zeros to have all keys present.
occurrence = dict.fromkeys(keys, 0)
occurrence.update({k:v for k,v in c.items() if k in keys})
total = sum(occurrence.values())
frequency = {k:v/total for k,v in occurrence.items()}
[...]
To handle upper case str.lower might be useful as well.

"how I separate the words into single letters" since you want to calculate the count of the characters you can implement python counter in collections.
For example
import collections
import pprint
...
...
file_input = input('File_Name: ')
with open(file_input, 'r') as info:
count = collections.Counter(info.read().upper()) # reading file
value = pprint.pformat(count)
print(value)
...
...
This read your file will output the count of characters present.

If an input string too id long or has a paragraphs won't copy it all

I have a question about input
description = input('add description: ')
I'm adding a text using Ctrl+C and Ctrl+V.
For example:
"The short story is a crafted form in its own right. Short stories
make use of plot, resonance, and other dynamic components as in a
novel, but typically to a lesser degree. While the short story is
largely distinct from the novel or novella/short novel, authors
generally draw from a common pool of literary techniques.
Determining what exactly separates a short story from longer fictional
formats is problematic. A classic definition of a short story is that
one should be able to read it in one sitting, a point most notably
made in Edgar Allan Poe's essay "The Philosophy of Composition"
(1846)"
Result is:
description = "The short story is a crafted form in its own right. Short stories make use of plot, resonance, and other dynamic components as in a novel, but typically to a lesser degree. While the short story is largely distinct from the novel or novella/short novel, authors generally draw from a common pool of literary techniques."
Whilst I want description to hold the entire text chain I copied.

Normally the input() function terminates on an End Of Line or \n. I would suggest using a setup like this:
line = []
while True:
line = input()
if line == "EOF":
break
else:
lines.append(line)
text = ' '.join(lines)
What this does is read input and add it to a array until you type in "EOF" on its own line and hit enter. Thsis should solve the multi line problem.

The problem you're facing here is that an input ends as soon as enter is hit or (in this case) the next line is started. The only way to use enter (I'm just going to call It that, hope you know what I mean) is to instead of actually writing a new paragraph just to write \n, since that is the representation of enter in a string. If you want to go around this issue though I highly recommend you learn how to use the TKinter model, since if you want to create any kind of app for frontend It is one of the best modules. Here a link to get you started https://www.tutorialspoint.com/python/python_gui_programming.htm

How can I extract specific data from e-prime output (.txt file)

Been learning Python the last couple of days for the function of completing a data extraction. I'm not getting anywhere & hope one of you lovely people can advise.
I need to extract data that follows: RESP, CRESP, RTTime and RT.
Here's a snippit for an example of the mess I have to deal with.
Thoughts?
Level: 4
*** LogFrame Start ***
Procedure: ActProcScenarios
No: 1
Line1: It is almost time for your town's spring festival. A friend of yours is
Line2: on the committee and asks if you would be prepared to help out with the
Line3: barbecue in the park. There is a large barn for use if it rains.
Line4: You hope that on that day it will be
pfrag: s-n-y
pword: sunny
pletter: u
Quest: Does the town have an autumn festival?
Correct: {LEFTARROW}
ScenarioListPract: 1
Topic: practice
Subtheme: practice
ActPracScenarios: 1
Running: ActPracScenarios
ActPracScenarios.Cycle: 1
ActPracScenarios.Sample: 1
DisplayFragInstr.OnsetDelay: 17
DisplayFragInstr.OnsetTime: 98031
DisplayFragInstr.DurationError: -999999
DisplayFragInstr.RTTime: 103886
DisplayFragInstr.ACC: 0
DisplayFragInstr.RT: 5855
DisplayFragInstr.RESP: {DOWNARROW}
DisplayFragInstr.CRESP:
FragInput.OnsetDelay: 13
FragInput.OnsetTime: 103899
FragInput.DurationError: -999999
FragInput.RTTime: 104998

I think regular expressions would be the right tool here because the \b word boundary anchors allow you to make sure that RESP only matches a whole word RESP and not just part of a longer word (like CRESP).
Something like this should get you started:
>>> import re
>>> for line in myfile:
... match = re.search(r"\b(RT|RTTime|RESP|CRESP): (.*)", line)
... if match:
... print("Matched {0} with value {1}".format(match.group(1),
... match.group(2)))
Output:
Matched RTTime with value 103886
Matched RT with value 5855
Matched RESP with value {DOWNARROW}
Matched CRESP with value
Matched RTTime with value 104998

transform it to a dict first, then just get items from the dict as you wish
d = {k.strip(): v.strip() for (k, v) in
[line.split(':') for line in s.split('\n') if line.find(':') != -1]}
print (d['DisplayFragInstr.RESP'], d['DisplayFragInstr.CRESP'],
d['DisplayFragInstr.RTTime'], d['DisplayFragInstr.RT'])
>>> ('{DOWNARROW}', '', '103886', '5855')

I think you may be making things harder for yourself than needed. E-prime has a file format called .edat that is designed for the purpose you are describing. An edat file is another format that contains the same information as the .txt file but it a way that makes extracting variables easier. I personally only use the type of text file you have posted here as a form of data storage redundancy.
If you are doing things this way because you do not have a software key, it might help to know that the E-Merge and E-DataAid programs for eprime don't require a key. You only need the key for editing build files. Whoever provided you with the .txt files should probably have an install disk for these programs. If not, it is available on the PST website (I believe you need a serial code to create an account, but not certain)
Eprime generally creates a .edat file that matches the content of the text file you have posted an example of. Sometimes though if eprime crashes you don't get the edat file and only have the .txt. Luckily you can generate the edat file from the .txt file.
Here's how I would approach this issue: If you do not have the edat files available first use E-DataAid to recover the files.
Then presuming you have multiple participants you can use e-merge to merge all of the edat files together for all participants in who completed this task.
Open the merged file. It might look a little chaotic depending on how much you have in the file. You can got to Go to tools->Arrange columns This will show a list of all your variables. Adjust so that only the desired variables are in the right hand box. Hit ok.
Looking at the file you posted it says level 4 at the top so I'm guessing there are a lot of procedures in this experiment. If you have many procedures in the program you might at this point have lines that just have startup info and NULL in the locations where your variables or interest are. You and fix this by going to tools->filter and creating a filter to eliminate those lines. Sometimes also depending on file structure you might also end up with duplicate lines of the same data. You can also fix this with filtering.
You can then export this file as a csv

import re
import pprint
def parse_logs(file_name):
with open(file_name, "r") as f:
lines = [line.strip() for line in f.readlines()]
base_regex = r'^.*{0}: (.*)$'
match_terms = ["RESP", "CRESP", "RTTime", "RT"]
regexes = {term: base_regex.format(term) for term in match_terms}
output_list = []
for line in lines:
for key, regex in regexes.items():
match = re.match(regex, line)
if match:
match_tuple = (key, match.groups()[0])
output_list.append(match_tuple)
return output_list
pprint.pprint(parse_logs("respregex"))
Edit: Tim and Guy's answers are both better. I was in a hurry to write something and missed two much more elegant solutions.

How to loop using .split() function on a text file python

I have a html file with different team names written throughout the file. I just want to grab the team names. The team names always occur after certain text and end before certain text, so I've split function to find the team name. I'm a beginner, and I'm sure I'm making this harder than it is. Data is the file
teams = data.split('team-away">')[1].split("</sp")[0]
for team in teams:
print team
This returns each individual character for the first team that it finds (so for example, if teams = San Francisco 49ers, it prints "S", then "A", etc. instead of what I need it to do: Print "San Francisco 49ers" then on the next line the next team "Carolina Panthers", etc.
Thank you!

"I'm a beginner, and I'm sure I'm making this harder than it is."
Well, kind of.
import re
teams = re.findall('team-away">(.*)</sp', data)
(with credit to Kurtis, for a simpler regular expression than I originally had)
Though an actual HTML parser would be best practice.

Don't re-invent the wheel! Look into BeautifulSoup, it'll to the job for you.

lists and sublists

i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list

The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]

open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.

This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.

The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.