I am trying to create an index of a .txt file, which has section titles in all caps. My attempt looks like this:
dictionary = {}
line_count = 0
for line in file:
line_count += 1
line = re.sub(r'[^a-zA-Z0-9 -]','',line)
list = []
if line.isupper():
head = line
else:
list = line.split(' ')
for i in list:
if i not in stopwords:
dictionary.setdefault(i, {}).setdefault(head, []).append(line_count)
The head variable, however, cannot find its value, which I am trying to assign to any lines that are all caps. My desired output would be something like:
>>dictionary['cat']
{'THE PARABLE': [3894, 3924, 3933, 3936, 3939], 'SNOW': [4501], 'THE CHASE': [6765, 6767, 6772, 6773, 6785, 6802, 6807, 6820, 6823, 6839]}
Here is a slice of the data:
THE GOLDEN BIRD
A certain king had a beautiful garden, and in the garden stood a tree
which bore golden apples. These apples were always counted, and about
the time when they began to grow ripe it was found that every night one
of them was gone.
THE PARABLE
Influenced by those remarks, the bird next morning refused to bring in
the wood, telling the others that he had been their servant long enough,
and had been a fool into the bargain, and that it was now time to make a
change, and to try some other way of arranging the work. Beg and pray
as the mouse and the sausage might, it was of no use; the bird remained
master of the situation, and the venture had to be made. They therefore
drew lots, and it fell to the sausage to bring in the wood, to the mouse
to cook, and to the bird to fetch the water.
At the heart your problem is that this:
if line.isupper()
this test will fail for a number ('0') or other such things. Instead try:
if line.isupper() == line:
But in general your code could use a little Pythonic love like maybe:
import re
data = {}
head = ''
with open('file1', 'rU') as f:
for line_num, line in enumerate(f):
line = re.sub(r'[^a-zA-Z0-9 -]', '', line)
if line.isupper() == line:
head = line
else:
for word in (w for w in line.split(' ') if w not in stopwords):
data.setdefault(word, {}).setdefault(head, []).append(line_num+1)
Related
I have a .txt file that reads:
Areca Palm
2018-11-03 18:21:26
Tropical/sub-Tropical plant
Leathery leaves, mid to dark green
Moist and well-draining soil
Semi-shade/full shade light requirements
Water only when top 2 inches of soil is dry
Intolerant to root rot
Propagate by cuttings in water
Canary Date Palm
2018-11-05 10:12:15
Semi-shade, full sun
Dark green leathery leaves
Like lots of water,but soil cannot be water-logged
Like to be root bound in pot
I want to convert these .txt file into dictionary on python and the output should look something like this:
d = {'Areca Palm': ('2018-11-03 18:21:26', 'Tropical/sub-Tropical plant', 'Leathery leaves, mid to dark green', 'Moist and well-draining soil'..etc 'Canary Date Palm': ('2018-11-05 10:12:15', 'Semi-shade, full sun'...)
How do I go about doing this?
The following code shows one way to do this, by reading the file with a very simple two-state state machine:
with open("data.in") as inFile:
# Initialise dictionary and simple state machine.
afterBlank = True
myDict = {}
# Process each line in turn.
for line in inFile.readlines():
line = line.strip()
if afterBlank:
# First non-blank after blank (or at file start) is key
# (blanks after blanks are ignored).
if line != "":
key = line
myDict[key] = []
afterBlank = False
else:
# Subsequent non-blanks are additional lines for key
# (blank after non-blank switches state).
if line != "":
myDict[key].append(line)
else:
afterBlank = True
# Dictionary holds lists, make into tuples if desired.
for key in myDict.keys():
myDict[key] = tuple(myDict[key])
import pprint
pprint.pprint(myDict)
Using your input data gives the output (output with pprint to make a little more readable than the standard Python print):
{'Areca Palm': ('2018-11-03 18:21:26',
'Tropical/sub-Tropical plant',
'Leathery leaves, mid to dark green',
'Moist and well-draining soil',
'Semi-shade/full shade light requirements',
'Water only when top 2 inches of soil is dry',
'Intolerant to root rot',
'Propagate by cuttings in water'),
'Canary Date Palm': ('2018-11-05 10:12:15',
'Semi-shade, full sun',
'Dark green leathery leaves',
'Like lots of water,but soil cannot be water-logged',
'Like to be root bound in pot')}
Many parsing problems are greatly simplified by writing a function
to process the file and yield its lines one meaningful section at a time.
Quite often, the logic needed for this part is pretty simple. And
it stays simple, because the function isn't concerned with any other
details about your larger problem.
That step then simplifies the downstream code, which focuses on deconstruction
of one meaningful section at a time. This part can ignore larger file concerns -- also
keeping things simple.
An illustration:
import sys
def get_paragraphs(path):
par = []
with open(path) as fh: # The basic pattern tends to repeat:
for line in fh:
line = line.rstrip()
if line: # Store lines you want.
par.append(line)
elif par: # Yield prior batch.
yield par
par = []
if par: # Don't forget the last one.
yield par
path = sys.argv[1]
d = {
p[0] : tuple(p[1:])
for p in get_paragraphs(path)
}
I already managed to uncover the spoken words with some help.
Now I'm looking for to get the text spoken by a chosen person.
So I can type in MIA and get every single words she is saying in the movie
Like this:
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
So I'm able to count the words afterwards.
This is how the movie script looks like
An awkward beat. They pass a wooden SALOON -- where a WESTERN
is being shot. Extras in COWBOY costumes drink coffee on the
steps.
Revision 25.
MIA (CONT'D)
I love this stuff. Makes coming to work
easier.
SEBASTIAN
I know what you mean. I get breakfast
five miles out of the way just to sit
outside a jazz club.
MIA
Oh yeah?
SEBASTIAN
It was called Van Beek. The swing bands
played there. Count Basie. Chick Webb.
(then,)
It's a samba-tapas place now.
MIA
A what?
SEBASTIAN
Samba-tapas. It's... Exactly. The joke's on
history.
I would ask the user for all the names in the script first. Then ask which name they want the words for. I would search the text word by word till I found the name wanted and copy the following words into a variable until I hit a name that matches someone else in the script. Now people could say the name of another character, but if you assume titles for people speaking are either all caps, or on a single line, the text should be rather easy to filter.
for word in script:
if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
recording = True
elif word in character_names and word.isupper(): # you may want to check that this is on its own line as well.
recording = False
if recording:
spoken_text += word + " "
I will outline how you could generate a dict which can give you the number of words spoken for all speakers and one which approximates your existing implementation.
General Use
If we define a word to be any chunk of characters in a string split along ' ' (space)...
import re
speaker = '' # current speaker
words = 0 # number of words on line
word_count = {} # dict of speakers and the number of words they speak
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
words = len(line.split())
if speaker in word_count:
word_count[speaker] += words
else:
word_count[speaker] = words
Generates a dict with the format {'JOHN DOE':55} if John Doe says 55 words.
Example output:
>>> word_count['MIA']
13
Your Implementation
Here is a version of the above procedure that approximates your implementation.
import re
def wordsspoken(script,name):
word_count = 0
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
if speaker == name:
word_count += len(line.split())
print(word_count)
def main():
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
If you want to compute your tally with only one pass over the script (which I imagine could be pretty long), you could just track which character is speaking; set things up like a little state machine:
import re
from collections import Counter, defaultdict
words_spoken = defaultdict(Counter)
currently_speaking = 'Narrator'
for line in SCRIPT.split('\n'):
name = line.replace('(CONT\'D)', '').strip()
if re.match('^[A-Z]+$', name):
currently_speaking = name
else:
words_spoken[currently_speaking].update(line.split())
You could use a more sophisticated regex to detect when the speaker changes, but this should do the trick.
demo
There are some good ideas above. The following should work just fine in Python 2.x and 3.x:
import codecs
from collections import defaultdict
speaker_words = defaultdict(str)
with codecs.open('script.txt', 'r', 'utf8') as f:
speaker = ''
for line in f.read().split('\n'):
# skip empty lines
if not line.split():
continue
# speakers have their names in all uppercase
first_word = line.split()[0]
if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
# remove the (CONT'D) from a speaker string
speaker = line.split('(')[0].strip()
# check if this is a dialogue line
elif len(line) - len(line.lstrip()) == 6:
speaker_words[speaker] += line.strip() + ' '
# get a Python-version-agnostic input
try:
prompt = raw_input
except:
prompt = input
speaker = prompt('Enter name: ').strip().upper()
print(speaker_words[speaker])
Example Output:
Enter name: sebastian
I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.
I am calling on the collective wisdom of Stack Overflow because I am at my wits end trying to figure out how to do this and I'm a newbie self-taught coder.
I have a txt file of Letters to the Editor that I need to split into their own individual files.
The files are all formatted in relatively the same way with:
For once, before offering such generous but the unasked for advice, put yourselves in...
Who has Israel to talk to? The cowardly Jordanian monarch? Egypt, a country rocked...
Why is it that The Times does not urge totalitarian Arab slates and terrorist...
PAUL STONEHILL Los Angeles
There you go again. Your editorial again makes groundless criticisms of the Israeli...
On Dec. 7 you called proportional representation “bizarre," despite its use in the...
Proportional representation distorts Israeli politics? Huh? If Israel changes the...
MATTHEW SHUGART Laguna Beach
Was Mayor Tom Bradley’s veto of the expansion of the Westside Pavilion a political...
Although the mayor did not support Proposition U (the slow-growth initiative) his...
If West Los Angeles is any indication of the no-growth policy, where do we go from here?
MARJORIE L. SCHWARTZ Los Angeles
I thought that the best way to go about it would be to try and use regex to identify the lines that started with a name that's all in capital letters since that's the only way to really tell where one letter ends and another begins.
I have tried quite a few different approaches but nothing seems to work quite right. All the other answers I have seen are based on a repeatable line or word. (for example the answers posted here how to split single txt file into multiple txt files by Python and here Python read through file until match, read until next pattern). It all seems to not work when I have to adjust it to accept my regex of all capital words.
The closest I've managed to get is the code below. It creates the right number of files. But after the second file is created it all goes wrong. The third file is empty and in all the rest the text is all out of order and/or incomplete. Paragraphs that should be in file 4 are in file 5 or file 7 etc or missing entirely.
import re
thefile = raw_input('Filename to split: ')
name_occur = []
full_file = []
pattern = re.compile("^[A-Z]{4,}")
with open (thefile, 'rt') as in_file:
for line in in_file:
full_file.append(line)
if pattern.search(line):
name_occur.append(line)
totalFiles = len(name_occur)
letters = 1
thefile = re.sub("(.txt)","",thefile)
while letters <= totalFiles:
f1 = open(thefile + '-' + str(letters) + ".txt", "a")
doIHaveToCopyTheLine = False
ignoreLines = False
for line in full_file:
if not ignoreLines:
f1.write(line)
full_file.remove(line)
if pattern.search(line):
doIHaveToCopyTheLine = True
ignoreLines = True
letters += 1
f1.close()
I am open to completely scrapping this approach and doing it another way (but still in Python). Any help or advice would be greatly appreciated. Please assume I am the inexperienced newbie that I am if you are awesome enough to take your time to help me.
I took a simpler approach and avoided regex. The tactic here is essentially to count the uppercase letters in the first three words and make sure they pass certain logic. I went for first word is uppercase and either the second or third word is uppercase too, but you can adjust this if needed. This will then write each letter to new files with the same name as the original file (note: it assumes your file has an extension like .txt or such) but with an incremented integer appended. Try it out and see how it works for you.
import string
def split_letters(fullpath):
current_letter = []
letter_index = 1
fullpath_base, fullpath_ext = fullpath.rsplit('.', 1)
with open(fullpath, 'r') as letters_file:
letters = letters_file.readlines()
for line in letters:
words = line.split()
upper_words = []
for word in words:
upper_word = ''.join(
c for c in word if c in string.ascii_uppercase)
upper_words.append(upper_word)
len_upper_words = len(upper_words)
first_word_upper = len_upper_words and len(upper_words[0]) > 1
second_word_upper = len_upper_words > 1 and len(upper_words[1]) > 1
third_word_upper = len_upper_words > 2 and len(upper_words[2]) > 1
if first_word_upper and (second_word_upper or third_word_upper):
current_letter.append(line)
new_filename = '{0}{1}.{2}'.format(
fullpath_base, letter_index, fullpath_ext)
with open(new_filename, 'w') as new_letter:
new_letter.writelines(current_letter)
current_letter = []
letter_index += 1
else:
current_letter.append(line)
I tested it on your sample input and it worked fine.
While the other answer is suitable, you may still be curious about using a regex to split up a file.
smallfile = None
buf = ""
with open ('input_file.txt', 'rt') as f:
for line in f:
buf += str(line)
if re.search(r'^([A-Z\s\.]+\b)' , line) is not None:
if smallfile:
smallfile.close()
match = re.findall(r'^([A-Z\s\.]+\b)' , line)
smallfile_name = '{}.txt'.format(match[0])
smallfile = open(smallfile_name, 'w')
smallfile.write(buf)
buf = ""
if smallfile:
smallfile.close()
If you run on Linux, use csplit.
Otherwise, check out these two threads:
How can I split a text file into multiple text files using python?
How to match "anything up until this sequence of characters" in a regular expression?
I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
lines = []
with open('words.txt') as f:
lines = f.readlines()
for each in lines
wordlist = open('test.txt', 'a')
word = ##figure out how to get items from list and assign them here
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url and make sure it's correct
html = urllib.urlopen(url).read()
# print html (deprecated)
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
Keep everything in a loop. Like that:
with open('test.txt', 'a') as wordlist:
for word in lines:
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
print url
# print url and make sure it's correct
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
wordlist.write("{0}\n".format(visible_text))
Secondly, some suggestions:
f.readlines() won't discard the trailing \n. So, I would use f.read().splitlines()
lines = f.read().splitlines()
You don't to initialize the lines list with [ ], as you are forming the list at one shot and assigning it to lines. You need to initialize the list, only when you consider using append() to the list. So, the below line isn't needed.
lines = []
You can handle KeyError by the following:
try:
value = soup.find('pre', text=True)[0]
return value
except KeyError:
return None
I also show how you can use the Python requests library for retrieving the raw html page. This allows us to easily check the status code for whether the retrieval was successful. You can replace the relevant lines to urllib with this if you like.
You can install requests in the command line using pip: pip install requests
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import re
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
def get_html_with_urllib(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
html = urllib.urlopen(url).read()
return html
def get_html(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
response = requests.get(url)
# Something bad happened
if response.status_code != 200:
return ""
# Did not get back html
if not response.headers["Content-Type"].startswith("text/html"):
return ""
html = response.content
return html
def format_definitions(raw_definitions_text):
# Get individual lines in definitions text
parts = raw_definitions_text.split('\n')
# Convert to str
# Remove extra spaces on the left.
# Add one space at the end for later joining with next line
parts = map(lambda x: str(x).lstrip() + ' ', parts)
result = []
current = ""
for p in parts:
if re.search("\w*[0-9]+:", p):
# Start of new line. Contains some char followed by <number>:
# Save previous lines
result.append(current.replace('\n', ' '))
# Set start of current line
current = p
else:
# Continue line
current += p
result.append(current)
return '\n'.join(result)
def get_definitions(word):
# Uncomment this to use requests
# html = get_html(word)
# Could not get definition
# if not html:
# return None
html = get_html_with_urllib(word)
soup = BeautifulSoup(html, "html.parser")
# Get block containing definition
definitions = soup.find("pre").get_text()
definitions = format_definitions(definitions)
return definitions
def batch_query(input_filepath):
with open(input_filepath) as infile:
for word in infile:
word = word.strip() # Remove spaces from both ends
definitions = get_definitions(word)
if not definitions:
print("Could not retrieve definitions for {word}".format(word=word))
print("Definition for {word} is: ".format(word=word))
print(definitions)
def main():
input_filepath = sys.argv[1] # Alternatively, change this to file containing words
batch_query(input_filepath)
if __name__ == "__main__":
main()
Output:
Definition for cat is:
cat
n 1: feline mammal usually having thick soft fur and being unable to roar; domestic cats; wildcats [syn: true cat]
2: an informal term for a youth or man; "a nice guy"; "the guy's only doing it for some doll" [syn: guy, hombre, bozo]
3: a spiteful woman gossip; "what a cat she is!"
4: the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant; "in Yemen kat is used daily by 85% of adults" [syn: kat, khat, qat, quat, Arabian tea, African tea]
5: a whip with nine knotted cords; "British sailors feared the cat" [syn: cat-o'-nine-tails]
6: a large vehicle that is driven by caterpillar tracks; frequently used for moving earth in construction and farm work [syn: Caterpillar]
7: any of several large cats typically able to roar and living in the wild [syn: big cat]
8: a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis [syn: computerized tomography, computed tomography, CT, computerized axial tomography, computed axial tomography]
v 1: beat with a cat-o'-nine-tails
2: eject the contents of the stomach through the mouth; "After drinking too much, the students vomited"; "He purged continuously"; "The patient regurgitated the food we gave him last night" [syn: vomit, vomit up, purge, cast, sick, be sick, disgorge, regorge, retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate, throw up] [ant: keep down] [also: catting, catted]
Definition for dog is:
dog
n 1: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night" [syn: domestic dog, Canis familiaris]
2: a dull unattractive unpleasant girl or woman; "she got a reputation as a frump"; "she's a real dog" [syn: frump]
3: informal term for a man; "you lucky dog"
4: someone who is morally reprehensible; "you dirty dog" [syn: cad, bounder, blackguard, hound, heel]
5: a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll [syn: frank, frankfurter, hotdog, hot dog, wiener, wienerwurst, weenie]
6: a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward [syn: pawl, detent, click]
7: metal supports for logs in a fireplace; "the andirons were too hot to touch" [syn: andiron, firedog, dog-iron] v : go after with the intent to catch; "The policeman chased the mugger down the alley"; "the dog chased the rabbit" [syn: chase, chase after, trail, tail, tag, give chase, go after, track] [also: dogging, dogged]
Definition for car is:
car
n 1: 4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work" [syn: auto, automobile, machine, motorcar]
2: a wheeled vehicle adapted to the rails of railroad; "three cars had jumped the rails" [syn: railcar, railway car, railroad car]
3: a conveyance for passengers or freight on a cable railway; "they took a cable car to the top of the mountain" [syn: cable car]
4: car suspended from an airship and carrying personnel and cargo and power plant [syn: gondola]
5: where passengers ride up and down; "the car was on the top floor" [syn: elevator car]
I've used readlines to split all of the sentences in a file up and I want to use re.findall to go through and find the capitals within them. However, the only output I can get is one set of capitals for all the sentences but I want a set of capitals for each sentence in the file.
I'm using a for loop to attempt this at the moment, but I'm not sure whether this is the best course of action with this task.
Input:
Line 01: HE went to the SHOP
Line 02: THE SHOP HE went
This is what I'm getting as an output:
[HE, SHOP, THE]
and I want to get the output:
[HE, SHOP], [THE, SHOP, HE]
Is there a way of doing this? I've put my coding at the minute below. Thanks!
import re, sys
f = open('findallEX.txt', 'r')
lines = f.readlines()
ii=0
for l in lines:
sys.stdout.write('line %s: %s' %(ii, l))
ii = ii + 1
for x in l
re.findall('[A-Z]+', l)
print x
I think the way to do that is as follows:
txt = """HE went to the SHOP
THE SHOP HE went"""
result = []
for s in txt.split('\n'):
result += [re.findall(r'[A-Z]+', s)]
print(result) # prints [['HE', 'SHOP'], ['THE', 'SHOP', 'HE']]
Or using list comprehensions (a bit less readable):
txt = """HE went to the SHOP
THE SHOP HE went"""
print([re.findall(r'[A-Z]+', s) for s in txt.split('\n')])
If your data really is in that form (words fully capitalized), you don't even need regexes. isupper is all you need.
with open('findallEX.txt') as f:
for line in f.readlines():
print [word for word in line.split() if word.isupper()]
Added an example.