I have very limited coding background except for some Ruby, so if there's a better way of doing this, please let me know!
Essentially I have a .txt file full of words. I want to import the .txt file and turn it into a list. Then, I want to take the first item in the list, assign it to a variable, and use that variable in an external request that sends off to get the definition of the word. The definition is returned, and tucked into a different .txt file. Once that's done, I want the code to grab the next item in the list and do it all again until the list is exhausted.
Below is my code in progress to give an idea of where I'm at. I'm still trying to figure out how to iterate through the list correctly, and I'm having a hard time interpreting the documentation.
Sorry in advance if this was already asked! I searched, but couldn't find anything that specifically answered my issue.
from __future__ import print_function
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
lines = []
with open('words.txt') as f:
lines = f.readlines()
for each in lines
wordlist = open('test.txt', 'a')
word = ##figure out how to get items from list and assign them here
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
# print url and make sure it's correct
html = urllib.urlopen(url).read()
# print html (deprecated)
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
print(visible_text, file=wordlist)
Keep everything in a loop. Like that:
with open('test.txt', 'a') as wordlist:
for word in lines:
url = 'http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query=%s' % word
print url
# print url and make sure it's correct
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
visible_text = soup.find('pre')(text=True)[0]
wordlist.write("{0}\n".format(visible_text))
Secondly, some suggestions:
f.readlines() won't discard the trailing \n. So, I would use f.read().splitlines()
lines = f.read().splitlines()
You don't to initialize the lines list with [ ], as you are forming the list at one shot and assigning it to lines. You need to initialize the list, only when you consider using append() to the list. So, the below line isn't needed.
lines = []
You can handle KeyError by the following:
try:
value = soup.find('pre', text=True)[0]
return value
except KeyError:
return None
I also show how you can use the Python requests library for retrieving the raw html page. This allows us to easily check the status code for whether the retrieval was successful. You can replace the relevant lines to urllib with this if you like.
You can install requests in the command line using pip: pip install requests
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import re
import requests
import urllib2, urllib
from bs4 import BeautifulSoup
def get_html_with_urllib(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
html = urllib.urlopen(url).read()
return html
def get_html(word):
url = "http://services.aonaware.com/DictService/Default.aspx?action=define&dict=wn&query={word}".format(word=word)
response = requests.get(url)
# Something bad happened
if response.status_code != 200:
return ""
# Did not get back html
if not response.headers["Content-Type"].startswith("text/html"):
return ""
html = response.content
return html
def format_definitions(raw_definitions_text):
# Get individual lines in definitions text
parts = raw_definitions_text.split('\n')
# Convert to str
# Remove extra spaces on the left.
# Add one space at the end for later joining with next line
parts = map(lambda x: str(x).lstrip() + ' ', parts)
result = []
current = ""
for p in parts:
if re.search("\w*[0-9]+:", p):
# Start of new line. Contains some char followed by <number>:
# Save previous lines
result.append(current.replace('\n', ' '))
# Set start of current line
current = p
else:
# Continue line
current += p
result.append(current)
return '\n'.join(result)
def get_definitions(word):
# Uncomment this to use requests
# html = get_html(word)
# Could not get definition
# if not html:
# return None
html = get_html_with_urllib(word)
soup = BeautifulSoup(html, "html.parser")
# Get block containing definition
definitions = soup.find("pre").get_text()
definitions = format_definitions(definitions)
return definitions
def batch_query(input_filepath):
with open(input_filepath) as infile:
for word in infile:
word = word.strip() # Remove spaces from both ends
definitions = get_definitions(word)
if not definitions:
print("Could not retrieve definitions for {word}".format(word=word))
print("Definition for {word} is: ".format(word=word))
print(definitions)
def main():
input_filepath = sys.argv[1] # Alternatively, change this to file containing words
batch_query(input_filepath)
if __name__ == "__main__":
main()
Output:
Definition for cat is:
cat
n 1: feline mammal usually having thick soft fur and being unable to roar; domestic cats; wildcats [syn: true cat]
2: an informal term for a youth or man; "a nice guy"; "the guy's only doing it for some doll" [syn: guy, hombre, bozo]
3: a spiteful woman gossip; "what a cat she is!"
4: the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant; "in Yemen kat is used daily by 85% of adults" [syn: kat, khat, qat, quat, Arabian tea, African tea]
5: a whip with nine knotted cords; "British sailors feared the cat" [syn: cat-o'-nine-tails]
6: a large vehicle that is driven by caterpillar tracks; frequently used for moving earth in construction and farm work [syn: Caterpillar]
7: any of several large cats typically able to roar and living in the wild [syn: big cat]
8: a method of examining body organs by scanning them with X rays and using a computer to construct a series of cross-sectional scans along a single axis [syn: computerized tomography, computed tomography, CT, computerized axial tomography, computed axial tomography]
v 1: beat with a cat-o'-nine-tails
2: eject the contents of the stomach through the mouth; "After drinking too much, the students vomited"; "He purged continuously"; "The patient regurgitated the food we gave him last night" [syn: vomit, vomit up, purge, cast, sick, be sick, disgorge, regorge, retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate, throw up] [ant: keep down] [also: catting, catted]
Definition for dog is:
dog
n 1: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds; "the dog barked all night" [syn: domestic dog, Canis familiaris]
2: a dull unattractive unpleasant girl or woman; "she got a reputation as a frump"; "she's a real dog" [syn: frump]
3: informal term for a man; "you lucky dog"
4: someone who is morally reprehensible; "you dirty dog" [syn: cad, bounder, blackguard, hound, heel]
5: a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll [syn: frank, frankfurter, hotdog, hot dog, wiener, wienerwurst, weenie]
6: a hinged catch that fits into a notch of a ratchet to move a wheel forward or prevent it from moving backward [syn: pawl, detent, click]
7: metal supports for logs in a fireplace; "the andirons were too hot to touch" [syn: andiron, firedog, dog-iron] v : go after with the intent to catch; "The policeman chased the mugger down the alley"; "the dog chased the rabbit" [syn: chase, chase after, trail, tail, tag, give chase, go after, track] [also: dogging, dogged]
Definition for car is:
car
n 1: 4-wheeled motor vehicle; usually propelled by an internal combustion engine; "he needs a car to get to work" [syn: auto, automobile, machine, motorcar]
2: a wheeled vehicle adapted to the rails of railroad; "three cars had jumped the rails" [syn: railcar, railway car, railroad car]
3: a conveyance for passengers or freight on a cable railway; "they took a cable car to the top of the mountain" [syn: cable car]
4: car suspended from an airship and carrying personnel and cargo and power plant [syn: gondola]
5: where passengers ride up and down; "the car was on the top floor" [syn: elevator car]
Related
I have a .txt file that reads:
Areca Palm
2018-11-03 18:21:26
Tropical/sub-Tropical plant
Leathery leaves, mid to dark green
Moist and well-draining soil
Semi-shade/full shade light requirements
Water only when top 2 inches of soil is dry
Intolerant to root rot
Propagate by cuttings in water
Canary Date Palm
2018-11-05 10:12:15
Semi-shade, full sun
Dark green leathery leaves
Like lots of water,but soil cannot be water-logged
Like to be root bound in pot
I want to convert these .txt file into dictionary on python and the output should look something like this:
d = {'Areca Palm': ('2018-11-03 18:21:26', 'Tropical/sub-Tropical plant', 'Leathery leaves, mid to dark green', 'Moist and well-draining soil'..etc 'Canary Date Palm': ('2018-11-05 10:12:15', 'Semi-shade, full sun'...)
How do I go about doing this?
The following code shows one way to do this, by reading the file with a very simple two-state state machine:
with open("data.in") as inFile:
# Initialise dictionary and simple state machine.
afterBlank = True
myDict = {}
# Process each line in turn.
for line in inFile.readlines():
line = line.strip()
if afterBlank:
# First non-blank after blank (or at file start) is key
# (blanks after blanks are ignored).
if line != "":
key = line
myDict[key] = []
afterBlank = False
else:
# Subsequent non-blanks are additional lines for key
# (blank after non-blank switches state).
if line != "":
myDict[key].append(line)
else:
afterBlank = True
# Dictionary holds lists, make into tuples if desired.
for key in myDict.keys():
myDict[key] = tuple(myDict[key])
import pprint
pprint.pprint(myDict)
Using your input data gives the output (output with pprint to make a little more readable than the standard Python print):
{'Areca Palm': ('2018-11-03 18:21:26',
'Tropical/sub-Tropical plant',
'Leathery leaves, mid to dark green',
'Moist and well-draining soil',
'Semi-shade/full shade light requirements',
'Water only when top 2 inches of soil is dry',
'Intolerant to root rot',
'Propagate by cuttings in water'),
'Canary Date Palm': ('2018-11-05 10:12:15',
'Semi-shade, full sun',
'Dark green leathery leaves',
'Like lots of water,but soil cannot be water-logged',
'Like to be root bound in pot')}
Many parsing problems are greatly simplified by writing a function
to process the file and yield its lines one meaningful section at a time.
Quite often, the logic needed for this part is pretty simple. And
it stays simple, because the function isn't concerned with any other
details about your larger problem.
That step then simplifies the downstream code, which focuses on deconstruction
of one meaningful section at a time. This part can ignore larger file concerns -- also
keeping things simple.
An illustration:
import sys
def get_paragraphs(path):
par = []
with open(path) as fh: # The basic pattern tends to repeat:
for line in fh:
line = line.rstrip()
if line: # Store lines you want.
par.append(line)
elif par: # Yield prior batch.
yield par
par = []
if par: # Don't forget the last one.
yield par
path = sys.argv[1]
d = {
p[0] : tuple(p[1:])
for p in get_paragraphs(path)
}
I have selected some fields from a json file and I saved its name along with its respective comment to do preprocessing..
Below are the codes:
import re
import json
with open('C:/Users/User/Desktop/Coding/parsehubjsonfileeg/all.json', encoding='utf8') as f:
data = json.load(f)
# dictionary for element which you want to keep
new_data = {'selection1': []}
print(new_data)
# copy item from old data to new data if it has 'reviews'
for item in data['selection1']:
if 'reviews' in item:
new_data['selection1'].append(item)
print(item['reviews'])
print('--')
# save in file
with open('output.json', 'w') as f:
json.dump(new_data, f)
selection1 = new_data['selection1']
for item in selection1:
name = item['name']
print('>>>>>>>.', name)
CommentID = item['reviews']
for com in CommentID:
comment = com['review'].lower() # converting all to lowercase
result = re.sub(r'\d+', '', comment) # remove numbers
results = (result.translate(
str.maketrans('', '', string.punctuation))).strip() # remove punctuations and white spaces
comments = (results)
print(comment)
my output is:
>>>>>>>. Heritage The Villas
we booked at villa valriche through mari deal for 2 nights and check-in was too lengthy (almost 2 hours) and we were requested to make a deposit of rs 10,000 or credit card which we were never informed about it upon booking.
lovely place to recharge.
one word: suoerb
definitely not a 5 star. extremely poor staff service.
>>>>>>>. Oasis Villas by Evaco Holidays
excellent
spent 3 days with my family and really enjoyed my stay. the advantage of oasis is its privacy - with 3 children under 6 years, going to dinner/breakfast at hotels is often a burden rather than an enjoyable experience.
staff were very friendly and welcoming. artee and menni made sure everything was fine and brought breakfast - warm croissants - every morning. atish made the check-in arrangements - and was fast and hassle free.
will definitely go again!
what should I perform to convert this output to a dataframe having column name and comment?
After tokenizing, my sentence contains many weird characters. How can I remove them?
This is my code:
def summary(filename, method):
list_names = glob.glob(filename)
orginal_data = []
topic_data = []
print(list_names)
for file_name in list_names:
article = []
article_temp = io.open(file_name,"r", encoding = "utf-8-sig").readlines()
for line in article_temp:
print(line)
if (line.strip()):
tokenizer =nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(line)
print(sentences)
article = article + sentences
orginal_data.append(article)
topic_data.append(preprocess_data(article))
if (method == "orig"):
summary = generate_summary_origin(topic_data, 100, orginal_data)
elif (method == "best-avg"):
summary = generate_summary_best_avg(topic_data, 100, orginal_data)
else:
summary = generate_summary_simplified(topic_data, 100, orginal_data)
return summary
The print(line) prints a line of a txt. And print(sentences) prints the tokenized sentences in the line.
But sometimes the sentences contains weird characters after nltk's processing.
Assaly, who is a fan of both Pusha T and Drake, said he and his friends
wondered if people in the crowd might boo Pusha T during the show, but
said he never imagined actual violence would take place.
[u'Assaly, who is a fan of both Pusha T and Drake, said he and his
friends wondered if people in\xa0the crowd might boo Pusha\xa0T during
the show, but said he never imagined actual violence would take
place.']
Like above example, where is the \xa0 and \xa0T from?
x = u'Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in\xa0the crowd might boo Pusha\xa0T during the show, but said he never imagined actual violence would take place.'
# method 1
x.replace('\xa0', ' ')
# method 2
import unicodedata
unicodedata.normalize('NFKD', x)
print(x)
Output:
Assaly, who is a fan of both Pusha T and Drake, said he and his friends wondered if people in the crowd might boo Pusha T during the show, but said he never imagined actual violence would take place.
Reference: unicodedata.normalize()
I already managed to uncover the spoken words with some help.
Now I'm looking for to get the text spoken by a chosen person.
So I can type in MIA and get every single words she is saying in the movie
Like this:
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
So I'm able to count the words afterwards.
This is how the movie script looks like
An awkward beat. They pass a wooden SALOON -- where a WESTERN
is being shot. Extras in COWBOY costumes drink coffee on the
steps.
Revision 25.
MIA (CONT'D)
I love this stuff. Makes coming to work
easier.
SEBASTIAN
I know what you mean. I get breakfast
five miles out of the way just to sit
outside a jazz club.
MIA
Oh yeah?
SEBASTIAN
It was called Van Beek. The swing bands
played there. Count Basie. Chick Webb.
(then,)
It's a samba-tapas place now.
MIA
A what?
SEBASTIAN
Samba-tapas. It's... Exactly. The joke's on
history.
I would ask the user for all the names in the script first. Then ask which name they want the words for. I would search the text word by word till I found the name wanted and copy the following words into a variable until I hit a name that matches someone else in the script. Now people could say the name of another character, but if you assume titles for people speaking are either all caps, or on a single line, the text should be rather easy to filter.
for word in script:
if word == speaker and word.isupper(): # you may want to check that this is on its own line as well.
recording = True
elif word in character_names and word.isupper(): # you may want to check that this is on its own line as well.
recording = False
if recording:
spoken_text += word + " "
I will outline how you could generate a dict which can give you the number of words spoken for all speakers and one which approximates your existing implementation.
General Use
If we define a word to be any chunk of characters in a string split along ' ' (space)...
import re
speaker = '' # current speaker
words = 0 # number of words on line
word_count = {} # dict of speakers and the number of words they speak
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
words = len(line.split())
if speaker in word_count:
word_count[speaker] += words
else:
word_count[speaker] = words
Generates a dict with the format {'JOHN DOE':55} if John Doe says 55 words.
Example output:
>>> word_count['MIA']
13
Your Implementation
Here is a version of the above procedure that approximates your implementation.
import re
def wordsspoken(script,name):
word_count = 0
for line in script.split('\n'):
if re.match('^[ ]{19}[^ ]{1,}.*', line): # name of speaker
speaker = line.split(' (')[0][19:]
if re.match('^[ ]{6}[^ ]{1,}.*', line): # dialogue line
if speaker == name:
word_count += len(line.split())
print(word_count)
def main():
name = input("Enter name:")
wordsspoken(script, name)
name1 = input("Enter another name:")
wordsspoken(script, name1)
If you want to compute your tally with only one pass over the script (which I imagine could be pretty long), you could just track which character is speaking; set things up like a little state machine:
import re
from collections import Counter, defaultdict
words_spoken = defaultdict(Counter)
currently_speaking = 'Narrator'
for line in SCRIPT.split('\n'):
name = line.replace('(CONT\'D)', '').strip()
if re.match('^[A-Z]+$', name):
currently_speaking = name
else:
words_spoken[currently_speaking].update(line.split())
You could use a more sophisticated regex to detect when the speaker changes, but this should do the trick.
demo
There are some good ideas above. The following should work just fine in Python 2.x and 3.x:
import codecs
from collections import defaultdict
speaker_words = defaultdict(str)
with codecs.open('script.txt', 'r', 'utf8') as f:
speaker = ''
for line in f.read().split('\n'):
# skip empty lines
if not line.split():
continue
# speakers have their names in all uppercase
first_word = line.split()[0]
if (len(first_word) > 1) and all([char.isupper() for char in first_word]):
# remove the (CONT'D) from a speaker string
speaker = line.split('(')[0].strip()
# check if this is a dialogue line
elif len(line) - len(line.lstrip()) == 6:
speaker_words[speaker] += line.strip() + ' '
# get a Python-version-agnostic input
try:
prompt = raw_input
except:
prompt = input
speaker = prompt('Enter name: ').strip().upper()
print(speaker_words[speaker])
Example Output:
Enter name: sebastian
I know what you mean. I get breakfast five miles out of the way just to sit outside a jazz club. It was called Van Beek. The swing bands played there. Count Basie. Chick Webb. It's a samba-tapas place now. Samba-tapas. It's... Exactly. The joke's on history.
I am trying to create an index of a .txt file, which has section titles in all caps. My attempt looks like this:
dictionary = {}
line_count = 0
for line in file:
line_count += 1
line = re.sub(r'[^a-zA-Z0-9 -]','',line)
list = []
if line.isupper():
head = line
else:
list = line.split(' ')
for i in list:
if i not in stopwords:
dictionary.setdefault(i, {}).setdefault(head, []).append(line_count)
The head variable, however, cannot find its value, which I am trying to assign to any lines that are all caps. My desired output would be something like:
>>dictionary['cat']
{'THE PARABLE': [3894, 3924, 3933, 3936, 3939], 'SNOW': [4501], 'THE CHASE': [6765, 6767, 6772, 6773, 6785, 6802, 6807, 6820, 6823, 6839]}
Here is a slice of the data:
THE GOLDEN BIRD
A certain king had a beautiful garden, and in the garden stood a tree
which bore golden apples. These apples were always counted, and about
the time when they began to grow ripe it was found that every night one
of them was gone.
THE PARABLE
Influenced by those remarks, the bird next morning refused to bring in
the wood, telling the others that he had been their servant long enough,
and had been a fool into the bargain, and that it was now time to make a
change, and to try some other way of arranging the work. Beg and pray
as the mouse and the sausage might, it was of no use; the bird remained
master of the situation, and the venture had to be made. They therefore
drew lots, and it fell to the sausage to bring in the wood, to the mouse
to cook, and to the bird to fetch the water.
At the heart your problem is that this:
if line.isupper()
this test will fail for a number ('0') or other such things. Instead try:
if line.isupper() == line:
But in general your code could use a little Pythonic love like maybe:
import re
data = {}
head = ''
with open('file1', 'rU') as f:
for line_num, line in enumerate(f):
line = re.sub(r'[^a-zA-Z0-9 -]', '', line)
if line.isupper() == line:
head = line
else:
for word in (w for w in line.split(' ') if w not in stopwords):
data.setdefault(word, {}).setdefault(head, []).append(line_num+1)