How to retrieve the storyline paragraph with imdbpy? - python

So far, I haven't figured out how to retrieve the short description paragraph from imdb with imdbpy.
I can retrieve a very (very) long plot this way though :
ia = IMDb()
movie = ia.search_movie("brazil")
movie = movie[0]
movie = ia.get_movie(movie.movieID)
plot = movie.get('plot', [''])[0]
plot = plot.split('::')[0]
The last line removes the submitter username.
In the HTML source, the block I'm looking for is markedup as <p itemprop="description">.
Any idea ?
Thanks !

description = movie.get('plot outline')
This code will get a list of the type of information available for a movie:
movie.keys()

Related

Python Web-scraping, category extraction

I have below code to extract quote text and author using Beautifulsoup. I am able to get that, however each quote falls under a category (e.g. KINDNESS in below html, at the end of string). Kindly let me know how to get category along with quote text and author.
table = soup.findAll('img')
for image in table:
alt_table = image.attrs['alt'].split('#')
# print(alt_table[0]) # Quote text extracted
# print(len(alt_table))
# To prevent index error if author is not there
if len(alt_table)>1:
quote = alt_table[0]
author = alt_table[1]
author = (alt_table[1]).replace('<Author:' , '').replace('>', '') #Format author label
print('Quote: %s \nAuthor: %s' %(quote, author))
else:
quote = alt_table[0]
print('Quote: %s' %(quote))
html example
</div><div class="col-6 col-lg-3 text-center margin-30px-bottom sm-margin-30px-top">
<img alt="Extend yourself in kindness to other human beings wherever you can. #<Author:0x00007f7746c65b78>" class="margin-10px-bottom shadow" height="310" src="https://assets.passiton.com/quotes/quote_artwork/8165/medium/20201208_tuesday_quote_alternate.jpg?1607102963" width="310"/>
<h5 class="value_on_red">KINDNESS</h5>
Since you are dealing with image tag use find_next to get the next tag and use .text to get the value.
table = soup.findAll('img')
for image in table:
alt_table = image.attrs['alt'].split('#')
# print(alt_table[0]) # Quote text extracted
# print(len(alt_table))
# To prevent index error if author is not there
if len(alt_table)>1:
quote = alt_table[0]
author = alt_table[1]
author = (alt_table[1]).replace('<Author:' , '').replace('>', '') #Format author label
print('Quote: %s \nAuthor: %s' %(quote, author))
print(image.find_next('h5', class_='value_on_red').find_next('a').text)
else:
quote = alt_table[0]
print('Quote: %s' %(quote))
print(image.find_next('h5', class_='value_on_red').find_next('a').text)

Web-scraping: unable to extract the required text

I am trying to extract the novel description from this url https://www.wuxiaworld.co/Horizon-Bright-Moon-Sabre/
Howevery, when I try this code:
html=requests.get(site)
html.encoding = html.apparent_encoding
soup = BeautifulSoup(html.text,"html.parser")
summary = soup.find(id ='intro').get_text()
print (summary)
I get:
Description
Process finished with exit code 0
Any help would be appreciated, thanks in advance.
Try this:
site = "https://www.wuxiaworld.co/Horizon-Bright-Moon-Sabre/"
html = requests.get(site)
soup = BeautifulSoup(html.content)
summary = soup.find(id ='intro')
print(summary.text)
This prints out:
Description Fu Hongxue was a cripple, born with a lame leg and subject
to epileptic seizures. He was also one of the most powerful, legendary
figures of the martial arts world, with a dull black saber that was
second to none. His fame made him a frequent target of challengers,
but whenever his saber left its sheath, only corpses would remain in
its wake. One day, however, F...

How to pass a Dynamic String value from Python to html in a folium based Map?

I am trying to create a map with dynamic information for all map markers. For example Map with markers for restaurants in an area that displays Name, Pic & other information relevant to that restaurant.
Problem: How do I pass a dynamic string value from Python to HTML for each marker in a map.
I am able to link pictures with each marker correctly but not able to link text fields like Names etc. It doesn't matter if I put html inside or out of for loop it always gives me wrong static value.
P.S - I am new to programming
# creating map layout, center point & view
m_sat = folium.Map(location=[28.595793, 77.414752], zoom_start=13, tiles='https://server.arcgisonline.com/ArcGIS/rest/services/World_Imagery/MapServer/tile/{z}/{y}/{x}', attr='Created by') # added Esri_WorldImagery
html = '''<img ALIGN="Right" src="data:image/png;base64,{}">\
<h1>Name: </h1>{{ var_pass }}<br />\
<h2>Location: </h2>Noida<br />'''.format
# creating for loop below for dynamic content
for plot_numb in range(address.shape[0]):
picture = base64.b64encode(open(str(plot_numb+1)+'.png','rb').read()).decode()
iframe = IFrame(html(picture), width=300+200, height=300+20)
popup = folium.Popup(iframe, max_width=650)
icon = folium.Icon(color="white", icon="cloud", icon_color='black')
tooltip = 'Click to view more about: '+address.iloc[plot_numb,0]
var_pass = address.iloc[plot_numb,0]
marker = folium.Marker(location=address.iloc[plot_numb,1],
popup=popup, tooltip=tooltip, icon=icon).add_to(m_sat)
m_sat
I should be able to display Relevant name information for each marker on the map.
Attaching an end result picture of issues:
Example of "address" DataFrame:
Name Location
0 Farzi Cafe [28.562, 77.387]
1 Skylounge [28.562, 77.387]
2 Tamasha Cafe [28.562, 77.387]
3 Starbucks [28.565, 77.449]
4 Pizza Hut [28.620, 77.425]
Try this:
var_name = 'restaurant_name'
var_loc = 'restaurant_location'
# var_picture = <base64 image data>
html = f'''<img ALIGN="Right" src="data:image/png;base64,{var_picture}">\
<h1>Name: </h1>{var_name}<br />\
<h2>Location: </h2>{var_loc}<br />\
'''
html
## '<img ALIGN="Right" src="data:image/png;base64,{}"><h1>Name: </h1>restaurant_name<br /><h2>Location: </h2>restaurant_location<br />'
Now, loop over your data frame for all the names and locations.

How to extract text from a html table row

This is my string :
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
I have tried below regular expression to extract the text which is in between h5 element tag:
reg = re.search(r'<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>([A-Za-z0-9%s]+)</h5></span></td></tr>' % string.punctuation,content)
It's exactly returns what I wants.
Is there any more pythonic way to get this one ?
Dunno whether this qualifies as more pythonic or not, but it handles it as HTML data.
from lxml import html
content = '<tr class="cart-subtotal"><th>RTO / Registration office :</th><td><span class="amount"><h5>Yadgiri</h5></span></td></tr>'
HtmlData = html.fromstring(content)
ListData = HtmlData.xpath(‘//text()’)
And to get the last element:
ListData[-1]

Python extracting data from HTML using split

A certain page retrieved from a URL, has the following syntax :
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
I want to extract the data in Name, Surname etc. (I have to repeat this task for many pages)
For that I tried using the following code:
import urllib2
url = 'http://www.my.lk/details.aspx?view=1&id=%2031'
source = urllib2.urlopen(url)
start = '<p><strong>Given Name:</strong>'
end = '<strong>Surname'
givenName=(source.read().split(start))[1].split(end)[0]
start = 'Surname: </strong>'
end = 'Former/AKA Name'
surname=(source.read().split(start))[1].split(end)[0]
print(givenName)
print(surname)
When I'm calling the source.read.split method only one time it works fine. But when I use it twice it gives a list index out of range error.
Can someone suggest a solution?
You can use BeautifulSoup for parsing the HTML string.
Here is some code you might try,
It is using BeautifulSoup (to get the text made by the html code), then parses the string for extracting the data.
from bs4 import BeautifulSoup as bs
dic = {}
data = \
"""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
soup = bs(data)
# Get the text on the html through BeautifulSoup
text = soup.get_text()
# parsing the text
lines = text.splitlines()
for line in lines:
# check if line has ':', if it doesn't, move to the next line
if line.find(':') == -1:
continue
# split the string at ':'
parts = line.split(':')
# You can add more tests here like
# if len(parts) != 2:
# continue
# stripping whitespace
for i in range(len(parts)):
parts[i] = parts[i].strip()
# adding the vaules to a dictionary
dic[parts[0]] = parts[1]
# printing the data after processing
print '%16s %20s' % (parts[0],parts[1])
A tip:
If you are going to use BeautifulSoup to parse HTML,
You should have certain attributes like class=input or id=10, That is, you keep all tags of the same type to be the same id or class.
Update
Well for your comment, see the code below
It applies the tip above, making life (and coding) a lot easier
from bs4 import BeautifulSoup as bs
c_addr = []
id_addr = []
data = \
"""
<h2>Primary Location</h2>
<div class="address" id="10">
<p>
No. 4<br>
Private Drive,<br>
Sri Lanka ON K7L LK <br>
"""
soup = bs(data)
for i in soup.find_all('div'):
# get data using "class" attribute
addr = ""
if i.get("class")[0] == u'address': # unicode string
text = i.get_text()
for line in text.splitlines(): # line-wise
line = line.strip() # remove whitespace
addr += line # add to address string
c_addr.append(addr)
# get data using "id" attribute
addr = ""
if int(i.get("id")) == 10: # integer
text = i.get_text()
# same processing as above
for line in text.splitlines():
line = line.strip()
addr += line
id_addr.append(addr)
print "id_addr"
print id_addr
print "c_addr"
print c_addr
You are calling read() twice. That is the problem. Instead of doing that you want to call read once, store the data in a variable, and use that variable where you were calling read(). Something like this:
fetched_data = source.read()
Then later...
givenName=(fetched_data.split(start))[1].split(end)[0]
and...
surname=(fetched_data.split(start))[1].split(end)[0]
That should work. The reason your code didn't work is because the read() method is reading the content the first time, but after it gets done reading it is looking at the end of the content. The next time you call read() it has no more content remaining and throws an exception.
Check out the docs for urllib2 and methods on file objects
If you want to be quick, regexes are more useful for this kind of task. It can be a harsh learning curve at first but regexes will save your butt one day.
Try this code:
# read the whole document into memory
full_source = source.read()
NAME_RE = re.compile('Name:.+?>(.*?)<')
SURNAME_RE = re.compile('Surname:.+?>(.*?)<')
name = NAME_RE.search(full_source, re.MULTILINE).group(1).strip()
surname = SURNAME_RE.search(full_source, re.MULTILINE).group(1).strip()
See here for more info on how to use regexes in python.
A more comprehensive solution would involve parsing the HTML (using a lib like BeautifulSoup), but that can be overkill depending on your particular application.
You can use HTQL:
page="""
<p>
<strong>Name:</strong> Pasan <br/>
<strong>Surname: </strong> Wijesingher <br/>
<strong>Former/AKA Name:</strong> No Former/AKA Name <br/>
<strong>Gender:</strong> Male <br/>
<strong>Language Fluency:</strong> ENGLISH <br/>
</p>
"""
import htql
print(htql.query(page, "<p>.<strong> {a=:tx; b=:xx} "))
# [('Name:', ' Pasan '),
# ('Surname: ', ' Wijesingher '),
# ('Former/AKA Name:', ' No Former/AKA Name '),
# ('Gender:', ' Male '),
# ('Language Fluency:', ' ENGLISH ')
# ]

Categories