Copy text from a specific point on a web page in Python - python

Say if I was given a web page, e.g this, how could I copy the text starting from <root response="True"> and ending at </root>
How could I do this in Python?

import xml.etree.ElementTree as et
import requests
URL = "http://www.omdbapi.com/?t=True%20Grit&r=XML"
def main():
pg = requests.get(URL).content
root = et.fromstring(pg)
for attr,value in root[0].items():
print("{:>10}: {}".format(attr, value))
if __name__=="__main__":
main()
results in
poster: http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg
metascore: 80
director: Ethan Coen, Joel Coen
released: 22 Dec 2010
awards: Nominated for 10 Oscars. Another 30 wins & 85 nominations.
year: 2010
genre: Adventure, Drama, Western
imdbVotes: 184,711
plot: A tough U.S. Marshal helps a stubborn young woman track down her father's murderer.
rated: PG-13
language: English
title: True Grit
country: USA
writer: Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)
actors: Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin
imdbID: tt1403865
runtime: 110 min
type: movie
imdbRating: 7.7

I would use requests and BeautifulSoup for this:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.omdbapi.com/?t=True%20Grit&r=XML')
>>> soup = BeautifulSoup(r.text)
>>> list(soup('root')[0].children)
[<movie actors="Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin" awards="Nominated for 10 Oscars. Another 30 wins & 85 nominations." country="USA" director="Ethan Coen, Joel Coen" genre="Adventure, Drama, Western" imdbid="tt1403865" imdbrating="7.7" imdbvotes="184,711" language="English" metascore="80" plot="A tough U.S. Marshal helps a stubborn young woman track down her father's murderer." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg" rated="PG-13" released="22 Dec 2010" runtime="110 min" title="True Grit" type="movie" writer="Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)" year="2010"></movie>]

Download the document with urllib2: http://docs.python.org/2/howto/urllib2.html
A good parser for short, simple, well formed XML like this is Minidom. Here is how to parse:
http://docs.python.org/2/library/xml.dom.minidom.html
Then get the text, e.g.: Getting text between xml tags with minidom

Related

How to clean duplicate data from webscraping?

So I want to make a list of books from a bookstore with a web scraper. I need the title and author of books. I can get the title nicely. The problem is with author. Namely the class of title is the same with different data. If I run the script it duplictes the data and in addition the data I don't need (book code, publisher, year etc.)
Here is the script
`
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
response = requests.get('https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list',headers=headers)
webpage = response.content
soup = BeautifulSoup(webpage,'html.parser')
for parent in soup.find_all('div',class_='product-info'):
for n,tag in enumerate(parent.find_all('div')):
title = [x for x in tag.find_all('a',class_='block no-underline product-link')]
author = [x for x in tag.find_all('span',class_='info__text block weight-700 mt8 mb16')]
for item in title:
print('TITLE:',item.text.strip())
for item in author:
author = item.text.split('\n')
print('AUTHOR: ',item.text.strip())
`
Example result:
TITLE: Parv II
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
As you can see the data for the author duplicates and I get data that I don't need(publisher, year, code etc). Now I know that all these different data have the same class. My question would be: Is there a way to get only data for the author? Or how to clean it?
Thank you
You can use CSS selectors to properly select the Title/Author:
import requests
from bs4 import BeautifulSoup
url = "https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for book in soup.select(".product"):
title = book.select_one(".product-title").text.strip()
author = book.select_one(
".info__title:-soup-contains(Autor) + span"
).text.strip()
print("{:<50} {}".format(title, author))
Prints:
Eric Terry Pratchett
Lõputuse piirid Lois McMaster Bujold
Mõõkade maru. II raamat George R. R. Martin
Euromant Maniakkide Tänav
Joosta oma varju eest. Täheaeg 9 Triinu Meres, Kadri Pettai, Maniakkide tänav, Tea Roosvald, Ülle Lätte, Jeff Vandermeer
Ilus pimedus Kami Garcia, Margaret Stohl
Bal-Sagothi jumalad Robert E. Howard
Tüdruk veiniplekiga kleidis Eda Kalmre
Talvesepp Terry Pratchett
Võluv võrdsus Terry Pratchett
Nekromanteion Arvi Nikkarev
Vere magus lõhn Suzanne McLeod
Tantsud armastuse lõppu Liis Koger
Kõrbeoda Peter V. Brett
Palimpsest Charles Stross
Tigana Guy Gavriel Kay
Kardinali mõõgad Pierre Pevel
Ajaratsurid: Kättemaksja. P.C.Cast
Hullumeelsuse mägedes H. P. Lovecraft
Vori mäng Lois McMaster Bujold
Ajalooja Milvi Martina Piir
Punane Sonja Robert E. Howard
Sõda kosmose rannavetes Raul Sulbi
Katastroof Krystyna Kuhn
Robotid ja impeerium Isaac Asimov
Ajaratsurid: Tapja Cindy Dees
Hundipäikese aeg III Tamur Kusnets
Gort Ashryn III osa. Rahu Leo Kunnas
Surnud, kuni jõuab öö Charlaine Harris
Lase sisse see õige John Ajvide Lindqvist
Mäng Krystyna Kuhn
Järgmiseks Crichton, Michael
Kõigi hingede öö Jennifer Armintrout
Kaose märk Roger Zelazny
Allakäik Guillermo del Toro, Chuck Hogan
Külmavõetud Richelle Mead
Nähtamatud akadeemikud Terry Pratchett
Varjus hiilija Aleksei Pehhov
Ajaratsurid: Otsija Lindsay McKenna
Maalingutega mees Peter V. Brett
Viimane sõrmusekandja Kirill Jeskov
Falkenbergi leegion Jerry Pournelle
Sõduri õpilane Lois McMaster Bujold
Vampiiri kättemaks Alexis Morgan
Kurjuse küüsis Jennifer Armintrout
Tiiger! Tiiger! Alfred Bester
Tume torn Stephen King
Koidu robotid Isaac Asimov
Nõid Michael Scott
Päevane vahtkond Sergei Lukjanenko
Perpendikulaarne maailm Kir Bulöchov
Ümbersünd Jennifer Armintrout
Kadunud oaas Paul Sussman
Täheaeg 7. Ingel ja kvantkristall Raul Sulbi
Leek sügaviku kohal Vernor Vinge
Sümfoonia katkenud keelele Siim Veskimees
Parv II Frank Schätzing
Kuldne linn John Twelve Hawks
Hõbekeel Charlie Fletcher
Rahategu Terry Pratchett
Amberi veri Roger Zelazny
Kuninga tagasitulek J.R.R. Tolkien
Tõbi Guillermo del Toro, Chuck Hogan
Võõrana võõral maal Robert A. Heinlein
Elu, universum ja kõik Douglas Adams
Frenchi ja Koulu reisid Indrek Hargla
13. sõdalane. Laibaõgijad Michael Crichton
Accelerando Charles Stross
Universumi lõpu restoran Douglas Adams
Surmakarva Maniakkide Tänav
Kauge maa laulud Arthur C. Clarke
Pöidlaküüdi reisijuht galaktikas Douglas Adams
Alasti päike Isaac Asimov
Barrayar Loius McMaster Bujold
Fevre´i unelm George R.R. Martin
Hukatuse kaardid Roger Zelazny
Loomine Bernard Beckett
Pika talve algus. Täheaeg 6 Raul Sulbi
Troonide mäng. II raamat George R. R. Martin
Malakas Terry Pratchett
Kuninglik salamõrtsukas Robin Hobb
Algolagnia Marion Andra
Tormikuninganna Marion Zimmer Bradley
Munk maailma äärel Skarabeus
Alaizabel Cray painaja Chris Wooding
Anubise väravad Tim Powers
Lisey lugu Stephen King
Postiteenistus Terry Pratchett
Linnusild Barry Hughart
Raudkäsi Charlie Fletcher
Südame metsad Charles de Lint
Meie, kromanjoonlased Tiit Tarlap
Calla hundid Stephen King
Ruunimärgid Joanne Harris
Tuule nimi. Kuningatapja kroonika I osa 2. raamat Patrick Rothfuss
Windhaven George R. R. Martin, Lisa Tuttle
Taevane tuli Anne Robillard
Vampiiri käealune Darren Shan
Teemärgid Roger Zelazny
Kinnimüüritud tuba Ilo

Webscraping past a show more button that extends the page

I'm trying to scrape data from Elle.com under a search term. I noticed when I click the button, it sends a request that updates the &page=2 in the url. However, the following code just gets me a lot of duplicate entries. I need help finding a way to set a start point for each iteration of the loop (I think). Any ideas?
import requests,nltk,pandas as pd
from bs4 import BeautifulSoup as bs
def get_hits(url):
r = requests.get(url)
soup = bs(r.content, 'html')
body = []
for p in soup.find_all('p',{'class':'body-text'}):
sentences = nltk.sent_tokenize(p.text)
result1 = [s for s in sentences if 'kim' in s]
body.append(result1)
result2 = [s for s in sentences if 'kanye' in s]
body.append(result2)
body = [a for a in body if a!=[]]
if body == []:
body.append("no hits")
return body
titles =[]
key_hits = []
urls = []
counter = 1
for i in range(1,10):
url = f'https://www.elle.com/search/?page={i}&q=kanye'
r = requests.get(url)
soup = bs(r.content, 'html')
groups = soup.find_all('div',{'class':'simple-item grid-simple-item'})
for j in range(len(groups)):
urls.append('https://www.elle.com'+ groups[j].find('a')['href'])
titles.append(groups[j].find('div',{'class':'simple-item-title item-title'}).text)
key_hits.append(get_hits('https://www.elle.com'+ groups[j].find('a')['href']))
if (counter == 100):
break
counter+=1
data = pd.DataFrame({
'Title':titles,
'Body':key_hits,
'Links':urls
})
data.head()
Let me know if there's something I don't understand that I probably should. Just a marketing researcher trying to learn powerful tools here.
To get pagination working on the sige, you can use their infinite-scroll API URL (this example will print 9*42 titles):
import requests
from bs4 import BeautifulSoup
api_url = "https://www.elle.com/ajax/infiniteload/"
params = {
"id": "search",
"class": "CoreModels\\search\\TagQueryModel",
"viewset": "search",
"trackingId": "search-results",
"trackingLabel": "kanye",
"params": '{"input":"kanye","page_size":"42"}',
"page": "1",
"cachebuster": "undefined",
}
all_titles = set()
for page in range(1, 10):
params["page"] = page
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
for title in soup.select(".item-title"):
print(title.text)
all_titles.add(title.text)
print()
print("Unique titles:", len(all_titles)) # <-- 9 * 42 = 378
Prints:
...
Kim Kardashian and Kanye West Respond to Those Divorce Rumors
People Are Noticing Something Fishy About Taylor Swift's Response to Kim Kardashian
Kim Kardashian Just Went on an Intense Twitter Rant Defending Kanye West
Trump Is Finally Able to Secure a Meeting With a Kim
Kim Kardashian West is Modeling Yeezy on the Street Again
Aziz Ansari's Willing to Model Kanye's Clothes
Unique titles: 378
Actually, load more pagination is generating from api calls plain html response and each page link/url is relative url and convert it into absolute url using urljoin method and I make pagination in api_urls.
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
api_urls = ["https://www.elle.com/ajax/infiniteload/?id=search&class=CoreModels%5Csearch%5CTagQueryModel&viewset=search&trackingId=search-results&trackingLabel=kanye&params=%7B%22input%22%3A%22kanye%22%2C%22page_size%22%3A%2242%22%7D&page="+str(x)+"&cachebuster=undefined" for x in range(1,4)]
Base_url = "https://store.steampowered.com"
for url in api_urls:
req = requests.get(url)
soup = BeautifulSoup(req.content,"lxml")
cards = soup.select("div.simple-item.grid-simple-item")
for card in cards:
title = card.select_one("div.simple-item-title.item-title")
p = card.select_one("a")
l=p['href']
abs_link=urljoin(Base_url,l)
print("Title:" + title.text + " Links: " + abs_link)
print("-" * 80)
Output:
Title:Inside Kim Kardashian and Kanye West’s Current Relationship Amid Dinner Sighting Links: https://store.steampowered.com/culture/celebrities/a37833256/kim-kardashian-kanye-west-reconciled/
Title:Kim Kardashian And Ex Kanye West Left For SNL Together Amid Reports of Reconciliation Efforts Links: https://store.steampowered.com/culture/celebrities/a37919434/kim-kardashian-kanye-west-leave-for-snl-together-reconciliation/
Title:Kim Kardashian Wore a Purple Catsuit for Dinner With Kanye West Amid Reports She's Open to Reconciling Links: https://store.steampowered.com/culture/celebrities/a37822625/kim-kardashian-kanye-west-nobu-dinner-september-2021/
Title:How Kim Kardashian Really Feels About Kanye West Saying He ‘Wants Her Back’ Now Links:
https://store.steampowered.com/culture/celebrities/a37463258/kim-kardashian-kanye-west-reconciliation-feelings-september-2021/
Title:Why Irina Shayk and Kanye West Called Off Their Two-Month Romance Links: https://store.steampowered.com/culture/celebrities/a37366860/why-irina-shayk-kanye-west-broke-up-august-2021/
Title:Kim Kardashian and Kanye West Reportedly Are ‘Working on Rebuilding’ Relationship and May Call Off Divorce Links: https://store.steampowered.com/culture/celebrities/a37421190/kim-kardashian-kanye-west-repairing-relationship-divorce-august-2021/
Title:What Kim Kardashian and Kanye West's ‘Donda’ Wedding Moment Really Means for Their Relationship Links: https://store.steampowered.com/culture/celebrities/a37415557/kim-kardashian-kanye-west-donda-wedding-moment-explained/
Title:What Kim Kardashian and Kanye West's Relationship Is Like Now: ‘The Tension Has Subsided’ Links: https://store.steampowered.com/culture/celebrities/a37383301/kim-kardashian-kanye-west-relationship-details-august-2021/
Title:How Kim Kardashian and Kanye West’s Relationship as Co-Parents Has Evolved Links: https://store.steampowered.com/culture/celebrities/a37250155/kim-kardashian-kanye-west-co-parents/Title:Kim Kardashian Went Out in a Giant Shaggy Coat and a Black Wrap Top for Dinner in NYC Links: https://store.steampowered.com/culture/celebrities/a37882897/kim-kardashian-shaggy-coat-black-outfit-nyc-dinner/
Title:Kim Kardashian Wore Two Insane, Winter-Ready Outfits in One Warm NYC Day Links: https://store.steampowered.com/culture/celebrities/a37906750/kim-kardashian-overdressed-fall-outfits-october-2021/
Title:Kim Kardashian Dressed Like a Superhero for Justin Bieber's 2021 Met Gala After Party Links: https://store.steampowered.com/culture/celebrities/a37593656/kim-kardashian-superhero-outfit-met-gala-after-party-2021/
Title:Kim Kardashian Killed It In Her Debut as a Saturday Night Live Host Links: https://store.steampowered.com/culture/celebrities/a37918950/kim-kardashian-saturday-night-live-best-sketches/
Title:Kim Kardashian Has Been Working ‘20 Hours a Day’ For Her Appearance On SNL Links: https://store.steampowered.com/culture/celebrities/a37915962/kim-kardashian-saturday-night-live-preperation/
Title:Why Taylor Swift and Joe Alwyn Skipped the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37446411/why-taylor-swift-joe-alwyn-skipped-met-gala-2021/
Title:Kim Kardashian Says North West Still Wants to Be an Only Child Five Years Into Having Siblings Links: https://store.steampowered.com/culture/celebrities/a37620539/kim-kardashian-north-west-only-child-comment-september-2021/
Title:How Kim Kardashian's Incognito 2021 Met Gala Glam Came Together Links: https://store.s
teampowered.com/beauty/makeup-skin-care/a37584576/kim-kardashians-incognito-2021-met-gala-beauty-breakdown/
Title:Kim Kardashian Completely Covered Her Face and Everything in a Black Balenciaga Look at the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37578520/kim-kardashian-faceless-outfit-met-gala-2021/
Title:How Kim Kardashian Feels About Kanye West Singing About Their Divorce and ‘Losing My Family’ on Donda Album Links: https://store.steampowered.com/culture/celebrities/a37113130/kim-kardashian-kanye-west-divorce-song-donda-album-feelings/
Title:Kanye West Teases New Song In Beats By Dre Commercial Starring Sha'Carri Richardson Links: https://store.steampowered.com/culture/celebrities/a37090223/kanye-west-teases-new-song-in-beats-by-dre-commercial-starring-shacarri-richardson/
Title:Inside Kim Kardashian and Kanye West's Relationship Amid His Irina Shayk Romance Links: https://store.steampowered.com/culture/celebrities/a37077662/kim-kardashian-kanye-west-relationship-irina-shayk-romance-july-2021/
and ... so on

Python Webscraping Approach for Comparing Football Players' college alma maters with total NFL Fantasy Football output

I am looking to a data science project where I will be able to sum up the fantasy football points by the college the players went to (e.g. Alabama has 56 active players in the NFL so I will go through a database and add up all of their fantasy points to compare with other schools).
I was looking at the website:
https://fantasydata.com/nfl/fantasy-football-leaders?season=2020&seasontype=1&scope=1&subscope=1&aggregatescope=1&range=3
and I was going to use Beautiful Soup to scrape the rows of players and statistics and ultimately, fantasy football points.
However, I am having trouble figuring out how to extract the players' college alma mater. To do so, I would have to:
Click each "players" name
Scrape each and every profile of the hundreds of NFL players for one line "College"
Place all of this information into its own column.
Any suggestions here?
There's no need for Selenium, or other headless, automated browsers. That's overkill.
If you take a look at your browser's network traffic, you'll notice that your browser makes a POST request to this REST API endpoint: https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read
If the POST request is well-formed, the API responds with JSON, containing information about every single player. Normally, this information would be used to populate the DOM asynchronously using JavaScript. There's quite a lot of information there, but unfortunately, the college information isn't part of the JSON response. However, there is a field PlayerUrlString, which is a relative-URL to a given player's profile page, which does contain the college name. So:
Make a POST request to the API to get information about all players
For each player in the response JSON:
Visit that player's profile
Use BeautifulSoup to extract the college name from the current
player's profile
Code:
def main():
import requests
from bs4 import BeautifulSoup
url = "https://fantasydata.com/NFL_FantasyStats/FantasyStats_Read"
data = {
"sort": "FantasyPoints-desc",
"pageSize": "50",
"filters.season": "2020",
"filters.seasontype": "1",
"filters.scope": "1",
"filters.subscope": "1",
"filters.aggregatescope": "1",
"filters.range": "3",
}
response = requests.post(url, data=data)
response.raise_for_status()
players = response.json()["Data"]
for player in players:
url = "https://fantasydata.com" + player["PlayerUrlString"]
response = requests.get(url)
response.raise_for_status()
soup = BeautifulSoup(response.content, "html.parser")
college = soup.find("dl", {"class": "dl-horizontal"}).findAll("dd")[-1].text.strip()
print(player["Name"] + " went to " + college)
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
Output:
Patrick Mahomes went to Texas Tech
Kyler Murray went to Oklahoma
Aaron Rodgers went to California
Russell Wilson went to Wisconsin
Josh Allen went to Wyoming
Deshaun Watson went to Clemson
Ryan Tannehill went to Texas A&M
Lamar Jackson went to Louisville
Dalvin Cook went to Florida State
...
You can also edit the pageSize POST parameter in the data dictionary. The 50 corresponds to information about the first 50 players in the JSON response (according to the filters set by the other POST parameters). Changing this value will yield more or less players in the JSON response.
I agree, API are the way to go if they are there. My second "go to" is pandas' .read_html() (which uses BeautifulSoup under the hood to parse <table> tags. Here's an alternate solution using ESPNs api to get team roster links, then use pandas to pull the table from each link. Saves you the trouble of having to iterate througheach player to get the college (I whish they just had an api that returned all players. nfl.com USED to have that, but is no longer publicly available, that I know of).
Code:
import requests
import pandas as pd
url = 'https://site.web.api.espn.com/apis/common/v3/sports/football/nfl/athletes/101'
all_teams = []
roster_links = []
for i in range(1,35):
url = 'http://site.api.espn.com/apis/site/v2/sports/football/nfl/teams/{teamId}'.format(teamId=i)
jsonData = requests.get(url).json()
print (jsonData['team']['displayName'])
for link in jsonData['team']['links']:
if link['text'] == 'Roster':
roster_links.append(link['href'])
break
for link in roster_links:
print (link)
tables = pd.read_html(link)
df = pd.concat(tables).drop('Unnamed: 0',axis=1)
df['Jersey'] = df['Name'].str.replace("([A-Za-z.' ]+)", '')
df['Name'] = df['Name'].str.extract("([A-Za-z.' ]+)")
all_teams.append(df)
final_df = pd.concat(all_teams).reset_index(drop=True)
Output:
print (final_df)
Name POS Age HT WT Exp College Jersey
0 Matt Ryan QB 35 6' 4" 217 lbs 13 Boston College 2
1 Matt Schaub QB 39 6' 6" 245 lbs 17 Virginia 8
2 Todd Gurley II RB 26 6' 1" 224 lbs 6 Georgia 21
3 Brian Hill RB 25 6' 1" 219 lbs 4 Wyoming 23
4 Qadree Ollison RB 24 6' 1" 232 lbs 2 Pittsburgh 30
... .. ... ... ... .. ... ...
1772 Jonathan Owens S 25 5' 11" 210 lbs 2 Missouri Western 36
1773 Justin Reid S 23 6' 1" 203 lbs 3 Stanford 20
1774 Ka'imi Fairbairn PK 26 6' 0" 183 lbs 5 UCLA 7
1775 Bryan Anger P 32 6' 3" 205 lbs 9 California 9
1776 Jon Weeks LS 34 5' 10" 242 lbs 11 Baylor 46
[1777 rows x 8 columns]

Extracting tag under tag values from HTML in python

<div class="book-cover-image">
<img alt="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities" class="img-responsive" src="https://cdn.downtoearth.org.in/library/medium/2016-05-23/0.42611000_1463993925_book-cover.jpg" title="NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities"/>
</div>
I need to extract this title value from all such div tags. What can be the best way to perform this operation. Please suggest.
I am trying to fetch the title of all the books mentioned on this page.
I have tried this so far:
import requests
from bs4 import BeautifulSoup as bs
url1 ="https://www.downtoearth.org.in/books"
page1 = requests.get(url1, verify=False)
#print(page1.content)
soup1= bs(page1.content, 'html.parser')
class_names = soup1.find_all('div',{'class':'book-cover-image'} )
for class_name in class_names:
title_text = class_name.text
print(class_name)
print(title_text)
To get all title attributes for the book covers, you can use CSS selector .book-cover-image img[title] (select all <img> tags with attribute title that are under tag with class book-cover-image):
import requests
from bs4 import BeautifulSoup
url = 'https://www.downtoearth.org.in/books'
soup = BeautifulSoup(requests.get(url).text, 'lxml')
for i, img in enumerate(soup.select('.book-cover-image img[title]'), 1):
print('{:>4}\t{}'.format(i, img['title']))
Prints:
1 State of India’s Environment 2019: In Figures (eBook)
2 Victim Africa (eBook)
3 Frames of change - Heartening tales that define new India
4 STATE OF INDIA’S ENVIRONMENT 2019
5 State of India’s Environment In Figures 2018 (eBook)
6 Getting to know about environment
7 CLIMATE CHANGE NOW - The Story of Carbon Colonisation
8 Climate change - For the young and curious
9 Conflicts of Interest: My Journey through India’s Green Movement
10 Body Burden: Lifestyle Diseases
11 STATE OF INDIA’S ENVIRONMENT 2018
12 DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief
13 SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer
14 State of India's Environment 2017 In Figures (eBook)
15 Environment Reader for Universities
16 Not in My Backyard (Book & DVD combo offer)
17 The Crow, Honey Hunter and the Kitchen Garden
18 BIOSCOPE OF PIU & POM
19 SOE 2017 and Food book combo offer
20 FIRST FOOD: Culture of Taste
21 Annual State Of India’s Environment - SOE 2017
22 An 8-million-year-old mysterious date with monsoon (e-book)
23 Why I Should be Tolerant
24 NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities
You can do with xpath like this.
import requests
from lxml import html
url1 ="https://www.downtoearth.org.in/books"
res = requests.get(url1, verify=False)
tree = html.fromstring(res.text)
d = tree.xpath("//div[#class='book-cover-image']//img/#title")
for title in d:
print(title)
Output
State of India’s Environment 2019: In Figures (eBook)
Victim Africa (eBook)
Frames of change - Heartening tales that define new India
STATE OF INDIA’S ENVIRONMENT 2019
State of India’s Environment In Figures 2018 (eBook)
Getting to know about environment
CLIMATE CHANGE NOW - The Story of Carbon Colonisation
Climate change - For the young and curious
Conflicts of Interest: My Journey through India’s Green Movement
Body Burden: Lifestyle Diseases
STATE OF INDIA’S ENVIRONMENT 2018
DROUGHT BUT WHY? How India can fight the scourge by abandoning drought relief
SOE 2017 (Print version) and SOE 2017 in Figures (Digital version) combo offer
State of India's Environment 2017 In Figures (eBook)
Environment Reader for Universities
Not in My Backyard (Book & DVD combo offer)
The Crow, Honey Hunter and the Kitchen Garden
BIOSCOPE OF PIU & POM
SOE 2017 and Food book combo offer
FIRST FOOD: Culture of Taste
Annual State Of India’s Environment - SOE 2017
An 8-million-year-old mysterious date with monsoon (e-book)
Why I Should be Tolerant
NOT IN MY BACKYARD – Solid Waste Mgmt in Indian Cities

How to scrape data from imdb business page?

I am making a project that requires data from imdb business page.I m using python. The data is stored between two tags like this :
Budget
$220,000,000 (estimated)
I want the numeric amount but have not been successful so far. Any suggestions.
Take a look at Beautiful Soup, its a useful library for scraping. If you take a look at the source, the "Budget" is inside an h4 element, and the value is next in the DOM. This may not be the best example, but it works for your case:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen('http://www.imdb.com/title/tt0118715/?ref_=fn_al_nm_1a')
soup = BeautifulSoup(page.read())
for h4 in soup.find_all('h4'):
if "Budget:" in h4:
print h4.next_sibling.strip()
# $15,000,000
This is whole bunch of code (you can find your requirement here).
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long
Well you asked for python and you asked for a scraping solution.
But there is no need for python and no need to scrape anything because the budget figures are available in the business.list text file available at http://www.imdb.com/interfaces
Try IMDbPY and its documentation. To install, just pip install imdbpy
from imdb import IMDb
ia = IMDb()
movie = ia.search_movie('The Untouchables')[0]
ia.update(movie)
#Lots of info for the movie from IMDB
movie.keys()
Though I'm not sure where to find specifically budget info

Categories