How to clean duplicate data from webscraping?

How to clean duplicate data from webscraping? - python

So I want to make a list of books from a bookstore with a web scraper. I need the title and author of books. I can get the title nicely. The problem is with author. Namely the class of title is the same with different data. If I run the script it duplictes the data and in addition the data I don't need (book code, publisher, year etc.)
Here is the script
`
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
response = requests.get('https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list',headers=headers)
webpage = response.content
soup = BeautifulSoup(webpage,'html.parser')
for parent in soup.find_all('div',class_='product-info'):
for n,tag in enumerate(parent.find_all('div')):
title = [x for x in tag.find_all('a',class_='block no-underline product-link')]
author = [x for x in tag.find_all('span',class_='info__text block weight-700 mt8 mb16')]
for item in title:
print('TITLE:',item.text.strip())
for item in author:
author = item.text.split('\n')
print('AUTHOR: ',item.text.strip())
`
Example result:
TITLE: Parv II
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
As you can see the data for the author duplicates and I get data that I don't need(publisher, year, code etc). Now I know that all these different data have the same class. My question would be: Is there a way to get only data for the author? Or how to clean it?
Thank you

You can use CSS selectors to properly select the Title/Author:
import requests
from bs4 import BeautifulSoup
url = "https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for book in soup.select(".product"):
title = book.select_one(".product-title").text.strip()
author = book.select_one(
".info__title:-soup-contains(Autor) + span"
).text.strip()
print("{:<50} {}".format(title, author))
Prints:
Eric Terry Pratchett
Lõputuse piirid Lois McMaster Bujold
Mõõkade maru. II raamat George R. R. Martin
Euromant Maniakkide Tänav
Joosta oma varju eest. Täheaeg 9 Triinu Meres, Kadri Pettai, Maniakkide tänav, Tea Roosvald, Ülle Lätte, Jeff Vandermeer
Ilus pimedus Kami Garcia, Margaret Stohl
Bal-Sagothi jumalad Robert E. Howard
Tüdruk veiniplekiga kleidis Eda Kalmre
Talvesepp Terry Pratchett
Võluv võrdsus Terry Pratchett
Nekromanteion Arvi Nikkarev
Vere magus lõhn Suzanne McLeod
Tantsud armastuse lõppu Liis Koger
Kõrbeoda Peter V. Brett
Palimpsest Charles Stross
Tigana Guy Gavriel Kay
Kardinali mõõgad Pierre Pevel
Ajaratsurid: Kättemaksja. P.C.Cast
Hullumeelsuse mägedes H. P. Lovecraft
Vori mäng Lois McMaster Bujold
Ajalooja Milvi Martina Piir
Punane Sonja Robert E. Howard
Sõda kosmose rannavetes Raul Sulbi
Katastroof Krystyna Kuhn
Robotid ja impeerium Isaac Asimov
Ajaratsurid: Tapja Cindy Dees
Hundipäikese aeg III Tamur Kusnets
Gort Ashryn III osa. Rahu Leo Kunnas
Surnud, kuni jõuab öö Charlaine Harris
Lase sisse see õige John Ajvide Lindqvist
Mäng Krystyna Kuhn
Järgmiseks Crichton, Michael
Kõigi hingede öö Jennifer Armintrout
Kaose märk Roger Zelazny
Allakäik Guillermo del Toro, Chuck Hogan
Külmavõetud Richelle Mead
Nähtamatud akadeemikud Terry Pratchett
Varjus hiilija Aleksei Pehhov
Ajaratsurid: Otsija Lindsay McKenna
Maalingutega mees Peter V. Brett
Viimane sõrmusekandja Kirill Jeskov
Falkenbergi leegion Jerry Pournelle
Sõduri õpilane Lois McMaster Bujold
Vampiiri kättemaks Alexis Morgan
Kurjuse küüsis Jennifer Armintrout
Tiiger! Tiiger! Alfred Bester
Tume torn Stephen King
Koidu robotid Isaac Asimov
Nõid Michael Scott
Päevane vahtkond Sergei Lukjanenko
Perpendikulaarne maailm Kir Bulöchov
Ümbersünd Jennifer Armintrout
Kadunud oaas Paul Sussman
Täheaeg 7. Ingel ja kvantkristall Raul Sulbi
Leek sügaviku kohal Vernor Vinge
Sümfoonia katkenud keelele Siim Veskimees
Parv II Frank Schätzing
Kuldne linn John Twelve Hawks
Hõbekeel Charlie Fletcher
Rahategu Terry Pratchett
Amberi veri Roger Zelazny
Kuninga tagasitulek J.R.R. Tolkien
Tõbi Guillermo del Toro, Chuck Hogan
Võõrana võõral maal Robert A. Heinlein
Elu, universum ja kõik Douglas Adams
Frenchi ja Koulu reisid Indrek Hargla
13. sõdalane. Laibaõgijad Michael Crichton
Accelerando Charles Stross
Universumi lõpu restoran Douglas Adams
Surmakarva Maniakkide Tänav
Kauge maa laulud Arthur C. Clarke
Pöidlaküüdi reisijuht galaktikas Douglas Adams
Alasti päike Isaac Asimov
Barrayar Loius McMaster Bujold
Fevre´i unelm George R.R. Martin
Hukatuse kaardid Roger Zelazny
Loomine Bernard Beckett
Pika talve algus. Täheaeg 6 Raul Sulbi
Troonide mäng. II raamat George R. R. Martin
Malakas Terry Pratchett
Kuninglik salamõrtsukas Robin Hobb
Algolagnia Marion Andra
Tormikuninganna Marion Zimmer Bradley
Munk maailma äärel Skarabeus
Alaizabel Cray painaja Chris Wooding
Anubise väravad Tim Powers
Lisey lugu Stephen King
Postiteenistus Terry Pratchett
Linnusild Barry Hughart
Raudkäsi Charlie Fletcher
Südame metsad Charles de Lint
Meie, kromanjoonlased Tiit Tarlap
Calla hundid Stephen King
Ruunimärgid Joanne Harris
Tuule nimi. Kuningatapja kroonika I osa 2. raamat Patrick Rothfuss
Windhaven George R. R. Martin, Lisa Tuttle
Taevane tuli Anne Robillard
Vampiiri käealune Darren Shan
Teemärgid Roger Zelazny
Kinnimüüritud tuba Ilo

Related

Scrape Certain elements from HTML using Python and Beautifulsoup

So this is the html I'm working with
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>
I would like for it to look like this:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York.
Here's my code:
from bs4 import BeautifulSoup
import requests
import linkMaker as linkMaker
url = linkMaker.link
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
with open("test1.txt", "w") as file:
hrs = soup.find_all('hr')
for hr in hrs:
lis = soup.find_all('li')
for li in lis:
file.write(str(li.text)+str(hr.text)+"\n"+"\n"+"\n")
Here's what it's returning:
Birth of Herbert Hans Guendel - .
: Germany,
USA.
Related Persons: Guendel.
German-American engineer in WW2, member of the Rocket Team in the United States thereafter. German expert in guided missiles during WW2. As of January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
My ultimate Goal is to get those two parts of the html tags to tweet them out.

Looking at the HTML snippet for title you can search for first <b> inside the <li> tag. For the text you can search the last .contents of the <li> tag:
from bs4 import BeautifulSoup
html_doc = """\
<hr>
<b>1914 December 12 - </b>.
<ul>
<li>
<b>Birth of Herbert Hans Guendel</b> - .
<i>Nation</i>:
Germany,
USA.
<i>Related Persons</i>:
Guendel.
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York..
</li>
</ul>"""
soup = BeautifulSoup(html_doc, "html.parser")
title = soup.find("li").b.text
text = soup.find("li").contents[-1].strip(" .\n")
print(title)
print(text)
Prints:
Birth of Herbert Hans Guendel
German-American engineer in WW2, member of the Rocket Team in the United
States thereafter. German expert in guided missiles during WW2. As of
January 1947, working at Fort Bliss, Texas. Died at Boston, New York

How do I get output from title_op() and address_op() functions into a data frame

How do I get output from title_op() and address_op() functions into a data frame as below
title Address
Silk Court 16 Ivimey Street, Bethnal Green, London E2 6LR
Westport Care Home 14/26 Westport Street, Lime House, Wapping, London E1 0RA
Aspen Court Care Home 17/21 Dod Street, Poplar, London E14 7EG
etc etc
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup as soup
from selenium import webdriver
#grabspage and parses it through ready for picking apart
my_url = "https://www.carehome.co.uk/care_search_results.cfm/searchunitary/Tower-Hamlets"
driver = webdriver.Chrome(executable_path='C:/Users/lemonade/Documents/work/chromedriver')
driver.get(my_url)
page_s = soup(driver.page_source, features='html.parser')
def title_op(page):
title_container = page_s.select("div.home-name>p>a[href]")
for container in title_container:
titles = container.text.strip()
print(titles)
def address_op(page):
address_container = page_s.select("div.home-name>p.grey")
for address in address_container:
addresses = address.text
print(addresses)
title_op(page_s)
address_op(page_s)
OUTPUT
Silk Court
Westport Care Home
Aspen Court Care Home
Beaumont Court
Hawthorn Green Residential and Nursing Home
Coxley House
Toby Lodge
Hotel in the Park
34/35 Huddleston Close
Approach Lodge
16 Ivimey Street, Bethnal Green, London E2 6LR
14/26 Westport Street, Lime House, Wapping, London E1 0RA
17/21 Dod Street, Poplar, London E14 7EG
Beaumont Square, Stepney, London E1 4NA
82 Redmans Road, Stepney Green, London E1 3AG
28 Bow Road, Bow, London E3 4LN
141 White Horse Road, London E1 0NW
130 Sewardstone Road, Bethnal Green, London E2 9HN
Bethnal Green, London E2 9NR
2 Approach Road, London E2 9LY

Assuming you are getting appropriate response then gather two lists and use zip then wrap with DataFrame call from pandas
import pandas as pd
# your code
page_s = soup(driver.page_source, features='html.parser')
titles = [i.text for i in page_s.select('.home-name [href]')]
addresses = [i.text for i in page_s.select('p.grey')]
df = pd.DataFrame(zip(titles, addresses), columns = ['title','address'])
print(df)

Parsing through HTML in a dictionary

I'm trying to pull table data from the following website: https://msih.bgu.ac.il/md-program/residency-placements/
While there are no table tags I found the common tag to pull individual segments of the table to be div class=accord-con
I made a dictionary where the keys are the graduation year (ie, 2019, 2018, etc), and the values is the html from each div class-accord con.
I'm stuck and don't know how to parse the html within the dictionary. My goal is to have separate lists of the specialty, hospital, and location for each year. I don't know how to move forward.
Below is my working code:
import numpy as np
import bs4 as bs
from bs4 import BeautifulSoup
import urllib.request
import pandas as pd
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
Here is a sample of what my dictionary looks like:
{'2019': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>University at Buffalo School of Medicine, Buffalo, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Aventura Hospital, Aventura, Fl</li></ul><h4>Family Medicine</h4><ul><li>Louisiana State University School of Medicine, New Orleans, LA</li><li>UT St Thomas Hospitals, Murfreesboro, TN</li><li>Sea Mar Community Health Center, Seattle, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>St Joseph Hospital, Denver, CO </li></ul><h4>Obstetrics-Gynecology</h4><ul><li>Jersey City Medical Center, Jersey City, NJ</li><li>New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY</li></ul><h4>Pediatrics</h4><ul><li>St Louis Children’s Hospital, St Louis, MO</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>St Christopher’s Hospital, Philadelphia, PA</li></ul><h4>Surgery</h4><ul><li>Mountain Area Health Education Center, Asheville, NC</li></ul><p></p></div>,
'2018': <div class="accord-con"><h4>Anesthesiology</h4><ul><li>NYU School of Medicine, New York, NY</li></ul><h4>Emergency Medicine</h4><ul><li>Kent Hospital, Warwick, Rhode Island</li><li>University of Connecticut School of Medicine, Farmington, CT</li><li>University of Texas Health Science Center at San Antonio, San Antonio, TX</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Family Medicine</h4><ul><li>University of Kansas Medical Center, Wichita, KS</li><li>Ellis Hospital, Schenectady, NY</li><li>Harrison Medical Center, Seattle, WA</li><li>St Francis Hospital, Wilmington, DE </li><li>University of Virginia, Charlottesville, VA</li><li>Valley Medical Center, Renton, WA</li></ul><h4>Internal Medicine</h4><ul><li>Oregon Health and Science University, Portland, OR</li><li>Virginia Commonwealth University Health Systems, Richmond, VA</li><li>University of Chicago Medical Center, Chicago, IL</li></ul><h4>Obstetrics-Gynecology</h4><ul><li>St Francis Hospital, Hartford, CT</li></ul><h4>Pediatrics</h4><ul><li>Case Western University Hospitals Cleveland Medical Center, Cleveland, OH</li><li>Jersey Shore University Medical Center, Neptune City, NJ</li><li>University of Maryland Medical Center, Baltimore, MD</li><li>University of Virginia, Charlottesville, VA</li><li>Vidant Medical Center East Carolina University, Greenville, NC</li></ul><h4>Preliminary Medicine Neurology</h4><ul><li>Howard University Hospital, Washington, DC</li></ul><h4>Preliminary Medicine Radiology</h4><ul><li>Maimonides Medical Center, Bronx, NY</li></ul><h4>Preliminary Medicine Surgery</h4><ul><li>Providence Park Hospital, Southfield, MI</li></ul><h4>Psychiatry</h4><ul><li>University of Maryland Medical Center, Baltimore, MI</li></ul><p></p></div>,
My ultimate goal is to pull this data into a pandas dataframe with the following columns: grad year, specialty, hospital, location

Your code is quite close to finding the end result. Once you have paired the years with the student placement data, simply apply an extraction function to the latter.:
from bs4 import BeautifulSoup as soup
import re
from selenium import webdriver
_d = webdriver.Chrome('/path/to/chromedriver')
_d.get('https://msih.bgu.ac.il/md-program/residency-placements/')
d = soup(_d.page_source, 'html.parser')
def placement(block):
r = block.find_all(re.compile('ul|h4'))
return {r[i].text:[b.text for b in r[i+1].find_all('li')] for i in range(0, len(r)-1, 2)}
result = {i.h2.text:placement(i) for i in d.find_all('div', {'class':'accord-head'})}
print(result['Class of 2019'])
Output:
{'Anesthesiology': ['University at Buffalo School of Medicine, Buffalo, NY'], 'Emergency Medicine': ['Aventura Hospital, Aventura, Fl'], 'Family Medicine': ['Louisiana State University School of Medicine, New Orleans, LA', 'UT St Thomas Hospitals, Murfreesboro, TN', 'Sea Mar Community Health Center, Seattle, WA'], 'Internal Medicine': ['Oregon Health and Science University, Portland, OR', 'St Joseph Hospital, Denver, CO\xa0'], 'Obstetrics-Gynecology': ['Jersey City Medical Center, Jersey City, NJ', 'New York Presbyterian Brooklyn Methodist Hospital, Brooklyn, NY'], 'Pediatrics': ['St Louis Children’s Hospital, St Louis, MO', 'University of Maryland Medical Center, Baltimore, MD', 'St Christopher’s Hospital, Philadelphia, PA'], 'Surgery': ['Mountain Area Health Education Center, Asheville, NC']}
Note: I ended up using selenium because for me, the returned HTML response from requests.get did not included the rendered student placement data.

You have dictionary with BS elements ('bs4.element.Tag') and you don't have to parse them.
You can directly uses find(), find_all(), etc.
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)
Result
<class 'bs4.element.Tag'> 2019 Anesthesiology
<class 'bs4.element.Tag'> 2018 Anesthesiology
<class 'bs4.element.Tag'> 2017 Anesthesiology
<class 'bs4.element.Tag'> 2016 Emergency Medicine
<class 'bs4.element.Tag'> 2015 Emergency Medicine
<class 'bs4.element.Tag'> 2014 Anesthesiology
<class 'bs4.element.Tag'> 2013 Anesthesiology
<class 'bs4.element.Tag'> 2012 Emergency Medicine
<class 'bs4.element.Tag'> 2011 Emergency Medicine
<class 'bs4.element.Tag'> 2010 Dermatology
<class 'bs4.element.Tag'> 2009 Emergency Medicine
<class 'bs4.element.Tag'> 2008 Family Medicine
<class 'bs4.element.Tag'> 2007 Anesthesiology
<class 'bs4.element.Tag'> 2006 Triple Board (Pediatrics/Adult Psychiatry/Child Psychiatry)
<class 'bs4.element.Tag'> 2005 Family Medicine
<class 'bs4.element.Tag'> 2004 Anesthesiology
<class 'bs4.element.Tag'> 2003 Emergency Medicine
<class 'bs4.element.Tag'> 2002 Family Medicine
Full code:
import urllib.request
import bs4 as bs
sauce = urllib.request.urlopen('https://msih.bgu.ac.il/md-program/residency-placements/').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
headers = soup.find_all('div', class_={'accord-head'})
grad_yr_list = []
for header in headers:
grad_yr_list.append(header.h2.text[-4:])
rez_classes = soup.find_all('div', class_={'accord-con'})
data_dict = dict(zip(grad_yr_list, rez_classes))
for key, value in data_dict.items():
print(type(value), key, value.find('h4').text)

You can go to pandas once you get the soup, then parse the necessary information
df = pd.DataFrame(soup)
df['grad_year'] = df[0].map(lambda x: x.text[-4:])
df['specialty'] = df[1].map(lambda x: [i.text for i in x.find_all('h4')])
df['hospital'] = df[1].map(lambda x: [i.text for i in x.find_all('li')])
df['location'] = df[1].map(lambda x: [''.join(i.text.split(',')[1:]) for i in x.find_all('li')])
You will have to do some pandas magic after that

I don't know pandas. The following code can get the data in the table. I don't know if this will meet your needs.
import requests
from simplified_scrapy.simplified_doc import SimplifiedDoc
url = 'https://msih.bgu.ac.il/md-program/residency-placements/'
response = requests.get(url)
doc = SimplifiedDoc(response.text)
divs = doc.getElementsByClass('accord-head')
datas={}
for div in divs:
grad_year = div.h2.text[-4:]
rez_classe = div.getElementByClass('accord-con')
h4s = rez_classe.h4s # get h4
for h4 in h4s:
if not h4.next:
continue
lis = h4.next.lis
specialty = h4.text
hospital = [li.text for li in lis]
datas[grad_year]={'specialty':specialty,'hospital':hospital}
for data in datas:
print (data,datas[data])

Scrapy csv outputing on multiple lines

Here's my spider:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from ..items import TutorialItem
class Tutorial1(BaseSpider):
name = "Tut"
allowed_domains = ['nytimes.com']
start_urls = ["http://nytimes.com",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="span-ab-layout layout"]')
items = []
for site in sites:
item = TutorialItem()
item['title'] = map(unicode.strip, site.select('//h2[#class="story-heading"]/a/text()').extract())
item['time'] = map(unicode.strip, site.select('//time[#class="timestamp"]/text()').extract())
yield item
Here is my output:
author time
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÃ‰REZ-PEÃ‘A,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÃ‰ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÃ‰REZ-PEÃ‘A,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÃ‰ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
I made the indention so it was clear where it was duplicating.
My problem occurs when I go to print out my work in CSV is always comes out in 1 giant row. It also makes a duplicate column for some reason. Can anyone help me with this dilemma?

I was able to find it by experimenting with:
hxs = HtmlXPathSelector(response)
Apparently, there is a huge difference between Selector and HtmlPatchSelector

Copy text from a specific point on a web page in Python

Say if I was given a web page, e.g this, how could I copy the text starting from <root response="True"> and ending at </root>
How could I do this in Python?

import xml.etree.ElementTree as et
import requests
URL = "http://www.omdbapi.com/?t=True%20Grit&r=XML"
def main():
pg = requests.get(URL).content
root = et.fromstring(pg)
for attr,value in root[0].items():
print("{:>10}: {}".format(attr, value))
if __name__=="__main__":
main()
results in
poster: http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg
metascore: 80
director: Ethan Coen, Joel Coen
released: 22 Dec 2010
awards: Nominated for 10 Oscars. Another 30 wins & 85 nominations.
year: 2010
genre: Adventure, Drama, Western
imdbVotes: 184,711
plot: A tough U.S. Marshal helps a stubborn young woman track down her father's murderer.
rated: PG-13
language: English
title: True Grit
country: USA
writer: Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)
actors: Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin
imdbID: tt1403865
runtime: 110 min
type: movie
imdbRating: 7.7

I would use requests and BeautifulSoup for this:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.omdbapi.com/?t=True%20Grit&r=XML')
>>> soup = BeautifulSoup(r.text)
>>> list(soup('root')[0].children)
[<movie actors="Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin" awards="Nominated for 10 Oscars. Another 30 wins & 85 nominations." country="USA" director="Ethan Coen, Joel Coen" genre="Adventure, Drama, Western" imdbid="tt1403865" imdbrating="7.7" imdbvotes="184,711" language="English" metascore="80" plot="A tough U.S. Marshal helps a stubborn young woman track down her father's murderer." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg" rated="PG-13" released="22 Dec 2010" runtime="110 min" title="True Grit" type="movie" writer="Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)" year="2010"></movie>]

Download the document with urllib2: http://docs.python.org/2/howto/urllib2.html
A good parser for short, simple, well formed XML like this is Minidom. Here is how to parse:
http://docs.python.org/2/library/xml.dom.minidom.html
Then get the text, e.g.: Getting text between xml tags with minidom

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.