Scrapy csv outputing on multiple lines - python

Here's my spider:
from scrapy.spider import BaseSpider
from scrapy.selector import Selector
from ..items import TutorialItem
class Tutorial1(BaseSpider):
name = "Tut"
allowed_domains = ['nytimes.com']
start_urls = ["http://nytimes.com",]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//div[#class="span-ab-layout layout"]')
items = []
for site in sites:
item = TutorialItem()
item['title'] = map(unicode.strip, site.select('//h2[#class="story-heading"]/a/text()').extract())
item['time'] = map(unicode.strip, site.select('//time[#class="timestamp"]/text()').extract())
yield item
Here is my output:
author time
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
By PETER BAKER,By JONATHAN M. KATZ and RICHARD PÉREZ-PEÑA,By NEIL MacFARQUHAR,By RON NIXON,By RICHARD GOLDSTEIN,By LOUISE STORY and ALEJANDRA XANIC von BERTRAB,By DAVID CARR,By A.O. SCOTT,By JERÉ LONGMAN,By THE EDITORIAL BOARD,By JON BECKMANN,By C. J. HUGHES,By JOANNE KAUFMAN 10:26 AM ET,1:08 PM ET,11:57 AM ET,8:33 AM ET,10:01 AM ET,12:35 PM ET,1:47 PM ET,10:36 AM ET,10:26 AM ET,9:49 AM ET,12:05 PM ET,9:21 AM ET,12:22 PM ET,11:52 AM ET,8:59 AM ET
I made the indention so it was clear where it was duplicating.
My problem occurs when I go to print out my work in CSV is always comes out in 1 giant row. It also makes a duplicate column for some reason. Can anyone help me with this dilemma?

I was able to find it by experimenting with:
hxs = HtmlXPathSelector(response)
Apparently, there is a huge difference between Selector and HtmlPatchSelector

Related

How to clean duplicate data from webscraping?

So I want to make a list of books from a bookstore with a web scraper. I need the title and author of books. I can get the title nicely. The problem is with author. Namely the class of title is the same with different data. If I run the script it duplictes the data and in addition the data I don't need (book code, publisher, year etc.)
Here is the script
`
import requests
from bs4 import BeautifulSoup
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
response = requests.get('https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list',headers=headers)
webpage = response.content
soup = BeautifulSoup(webpage,'html.parser')
for parent in soup.find_all('div',class_='product-info'):
for n,tag in enumerate(parent.find_all('div')):
title = [x for x in tag.find_all('a',class_='block no-underline product-link')]
author = [x for x in tag.find_all('span',class_='info__text block weight-700 mt8 mb16')]
for item in title:
print('TITLE:',item.text.strip())
for item in author:
author = item.text.split('\n')
print('AUTHOR: ',item.text.strip())
`
Example result:
TITLE: Parv II
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
AUTHOR: Kõvakaaneline
AUTHOR: 2010
AUTHOR: Frank Schätzing
AUTHOR: 9789985317105
AUTHOR: Varrak
As you can see the data for the author duplicates and I get data that I don't need(publisher, year, code etc). Now I know that all these different data have the same class. My question would be: Is there a way to get only data for the author? Or how to clean it?
Thank you
You can use CSS selectors to properly select the Title/Author:
import requests
from bs4 import BeautifulSoup
url = "https://www.apollo.ee/raamatud/eestikeelsed-raamatud/ilukirjandus/ulme-ja-oudus?page=7&mode=list"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for book in soup.select(".product"):
title = book.select_one(".product-title").text.strip()
author = book.select_one(
".info__title:-soup-contains(Autor) + span"
).text.strip()
print("{:<50} {}".format(title, author))
Prints:
Eric Terry Pratchett
Lõputuse piirid Lois McMaster Bujold
Mõõkade maru. II raamat George R. R. Martin
Euromant Maniakkide Tänav
Joosta oma varju eest. Täheaeg 9 Triinu Meres, Kadri Pettai, Maniakkide tänav, Tea Roosvald, Ülle Lätte, Jeff Vandermeer
Ilus pimedus Kami Garcia, Margaret Stohl
Bal-Sagothi jumalad Robert E. Howard
Tüdruk veiniplekiga kleidis Eda Kalmre
Talvesepp Terry Pratchett
Võluv võrdsus Terry Pratchett
Nekromanteion Arvi Nikkarev
Vere magus lõhn Suzanne McLeod
Tantsud armastuse lõppu Liis Koger
Kõrbeoda Peter V. Brett
Palimpsest Charles Stross
Tigana Guy Gavriel Kay
Kardinali mõõgad Pierre Pevel
Ajaratsurid: Kättemaksja. P.C.Cast
Hullumeelsuse mägedes H. P. Lovecraft
Vori mäng Lois McMaster Bujold
Ajalooja Milvi Martina Piir
Punane Sonja Robert E. Howard
Sõda kosmose rannavetes Raul Sulbi
Katastroof Krystyna Kuhn
Robotid ja impeerium Isaac Asimov
Ajaratsurid: Tapja Cindy Dees
Hundipäikese aeg III Tamur Kusnets
Gort Ashryn III osa. Rahu Leo Kunnas
Surnud, kuni jõuab öö Charlaine Harris
Lase sisse see õige John Ajvide Lindqvist
Mäng Krystyna Kuhn
Järgmiseks Crichton, Michael
Kõigi hingede öö Jennifer Armintrout
Kaose märk Roger Zelazny
Allakäik Guillermo del Toro, Chuck Hogan
Külmavõetud Richelle Mead
Nähtamatud akadeemikud Terry Pratchett
Varjus hiilija Aleksei Pehhov
Ajaratsurid: Otsija Lindsay McKenna
Maalingutega mees Peter V. Brett
Viimane sõrmusekandja Kirill Jeskov
Falkenbergi leegion Jerry Pournelle
Sõduri õpilane Lois McMaster Bujold
Vampiiri kättemaks Alexis Morgan
Kurjuse küüsis Jennifer Armintrout
Tiiger! Tiiger! Alfred Bester
Tume torn Stephen King
Koidu robotid Isaac Asimov
Nõid Michael Scott
Päevane vahtkond Sergei Lukjanenko
Perpendikulaarne maailm Kir Bulöchov
Ümbersünd Jennifer Armintrout
Kadunud oaas Paul Sussman
Täheaeg 7. Ingel ja kvantkristall Raul Sulbi
Leek sügaviku kohal Vernor Vinge
Sümfoonia katkenud keelele Siim Veskimees
Parv II Frank Schätzing
Kuldne linn John Twelve Hawks
Hõbekeel Charlie Fletcher
Rahategu Terry Pratchett
Amberi veri Roger Zelazny
Kuninga tagasitulek J.R.R. Tolkien
Tõbi Guillermo del Toro, Chuck Hogan
Võõrana võõral maal Robert A. Heinlein
Elu, universum ja kõik Douglas Adams
Frenchi ja Koulu reisid Indrek Hargla
13. sõdalane. Laibaõgijad Michael Crichton
Accelerando Charles Stross
Universumi lõpu restoran Douglas Adams
Surmakarva Maniakkide Tänav
Kauge maa laulud Arthur C. Clarke
Pöidlaküüdi reisijuht galaktikas Douglas Adams
Alasti päike Isaac Asimov
Barrayar Loius McMaster Bujold
Fevre´i unelm George R.R. Martin
Hukatuse kaardid Roger Zelazny
Loomine Bernard Beckett
Pika talve algus. Täheaeg 6 Raul Sulbi
Troonide mäng. II raamat George R. R. Martin
Malakas Terry Pratchett
Kuninglik salamõrtsukas Robin Hobb
Algolagnia Marion Andra
Tormikuninganna Marion Zimmer Bradley
Munk maailma äärel Skarabeus
Alaizabel Cray painaja Chris Wooding
Anubise väravad Tim Powers
Lisey lugu Stephen King
Postiteenistus Terry Pratchett
Linnusild Barry Hughart
Raudkäsi Charlie Fletcher
Südame metsad Charles de Lint
Meie, kromanjoonlased Tiit Tarlap
Calla hundid Stephen King
Ruunimärgid Joanne Harris
Tuule nimi. Kuningatapja kroonika I osa 2. raamat Patrick Rothfuss
Windhaven George R. R. Martin, Lisa Tuttle
Taevane tuli Anne Robillard
Vampiiri käealune Darren Shan
Teemärgid Roger Zelazny
Kinnimüüritud tuba Ilo

Webscraping past a show more button that extends the page

I'm trying to scrape data from Elle.com under a search term. I noticed when I click the button, it sends a request that updates the &page=2 in the url. However, the following code just gets me a lot of duplicate entries. I need help finding a way to set a start point for each iteration of the loop (I think). Any ideas?
import requests,nltk,pandas as pd
from bs4 import BeautifulSoup as bs
def get_hits(url):
r = requests.get(url)
soup = bs(r.content, 'html')
body = []
for p in soup.find_all('p',{'class':'body-text'}):
sentences = nltk.sent_tokenize(p.text)
result1 = [s for s in sentences if 'kim' in s]
body.append(result1)
result2 = [s for s in sentences if 'kanye' in s]
body.append(result2)
body = [a for a in body if a!=[]]
if body == []:
body.append("no hits")
return body
titles =[]
key_hits = []
urls = []
counter = 1
for i in range(1,10):
url = f'https://www.elle.com/search/?page={i}&q=kanye'
r = requests.get(url)
soup = bs(r.content, 'html')
groups = soup.find_all('div',{'class':'simple-item grid-simple-item'})
for j in range(len(groups)):
urls.append('https://www.elle.com'+ groups[j].find('a')['href'])
titles.append(groups[j].find('div',{'class':'simple-item-title item-title'}).text)
key_hits.append(get_hits('https://www.elle.com'+ groups[j].find('a')['href']))
if (counter == 100):
break
counter+=1
data = pd.DataFrame({
'Title':titles,
'Body':key_hits,
'Links':urls
})
data.head()
Let me know if there's something I don't understand that I probably should. Just a marketing researcher trying to learn powerful tools here.
To get pagination working on the sige, you can use their infinite-scroll API URL (this example will print 9*42 titles):
import requests
from bs4 import BeautifulSoup
api_url = "https://www.elle.com/ajax/infiniteload/"
params = {
"id": "search",
"class": "CoreModels\\search\\TagQueryModel",
"viewset": "search",
"trackingId": "search-results",
"trackingLabel": "kanye",
"params": '{"input":"kanye","page_size":"42"}',
"page": "1",
"cachebuster": "undefined",
}
all_titles = set()
for page in range(1, 10):
params["page"] = page
soup = BeautifulSoup(
requests.get(api_url, params=params).content, "html.parser"
)
for title in soup.select(".item-title"):
print(title.text)
all_titles.add(title.text)
print()
print("Unique titles:", len(all_titles)) # <-- 9 * 42 = 378
Prints:
...
Kim Kardashian and Kanye West Respond to Those Divorce Rumors
People Are Noticing Something Fishy About Taylor Swift's Response to Kim Kardashian
Kim Kardashian Just Went on an Intense Twitter Rant Defending Kanye West
Trump Is Finally Able to Secure a Meeting With a Kim
Kim Kardashian West is Modeling Yeezy on the Street Again
Aziz Ansari's Willing to Model Kanye's Clothes
Unique titles: 378
Actually, load more pagination is generating from api calls plain html response and each page link/url is relative url and convert it into absolute url using urljoin method and I make pagination in api_urls.
Code:
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
api_urls = ["https://www.elle.com/ajax/infiniteload/?id=search&class=CoreModels%5Csearch%5CTagQueryModel&viewset=search&trackingId=search-results&trackingLabel=kanye&params=%7B%22input%22%3A%22kanye%22%2C%22page_size%22%3A%2242%22%7D&page="+str(x)+"&cachebuster=undefined" for x in range(1,4)]
Base_url = "https://store.steampowered.com"
for url in api_urls:
req = requests.get(url)
soup = BeautifulSoup(req.content,"lxml")
cards = soup.select("div.simple-item.grid-simple-item")
for card in cards:
title = card.select_one("div.simple-item-title.item-title")
p = card.select_one("a")
l=p['href']
abs_link=urljoin(Base_url,l)
print("Title:" + title.text + " Links: " + abs_link)
print("-" * 80)
Output:
Title:Inside Kim Kardashian and Kanye West’s Current Relationship Amid Dinner Sighting Links: https://store.steampowered.com/culture/celebrities/a37833256/kim-kardashian-kanye-west-reconciled/
Title:Kim Kardashian And Ex Kanye West Left For SNL Together Amid Reports of Reconciliation Efforts Links: https://store.steampowered.com/culture/celebrities/a37919434/kim-kardashian-kanye-west-leave-for-snl-together-reconciliation/
Title:Kim Kardashian Wore a Purple Catsuit for Dinner With Kanye West Amid Reports She's Open to Reconciling Links: https://store.steampowered.com/culture/celebrities/a37822625/kim-kardashian-kanye-west-nobu-dinner-september-2021/
Title:How Kim Kardashian Really Feels About Kanye West Saying He ‘Wants Her Back’ Now Links:
https://store.steampowered.com/culture/celebrities/a37463258/kim-kardashian-kanye-west-reconciliation-feelings-september-2021/
Title:Why Irina Shayk and Kanye West Called Off Their Two-Month Romance Links: https://store.steampowered.com/culture/celebrities/a37366860/why-irina-shayk-kanye-west-broke-up-august-2021/
Title:Kim Kardashian and Kanye West Reportedly Are ‘Working on Rebuilding’ Relationship and May Call Off Divorce Links: https://store.steampowered.com/culture/celebrities/a37421190/kim-kardashian-kanye-west-repairing-relationship-divorce-august-2021/
Title:What Kim Kardashian and Kanye West's ‘Donda’ Wedding Moment Really Means for Their Relationship Links: https://store.steampowered.com/culture/celebrities/a37415557/kim-kardashian-kanye-west-donda-wedding-moment-explained/
Title:What Kim Kardashian and Kanye West's Relationship Is Like Now: ‘The Tension Has Subsided’ Links: https://store.steampowered.com/culture/celebrities/a37383301/kim-kardashian-kanye-west-relationship-details-august-2021/
Title:How Kim Kardashian and Kanye West’s Relationship as Co-Parents Has Evolved Links: https://store.steampowered.com/culture/celebrities/a37250155/kim-kardashian-kanye-west-co-parents/Title:Kim Kardashian Went Out in a Giant Shaggy Coat and a Black Wrap Top for Dinner in NYC Links: https://store.steampowered.com/culture/celebrities/a37882897/kim-kardashian-shaggy-coat-black-outfit-nyc-dinner/
Title:Kim Kardashian Wore Two Insane, Winter-Ready Outfits in One Warm NYC Day Links: https://store.steampowered.com/culture/celebrities/a37906750/kim-kardashian-overdressed-fall-outfits-october-2021/
Title:Kim Kardashian Dressed Like a Superhero for Justin Bieber's 2021 Met Gala After Party Links: https://store.steampowered.com/culture/celebrities/a37593656/kim-kardashian-superhero-outfit-met-gala-after-party-2021/
Title:Kim Kardashian Killed It In Her Debut as a Saturday Night Live Host Links: https://store.steampowered.com/culture/celebrities/a37918950/kim-kardashian-saturday-night-live-best-sketches/
Title:Kim Kardashian Has Been Working ‘20 Hours a Day’ For Her Appearance On SNL Links: https://store.steampowered.com/culture/celebrities/a37915962/kim-kardashian-saturday-night-live-preperation/
Title:Why Taylor Swift and Joe Alwyn Skipped the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37446411/why-taylor-swift-joe-alwyn-skipped-met-gala-2021/
Title:Kim Kardashian Says North West Still Wants to Be an Only Child Five Years Into Having Siblings Links: https://store.steampowered.com/culture/celebrities/a37620539/kim-kardashian-north-west-only-child-comment-september-2021/
Title:How Kim Kardashian's Incognito 2021 Met Gala Glam Came Together Links: https://store.s
teampowered.com/beauty/makeup-skin-care/a37584576/kim-kardashians-incognito-2021-met-gala-beauty-breakdown/
Title:Kim Kardashian Completely Covered Her Face and Everything in a Black Balenciaga Look at the 2021 Met Gala Links: https://store.steampowered.com/culture/celebrities/a37578520/kim-kardashian-faceless-outfit-met-gala-2021/
Title:How Kim Kardashian Feels About Kanye West Singing About Their Divorce and ‘Losing My Family’ on Donda Album Links: https://store.steampowered.com/culture/celebrities/a37113130/kim-kardashian-kanye-west-divorce-song-donda-album-feelings/
Title:Kanye West Teases New Song In Beats By Dre Commercial Starring Sha'Carri Richardson Links: https://store.steampowered.com/culture/celebrities/a37090223/kanye-west-teases-new-song-in-beats-by-dre-commercial-starring-shacarri-richardson/
Title:Inside Kim Kardashian and Kanye West's Relationship Amid His Irina Shayk Romance Links: https://store.steampowered.com/culture/celebrities/a37077662/kim-kardashian-kanye-west-relationship-irina-shayk-romance-july-2021/
and ... so on

How can I webscrape a Website for the Winners

Hi I am trying to scrape this website with Python 3 and noticed that in the source code it does not give a clear indication of how I would scrape the names of the winners in these primary elections. Can you show me how to scrape a list of all the winners in every MD primary election with this website?
https://elections2018.news.baltimoresun.com/results/
The parsing is a little bit complicated, because the results are in many subpages. This scripts collects them and prints result (all data is stored in variable data):
from bs4 import BeautifulSoup
import requests
url = "https://elections2018.news.baltimoresun.com/results/"
r = requests.get(url)
data = {}
soup = BeautifulSoup(r.text, 'lxml')
for race in soup.select('div[id^=race]'):
r = requests.get(f"https://elections2018.news.baltimoresun.com/results/contests/{race['id'].split('-')[1]}.html")
s = BeautifulSoup(r.text, 'lxml')
l = []
data[(s.find('h3').text, s.find('div', {'class': 'party-header'}).text)] = l
for candidate, votes, percent in zip(s.select('td.candidate'), s.select('td.votes'), s.select('td.percent')):
l.append((candidate.text, votes.text, percent.text))
print('Winners:')
for (race, party), v in data.items():
print(race, party, v[0])
# print(data)
Outputs:
Winners:
Governor / Lt. Governor Democrat ('Ben Jealous and Susan Turnbull', '227,764', '39.6%')
U.S. Senator Republican ('Tony Campbell', '50,915', '29.2%')
U.S. Senator Democrat ('Ben Cardin', '468,909', '80.4%')
State's Attorney Democrat ('Marilyn J. Mosby', '39,519', '49.4%')
County Executive Democrat ('John "Johnny O" Olszewski, Jr.', '27,270', '32.9%')
County Executive Republican ('Al Redmer, Jr.', '17,772', '55.7%')

How to scrape data from imdb business page?

I am making a project that requires data from imdb business page.I m using python. The data is stored between two tags like this :
Budget
$220,000,000 (estimated)
I want the numeric amount but have not been successful so far. Any suggestions.
Take a look at Beautiful Soup, its a useful library for scraping. If you take a look at the source, the "Budget" is inside an h4 element, and the value is next in the DOM. This may not be the best example, but it works for your case:
import urllib
from bs4 import BeautifulSoup
page = urllib.urlopen('http://www.imdb.com/title/tt0118715/?ref_=fn_al_nm_1a')
soup = BeautifulSoup(page.read())
for h4 in soup.find_all('h4'):
if "Budget:" in h4:
print h4.next_sibling.strip()
# $15,000,000
This is whole bunch of code (you can find your requirement here).
The below Python script will give you, 1) List of Top Box Office movies from IMDb 2) And also the List of Cast for each of them.
from lxml.html import parse
def imdb_bo(no_of_movies=5):
bo_url = 'http://www.imdb.com/chart/'
bo_page = parse(bo_url).getroot()
bo_table = bo_page.cssselect('table.chart')
bo_total = len(bo_table[0][2])
if no_of_movies <= bo_total:
count = no_of_movies
else:
count = bo_total
movies = {}
for i in range(0, count):
mo = {}
mo['url'] = 'http://www.imdb.com'+bo_page.cssselect('td.titleColumn')[i][0].get('href')
mo['title'] = bo_page.cssselect('td.titleColumn')[i][0].text_content().strip()
mo['year'] = bo_page.cssselect('td.titleColumn')[i][1].text_content().strip(" ()")
mo['weekend'] = bo_page.cssselect('td.ratingColumn')[i*2].text_content().strip()
mo['gross'] = bo_page.cssselect('td.ratingColumn')[(i*2)+1][0].text_content().strip()
mo['weeks'] = bo_page.cssselect('td.weeksColumn')[i].text_content().strip()
m_page = parse(mo['url']).getroot()
m_casttable = m_page.cssselect('table.cast_list')
flag = 0
mo['cast'] = []
for cast in m_casttable[0]:
if flag == 0:
flag = 1
else:
m_starname = cast[1][0][0].text_content().strip()
mo['cast'].append(m_starname)
movies[i] = mo
return movies
if __name__ == '__main__':
no_of_movies = raw_input("Enter no. of Box office movies to display:")
bo_movies = imdb_bo(int(no_of_movies))
for k,v in bo_movies.iteritems():
print '#'+str(k+1)+' '+v['title']+' ('+v['year']+')'
print 'URL: '+v['url']
print 'Weekend: '+v['weekend']
print 'Gross: '+v['gross']
print 'Weeks: '+v['weeks']
print 'Cast: '+', '.join(v['cast'])
print '\n'
Output (run in terminal):
parag#parag-innovate:~/python$ python imdb_bo_scraper.py
Enter no. of Box office movies to display:3
#1 Cinderella (2015)
URL: http://www.imdb.com/title/tt1661199?ref_=cht_bo_1
Weekend: $67.88M
Gross: $67.88M
Weeks: 1
Cast: Cate Blanchett, Lily James, Richard Madden, Helena Bonham Carter, Nonso Anozie, Stellan Skarsgård, Sophie McShera, Holliday Grainger, Derek Jacobi, Ben Chaplin, Hayley Atwell, Rob Brydon, Jana Perez, Alex Macqueen, Tom Edden
#2 Run All Night (2015)
URL: http://www.imdb.com/title/tt2199571?ref_=cht_bo_2
Weekend: $11.01M
Gross: $11.01M
Weeks: 1
Cast: Liam Neeson, Ed Harris, Joel Kinnaman, Boyd Holbrook, Bruce McGill, Genesis Rodriguez, Vincent D'Onofrio, Lois Smith, Common, Beau Knapp, Patricia Kalember, Daniel Stewart Sherman, James Martinez, Radivoje Bukvic, Tony Naumovski
#3 Kingsman: The Secret Service (2014)
URL: http://www.imdb.com/title/tt2802144?ref_=cht_bo_3
Weekend: $6.21M
Gross: $107.39M
Weeks: 5
Cast: Adrian Quinton, Colin Firth, Mark Strong, Jonno Davies, Jack Davenport, Alex Nikolov, Samantha Womack, Mark Hamill, Velibor Topic, Sofia Boutella, Samuel L. Jackson, Michael Caine, Taron Egerton, Geoff Bell, Jordan Long
Well you asked for python and you asked for a scraping solution.
But there is no need for python and no need to scrape anything because the budget figures are available in the business.list text file available at http://www.imdb.com/interfaces
Try IMDbPY and its documentation. To install, just pip install imdbpy
from imdb import IMDb
ia = IMDb()
movie = ia.search_movie('The Untouchables')[0]
ia.update(movie)
#Lots of info for the movie from IMDB
movie.keys()
Though I'm not sure where to find specifically budget info

Copy text from a specific point on a web page in Python

Say if I was given a web page, e.g this, how could I copy the text starting from <root response="True"> and ending at </root>
How could I do this in Python?
import xml.etree.ElementTree as et
import requests
URL = "http://www.omdbapi.com/?t=True%20Grit&r=XML"
def main():
pg = requests.get(URL).content
root = et.fromstring(pg)
for attr,value in root[0].items():
print("{:>10}: {}".format(attr, value))
if __name__=="__main__":
main()
results in
poster: http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg
metascore: 80
director: Ethan Coen, Joel Coen
released: 22 Dec 2010
awards: Nominated for 10 Oscars. Another 30 wins & 85 nominations.
year: 2010
genre: Adventure, Drama, Western
imdbVotes: 184,711
plot: A tough U.S. Marshal helps a stubborn young woman track down her father's murderer.
rated: PG-13
language: English
title: True Grit
country: USA
writer: Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)
actors: Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin
imdbID: tt1403865
runtime: 110 min
type: movie
imdbRating: 7.7
I would use requests and BeautifulSoup for this:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://www.omdbapi.com/?t=True%20Grit&r=XML')
>>> soup = BeautifulSoup(r.text)
>>> list(soup('root')[0].children)
[<movie actors="Jeff Bridges, Hailee Steinfeld, Matt Damon, Josh Brolin" awards="Nominated for 10 Oscars. Another 30 wins & 85 nominations." country="USA" director="Ethan Coen, Joel Coen" genre="Adventure, Drama, Western" imdbid="tt1403865" imdbrating="7.7" imdbvotes="184,711" language="English" metascore="80" plot="A tough U.S. Marshal helps a stubborn young woman track down her father's murderer." poster="http://ia.media-imdb.com/images/M/MV5BMjIxNjAzODQ0N15BMl5BanBnXkFtZTcwODY2MjMyNA##._V1_SX300.jpg" rated="PG-13" released="22 Dec 2010" runtime="110 min" title="True Grit" type="movie" writer="Joel Coen (screenplay), Ethan Coen (screenplay), Charles Portis (novel)" year="2010"></movie>]
Download the document with urllib2: http://docs.python.org/2/howto/urllib2.html
A good parser for short, simple, well formed XML like this is Minidom. Here is how to parse:
http://docs.python.org/2/library/xml.dom.minidom.html
Then get the text, e.g.: Getting text between xml tags with minidom

Categories