Python BeautifulSoup webcrawling: Formatting output

Python BeautifulSoup webcrawling: Formatting output - python

The site I am trying to crawl is http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. The specific page I'm focusing on now is http://www.boxofficemojo.com/movies/?id=catchingfire.htm.
From this page, I am having trouble with two things. The first thing is the "Foreign Gross" amount (under Total lifetime Grosses). I got the amount with this function:
def getForeign(item_url):
response = requests.get(item_url)
soup = BeautifulSoup(response.content)
print soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
The problem is, I can print this amount out to the console, but I can't append these values to a list or write them to a csv file. For the previous data I needed to get on this site, I got the individual piece of information for each movie and appended them all to one list, which I then exported to the csv file.
How can I get this "Foreign Gross" amount as a separate amount for each movie? What do I need to change?
The second problem is related to getting a list of the actors/actresses for each movie. I have this function:
def getActors(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
tempActors = []
print soup.find(text="Actors:").find_parent("tr").text[7:]
This prints out a list of actors: Jennifer LawrenceJosh HutchersonLiam HemsworthElizabeth BanksStanley TucciWoody HarrelsonPhilip Seymour HoffmanJeffrey WrightJena MaloneAmanda PlummerSam ClaflinDonald SutherlandLenny Kravitz
- as so.
I am also having the same problem as I am having with the foreign gross amount. I want to get each individual actor seperately, then append them all to a temporary list, and then later append that list to another full list of all the movies. I did this with the list of directors, but since all the directors are links, but not all of the actors/actresses have html links, I can't do the same. Another issue right now is that there is no space between each of the actors.
Why are my current functions not working, and how can I fix them?
More Code::
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
details(href)
listOfForeign.append(getForeign(href))
listOfDirectors.append(getDirectors(href))
str(listOfDirectors).replace('[','').replace(']','')
getActors(href)
title = link.string
listOfTitles.append(title)
page
listOfForeign = []
def getForeign(item_url):
s = urlopen(item_url).read()
soup = BeautifulSoup(s)
return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
def spider(max_pages):
page = 1
while page <= max_pages:
url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.select('td > b > font > a[href^=/movies/?]'):
href = 'http://www.boxofficemojo.com' + link.get('href')
listOfForeign.append(getForeign(href))
page += 1
print listOfForeign
returns
Traceback (most recent call last):
File "C:/Users/younjin/PycharmProjects/untitled/movies.py", line 75, in
spider(1)
File "C:/Users/younjin/PycharmProjects/untitled/movies.py", line 29, in spider
listOfForeign.append(getForeign(href))
File "C:/Users/younjin/PycharmProjects/untitled/movies.py", line 73, in getForeign
return soup.find(text="Foreign:").find_parent("td").find_next_sibling("td").get_text(strip = True)
AttributeError: 'NoneType' object has no attribute 'find_parent'

Related

Cannot find the table data within the soup, but I know its there

I am trying create a function that scrapes college baseball team roster pages for a project. And I have created a function that crawls the roster page, gets a list of the links I want to scrape. But when I try to scrape the individual links for each player, it works but cannot find the data that is on their page.
This is the link to the page I am crawling from at the start:
https://gvsulakers.com/sports/baseball/roster
These are just functions that I call within the function that I am having a problem with:
def parse_row(rows):
return [str(x.string)for x in rows.find_all('td')]
def scrape(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers = headers)
html = page.text
soop = BeautifulSoup(html, 'lxml')
row = soop.find_all('tr')
lopr = [parse_row(rows) for rows in row]
return(lopr)
Here is what I am having an issue with. when I assign type1_roster with a variable and print it, i only get an empty list. Ideally it should contain data about a player or players from a players roster page.
# Roster page crawler
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
href_tags = soop.find_all(href = True)
hrefs = [tag.get('href') for tag in href_tags]
# get all player links
player_hrefs = []
for href in hrefs:
if 'sports/baseball/roster' in href:
if 'sports/baseball/roster/coaches' not in href:
if 'https:' not in href:
player_hrefs.append(href)
# get rid of duplicates
player_links = list(set(player_hrefs))
# scrape the roster links
for link in player_links:
player_ = url + link[24:]
return(find_data(player_))

A number of things:
I would pass the headers as a global
You are slicing 1 character too late the link I think for player_
You need to re-work the logic of find_data(), as data is present in a mixture of element types and not in table/tr/td elements e.g. found in spans. The html attributes are nice and descriptive and will support targeting content easily
You can target the player links from the landing page more tightly with the css selector list shown below. This removes the need for multiple loops as well as the use of list(set())
import requests
from bs4 import BeautifulSoup
HEADERS = {'User-Agent': 'Mozilla/5.0'}
def scrape(url):
page = requests.get(url, headers=HEADERS)
html = page.text
soop = BeautifulSoup(html, 'lxml')
return(soop)
def find_data(url):
page = requests.get(url, headers=HEADERS)
#print(page)
html = page.text
soop = BeautifulSoup(html, 'lxml')
# re-think logic here to return desired data e.g.
# soop.select_one('.sidearm-roster-player-jersey-number').text
first_name = soop.select_one('.sidearm-roster-player-first-name').text
# soop.select_one('.sidearm-roster-player-last-name').text
# need targeted string cleaning possibly
bio = soop.select_one('#sidearm-roster-player-bio').get_text('')
return (first_name, bio)
def type1_roster(team_id):
url = "https://" + team_id + ".com/sports/baseball/roster"
soop = scrape(url)
player_links = [i['href'] for i in soop.select(
'.sidearm-roster-players-container .sidearm-roster-player h3 > a')]
# scrape the roster links
for link in player_links:
player_ = url + link[23:]
# print(player_)
return(find_data(player_))
print(type1_roster('gvsulakers'))

Scraping multiple pages in Python

I am trying to scrape a page that includes 12 links. I need to open each of these links and scrape all of their titles. When I open each page, I face multiple pages in each link. However, my code could only scrape the first page in all of these 12 links
By below code, I can print all the 12 links URLs that exist on the main page.
url = 'http://mlg.ucd.ie/modules/COMP41680/assignment2/index.html'
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
links = soup.find_all("a")
all_urls = []
for link in links[1:]:
link_address ='http://mlg.ucd.ie/modules/COMP41680/assignment2/' + link.get("href")
all_urls.append(link_address)
Then, I looped in all of them.
for i in range(0,12):
url = all_urls[i]
res = requests.get (url)
soup = BeautifulSoup(res.text, 'html.parser')
The title could be extracted by below lines:
title_news = []
news_div = soup.find_all('div', class_ = 'article')
for container in news_div:
title = container.h5.a.text
title_news.append(title)
The output of this code only includes the title for the first page of each of these 12 pages, while I need my code to go through multiple pages in these 12 URLs.
The below gives me the links of all the pages that exist in each of these 12 links if it defines in an appropriate loop. ( It reads the pagination section and look for the next page URL link)
page = soup.find('ul', {'class' : 'pagination'}).select('li', {'class': "page-link"})[2].find('a')['href']
How I should use a page variable inside my code to extract multiple pages in all of these 12 links and read all the titles and not only first-page titles.

You can use this code to get all titles from all the pages:
import requests
from bs4 import BeautifulSoup
base_url = "http://mlg.ucd.ie/modules/COMP41680/assignment2/"
soup = BeautifulSoup(
requests.get(base_url + "index.html").content, "html.parser"
)
title_news = []
for a in soup.select("#all a"):
next_link = a["href"]
print("Getting", base_url + next_link)
while True:
soup = BeautifulSoup(
requests.get(base_url + next_link).content, "html.parser"
)
for title in soup.select("h5 a"):
title_news.append(title.text)
next_link = soup.select_one('a[aria-label="Next"]')["href"]
if next_link == "#":
break
print("Length of title_news:", len(title_news))
Prints:
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jan-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-feb-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-mar-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-apr-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-may-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jun-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-jul-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-aug-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-sep-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-oct-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-nov-001.html
Getting http://mlg.ucd.ie/modules/COMP41680/assignment2/month-dec-001.html
Length of title_news: 16226

Python Web Scraping Randomly Failing

I am trying to make a sitemap for the following website:
http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx
The code goes through and first determines how many pages are on the top directory level, then it stores the each page number and its corresponding link. Then it goes through each page and creates a dictionary that contains each 3 digit file value and the corresponding link for that value. From there the code takes creates another dictionary of the pages and links for each 3 digit directory (this is the point at which I am stuck). Once this is complete the goal is to create a dictionary that contains each 6 digit file number and its corresponding link.
However, the code randomly fails at certain points throughout the scraping process and gives the following error message:
Traceback (most recent call last):
File "C:\Scraping_Test.py", line 76, in <module>
totalPages = totalPages.text
AttributeError: 'NoneType' object has no attribute 'text'
Sometimes the code does not even run and automatically skips to the end of the program without any errors.
I am currently running python 3.6.0 and using all updated libraries on Visual Studio Community 2015. Any help will be appreciated as I am new to programming.
import bs4 as bs
import requests
import re
import time
def stop():
print('sleep 5 sec')
time.sleep(5)
url0 = 'http://aogweb.state.ak.us'
url1 = 'http://aogweb.state.ak.us/WebLink/'
r = requests.get('http://aogweb.state.ak.us/WebLink/0/fol/12497/Row1.aspx')
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
pagesTopDic = {}
pagesTopDic['1'] = '/WebLink/0/fol/12497/Row1.aspx'
dig3Dic = {}
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
if 'page' in link.get('title').lower():
page = link.get('title')
page = page.split(' ')[1]
#print(page)
pagesTopDic[page] = link.get('href')
listKeys = pagesTopDic.keys()
for page in listKeys: #on each page find urls for beggining 3 digits
url = url0 + pagesTopDic[page]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'):
if not link.get("aria-label") is None:
folder = link.get("aria-label")
folder = folder.split(' ')[0]
dig3Dic[folder] = link.get('href')
listKeys = dig3Dic.keys()
pages3Dic = {}
for entry in listKeys: #pages for each three digit num
print(entry)
url = url1 + dig3Dic[entry]
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
tmpDic = {}
tmpDic['1'] = '/Weblink/' + dig3Dic[entry]
totalPages = soup.find('div',{"class": "PageXofY"})
print(totalPages)
totalPages = totalPages.text
print(totalPages)
totalPages = totalPages.split(' ')[3]
print(totalPages)
while len(tmpDic.keys()) < int(totalPages):
r = requests.get(url)
soup = bs.BeautifulSoup(r.content, 'lxml')
print('Status: ' + str(r.status_code))
stop()
for link in soup.find_all('a'): #find top pages
if not link.get('title') is None:
#print(link.get('title'))
if 'Page' in link.get('title'):
page = link.get('title')
page = page.split(' ')[1]
tmpDic[page] = link.get('href')
num = len(tmpDic.keys())
url = url0 + tmpDic[str(num)]
print()
pages3Dic[entry] = tmpDic

Python BeautifulSoup : Is there a way to count the number of results crawled?

Is there a way to count the number of results crawled in BeautifulSoup?
Here is the code.
def crawl_first_url(max_page):
page = 1
while page <= max_page:
url = 'http://www.hdwallpapers.in/page/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for div in soup.select('.thumb a'):
href = 'http://www.hdwallpapers.in' + div.get('href')
crawl_second_url(href)
page += 1
def crawl_second_url(second_href):
#need to count the number of results here.
#I tried, len(second_href) but it doesn't work well.
crawl_first_url(1)
I want the second function to count the number of crawled results, for exemple if 19 urls have been crawled I want it the amount.

Since you only want to count the number of results, I don't see a reason to have a separate function, just add a counter.
page = 1
numResults = 0
while page <= max_page:
url = 'http://www.hdwallpapers.in/page/' + str(page)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text, 'html.parser')
for div in soup.select('.thumb a'):
href = 'http://www.hdwallpapers.in' + div.get('href')
numResults += 1
page += 1
print("There are " + numResults + " results.")
This will only count the number of subpages. If you also want to count the top level pages just add another increment line after the soup. You might also want to add a try: except: block to avoid crashes.

BeautifulSoup is not getting all data, only some

import requests
from bs4 import BeautifulSoup
def trade_spider(max_pages):
page = 0
while page <= max_pages:
url = 'http://orangecounty.craigslist.org/search/foa?s=' + str(page * 100)
source_code = requests.get(url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for link in soup.findAll('a', {'class':'hdrlnk'}):
href = 'http://orangecounty.craigslist.org/' + link.get('href')
title = link.string
print title
#print href
get_single_item_data(href)
page += 1
def get_single_item_data(item_url):
source_code = requests.get(item_url)
plain_text = source_code.text
soup = BeautifulSoup(plain_text)
for item_name in soup.findAll('section', {'id':'postingbody'}):
print item_name.string
trade_spider(1)
I am trying to crawl craigslist (for practice), http://orangecounty.craigslist.org/search/foa?s=0 in particular. I have it right now set to print the title of the entry and the description of the entry. The issue is that although the title correctly prints for every object listed, the description is listed as "None" for most of them, even though there is clearly a description. Any help would be appreciated. Thanks.

You are almost there. Just change item_name.string to item_name.text

Instead of getting the .string, get the text of the posting body (worked for me):
item_name.get_text(strip=True)
As a side note, your script has a blocking "nature", you may speed things up dramatically by switching to Scrapy web-scraping framework.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup webcrawling: Formatting output - python

Related

Cannot find the table data within the soup, but I know its there

Scraping multiple pages in Python

Python Web Scraping Randomly Failing

Python BeautifulSoup : Is there a way to count the number of results crawled?

BeautifulSoup is not getting all data, only some

Categories

Resources