Scraping multiple pages for parsing in Beautiful Soup

Scraping multiple pages for parsing in Beautiful Soup - python

I'm trying to scrape multiple pages off of a single website for BeautifulSoup to parse. So far, I've tried using urllib2 to do this, but have been encountering some problems. What I've attempted is:
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
This only gives me results for the second number in the numb sequence, i.e. http://www.presidency.ucsb.edu/ws/index.php?pid=87433. I also made some attempts at using mechanize, but had no success. Ideally what I would like to be able to do is have a page, with a list of links, and then automatically select a link, pass the HTML off to BeautifulSoup, and then move to the next link in the list.

You need to put the rest of the code inside the loop. Right now you're iterating over both items in the tuple, but at the end of the iteration only the last item remains assigned to address which subsequently gets parsed outside the loop.

I think you just missed the indentation in the loop :
import urllib2,sys
from BeautifulSoup import BeautifulSoup
for numb in ('85753', '87433'):
address = ('http://www.presidency.ucsb.edu/ws/index.php?pid=' + numb)
html = urllib2.urlopen(address).read()
soup = BeautifulSoup(html)
title = soup.find("span", {"class":"paperstitle"})
date = soup.find("span", {"class":"docdate"})
span = soup.find("span", {"class":"displaytext"}) # span.string gives you the first bit
paras = [x for x in span.findAllNext("p")]
first = title.string
second = date.string
start = span.string
middle = "\n\n".join(["".join(x.findAll(text=True)) for x in paras[:-1]])
last = paras[-1].contents[0]
print "%s\n\n%s\n\n%s\n\n%s\n\n%s" % (first, second, start, middle, last)
I think this should solve the problem..

Here's a tidier solution (using lxml):
import lxml.html as lh
root_url = 'http://www.presidency.ucsb.edu/ws/index.php?pid='
page_ids = ['85753', '87433']
def scrape_page(page_id):
url = root_url + page_id
tree = lh.parse(url)
title = tree.xpath("//span[#class='paperstitle']")[0].text
date = tree.xpath("//span[#class='docdate']")[0].text
text = tree.xpath("//span[#class='displaytext']")[0].text_content()
return title, date, text
if __name__ == '__main__':
for page_id in page_ids:
title, date, text = scrape_page(page_id)

Related

Beautiful soup doesn't load the whole page

I got this project where I'm scraping data on Trulia.com and where I want to get the max number of page (last number) for a specific location (photo below) so I can loop through it and get all the hrefs.
To get that last number, I have my code that run as planned and should return an integer but it doesn't always return the same number. I added the print(comprehension list) to understand what's wrong. Here is the code and the output below. The return is commented but sould return the last number of the output list as an int.
city_link = "https://www.trulia.com/for_rent/San_Francisco,CA/"
def bsoup(url):
resp = r.get(url, headers=req_headers)
soup = bs(resp.content, 'html.parser')
return soup
def max_page(link):
soup = bsoup(link)
page_num = soup.find_all(attrs={"data-testid":"pagination-page-link"})
print([x.get_text() for x in page_num])
# return int(page_num[-1].get_text())
for x in range(10):
max_page(city_link)
I have no clue why sometimes it's returning something wrong. The photo above is the corresponding link.

Okay, now if I understand what you want, you are trying to see how many pages of links there are for a given location for rent. If we can assume the given link is the only required link, this code:
import requests
import bs4
url = "https://www.trulia.com/for_rent/San_Francisco,CA/"
req = requests.get(url)
soup = bs4.BeautifulSoup(req.content, features='lxml')
def get_number_of_pages(soup):
caption_tag = soup.find('div', class_="Text__TextBase-sc-1cait9d-0-
div Text__TextContainerBase-sc-1cait9d-1 RBSGf")
pagination = caption_tag.text
words = pagination.split(" ")
values = []
for word in words:
if not word.isalpha():
values.append(word)
links_per_page = values[0].split('-')[1]
total_links = values[1].replace(',', '')
no_of_pages = round(int(total_links)/int(links_per_page) + 0.5)
return no_of_pages
for i in range(10):
print(get_number_of_pages(soup))
achieves what you're looking for, and has repeatability because it doesn't interact with javascript, but the pagination caption at the bottom of the page.

Web scraping with bs4 python: How to display football matchups

I'm a beginner to Python and am trying to create a program that will scrape the football/soccer schedule from skysports.com and will send it through SMS to my phone through Twilio. I've excluded the SMS code because I have that figured out, so here's the web scraping code I am getting stuck with so far:
import requests
from bs4 import BeautifulSoup
URL = "https://www.skysports.com/football-fixtures"
page = requests.get(URL)
results = BeautifulSoup(page.content, "html.parser")
d = defaultdict(list)
comp = results.find('h5', {"class": "fixres__header3"})
team1 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side1"})
date = results.find('span', {"class": "matches__date"})
team2 = results.find('span', {"class": "matches__item-col matches__participant matches__participant--side2"})
for ind in range(len(d)):
d['comp'].append(comp[ind].text)
d['team1'].append(team1[ind].text)
d['date'].append(date[ind].text)
d['team2'].append(team2[ind].text)

Down below should do the trick for you:
from bs4 import BeautifulSoup
import requests
a = requests.get('https://www.skysports.com/football-fixtures')
soup = BeautifulSoup(a.text,features="html.parser")
teams = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="swap-text--bp30")[1:]: #skips the first one because that's a heading
teams.append(i.text)
date = soup.find(class_="fixres__header2").text
print(date)
teams = [i.strip('\n') for i in teams]
for x in range(0,len(teams),2):
print (teams[x]+" vs "+ teams[x+1])
Let me further explain what I have done:
All the football have this class name - swap-text--bp30
So we can use find_all to extract all the classes with that name.
Once we have our results we can put them into an array "teams = []" then append them in a for loop "team.append(i.text)". ".text" strips the html
Then we can get rid of "\n" in the array by stripping it and printing out each string in the array two by two.
This should be your final output:
EDIT: To scrape the title of the leagues we will do pretty much the same:
league = []
for date in soup.find_all(class_="fixres__header2"): # searching in that date
for i in soup.find_all(class_="fixres__header3"): #skips the first one because that's a heading
league.append(i.text)
Strip the array and create another one:
league = [i.strip('\n') for i in league]
final = []
Then add this final bit of code which is essentially just printing the league then the two teams over and over:
for x in range(0,len(teams),5):
final.append(teams[x]+" vs "+ teams[x+1])
for i in league:
print(i)
for i in final:
print(i)

Beautiful soup how select <a href> and <td> elements with whitespaces

I'm trying to use BeautifulSoup to select the date, url, description, and additional url from table and am having trouble accessing them given the weird white spaces:
So far I've written:
import urllib
import urllib.request
from bs4 import BeautifulSoup
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, "html.parser")
return soupdata
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
test1 = soup.findAll("td", {"nowrap" : "nowrap"})
test2 = [item.text.strip() for item in test1]

With bs4 4.7.1 you can use :has and nth-of-type in combination with next_sibling to get those columns
from bs4 import BeautifulSoup
import requests, re
def make_soup(url):
the_page = requests.get(url)
soup_data = BeautifulSoup(the_page.content, "html.parser")
return soup_data
soup = make_soup('https://www.sec.gov/litigation/litreleases/litrelarchive/litarchive2010.shtml')
releases = []
links = []
dates = []
descs = []
addit_urls = []
for i in soup.select('td:nth-of-type(1):has([href^="/litigation/litreleases/"])'):
sib_sib = i.next_sibling.next_sibling.next_sibling.next_sibling
releases+= [i.a.text]
links+= [i.a['href']]
dates += [i.next_sibling.next_sibling.text.strip()]
descs += [re.sub('\t+|\s+',' ',sib_sib.text.strip())]
addit_urls += ['N/A' if sib_sib.a is None else sib_sib.a['href']]
result = list(zip(releases, links, dates, descs, addit_urls))
print(result)

Unfortunately there is no class or id HTML attribute to quickly identify the table to scrape; after experimentation I found it was the table at index 4.
Next we ignore the header by separating it from the data, which still has table rows that are just separations for quarters. We can skip over these using a try-except block since those only contain one table data tag.
I noticed that the description is separated by tabs, so I split the text on \t.
For the urls, I used .get('href') rather than ['href'] since not every anchor tag has an href attribute from my experience scraping. This avoids errors should that case occur. Finally the second anchor tag does not always appear, so this is wrapped in a try-except block as well.
data = []
table = soup.find_all('table')[4] # target the specific table
header, *rows = table.find_all('tr')
for row in rows:
try:
litigation, date, complaint = row.find_all('td')
except ValueError:
continue # ignore quarter rows
id = litigation.text.strip().split('-')[-1]
date = date.text.strip()
desc = complaint.text.strip().split('\t')[0]
lit_url = litigation.find('a').get('href')
try:
comp_url = complaint.find('a').get('href')
except AttributeError:
comp_ulr = None # complaint url is optional
info = dict(id=id, date=date, desc=desc, lit_url=lit_url, comp_url=comp_url)
data.append(info)

cannot get the <span></span> texts

cannot get the span text within the "table", thanks !
from bs4 import BeautifulSoup
import urllib2
url1 = "url"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table.find_all('span',recursive=False)
for row in rows:
print(row.text)

table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
In the above line, findAll() returns a list.
So, in the next line you are getting the error because its expecting an HTML string.
If you expect only one table, try using the following code. Just replace
rows = table.find_all('span',recursive=False)
with
rows = table[0].find_all('span')
If you expect multiple tables in the page, run a for loop on the table and then run the rest of the statements inside the for loop.
Also, for pretty output, you can replace the tabs with spaces as in the following code:
row = row.get_text()
row = row.replace('\t', '')
print(row)
The final working code for you is:
from bs4 import BeautifulSoup
import urllib2
url1 = "url"
content1 = urllib2.urlopen(url1).read()
soup = BeautifulSoup(content1,"lxml")
table = soup.findAll("div", {"class" : "iw_component","id":"c1417094965154"})
rows = table[0].find_all('span')
for row in rows:
row_str = row.get_text()
row_str = row_str.replace('\t', '')
print(row_str)
Regarding recursive=False parameter, if it's set to false, it will only find in direct children which, in your case will give no result.
Recursive Argument in find()
If you only want Beautiful Soup to consider direct children, you can pass in recursive=False

Here's another approach using lxml instead of beautifulsoup:
import requests
from lxml import html
req = requests.get("<URL>")
raw_html = html.fromstring(req.text)
spans = raw_html.xpath('//div[#id="c1417094965154"]//span/text()')
print("".join([x.replace("\t", "").replace("\r\n","").strip() for x in spans]))
Output: Kranji Mile Day simulcast races, Kranji Racecourse, SINClass 3 Handicap - 1200M TURFSaturday, 26 May 2018Race 1, 5:15 PM
As you see, the output need a little formatting, spans is a list of all spans text, so you can do any processing you need.

You seem to use python 2.x, here is a python 3.x solution, since I do not have a python 2.x environment at the moment :
from bs4 import BeautifulSoup
import urllib.request as urllib
url1 = "<URL>"
# Read the HTML page
content1 = urllib.urlopen(url1).read()
soup = BeautifulSoup(content1, "lxml")
# Find the div (there is only one, so you do not need findAll) -> this is your problem
div = soup.find("div", class_="iw_component", id="c1417094965154")
# Now you retrieve all the span within this div
rows = div.find_all("span")
# You can do what you want with it !
line = ""
for row in rows:
row_str = row.get_text()
row_str = row_str.replace('\t', '')
line += row_str + ", "
print(line)

Having problems following links with webcrawler

I am trying to create a webcrawler that parses all the html on the page, grabs a specified (via raw_input) link, follows that link, and then repeats this process a specified number of times (once again via raw_input). I am able to grab the first link and successfully print it. However, I am having problems "looping" the whole process, and usually grab the wrong link. This is the first link
https://pr4e.dr-chuck.com/tsugi/mod/python-data/data/known_by_Fikret.html
(Full disclosure, this questions pertains to an assignment for a Coursera course)
Here's my code
import urllib
from BeautifulSoup import *
url = raw_input('Enter - ')
rpt=raw_input('Enter Position')
rpt=int(rpt)
cnt=raw_input('Enter Count')
cnt=int(cnt)
count=0
counts=0
tags=list()
soup=None
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
# Retrieve all of the anchor tags
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
count=count + 1
if count== rpt:
break
counts=counts + 1
if counts==cnt:
x==1
else: continue
print url

Based on DJanssens' response, I found the solution;
url = tags[position-1].get('href')
did the trick for me!
Thanks for the assistance!

I also worked on that course, and help with a friend, I got this worked out:
import urllib
from bs4 import BeautifulSoup
url = "http://python-data.dr-chuck.net/known_by_Happy.html"
rpt=7
position=18
count=0
counts=0
tags=list()
soup=None
x=0
while x==0:
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html,"html.parser")
tags=soup.findAll('a')
url= tags[position-1].get('href')
count=count + 1
if count == rpt:
break
print url

I believe this is what you are looking for:
import urllib
from bs4 import *
url = raw_input('Enter - ')
position=int(raw_input('Enter Position'))
count=int(raw_input('Enter Count'))
#perform the loop "count" times.
for _ in xrange(0,count):
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html)
tags=soup.findAll('a')
for tag in tags:
url= tag.get('href')
tags=soup.findAll('a')
# if the link does not exist at that position, show error.
if not tags[position-1]:
print "A link does not exist at that position."
# if the link at that position exist, overwrite it so the next search will use it.
url = tags[position-1].get('href')
print url
The code will now loop the amount of times as specified in the input, each time it will take the href at the given position and replace it with the url, in that way each loop will look further in the tree structure.
I advice you to use full names for variables, which is a lot easier to understand. In addition you could cast them and read them in a single line, which makes your beginning easier to follow.

Here is my 2-cents:
import urllib
#import ssl
from bs4 import BeautifulSoup
#'http://py4e-data.dr-chuck.net/known_by_Fikret.html'
url = raw_input('Enter URL : ')
position = int(raw_input('Enter position : '))
count = int(raw_input('Enter count : '))
print('Retrieving: ' + url)
soup = BeautifulSoup(urllib.urlopen(url).read())
for x in range(1, count + 1):
link = list()
for tag in soup('a'):
link.append(tag.get('href', None))
print('Retrieving: ' + link[position - 1])
soup = BeautifulSoup(urllib.urlopen(link[position - 1]).read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping multiple pages for parsing in Beautiful Soup - python

You need to put the rest of the code inside the loop. Right now you're iterating over both items in the tuple, but at the end of the iteration only the last item remains assigned to address which subsequently gets parsed outside the loop.

Related

Beautiful soup doesn't load the whole page

Web scraping with bs4 python: How to display football matchups

Beautiful soup how select <a href> and <td> elements with whitespaces

cannot get the <span></span> texts

Having problems following links with webcrawler

Categories

Resources