Scrape href not working with python - python

I have copies of this very code that I am trying to do and every time I copy it line by line it isn't working right. I am more than frustrated and can't seem to figure out where it is not working. What I am trying to do is go to a website, scrap the different ratings pages which are labelled A, B, C ... etc. Then I am going to each site to pull the total number of pages they are using. I am trying to scrape the <span class='letter-pages' href='/ratings/A/1' and so on. What am I doing wrong?
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
ks = []
pages_scrape = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
# elif good_ratings.startswith('/401k'):
# ks.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
print(ratings)
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
for href in soup.find('span', class_='letter-pages'):
#Not working Here
pages_scrape.append(href.attrs['href'])
# Will print all the anchor tags with hrefs if I remove the above comment.
print(href)

You are trying to get the href prematurely. You are trying to extract the attribute directly from a span tag that has nested a tags, rather than a list of a tags.
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
I didn't test this on all pages, but it worked for the first one. You pointed out that on some of the pages the content wasn't getting scraped, which is due to the span search returning None. To get around this you can do something like:
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if span:
for a in span.find_all('a'):
href = a.get('href')
pages_scrape.append(href)
print(href)
else:
print('span.letter-pages not found on ' + page)
Depending on your use case you might want to do something different, but this will indicate to you which pages don't match your scraping model and need to be manually investigated.

You probably meant to do find_all instead of find -- so change
for href in soup.find('span', class_='letter-pages'):
to
for href in soup.find_all('span', class_='letter-pages'):
You want to be iterating over a list of tags, not a single tag. find would give you a single tag object. When you iterate over a single tag, you iterate get NavigableString objects. find_all gives you the list of tag objects you want.

Related

How can I extract a specific item attribute from an ebay listing using BeautifulSoup?

def get_data(url):
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
return soup
current_data = get_data(link)
x = current_data.find_all(text="Style Code:")
I'm trying to get the style code of a shoe off ebay but the problem is that it doesn't have a specific class or any kind of unique identifier so I can't just use find() to get the data. Currently I searched by text to find 'Style Code:' but how can I get to the next div? An example of a shoe product page would be this.
soup.select_one('span.ux-textspans:-soup-contains("Style Code:")').find_next('span').get_text(strip=True)
Try this,
spans = soup.find_all('span', attrs={'class':'ux-textspans'})
style_code = None
for idx, span in enumerate(spans):
if span.text == 'Style Code:':
style_code = spans[idx+1].text
break
print(style_code)
# 554724-371
Since there are lot's of span is similar (with class 'ux-textspans') you need to iterate through it and find the next span after 'Style Code:'

Access attributes with beautifulSoup and print

I'd like to scrape a site to findall title attributes of h2 tag
<h2 class="1">Titanic_Caprio</h2>
Using this code, I'm accessing the entire h2 tag
from bs4 import BeautifulSoup
import urllib2
url = "http://www.example.it"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
links = soup.findAll('h2')
print "".join([str(x) for x in links] )
using findAll('h2', attrs = {'title'}) doesn't have results. What Am I doing wrong? How can I print out the entire title's list in a file?
The problem is that title is not an attribute of the h2 tag, but of a tag included in it. So you must first search for <h2> tags, and then subtags having a title attribute:
titles = []
h2_list = links = soup.findAll('h2')
for h2 in h2_list:
titles.extend(h2.findAll(lambda x: x.has_attr('title')))
It works because BeautifulSoup can use functions as search filters.
you need to pass key value pairs in attrs
findAll('h2', attrs = {"key":"value"})

Parsing multiple urls with Python and BeautifulSoup

I started learning Python today and so it is not a surprise that I am struggling with some basics. I am trying to parse data from a school website for a project and I managed to parse the first page. However, there are multiple pages (results are paginated).
I have an idea about how to go about it, ie, run through the urls in a loop since I know the url format but I have no idea how to proceed. I figured it would be better to somehow search for the "next" button and run the function if it is there, if not, then stop function.
I would appreciate any help I can get.
import requests
from bs4 import BeautifulSoup
url = "http://www.myschoolwebsite.com/1"
#url2 = "http://www.myschoolwebsite.com/2"
r = requests.get(url)
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text
First, you can assume the maximum no. of pages of the directory (if you know pattern of the url). I am assuming the url is of the form http://base_url/page Next you can write this:
base_url = 'http://www.myschoolwebsite.com'
total_pages = 100
def parse_content(r):
soup = BeautifulSoup(r.content,'lxml')
g_data = soup.find_all('ul', {"class": "searchResults"})
for item in g_data:
for li in item.findAll('li'):
for resultnameh2 in li.findAll('h2'):
for resultname in resultnameh2.findAll('a'):
print(resultname).text
for resultAddress in li.findAll('p', {"class": "resultAddress"}):
print(resultAddress).text.replace('Get directions','').strip()
for resultContact in li.findAll('ul', {"class": "resultContact"}):
for resultContact in li.findAll('a', {"class": "resultMainNumber"}):
print(resultContact).text
for page in range(1, total_pages):
response = requests.get(base_url + '/' + str(page))
if response.status_code != 200:
break
parse_content(response)
I would make an array with all the URLs and loop through it, or if there is a clear pattern, write a regex to search for that pattern.

Scraping all h1 tags contents with beautiful soup

I am trying to scrape some review data with beautiful soup, and it will only let me grab a single element:
BASE_URL = "http://consequenceofsound.net/'category/reviews/album-reviews/"
html = urlopen(BASE_URL + section_url).read()
soup = BeautifulSoup(html, "lxml")
meta = soup.find("div", {"class": "content"}).h1
wordage = [s.contents for s in meta]
this will let me grab a single reviews title from this page. When I change find to find_all though, I can't identify h1 on this line, so I get some code like this:
meta = soup.find("div", {"class": "content"})
wordage = [s.h1 for s in meta]
and I'm unable to find a way to isolate the contents.
meta = soup.find_all("div", {"class": "content"})
wordage = [s.h1 for s in meta if s.h1 not in ([], None)]
link = [s.a['href'] for s in wordage]
Note the addition of the 'not in' statement. It seems on occassion empty and nonetype lists get added in to the 'soup' so this is an important measure.

Sending data to sqlite from beautifulsoup on scraperwiki but getting KeyError: 'href'

I am trying to learn Python and Beautiful Soup by using ScraperWiki. I want a list of all the kickstarter projects in Edmonton.
I have successfully scraped the page I am looking for and pulled out the data I want. I am having trouble getting that data formatted and exported to the database.
Console output:
Line 42 - url = link["href"]
/usr/local/lib/python2.7/dist-packages/bs4/element.py:879 -- __getitem__((self=<h2 class="bbcard_nam...more
KeyError: 'href'
Code:
import scraperwiki
from bs4 import BeautifulSoup
search_page ="http://www.kickstarter.com/projects/search?term=edmonton"
html = scraperwiki.scrape(search_page)
soup = BeautifulSoup(html)
max = soup.find("p", { "class" : "blurb" }).get_text()
num = int(max.split(" ")[0])
if num % 12 != 0:
last_page = int(num/12) + 1
else:
last_page = int(num/12)
for n in range(1, last_page + 1):
html = scraperwiki.scrape(search_page + "&page=" + str(n))
soup = BeautifulSoup(html)
projects = soup.find_all("h2", { "class" : "bbcard_name" })
counter = (n-1)*12 + 1
print projects
for link in projects:
url = link["href"]
data = {"URL": url, "id": counter}
#save into the data store, giving the unique parameter
scraperwiki.sqlite.save(["URL"],data)
counter+=1
There are anchors with href in projects. How can I get the URL from each <h2> element in the for loop?
Well, you're asking for <h2> tags, so that's what BeautifulSoup is giving you. None of these will have href attributes, obviously, because headers can't have href attributes.
Saying for link in projects merely gives each item in projects (which are level-2 headers) the name link, it doesn't magically turn them into links.
At the risk of seeming insultingly obvious, if you want links, look for <a> tags instead...? Or perhaps you want all the links inside each header... e.g.
for project in projects:
for link in project.find_all("a"):
Or maybe do away with finding the projects and go straight for the links in the first place:
for link in soup.select("h2.bbcard_name a"):
You're looking for the href attribute in <h2> tags.
This piece of code:
for link in projects:
iterates through projects, which contains <h2> tags, not links.
I'm not so clear on what do you want, but i assume you want to find the href attribute of <a> tags inside <h2> tags, try this one:
data = {"URL":[], "id":counter}
for header in projects: #take the header)
links = header.find_all("a")
for link in links:
url = link["href"]
Also, data = {"URL": url, "id": counter} overwrites the dictionary data on each loop. So change it to this:
data["URL"].append(url) # store it on this format {'URL':[link1,link2,link3]}

Categories