rewriting spider in OOP terms - python

Hope everyone is well
So I wrote some code for a spider not long ago and it was for a website that shows house prices and sale dates in London. I have recently decided to improve it by making it object oriented.
The scope of the spider extends to 680 areas such as one given by this link:
http://www.rightmove.co.uk/house-prices/St-Johns-Wood.html
and if you click on the page, you will see that there are 40 pages for each area.
The reason that I would like to do OOP on this is because I need to involve a few methods that deal with updating the database that it saves to, so I do not have to run the whole spider again.
My first question is this: is it better to abstract the individual pages and treat them as an object or to treat each of the 680 areas as an object and involve methods to crawl each of the 40 pages for one area?
My second question is this:
Would someone be kind enough to show me the way that this would be implemented in OOP terms given the answer to the first question?
Below I provide code for the two jobs:
def forty_page_getter(link):
driver = webdriver.PhantomJS()
driver.get(link)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
link_box = soup.find('div', {'id':'sliderBottom'})
rest = link_box.find_all('a')
links = []
for i in rest:
try:
link = i.get('href')
links.append(link)
except:
pass
return links
def page_stripper(link):
'''this function strips all the information on houses and transactions from a page ,
ready for entry into the database, now make a function for the database '''
myopener = MyOpener()
page = myopener.open(link)
soup = BeautifulSoup(page, 'html.parser')
houses = soup.find_all('div', {'class':'soldDetails'})
for house in houses:
try:
address = house.a.text
except:
address = house.find('div',{'class': 'soldAddress'}).text
postcode_list = address.split()[-2:]
postcode = postcode_list[0] + ' ' + postcode_list[1]
table = house.find('table').find_all('tr')
bedrooms = table[0].find('td', {'class': 'noBed'}).text
if not bedrooms:
bedrooms = '0'
else:
bedrooms = bedrooms[0]
house_key = save_house_to_db(address=address, postcode=postcode, bedrooms=bedrooms)
for row in table[::-1]:
price = row.find('td', {'class': 'soldPrice'}).text
date = row.find('td', {'class': 'soldDate'}).text
save_transactions_to_db(id = house_key, date= date, sale_price= price)
print('saved %s to DB' %str(address))
I am somewhat confused, if we were to treat the page or even the area link as an object how it would work given that we use BeautifulSoup and selenium for the dynamic updating scrollbar for the forty pages which in themselves are objects.
for example, I was considering the two following ways:
class area(BeautifulSoup):
def __init__(self, html_code):
def get_all_pages(self):
def strip_all_pages(self):
#OR would it be better to do this?
class page(object):
def __init__(self, html_code):
def strip(self):
but I am new to classes and was wondering how to work with Bs4, potentially another library called selenium and making sure that works within my own defined class.
Thanks for any help guys. I appreciate it.

Related

Beautifulsoup - Python For loop only runs 8 times then exits with code 0 in visual studio code

I've got a python script that scrapes the first page on an auction site. The page it's scraping is trademe.co.nz - similar to ebay/amazon etc. It's purpose is to scrape all listings on the first page - only if it's not in my database. It's working as expected with one caveat - it's only scraping the first 8 listings (regardless of trademe url) & then exits with code 0 in visual studio code. If I try to run it again it exits immediately as it thinks there are no new auction IDs. If a new listing gets added & I run the script again - it will add the new one.
from bs4 import BeautifulSoup
from time import sleep
import requests
import datetime
import sqlite3
# Standard for all scrapings
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
def mechanicalKeyboards():
url = "https://www.trademe.co.nz/a/marketplace/computers/peripherals/keyboards/mechanical/search?condition=used&sort_order=expirydesc"
category = "Mechanical Keyboards"
dateAdded = datetime.datetime.now().strftime("%d/%m/%Y %H:%M:%S")
trademeLogo = "https://www.trademe.co.nz/images/frend/trademe-logo-no-tagline.png"
# getCode = requests.get(url).status_code
# print(getCode)
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
listingContainer = soup.select(".tm-marketplace-search-card__wrapper")
conn = sqlite3.connect('trademe.db')
c = conn.cursor()
c.execute('''SELECT ID FROM trademe ORDER BY DateAdded DESC ''')
allResult = str(c.fetchall())
for listing in listingContainer:
title = listing.select("#-title")
location = listing.select("#-region")
auctionID = listing['data-aria-id'].split("-").pop()
fullListingURL = "https://www.trademe.co.nz/a/" + auctionID
image = listing.select("picture img")
try:
buyNow = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price ng-star-inserted").text.strip()
except:
buyNow = "None"
try:
price = listing.select(".tm-marketplace-search-card__footer-pricing-row")[0].find(class_="tm-marketplace-search-card__price").text.strip()
except:
price = "None"
for t, l, i in zip(title, location, image):
if auctionID not in allResult:
print("Adding new data - " + t.text)
c.execute(''' INSERT INTO trademe VALUES(?,?,?,?)''', (auctionID, t.text, dateAdded, fullListingURL))
conn.commit()
sleep(5)
I thought perhaps I was getting rate-limited, but I get a 200 status code & changing URLs work for the first 8 listings again. I had a look at the elements & can't see any changes after the 8th listing. I'm hoping someone could assist, thanks so much.
When using requests.get(url) to scrape a website with lazy-loaded content, it only return the HTML with images for the first 8 listings, causing the zip(title, location, image) function to only yield 8 items since image variable is empty list after the 8th listing in listingContainer
To properly scrape this type of website, I would recommended using tools such as Playwright or Selenium.

Paginating pages using things other than numbers in python

I am trying to paginate a scraper on my my university's website.
Here is the url for one of the pages:
https://www.bu.edu/com/profile/david-abel/
where david-abel is a first followed by last name. (It would be first-middle-last if a middle was given which poses a problem based on my code only finding first and last currently). I have a plan to deal with middle names but my question is:
How do I go about adding names from my first and lastnames list to my base url to get a corresponding url in the layout above
import requests
from bs4 import BeautifulSoup
url = 'https://www.bu.edu/com/profiles/faculty/page/1/'
data = requests.get(url)
my_data = []
split_names = []
firstnames = []
lastnames = []
middlenames = []
html = BeautifulSoup(data.text, 'html.parser')
professors = html.select('h4.profile-card__name')
for professor in professors:
my_data.append(professor.text)
for name in my_data:
x = name.split()
split_names.append(x)
for name in split_names:
f, l = zip(*split_names)
firstnames.append(f)
lastnames.append(l)
#\/ appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl +
print(firstnames)
print(lastnames)
This simple modification should give you what you want, let me know if you have any more questions or if anything needs to be changed!
# appending searchable url using names
for name in split_names:
baseurl = "https://www.bu.edu/com/profile/"
newurl = baseurl + "-".join(name)
print(newurl)
Even better:
for name in split_names:
profile_url = f"https://www.bu.edu/com/profile/{'-'.join(name)}"
print(profile_url)
As for the pagination part, this should work and is not hard coded. Let's say that new faculty join and there are now 9 pages. This code should still work in that case.
url = 'https://www.bu.edu/com/profiles/faculty/page'
with requests.get(f"{url}/1") as response:
soup = BeautifulSoup(response.text, 'html.parser')
# select pagination numbers shown ex: [2, 3, 7, Next] (Omit the next)
page_numbers = [int(n.text) for n in soup.select("a.page-numbers")[:-1]]
# take the min and max for pagination
start_page, stop_page = min(page_numbers), max(page_numbers) + 1
# loop through pages
for page in range(start_page, stop_page):
with requests.get(f"{url}/{page}") as response:
soup = BeautifulSoup(response.text, 'html.parser')
professors = soup.select('h4.profile-card__name')
# ---
I believe this is the best and most concise way to solve your problem. Just as a tip you should use with when making requests as it takes care of a lot of issues for you and you don't have to pollute the namespace with things like resp1, resp2, etc. Like mentions above, f-strings are amazing and super easy to use.

How to perform paging to scrape quotes over several pages?

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance
You are on the right way but you could simplify the process a bit:
Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.
Reduce number of requests and scrape available and necessarry information in one go.
If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like dict instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url+e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url+soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author
qoute
dob
lob
url
0
Albert Einstein
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
March 14, 1879
in Ulm, Germany
http://quotes.toscrape.com/author/Albert-Einstein
1
J.K. Rowling
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
July 31, 1965
in Yate, South Gloucestershire, England, The United Kingdom
http://quotes.toscrape.com/author/J-K-Rowling
...
...
...
...
...
...
98
Dr. Seuss
“A person's a person, no matter how small.”
March 02, 1904
in Springfield, MA, The United States
http://quotes.toscrape.com/author/Dr-Seuss
99
George R.R. Martin
“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”
September 20, 1948
in Bayonne, New Jersey, The United States
http://quotes.toscrape.com/author/George-R-R-Martin
Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors += get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob and get_author_bplace remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

Adjusting Web Scraping Code for another site

I'm currently using this code to web scrape reviews from TrustPilot. I wish to adjust the code to scrape reviews from (https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create). However, unlike most other review sites, the reviews are not separated into multiple sub-pages but there is instead a button at the end of the page to "view more reviews" which shows 3 additional reviews whenever you press it.
Is it possible to adjust the code such that it is able to scrape all the reviews from this particular product within the website with this kind of web structure?
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
print ('all imported successfuly')
# Initialize an empty dataframe
df = pd.DataFrame()
for x in range(1, 44):
names = []
headers = []
bodies = []
ratings = []
published = []
updated = []
reported = []
link = (f'https://www.trustpilot.com/review/birchbox.com?page={x}')
print (link)
req = requests.get(link)
content = req.content
soup = BeautifulSoup(content, "lxml")
articles = soup.find_all('article', {'class':'review'})
for article in articles:
names.append(article.find('div', attrs={'class': 'consumer-information__name'}).text.strip())
headers.append(article.find('h2', attrs={'class':'review-content__title'}).text.strip())
try:
bodies.append(article.find('p', attrs={'class':'review-content__text'}).text.strip())
except:
bodies.append('')
try:
#ratings.append(article.find('div', attrs={'class':'star-rating star-rating--medium'}).text.strip())
#ratings.append(article.find('div', attrs={'class': 'star-rating star-rating--medium'})['alt'])
ratings.append(article.find_all("img", alt=True)[0]["alt"])
except:
ratings.append('')
dateElements = article.find('div', attrs={'class':'review-content-header__dates'}).text.strip()
jsonData = json.loads(dateElements)
published.append(jsonData['publishedDate'])
updated.append(jsonData['updatedDate'])
reported.append(jsonData['reportedDate'])
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names, 'Header': headers, 'Body': bodies, 'Rating': ratings, 'Published Date': published, 'Updated Date':updated, 'Reported Date':reported})
df = df.append(temp_df, sort=False).reset_index(drop=True)
print ('pass1')
df.to_csv('BirchboxReviews2.0.csv', index=False, encoding='utf-8')
print ('excel done')
Basically you are dealing with a website which is dynamically loaded via JavaScript code once the page loads, where the comments is rendered with JS code on each scroll down.
I've been able to navigate to the XHR request which obtain the Comments from JS and I've been able to call it and retrieve all comments you asked for.
You don't need to use selenium as it's will slow down your task process.
Here you can achieve your target. assuming that each page include 3 comments. so we just math it to work on the full pages.
import requests
from bs4 import BeautifulSoup
import math
def PageNum():
r = requests.get(
"https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create")
soup = BeautifulSoup(r.text, 'html.parser')
num = int(
soup.find("a", class_="show-more-reviews").text.split(" ")[3][1:-1])
if num % 3 == 0:
return (num / 3) + 1
else:
return math.ceil(num / 3) + 2
def Main():
num = PageNum()
headers = {
'X-Requested-With': 'XMLHttpRequest'
}
with requests.Session() as req:
for item in range(1, num):
print(f"Extracting Page# {item}")
r = req.get(
f"https://boxes.mysubscriptionaddiction.com/get_user_reviews?box_id=105&page={item}", headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
for com in soup.findAll("div", class_=r'\"comment-body\"'):
print(com.text[5:com.text.find(r"\n", 3)])
Main()
Simple of the output:
Number of Pages 49
Extracting Page# 1
****************************************
I think Boxycharm overall is the best beauty subscription. However, I think it's
ridiculous that if you want to upgrade you have to pay the 25 for the first box and then add additional money to get the premium. Even though it's only one time,
that's insane. So about 80 bucks just to switch to Premium. And suppose U do that and then my Boxy Premium shows up at my door. I open it ....and absolutely hate
the majority if everything I have. Yeah I would be furious! Not worth taking a chance on. Boxy only shows up half the time with actual products or colors I use.
I love getting the monthly boxes, just wish they would have followed my preferences for colors!
I used to really get excited for my boxes. But not so much anymore. This months
Fenty box choices lack! I am not a clown
Extracting Page# 2
****************************************
Love it its awsome
Boxycharm has always been a favorite subscription box, I’ve had it off and on , love most of the goodies. I get frustrated when they don’t curate it to fit me and or customer service isn’t that helpful but overall a great box’!
I like BoxyCharm but to be honest I feel like some months they don’t even look at your beauty profile because I sometimes get things I clearly said I wasn’t interested in getting.
Extracting Page# 3
****************************************
The BEST sub box hands down.
I love all the boxy charm boxes everything is amazing all full size products and
the colors are outstanding
I absolutely love Boxycharm. I have received amazing high end products. My makeup cart is so full I have such a variety everyday. I love the new premium box and paired with Boxyluxe I recieve 15 products for $85 The products are worth anywhere from $500 to $700 total. I used to spend $400 a month buying products at Ulta. I would HIGHLY recommend this subscription.
Also I have worked out the code for your website. It uses selenium for button clicks and scrolling do let me know if you have any doubts. I still suggest you go through the article first:-
# -*- coding: utf-8 -*-
"""
Created on Sun Mar 8 18:09:45 2020
#author: prakharJ
"""
from selenium import webdriver
import time
import pandas as pd
names_found = []
comments_found = []
ratings_found = []
dateElements_found = []
# Web extraction of web page boxes
print("scheduled to run boxesweb scrapper ")
driver = webdriver.Chrome(executable_path='Your/path/to/chromedriver.exe')
webpage = 'https://boxes.mysubscriptionaddiction.com/box/boxycharm?ratings=true#review-update-create'
driver.get(webpage)
SCROLL_PAUSE_TIME = 6
# Get scroll height
last_height = driver.execute_script("return document.body.scrollHeight")
while True:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight*0.80);")
time.sleep(SCROLL_PAUSE_TIME)
try:
b = driver.find_element_by_class_name('show-more-reviews')
b.click()
time.sleep(SCROLL_PAUSE_TIME)
except Exception:
s ='no button'
# Calculate new scroll height and compare with last scroll height
new_height = driver.execute_script("return document.body.scrollHeight")
if new_height == last_height:
break
last_height = new_height
names_list = driver.find_elements_by_class_name('name')
comment_list = driver.find_elements_by_class_name('comment-body')
rating_list = driver.find_elements_by_xpath("//meta[#itemprop='ratingValue']")
date_list = driver.find_elements_by_class_name('comment-date')
for names in names_list:
names_found.append(names.text)
for bodies in comment_list:
try:
comments_found.append(bodies.text)
except:
comments_found.append('NA')
for ratings in rating_list:
try:
ratings_found.append(ratings.get_attribute("content"))
except:
ratings_found.append('NA')
for dateElements in date_list:
dateElements_found.append(dateElements.text)
# Create your temporary dataframe of the first iteration, then append that into your "final" dataframe
temp_df = pd.DataFrame({'User Name': names_found, 'Body': comments_found, 'Rating': ratings_found, 'Published Date': dateElements_found})
#df = df.append(temp_df, sort=False).reset_index(drop=True)
print('extraction completed for the day and system goes into sleep mode')
driver.quit()

python - How would i scrape this website for specific data that's constantly changing/being updated?

the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!
This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.

Categories