How to perform paging to scrape quotes over several pages? - python

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance

You are on the right way but you could simplify the process a bit:
Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.
Reduce number of requests and scrape available and necessarry information in one go.
If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like dict instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url+e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url+soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author
qoute
dob
lob
url
0
Albert Einstein
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
March 14, 1879
in Ulm, Germany
http://quotes.toscrape.com/author/Albert-Einstein
1
J.K. Rowling
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
July 31, 1965
in Yate, South Gloucestershire, England, The United Kingdom
http://quotes.toscrape.com/author/J-K-Rowling
...
...
...
...
...
...
98
Dr. Seuss
“A person's a person, no matter how small.”
March 02, 1904
in Springfield, MA, The United States
http://quotes.toscrape.com/author/Dr-Seuss
99
George R.R. Martin
“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”
September 20, 1948
in Bayonne, New Jersey, The United States
http://quotes.toscrape.com/author/George-R-R-Martin

Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors += get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob and get_author_bplace remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

Related

Regex - Extracting PubMed publications via Beautiful Soup, identify authors from my list that appear in PubMed article, and add bold HTML tags

I'm working with a project where we are web-scraping PubMed research abstracts and detecting if any researchers from our organization have authorship on any new publications. When we detect a match, we want to add a bold HTML tag. For example, you might see something like this is PubMed: Sanjay Gupta 1 2 3, Mehmot Oz 3 4, Terry Smith 2 4 (the numbers denote their academic affiliation, which corresponds to a different field, but I've left this out for simplicity. If Mehmot Oz and Sanjay Gupta were in my list, I would add a bold tag before their first name and a tag to end the bold at the end of their name.
One of my challenges with PubMed is the authors sometimes only show their first and last name, other times it includes a middle initial (e.g., Sanjay K Gupta versus just Sanjay Gupta). In my list of people, I only have first and last name. What I tried to do is import my list of names, split first and last name, and then bold them in the list of authors. The problem is that my code will bold anyone with the first name or anyone with the last name (example: Sanjay Smith 1 2 3, Sanjay Gupta 1 3 4, Wendy Gupta 4 5 6, Linda Oz 4, Mehmet Jones 5, Mehmet Oz 1 4 6.) gets bolded. I realize the flaw in my code, but I'm struggling for how to get around this. Any help is appreciated.
Bottom Line: I have a list of people by first name and last name, I want to find their publications in PubMed and bold their name in the author credits. PubMed sometimes has their first and last name, but sometimes their middle initial.
To make things easier, I denoted the section in all caps for the part in my code where I need help.
import time
import requests
import re
import pandas as pd
from datetime import datetime
all_pmids = []
out = []
base_urls = ['https://pubmed.ncbi.nlm.nih.gov/?term=sanjay+gupta&filter=years.2021-2021','https://pubmed.ncbi.nlm.nih.gov/?term=AHRQ+Diabetes+telehealth&filter=years.2016-2016', 'https://pubmed.ncbi.nlm.nih.gov/?term=mehmet+oz&filter=years.2020-2020']
author_list = ['Mehmet Oz', 'Sanjay Gupta', 'Ken Jeong', 'Susie Bates', 'Vijay Singh', 'Cynthia Berg']
for search_url in base_urls:
response = requests.get(search_url)
soup = BeautifulSoup(response.content, 'html.parser')
pmids = soup.find_all('span', {'class' : 'docsum-pmid'})
for p in pmids:
p = p.get_text()
all_pmids.append(p) if p not in all_pmids else print(p + ' already in list, skipping')
for pmid in all_pmids:
url = 'https://pubmed.ncbi.nlm.nih.gov/'+pmid
response2 = requests.get(url)
soup2 = BeautifulSoup(response2.content, 'html.parser')
title = soup2.select('h1.heading-title')[0].text.strip() if soup2.find(class_='item-list') is not None else ''
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = soup2.find(class_='authors-list').get_text(' ') if soup2.find(class_='authors-list') is not None else ''
authors = authors.rstrip() if soup2.find(class_='authors-list') is not None else ''
authors = " ".join(authors.split()) if soup2.find(class_='authors-list') is not None else ''
for au in author_list:
au_l = au.split()[1] + ' '
au_f = au.split()[0] + ' '
authors = re.sub(au_f, '<b>'+au_f, authors) if '<b>' + au_f not in authors else authors
authors = re.sub(au_l, au_l+'</b>', authors) if '</b>' + au_l not in authors else authors
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': authors}
time.sleep(5)
out.append(data)
df = pd.DataFrame(out)
df.to_excel('my_output.xlsx')
Here is the modification that needs to be done in the section you want help with.
Here is the algorithm:
Create list of authors by splitting on ,
For each author in authors, check if au_l and au_f are present in author.
If true, add <b> tags
#THIS IS THE START OF THE SECTION I NEED HELP WITH
authors = None
if (authors_html := soup2.find(class_='authors-list')):
authors = authors_html.get_text(' ')
if not authors:
continue
authors = " ".join(authors.rstrip().split()).split(",")
for au in author_list:
au_f, au_l = au.split()
for i in range(len(authors)):
if au_f in authors[i] and au_l in authors[i]:
authors[i] = f"<b> {authors[i]} <b>"
#THIS IS THE END OF THE SECTION I NEED HELP WITH
data = {'title': title, 'authors': ",".join(authors)}
Also, made some minor updates to improve readability.

Web scraping with Python/BeautifulSoup: Site with multiple links to profiles > needing profile contents

For my Master Thesis I want to send a questionnaire to as many people as possible in the field (Early Childhood Education), so my goal is to scrape Emails from Dacare Centers (KiTa) from a public site. I am very new to Python, so while this seems trivial to most, it's proven to be quite a challenge for my level of knowledge. I'm also not familiar with the lingo, so I don't even know what I need to look for.
This is the site (German): https://www.kitanetz.de/
To get to the content I want, I have to first select a country ("Bundesland"), will be directed to the next level where I need to click "Kreise auflisten". Then I get to the next level, where all the small counties inside the Country are listed. Every link opens a next level of pages with postalcodes and profile links. Some of those profiles have Emails, some don't (found tutorials to let that be no problem).
It took me two days now to scrape postal codes and names of the centres from one of those pages. What do I need to do so Python is able to iterate through every country, every county and every profile to get to the links? If you know a ressource or a keyword I should look out for that'd be a great next step. I also haven't tried to put the data from this code in a dataframe using pandas yet, but my other attempts didn't work.
This is my attempt so far. I added ## to my comments/questions in the code. # are comments from the tutorial:
import requests
from bs4 import BeautifulSoup
## Here's the tutorial I was following: https://www.pluralsight.com/guides/extracting-data-html-beautifulsoup
# Step 1: Sending a HTTP request to a URL
url = requests.get("https://www.kitanetz.de/bezirke/bezirke.php?land=Baden-W%C3%BCrttemberg&kreis=Alb-Donau-Kreis")
# Step 2: Parse the html content
soup = BeautifulSoup(url.text, 'lxml')
# print(soup.prettify()) # print the parsed data of html
# Step 3: Analyze the HTML tag, where your content lives
# Create a data dictionary to store the data.
data = {}
## it says in the tutorial, but what does that actually do?
## Get the table inside the <div id="inhalt">
table = soup.find_all('table')[0]
## Get the data you want: PLZ, Name Kita (ids) and href to profiles
plz = table.find_all('td', attrs={"headers": "header2"})
ids = table.find_all('td', attrs={"headers": "header3"})
table_data = table.find_all("tr") ## contains 101 rows. row [0] is header, using th tags. Rows [1]:[101] use td tags
for link in table.find_all("a"):
print("Name: {}".format(link.text))
print("href: {}".format(link.get("href")))
# Get the headers of the list
t_headers = []
for th in table.find_all("th"):
# remove any newlines and extra spaces from left and right
t_headers.append(th.text.replace('\n', ' ').strip())
# Get all the rows of table
table_data = []
for tr in table.find_all('tr'): # find all tr's from table ## no, it doesn't
t_row = {}
# Each table row is stored in the form of
## t_row = {'.': '', 'PLZ': '', 'Name Kita': '', 'Alter', '', 'Profil':''}
## we want: t_row = {'PLZ':'', 'Name Kita': '', 'EMail': ''}. Emails are stored in the hrefs -> next layer
## how do I get my plz, ids and hrefs in one dataframe? I'd know in R but this here works different.
# find all td's(3) in tr and zip it with t_header
for td, th in zip(tr.find_all("td"), t_headers):
t_row[th] = td.text.replace('\n', '').strip()
table_data.append(t_row)
You can use the site's sitemap.xml to get all links to profiles. When you have all links, then it's just simple parsing:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.kitanetz.de/sitemap.xml'
sitemap = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/\d+/[^/]+\.php')
for loc in sitemap.select('loc'):
if r.search(loc.text):
html_data = requests.get(loc.text).text
soup = BeautifulSoup(html_data, 'html.parser')
title = soup.h1.text
email = re.search(r"ez='(.*?)'.*?ey='(.*?)'.*?ex='(.*?)'", html_data, flags=re.S)
if email:
email = email[1] + '#' + email[2] + '.' + email[3]
else:
email = '-'
print('{:<60} {:<35} {}'.format(title, email, loc.text))
Prints:
Evangelisch-lutherische Kindertagessstätte Lemförde kts.lemfoerde#evlka.de https://www.kitanetz.de/niedersachsen/49448/stettiner-str-43b.php
Kindertagesstätte Stuhr I kiga.stuhr#stuhr.de https://www.kitanetz.de/niedersachsen/28816/stuhrer-landstrasse33a.php
Kita St. Bonifatius (Frankestraße) frankestr#kath-kita-wunstorf.de https://www.kitanetz.de/niedersachsen/31515/frankestrasse11.php
Ev. Kita Ketzin ektketzin.wagenschuetz#arcor.de https://www.kitanetz.de/brandenburg/14669/rathausstr17.php
Humanistische Kindertagesstätte `Die kleinen Strolche´ strolche#humanisten.de https://www.kitanetz.de/niedersachsen/30823/auf_der_horst115.php
Kindertagesstätte Idensen kita.idensen#wunstorf.de https://www.kitanetz.de/niedersachsen/31515/an_der_sigwardskirche2.php
Kindergroßtagespflege `Nesthäkchen´ nesthaekchen-isernhagen#gmx.de https://www.kitanetz.de/niedersachsen/30916/am_rathfeld4.php
Venhof Kindertagesstätte venhof#t-online.de https://www.kitanetz.de/niedersachsen/31515/schulstrasse14.php
Kindergarten Uetze `Buddelkiste´ buddelkiste#uetze.de https://www.kitanetz.de/niedersachsen/31311/eichendorffstrasse2b.php
Kita Lindenblüte m.herzog#lebenshilfe-dh.de https://www.kitanetz.de/niedersachsen/27232/lindern17.php
DRK Kita Luthe kita.luthe#drk-hannover.de https://www.kitanetz.de/niedersachsen/31515/an_der_boehmerke7.php
Freier Kindergarten Allerleirauh info#kindergarten-allerleirauh.de https://www.kitanetz.de/niedersachsen/31303/dachtmisser_weg3.php
Ev.-luth. Kindergarten St. Johannis johannis.bs.kita#lk-bs.de https://www.kitanetz.de/niedersachsen/38102/leonhardstr40.php
Kindertagesstätte Immensen-Arpke I kita.immensen#htp-tel.de https://www.kitanetz.de/niedersachsen/31275/am_schnittgraben15.php
SV Mörsen-Scharrendorf Mini-Club svms-mini-club#freenet.de https://www.kitanetz.de/niedersachsen/27239/am-sportheim6.php
Kindergarten Transvaal kiga-transvaal#awo-emden.de https://www.kitanetz.de/niedersachsen/26723/althusiusstr89.php
Städtische Kindertagesstätte Gartenstadt kita.gartenstadt#braunschweig.de https://www.kitanetz.de/niedersachsen/38122/wurmbergstr48.php
Kindergruppe Till Eulenspiegel e.V. - Bärenbande & Windelrocker tilleulenspiegel-bs#gmx.de https://www.kitanetz.de/niedersachsen/38102/kurt-schumacher-str7.php
Ev. luth. Kindertagesstätte der Versöhnun kts.versoehnung-garbsen#evlka.de https://www.kitanetz.de/niedersachsen/30823/im_alten_dorfe6.php
Kinderkrippe Ratzenspatz ratzenspatz#kila-ini.de https://www.kitanetz.de/niedersachsen/31535/am_goetheplatz5.php
Kinderkrippe Hemmingen-Westerfeld krippe-hw#stadthemmingen.de https://www.kitanetz.de/niedersachsen/30966/berliner_strasse16-22.php
... and so on.

Removing new line characters in web scrape

I'm trying to scrape baseball lineup data but would only like to return the player names. However, as of right now, it is giving me - position, newline character, name, newline character, and then batting side. For example I want
'D. Fletcher'
but instead I get
'LF\nD. Fletcher\nR'
Additionally, it is giving me all players on the page. It would be preferable that I group them by team, which maybe requires a dictionary set up of some sort but am not sure what that code would look like.
I've tried using the strip function but I believe that only removes leading or trailing issues as opposed to in the middle. I've tried researching how to just get the title information from the anchor tag but have not figured out how to do that.
from bs4 import BeautifulSoup
import requests
url = 'https://www.rotowire.com/baseball/daily_lineups.htm'
r = requests.get(url)
soup = BeautifulSoup(r.text, "html.parser")
players = soup.find_all('li', {'class': 'lineup__player'})
####for link in players.find('a'):
##### print (link.string)
awayPlayers = [player.text.strip() for player in players]
print(awayPlayers)
You should only get the .text for the a tag, not the whole li:
awayPlayers = [player.find('a').text.strip() for player in players]
That would result in something like the following:
['L. Martin', 'Jose Ramirez', 'J. Luplow', 'C. Santana', ...
Say you wanted to build that dict with team names and players you could do something like as follows. I don't know if you want the highlighted players e.g. Trevor Bauer? I have added variables to hold them in case needed.
Ad boxes and tools boxes are excluded via :not pseudo class which is passed a list of classes to ignore.
from bs4 import BeautifulSoup as bs
import requests
r = requests.get('https://www.rotowire.com/baseball/daily-lineups.php')
soup = bs(r.content, 'lxml')
team_dict = {}
teams = [item.text for item in soup.select('.lineup__abbr')] #26
matches = {}
i = 0
for teambox in soup.select('.lineups > div:not(.is-ad, .is-tools)'):
team_visit = teams[i]
team_home = teams[i + 1]
highlights = teambox.select('.lineup__player-highlight-name a')
visit_highlight = highlights[0].text
home_highlight = highlights[1].text
match = team_visit + ' v ' + team_home
visitors = [item['title'] for item in teambox.select('.is-visit .lineup__player [title]')]
home = [item['title'] for item in teambox.select('.is-home .lineup__player [title]')]
matches[match] = {'visitor' : [{team_visit : visitors}] ,
'home' : [{team_home : home}]
}
i+=1
Example info:
Current structure:
I think you were almost there, you just needed to tweak it a little bit:
awayPlayers = [player.find('a').text for player in players]
This list comprehension will grab just the names from the list then pull the text from the anchor...you get just a list of the names:
['L. Martin',
'Jose Ramirez',
'J. Luplow'...]
You have to find a tag and title attribute in it, check below answer.
awayPlayers = [player.find('a').get('title') for player in players]
print(awayPlayers)
Output is:
['Leonys Martin', 'Jose Ramirez', 'Jordan Luplow', 'Carlos Santana',

python - How would i scrape this website for specific data that's constantly changing/being updated?

the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!
This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.

rewriting spider in OOP terms

Hope everyone is well
So I wrote some code for a spider not long ago and it was for a website that shows house prices and sale dates in London. I have recently decided to improve it by making it object oriented.
The scope of the spider extends to 680 areas such as one given by this link:
http://www.rightmove.co.uk/house-prices/St-Johns-Wood.html
and if you click on the page, you will see that there are 40 pages for each area.
The reason that I would like to do OOP on this is because I need to involve a few methods that deal with updating the database that it saves to, so I do not have to run the whole spider again.
My first question is this: is it better to abstract the individual pages and treat them as an object or to treat each of the 680 areas as an object and involve methods to crawl each of the 40 pages for one area?
My second question is this:
Would someone be kind enough to show me the way that this would be implemented in OOP terms given the answer to the first question?
Below I provide code for the two jobs:
def forty_page_getter(link):
driver = webdriver.PhantomJS()
driver.get(link)
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
link_box = soup.find('div', {'id':'sliderBottom'})
rest = link_box.find_all('a')
links = []
for i in rest:
try:
link = i.get('href')
links.append(link)
except:
pass
return links
def page_stripper(link):
'''this function strips all the information on houses and transactions from a page ,
ready for entry into the database, now make a function for the database '''
myopener = MyOpener()
page = myopener.open(link)
soup = BeautifulSoup(page, 'html.parser')
houses = soup.find_all('div', {'class':'soldDetails'})
for house in houses:
try:
address = house.a.text
except:
address = house.find('div',{'class': 'soldAddress'}).text
postcode_list = address.split()[-2:]
postcode = postcode_list[0] + ' ' + postcode_list[1]
table = house.find('table').find_all('tr')
bedrooms = table[0].find('td', {'class': 'noBed'}).text
if not bedrooms:
bedrooms = '0'
else:
bedrooms = bedrooms[0]
house_key = save_house_to_db(address=address, postcode=postcode, bedrooms=bedrooms)
for row in table[::-1]:
price = row.find('td', {'class': 'soldPrice'}).text
date = row.find('td', {'class': 'soldDate'}).text
save_transactions_to_db(id = house_key, date= date, sale_price= price)
print('saved %s to DB' %str(address))
I am somewhat confused, if we were to treat the page or even the area link as an object how it would work given that we use BeautifulSoup and selenium for the dynamic updating scrollbar for the forty pages which in themselves are objects.
for example, I was considering the two following ways:
class area(BeautifulSoup):
def __init__(self, html_code):
def get_all_pages(self):
def strip_all_pages(self):
#OR would it be better to do this?
class page(object):
def __init__(self, html_code):
def strip(self):
but I am new to classes and was wondering how to work with Bs4, potentially another library called selenium and making sure that works within my own defined class.
Thanks for any help guys. I appreciate it.

Categories