Combining BeautifulSoup and json into one output - python

I have probably not explained my question well but as this is new to me... Anyway, I need to combine these two pieces of code.
I can get the BS working, but it uses the wrong image. To get the right fields and the right image, I have to parse the json part of the website and therefore BS won't work.
The json parsing here
import json
import urllib
import requests
import re
r = urllib.urlopen('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
data = json.loads(r.read().decode('utf-8'))
for post in data['posts']:
print post['episodeNumber']
print post['title']
print post['audioSource']
print post['image']['medium']
print post['content']
And replace the try / BS part here:
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
I have tried all sorts of variations with passing through the output to the items section, but the most common error I get is. I just have no idea how to pass the data from the json through.
Error Type: <type 'exceptions.NameError'>
Error Contents: name 'title' is not defined
Traceback (most recent call last):
File ".../addon.py", line 6, in <module>
from resources.lib import thisiscriminal
File "....resources/lib/thisiscriminal.py", line 132, in <module>
'title': title,
NameError: name 'title' is not defined

Your JSON request should contain all the information you need. You should print json_data and take a look at what is returned and decide which parts you need.
Based on what your other code was looking for, the following code shows how you could extract some of the fields:
import requests
r = requests.get('https://thisiscriminal.com/wp-json/criminal/v1/episodes?posts=10000&page=1')
json_data = r.json()
items = []
for post in json_data['posts']:
items.append([
post['title'].encode('utf-8'),
post['image']['thumb'],
post['excerpt']['long'],
post['permalink'],
])
for item in items:
print item
This would give you output starting:
['Stowaway', u'https://thisiscriminal.com/wp-content/uploads/2019/07/Stowaway_art-150x150.png', u'One day in 1969, Paulette Cooper decided to see what she could get away with.', u'https://thisiscriminal.com/episode-118-stowaway-7-5-2019/']
['The Lake', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Lake_art-150x150.png', u'Amanda Hamm and her boyfriend Maurice LaGrone drove to Clinton Lake one night in 2003. The next day, DeWitt County Sheriff Roger Massey told a local newspaper, \u201cWe don\u2019t want to blow this up into something that it\u2019s not. But on the other side, we\u2019ve got three children...', u'https://thisiscriminal.com/episode-117-the-lake-6-21-2019/']
['Jessica and the Bunny Ranch', u'https://thisiscriminal.com/wp-content/uploads/2019/06/Bunny_art-150x150.png', u'In our\xa0last episode\xa0we spoke Cecilia Gentili, a trans Latina who worked for many years as an undocumented sex worker. Today, we get two more views of sex work in America. We speak with a high-end escort in New York City, and take a trip to one of the...', u'https://thisiscriminal.com/episode-116-jessica-and-the-bunny-ranch-6-7-2019/']

Related

How to perform paging to scrape quotes over several pages?

I'm looking to scrape the website 'https://quotes.toscrape.com/' and retrieve for each quote, the author's full name, date of birth, and location of birth. There are 10 pages of quotes. To retrieve the author's date of birth and location of birth, one must follow the <a href 'about'> link next to the author's name.
Functionally speaking, I need to scrape 10 pages of quotes and follow each quote author's 'about' link to retrieve their data mentioned in the paragraph above ^, and then compile this data into a list or dict, without duplicates.
I can complete some of these tasks separately, but I am new to BeautifulSoup and Python and am having trouble implementing them all together. My success so far is limited to retrieving the author's info from quotes on page 1, but being unable to properly assign the function's returns to a variable (without an erroneous in-function print statement), and unable to implement the 10 page scan... Any help is greatly appreciated.
def get_author_dob(url):
response_auth = requests.get(url)
html_auth = response_auth.content
auth_soup = BeautifulSoup(html_auth)
auth_tag = auth_soup.find("span", class_="author-born-date")
return [auth_tag.text]
def get_author_bplace(url):
response_auth2 = requests.get(url)
html_auth2 = response_auth2.content
auth_soup2 = BeautifulSoup(html_auth2)
auth_tag2 = auth_soup2.find("span", class_="author-born-location")
return [auth_tag2.text]
url = 'http://quotes.toscrape.com/'
soup = BeautifulSoup(html)
tag = soup.find_all("div", class_="quote")
def auth_retrieval (url):
for t in tag:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print (authorss)
I need to use 'return' in the above function to be able to assign the results to a variable, but when I do, it only returns one value. I have tried the generator route with yield but am confused on how to implement the counter when I am already iterating over tag. Also confused with where and how to insert 10-page scan task. Thanks in advance
You are on the right way but you could simplify the process a bit:
Use while-loop and check if next button is available to perform paging. This would also work if number of pages is not known. You could still add an interuption by a specific number of pages if needed.
Reduce number of requests and scrape available and necessarry information in one go.
If you pick a bit more it is not bad you could filter it in a easy way to get your goal df[['author','dob','lob']].drop_duplicates()
Store information in a structured way like dict instead of single variables.
Example
import pandas as pd
import requests
from bs4 import BeautifulSoup
def get_author(url):
soup = BeautifulSoup(requests.get(url).text)
author = {
'dob': soup.select_one('.author-born-date').text,
'lob': soup.select_one('.author-born-location').text,
'url': url
}
return author
base_url = 'http://quotes.toscrape.com'
url = base_url
quotes = []
while True:
soup = BeautifulSoup(requests.get(url).text)
for e in soup.select('div.quote'):
qoute = {
'author':e.select_one('small.author').text,
'qoute':e.select_one('span.text').text
}
qoute.update(get_author(base_url+e.a.get('href')))
quotes.append(qoute)
if soup.select_one('li.next a'):
url=base_url+soup.select_one('li.next a').get('href')
print(url)
else:
break
pd.DataFrame(quotes)
Output
author
qoute
dob
lob
url
0
Albert Einstein
“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
March 14, 1879
in Ulm, Germany
http://quotes.toscrape.com/author/Albert-Einstein
1
J.K. Rowling
“It is our choices, Harry, that show what we truly are, far more than our abilities.”
July 31, 1965
in Yate, South Gloucestershire, England, The United Kingdom
http://quotes.toscrape.com/author/J-K-Rowling
...
...
...
...
...
...
98
Dr. Seuss
“A person's a person, no matter how small.”
March 02, 1904
in Springfield, MA, The United States
http://quotes.toscrape.com/author/Dr-Seuss
99
George R.R. Martin
“... a mind needs books as a sword needs a whetstone, if it is to keep its edge.”
September 20, 1948
in Bayonne, New Jersey, The United States
http://quotes.toscrape.com/author/George-R-R-Martin
Your code is almost working and just needs a bit of refactoring.
One thing I found out was that you could access individual pages using this URL pattern,
https://quotes.toscrape.com/page/{page_number}/
Now, once you've figured out that, we can take advantage of this pattern in the code,
#refactored the auth_retrieval to this one for reusability
def get_page_data(base_url, tags):
all_authors = []
for t in tags:
a = t.find("small", class_="author")
author = [a.text]
hrefs = t.a
link = hrefs.get('href')
link_url = base_url + link
dob = get_author_dob(link_url)
b_place = get_author_bplace(link_url)
authorss = author + dob + b_place
print(authorss)
all_authors.append(authorss)
return all_authors
url = 'https://quotes.toscrape.com/' #base url for the website
total_pages = 10
all_page_authors = []
for i in range(1, total_pages):
page_url = f'{url}page/{i}/' #https://quotes.toscrape.com/page/1, 2, ... 10
print(page_url)
page = requests.get(page_url)
soup = BeautifulSoup(page.content,'html.parser')
tags = soup.find_all("div", class_="quote")
all_page_authors += get_page_data(url, tags) #merge all authors into one list
print(all_page_authors)
get_author_dob and get_author_bplace remain the same.
The final output will be an array of authors where each author's info is an array.
[['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],
['J.K. Rowling', 'July 31, 1965', 'in Yate, South Gloucestershire, England, The United Kingdom'],
['Albert Einstein', 'March 14, 1879', 'in Ulm, Germany'],...]

Formating Results from Scaper

I'm getting an error while trying to format a simple amazon scraper.
I'm trying to scrape amazon then create a tweet using the twitter API. After scraping Amazon I want to format the results, so I can pull the results to my twitter API.
While trying to format I get an error
ERROR:
File "/Users/user/Coding/TestRequests/amazonscraper.py", line 32, in <module>
deals = tvprices[0].replace("'title'", "Product: ")
AttributeError: 'dict' object has no attribute 'replace'
CODE:
from requests_html import HTMLSession
urls = ['https://amzn.to/3PUatLc']
def getPrice(url):
s = HTMLSession()
r = s.get(url)
r.html.render(sleep=1)
try:
product = {
'title': r.html.xpath('//*[#id="productTitle"]', first=True).text,
'price': r.html.xpath('//*[#id="corePriceDisplay_desktop_feature_div"]/div[1]/span[2]/span[1]', first=True).text,
'discount': r.html.xpath('//*[#id="corePriceDisplay_desktop_feature_div"]/div[1]/span[1]', first=True).text.replace('-', '')
}
print(product)
except:
product = {
'title': r.html.xpath('//*[#id="productTitle"]', first=True).text,
'price': 'item unavailable'
}
print(product)
return product
tvprices = []
for url in urls:
tvprices.append(getPrice(url))
deals = tvprices[0].replace("'title'", "Product: ")
print(deals)```
Any help would be appreciated. I'm just learning so this might be way more simple than I'm thinking.
Thanks all!
You can't replace Dictionary keys. If you'd really like to do something along the similar lines, you could delete the existing key and put another key as Product: . However, that's not the best resort.
You might want to build another List with the formatted data.
formatted_deals: List[str] = []
for tvprice in tvprices:
formatted_deals.append(f"Product: {tvprice['title']}")

Parsing just first result with beautiful soup

I have the following code which succesfully pulls links, titles, etc. for podcast episodes. How would I go about just pulling the first one it comes to (i.e. the latest episode) and then immediately stop and produce just that result? Any advice would be greatly appreciated.
def get_playable_podcast(soup):
"""
#param: parsed html page
"""
subjects = []
for content in soup.find_all('item'):
try:
link = content.find('enclosure')
link = link.get('url')
print "\n\nLink: ", link
title = content.find('title')
title = title.get_text()
desc = content.find('itunes:subtitle')
desc = desc.get_text()
thumbnail = content.find('itunes:image')
thumbnail = thumbnail.get('href')
except AttributeError:
continue
item = {
'url': link,
'title': title,
'desc': desc,
'thumbnail': thumbnail
}
subjects.append(item)
return subjects
def compile_playable_podcast(playable_podcast):
"""
#para: list containing dict of key/values pairs for playable podcasts
"""
items = []
for podcast in playable_podcast:
items.append({
'label': podcast['title'],
'thumbnail': podcast['thumbnail'],
'path': podcast['url'],
'info': podcast['desc'],
'is_playable': True,
})
return items
The answer of #John Gordon is completely correct.
#John Gordon pointed out that:
soup.find()
will always display the first found item (for you thats perfectly fine, when you want to scrape the "latest episode").
However, imagine you just wanted to select the second, third, fourth, etc. item of your BeautifulSoup. Then you could do that with the following line of code:
soup.find()[0] # This will works the same way as soup.find() and displays the first item
When you replace the 0 by any other number (e.g. 4) you solely get the choosen (in this example fourth) item ;).

Can't manage to print "None" with beautifulsoup: 'NoneType' object is not subscriptable

I've been trying to find a solution in questions found here but could not find one that would give me some kind of solution or a similar approach to my problem. I'm very new to python and as a first step I wanted to learn how to scrape data from IMDB using beautiful soup. I want to scrape the name of movie, IMDB rating and number of votes. There are some movies in the list that do not have rating and number of votes and I'm getting: Thanks so much for all your comments. The complete Traceback is the following:
Traceback (most recent call last):
File
"C:/Users/nmartine/PycharmProjects/ratings_ScraperMetracritic/venv/ratings_ScraperMetacritic.py", line 24, in votes = container.find('span', attrs= {'name':'nv'})['data-value'] TypeError: 'NoneType' object is not subscriptable ' in my output when python does not found the rating or votes. I'm getting the name of the titles correctly but I want the output to return a "None" if the title does not have IMDB or number of votes. This is my code so far:
from requests import get
from bs4 import BeautifulSoup
url = 'https://www.imdb.com/search/title?release_date=2014-01-01,2018-12-31&count=250&page=3&sort=moviemeter,asc&ref_=adv_nxt'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
program_containers = html_soup.find_all('div', class_ = 'lister-item mode-
advanced')
print(len(program_containers))
for container in program_containers:
name = container.h3.a.text
print(name)
if (container.strong):
imdb = float(container.strong.text)
print(imdb)
else: 'None'
votes = container.find('span', attrs= {'name':'nv'})['data-value']
print(votes)
Hope someone can help me! Thanks!
Accessing 'data-value' appears to be causing the current issue, since find('span', attrs= {'name':'nv'}) has to return a BeautifulSoup object for ['data-value'] to be successful. However, instead of 'data-value', the text attribute can be used, along with getattr. getattr will attempt to access the text attribute from the result of find('span', attrs= {'name':'nv'}), however, if the latter is None (which does not have a text attribute), None itself will be returned, being first specified as a third parameter in getattr:
from bs4 import BeautifulSoup as soup
import requests, re
from typing import NamedTuple
class Movie(NamedTuple):
title:str
rating:str
votes:str
def get_films(placeholder=None):
d = soup(requests.get('https://www.imdb.com/search/title?release_date=2014-01-01,2018-12-31&count=250&page=3&sort=moviemeter,asc&ref_=adv_nxt').text, 'html.parser')
films = [i for i in d.find_all('div', {'class':re.compile('lister-item[\w\W]+')})]
final_films = [[getattr(i.find(*c), 'text', placeholder) for c in [['a'], ['strong'], ['span', {'name':'nv'}]]] for i in films]
return [Movie(a, b, c) for a, b, c in final_films if a != ' \n']
new_films = get_films()
First ten elements in new_films:
[Movie(title='The OA', rating='7.8', votes='54,496'), Movie(title='Parmanu: The Story of Pokhran', rating='8.5', votes='4,116'), Movie(title='Batman Ninja', rating='5.7', votes='6,847'), Movie(title='Verónica', rating='6.2', votes='20,634'), Movie(title='Set It Up', rating=None, votes=None), Movie(title='Wynonna Earp', rating='7.5', votes='11,771'), Movie(title='Spectre', rating='6.8', votes='333,593'), Movie(title='Van Helsing', rating='6.0', votes='10,719'), Movie(title='The Year of Spectacular Men', rating='6.6', votes='64'), Movie(title='The Heretics', rating='4.8', votes='298')]
Notice that for some movies on the list, rating and votes are not listed, and this solution simply provides None in its place:
[Movie(title="Tom Clancy's Jack Ryan", rating=None, votes=None)]

python - How would i scrape this website for specific data that's constantly changing/being updated?

the website is:
https://pokemongo.gamepress.gg/best-attackers-type
my code is as follows, for now:
from bs4 import BeautifulSoup
import requests
import re
site = 'https://pokemongo.gamepress.gg/best-attackers-type'
page_data = requests.get(site, headers=headers)
soup = BeautifulSoup(page_data.text, 'html.parser')
check_gamepress = soup.body.findAll(text=re.compile("Strength"))
print(check_gamepress)
However, I really want to scrape certain data, and I'm really having trouble.
For example, how would I scrape the portion that show's the following for best Bug type:
"Good typing and lightning-fast attacks. Though cool-looking, Scizor is somewhat fragile."
This information could obviously be updated as it has been in the past, when a better Pokemon comes out for that type. So, how would I scrape this data where it'll probably be updated in the future, without me having to make code changes when that occurs.
In advance, thank you for reading!
This particular site is a bit tough because of how the HTML is organized. The relevant tags containing the information don't really have many distinguishing features, so we have to get a little clever. To make matters complicated, the divs the contain the information across the whole page are siblings. We'll also have to make up for this web-design travesty with some ingenuity.
I did notice a pattern that is (almost entirely) consistent throughout the page. Each 'type' and underlying section are broken into 3 divs:
A div containing the type and pokemon, for example Dark Type: Tyranitar.
A div containing the 'specialty' and moves.
A div containing the 'ratings' and commentary.
The basic idea that follows here is that we can begin to organize this markup chaos through a procedure that loosely goes like this:
Identify each of the type title divs
For each of those divs, get the other two divs by accessing its siblings
Parse the information out of each of those divs
With this in mind, I produced a working solution. The meat of the code consists of 5 functions. One to find each section, one to extract the siblings, and three functions to parse each of those divs.
import re
import json
import requests
from pprint import pprint
from bs4 import BeautifulSoup
def type_section(tag):
"""Find the tags that has the move type and pokemon name"""
pattern = r"[A-z]{3,} Type: [A-z]{3,}"
# if all these things are true, it should be the right tag
return all((tag.name == 'div',
len(tag.get('class', '')) == 1,
'field__item' in tag.get('class', []),
re.findall(pattern, tag.text),
))
def parse_type_pokemon(tag):
"""Parse out the move type and pokemon from the tag text"""
s = tag.text.strip()
poke_type, pokemon = s.split(' Type: ')
return {'type': poke_type, 'pokemon': pokemon}
def parse_speciality(tag):
"""Parse the tag containing the speciality and moves"""
table = tag.find('table')
rows = table.find_all('tr')
speciality_row, fast_row, charge_row = rows
speciality_types = []
for anchor in speciality_row.find_all('a'):
# Each type 'badge' has a href with the type name at the end
href = anchor.get('href')
speciality_types.append(href.split('#')[-1])
fast_move = fast_row.find('td').text
charge_move = charge_row.find('td').text
return {'speciality': speciality_types,
'fast_move': fast_move,
'charge_move': charge_move}
def parse_rating(tag):
"""Parse the tag containing categorical ratings and commentary"""
table = tag.find('table')
category_tags = table.find_all('th')
strength_tag, meta_tag, future_tag = category_tags
str_rating = strength_tag.parent.find('td').text.strip()
meta_rating = meta_tag.parent.find('td').text.strip()
future_rating = meta_tag.parent.find('td').text.strip()
blurb_tags = table.find_all('td', {'colspan': '2'})
if blurb_tags:
# `if` to accomodate fire section bug
str_blurb_tag, meta_blurb_tag, future_blurb_tag = blurb_tags
str_blurb = str_blurb_tag.text.strip()
meta_blurb = meta_blurb_tag.text.strip()
future_blurb = future_blurb_tag.text.strip()
else:
str_blurb = None;meta_blurb=None;future_blurb=None
return {'strength': {
'rating': str_rating,
'commentary': str_blurb},
'meta': {
'rating': meta_rating,
'commentary': meta_blurb},
'future': {
'rating': future_rating,
'commentary': future_blurb}
}
def extract_divs(tag):
"""
Get the divs containing the moves/ratings
determined based on sibling position from the type tag
"""
_, speciality_div, _, rating_div, *_ = tag.next_siblings
return speciality_div, rating_div
def main():
"""All together now"""
url = 'https://pokemongo.gamepress.gg/best-attackers-type'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
types = {}
for type_tag in soup.find_all(type_section):
type_info = {}
type_info.update(parse_type_pokemon(type_tag))
speciality_div, rating_div = extract_divs(type_tag)
type_info.update(parse_speciality(speciality_div))
type_info.update(parse_rating(rating_div))
type_ = type_info.get('type')
types[type_] = type_info
pprint(types) # We did it
with open('pokemon.json', 'w') as outfile:
json.dump(types, outfile)
There is, for now, one small wrench in the whole thing. Remember when I said this pattern was almost entirely consistent? Well, the Fire type is an odd-ball here, because they included two pokemon for that type, so the Fire type results are not correct. I or some brave person may come up with a way to deal with that. Or maybe they'll decide on one fire pokemon in the future.
This code, the resulting json (prettified), and an archive of HTML response used can be found in this gist.

Categories