Parsing JSON web scraper output - python

I am practicing web scraping using the requests and BeautifulSoup modules on the following website:
https://www.imdb.com/title/tt0080684/
My code thus far properly outputs the json in question. I'd like help in extracting from the json only the name and description into a response dictionary.
Code
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
print(p)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
Desired Output
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training with Yoda, while his friends are pursued by Darth Vader and a bounty hunter named Boba Fett all over the galaxy."}

You can parse the dictonary and then print a new JSON object using the dumps method:
# Send HTTP requests
import requests
import json
from bs4 import BeautifulSoup
class WebScraper:
def send_http_request():
# Obtain the URL via user input
url = input('Input the URL:\n')
# Get the webpage
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
output = json.dumps({"title": p["name"], "description": p["description"]})
print(output)
else:
print('\nInvalid movie page!')
WebScraper.send_http_request()
Output:
{"title": "Star Wars: Episode V - The Empire Strikes Back", "description": "Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training..."}

You just need to create a new dictionary from p given 2 keys name and description.
# Check response object's status code
if r:
p = json.loads("".join(soup.find('script', {'type':'application/ld+json'}).contents))
desired_output = {"title": p["name"], "description": p["description"]}
print(desired_output)
else:
print('\nInvalid movie page!')
Output:
{'title': 'Star Wars: Episode V - The Empire Strikes Back', 'description': 'Star Wars: Episode V - The Empire Strikes Back is a movie starring Mark Hamill, Harrison Ford, and Carrie Fisher. After the Rebels are brutally overpowered by the Empire on the ice planet Hoth, Luke Skywalker begins Jedi training...'}

Related

Scrape all URLs of a webpage

I have the following url https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801 where the last 6 digits is a unique identifier for a specific runner. I want to find all of the 6 digit unique identifiers on this page.
I've tried to scrape all urls on the page (code shown below), but unfortunately I only get a high-level summary. Rather than an in depth list which should contain >5000 runners. Im hoping to get a list/dataframe which shows:
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=517801
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500000
https://www.gbgb.org.uk/greyhound-profile/?greyhoundId=500005
etc.
This is what i've been able to do so far. I appreciate any help!
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
req = Request("https://www.gbgb.org.uk//greyhound-profile//")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
Thanks for the help in advance!
The data is loaded dynamicall from the external API URL. You can use next example how to load the data (with the IDs):
import json
import requests
api_url = "https://api.gbgb.org.uk/api/results/dog/517801" # <-- 517801 is the ID from your URL in the question
params = {"page": "1", "itemsPerPage": "20", "race_type": "race"}
page = 1
while True:
params["page"] = page
data = requests.get(api_url, params=params).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
if not data["items"]:
break
for i in data["items"]:
print(
"{:<30} {}".format(
i.get("winnerOr2ndName", ""), i.get("winnerOr2ndId", "")
)
)
page += 1
Prints:
Ferndale Boom 534358
Laganore Mustang 543937
Tickity Kara 535237
Thor 511842
Ballyboughlewiss 519556
Beef Cakes 551323
Distant Millie 546674
Lissan Kels 525148
Rosstemple Marko 534276
Happy Harry 550042
Porthall Ella 550841
Southlodge Eden 531677
Effernogue Beef 547416
Faydas Truffle 528780
Johns Lass 538763
Faydas Truffle 528780
Toms Hero 543659
Affane Buzz 547555
Emkay Flyer 531456
Ballymac Tilly 492923
Kilcrea Duke 542178
Sporting Sultan 541880
Droopys Poet 542020
Shortwood Elle 527241
Rosstemple Marko 534276
Erics Bozo 541863
Swift Launch 536667
Longsearch 523017
Swift Launch 536667
Takemyhand 535023
Floral Print 527192
Rustys Aero 497270
Autumn Dapper 519528
Droopys Kiwi 511989
Deep Chest 520634
Newtack Henry 525511
Indian Nightmare 524636
Lady Mascara 528399
Tarsna Yankee 517373
Leathems Act 516918
Final Star 514015
Ascot Faye 500812
Ballymac Ernie 503569
you can convert the result content to a pandas dataframe then just use winnerOr2ndName and winnerOr2ndId columns
Example
import json
import requests
import pandas as pd
def get_items(dog_id):
url = f"https://api.gbgb.org.uk/api/results/dog/{dog_id}?page=-1"
params = {"page": "-1", "itemsPerPage": "20", "race_type": "race"}
response = requests.get(url, params=params).json()
MAX_PAGES = response["meta"]["pageCount"]
result = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
result["winnerOr2ndId"] = result["winnerOr2ndId"].astype(int)
while int(params.get("page"))<MAX_PAGES:
params["page"] = str(int(params.get("page")) + 1)
response = requests.get(url, params=params).json()
new_items = pd.DataFrame(pd.DataFrame(response["items"]).loc[:, ['winnerOr2ndName','winnerOr2ndId']].dropna())
new_items["winnerOr2ndId"] = new_items["winnerOr2ndId"].astype(int)
result = pd.concat([result, new_items])
return result.drop_duplicates()
It would generate a dataframe looking like this:

Recursive Web Scraping Pagination

I'm trying to scrape some real estate articles from the following website:
Link
I manage to get the links I need,but I am struggling with pagination on the web page.I'm trying to scrape every link under each category 'building relationships', 'building your team', 'capital rising' etc.Some of these categories pages have pagination and some of them do not contain pagination.I tried with the following code but it just gives me the links from 2 page.
from requests_html import HTMLSession
def tag_words_links(url):
global _session
_request = _session.get(url)
tags = _request.html.find('a.tag-cloud-link')
links = []
for link in tags:
links.append({
'Tags': link.find('a', first=True).text,
'Links': link.find('a', first=True).attrs['href']
})
return links
def parse_tag_links(link):
global _session
_request = _session.get(link)
articles = []
try:
next_page = _request.html.find('link[rel="next"]', first=True).attrs['href']
_request = _session.get(next_page)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
except:
_request = _session.get(link)
article_links = _request.html.find('h3 a')
for article in article_links:
articles.append(article.find('a', first=True).attrs['href'])
return articles
if __name__ == '__main__':
_session = HTMLSession()
url = 'https://lifebridgecapital.com/podcast/'
links = tag_words_links(url)
print(parse_tag_links('https://lifebridgecapital.com/tag/multifamily/'))
To print title of every article under each tag and each page under the tag you can use this example:
import requests
from bs4 import BeautifulSoup
url = "https://lifebridgecapital.com/podcast/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
tag_links = [a["href"] for a in soup.select(".tagcloud a")]
for link in tag_links:
while True:
print(link)
print("-" * 80)
soup = BeautifulSoup(requests.get(link).content, "html.parser")
for title in soup.select("h3 a"):
print(title.text)
print()
next_link = soup.select_one("a.next")
if not next_link:
break
link = next_link["href"]
Prints:
...
https://lifebridgecapital.com/tag/multifamily/
--------------------------------------------------------------------------------
WS890: Successful Asset Classes In The Current Market with Jerome Maldonado
WS889: How To Avoid A $1,000,000 Mistake with Hugh Odom
WS888: Value-Based On BRRRR VS Cap Rate with John Stoeber
WS887: Slow And Steady Still Wins The Race with Nicole Pendergrass
WS287: Increase Your NOI by Converting Units to Short Term Rentals with Michael Sjogren
WS271: Investment Strategies To Survive An Economic Downturn with Vinney Chopra
WS270: Owning a Construction Company Creates More Value with Abraham Ng’hwani
WS269: The Impacts of Your First Deal with Kyle Mitchell
WS260: Structuring Deals To Get The Best Return On Investment with Jeff Greenberg
WS259: Capital Raising For Newbies with Bryan Taylor
https://lifebridgecapital.com/tag/multifamily/page/2/
--------------------------------------------------------------------------------
WS257: Why Ground Up Development is the Best Investment with Sam Bates
WS256: Mobile Home Park Investing: The Real Deal with Jefferson Lilly
WS249: Managing Real Estate Paperwork Successfully with Krista Testani
WS245: Multifamily Syndication with Venkat Avasarala
WS244: Passive Investing In Real Estate with Kay Kay Singh
WS243: Getting Started In Real Estate Brokerage with Tyler Chesser
WS213: Data Analytics In Real Estate with Raj Tekchandani
WS202: Ben Leybovich and Sam Grooms on The Advantages Of A Partnership In Real Estate Business
WS199: Financial Freedom Through Real Estate Investing with Rodney Miller
WS197: Loan Qualifications: How The Whole Process Works with Vinney Chopra
https://lifebridgecapital.com/tag/multifamily/page/3/
--------------------------------------------------------------------------------
WS172: Real Estate Syndication with Kyle Jones
...

How to avoid getting broken words while webcrawling

I'm trying to web crawl movie titles from this website: https://www.the-numbers.com/market/2019/top-grossing-movies
And keep getting broken word like "John Wick: Chapter 3 — ".
this is the picture:
This is the code:
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
for i in range(len(movie_list)):
print(movie_list[i].text)
And these are the outputs:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw…
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho…
John Wick: Chapter 3 — Para…
How to Train Your Dragon: T…
The Secret Life of Pets 2
Pokémon: Detective Pikachu
Once Upon a Time…in Hollywo…
I want to know why I keep getting these broken words and how to fix this!
Due to this page is server-render, you could request those page separately when the title getting broken.(Also don't forget to get the title by regex, because the title of its page contain the publication date.)
Try code below:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "html.parser")
movie_list = html.select("#page_filling_chart table tr > td > b > a") # "#page_filling_chart > table > tbody > tr > td > b"
for movie in movie_list:
raw = requests.get("https://www.the-numbers.com" + movie.get("href"), headers={'User-Agent': 'Mozilla/5.0'})
raw.encoding = 'utf-8'
html = BeautifulSoup(raw.text, "html.parser")
print(html.select_one("#main > div > h1").text)
That's gave me:
Avengers: Endgame (2019)
The Lion King (2019)
Frozen II (2019)
Toy Story 4 (2019)
Captain Marvel (2019)
Star Wars: The Rise of Skywalker (2019)
Spider-Man: Far From Home (2019)
....
You need to handle the strings like this, the solution code is:
import requests
from bs4 import BeautifulSoup
url = "https://www.the-numbers.com/market/" + "2019" + "/top-grossing-movies"
raw = requests.get(url,
headers={'User-Agent':'Mozilla/5.0'})
html = BeautifulSoup(raw.text, "lxml")
movie_list = html.select("#page_filling_chart table tr > td > b > a") #"#page_filling_chart > table > tbody > tr > td > b"
import unicodedata
for i in range(len(movie_list)):
movie_name = movie_list[i].text
print(unicodedata.normalize('NFKD', movie_name).encode('ascii', 'ignore').decode())
The output is like this:
Avengers: Endgame
The Lion King
Frozen II
Toy Story 4
Captain Marvel
Star Wars: The Rise of Skyw...
Spider-Man: Far From Home
Aladdin
Joker
Jumanji: The Next Level
It: Chapter Two
Us
Fast & Furious Presents: Ho...
John Wick: Chapter 3 a Para...
How to Train Your Dragon: T...
The Secret Life of Pets 2
PokAmon: Detective Pikachu
Once Upon a Timeain Hollywo...
Shazam!
Aquaman
Knives Out
Dumbo
Maleficent: Mistress of Evil
.
.
Narcissister Organ Player
Chef Flynn
I am Not a Witch
Divide and Conquer: The Sto...
Senso
Never-Ending Man: Hayao Miy...

Scraping citation text from PubMed search results with BeautifulSoup and Python?

So I'm attempting to scrape all the citations in AMA format from a PubMed search from every article. The following code is just intended to get the citation data from the first article.
import requests
import xlsxwriter
from bs4 import BeautifulSoup
URL = 'https://pubmed.ncbi.nlm.nih.gov/?term=infant+formula&size=200'
response = requests.get(URL)
html_soup = BeautifulSoup(response.text, 'html5lib')
article_containers = html_soup.find_all('article', class_ = 'labs-full-docsum')
first_article = article_containers[0]
citation_text = first_article.find('div', class_ = 'docsum-wrap').find('div', class_ = 'result-actions-bar').div.div.find('div', class_ = 'content').div.div.text
print(citation_text)
The script returns a blank line, even though when I inspect the source through Google Chrome, the text is clearly visible within that "div".
Does this have something to do with JavaScript, and if so, how do I fix it?
This script will get all citations in "AMA" format from the URL provided:
import json
import requests
from bs4 import BeautifulSoup
URL = 'https://pubmed.ncbi.nlm.nih.gov/?term=infant+formula&size=200'
response = requests.get(URL)
html_soup = BeautifulSoup(response.text, 'html5lib')
for article in html_soup.select('article'):
print(article.select_one('.labs-docsum-title').get_text(strip=True, separator=' '))
citation_id = article.input['value']
data = requests.get('https://pubmed.ncbi.nlm.nih.gov/{citation_id}/citations/'.format(citation_id=citation_id)).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
print(data['ama']['orig'])
print('-' * 80)
Prints:
Review of Infant Feeding: Key Features of Breast Milk and Infant Formula .
Martin CR, Ling PR, Blackburn GL. Review of Infant Feeding: Key Features of Breast Milk and Infant Formula. Nutrients. 2016;8(5):279. Published 2016 May 11. doi:10.3390/nu8050279
--------------------------------------------------------------------------------
Prebiotics in infant formula .
Vandenplas Y, De Greef E, Veereman G. Prebiotics in infant formula. Gut Microbes. 2014;5(6):681-687. doi:10.4161/19490976.2014.972237
--------------------------------------------------------------------------------
Effects of infant formula composition on long-term metabolic health.
Lemaire M, Le Huërou-Luron I, Blat S. Effects of infant formula composition on long-term metabolic health. J Dev Orig Health Dis. 2018;9(6):573-589. doi:10.1017/S2040174417000964
--------------------------------------------------------------------------------
Selenium in infant formula milk.
He MJ, Zhang SQ, Mu W, Huang ZW. Selenium in infant formula milk. Asia Pac J Clin Nutr. 2018;27(2):284-292. doi:10.6133/apjcn.042017.12
--------------------------------------------------------------------------------
... and so on.

Extracting title from link in Python (Beautiful soup)

I am new to Python and I'm looking to extract the title from a link. So far I have the following but have hit a dead end:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.find('a')
print (price)
print (title.contents[0])
To extract title from links, you can use title attribute.
Fore example:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.select('h3 > a'):
print(a['title'])
Prints:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red
The Dirty Little Secrets of Getting Your Dream Job
The Coming Woman: A Novel Based on the Life of the Infamous Feminist, Victoria Woodhull
The Boys in the Boat: Nine Americans and Their Epic Quest for Gold at the 1936 Berlin Olympics
The Black Maria
Starving Hearts (Triangular Trade Trilogy, #1)
Shakespeare's Sonnets
Set Me Free
Scott Pilgrim's Precious Little Life (Scott Pilgrim #1)
Rip it Up and Start Again
Our Band Could Be Your Life: Scenes from the American Indie Underground, 1981-1991
Olio
Mesaerion: The Best Science Fiction Stories 1800-1849
Libertarianism for Beginners
It's Only the Himalayas
you can use it:
import requests
from bs4 import BeautifulSoup
page = requests.get("http://books.toscrape.com/")
soup = BeautifulSoup(page.content, 'html.parser')
books = soup.find("section")
book_list = books.find_all(class_="product_pod")
tonight = book_list[0]
for book in book_list:
price = book.find(class_="price_color").get_text()
title = book.select_one('a img')['alt']
print (title)
Output:
A Light in the Attic
Tipping the Velvet
Soumission
Sharp Objects
Sapiens: A Brief History of Humankind
The Requiem Red...
By just modifying your existing code you can use the alt text which contains the book titles in your example.
print (title.contents[0].attrs["alt"])

Categories