Get specific information out of application/ld+json - python

I am looking to get just the "ratingValue" and "reviewCount" from the following application/ld+json but cannot figure out how to do this after looking through numerous How-to's, so I've essentially given up. In advance thank you for your help.
Sample of the application/ld+json
{
"#context": "http://schema.org",
"#graph": [
{
"#type": "Product",
"name": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) ",
"description": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) - FilterBuy.com",
"productID": 30100,
"sku": "ABR20x20x5M8",
"mpn": "ABR20x20x5M8",
"url": "https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20/merv-8/",
"itemCondition": "new",
"brand": "FilterBuy",
"image": "https://filterbuy.com/media/pla_images/20x25x5AB/20x25x5AB-m8-(x1).jpg",
"aggregateRating": {
"#type": "AggregateRating",
**"ratingValue": 4.79926,
"reviewCount": 538**}
My Code:
from bs4 import BeautifulSoup
import bs4
import requests
import json
import re
import numpy as np
import csv
urls = ['https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20',
'https://filterbuy.com/brand/trion-air-bear-air-filters/16x25x5-air-bear-1400/?selected_merv=11']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
mervs = BeautifulSoup(response.text, 'lxml').find_all('strong')
product = BeautifulSoup(response.text, 'lxml').find("h1", class_="text-center")
jsonString = soup.find_all('script', type='application/ld+json')[1].text
json_schema = soup.find_all('script', attrs={'type': 'application/ld+json'})[1]
json_file = json.loads(json_schema.get_text())
for i, cart in enumerate(BeautifulSoup(response.text, 'lxml').find_all('form', class_='cart')):
for tax in cart.attrs:
if 'data-price' in tax:
print(product.text, mervs[i].get_text(), [tax], cart[tax], json_file)

In python, pretty much anything can be nested inside other things. This is an example of nesting lists and dictionaries inside a dictionary. You can go about getting the value by thinking about what you need to do on each level.
Start by assigning the above dictionary to a variable, like the_dict. You want to access the "#graph" key, then access the first item in the list it returns, then access "aggregateRating". From there, you can get both the values you want. Your code may look something like this:
the_dict = ...
d = the_dict['#graph'][0][aggregateRating']
rating_value, review_count = d['ratingValue'], d['reviewCount']

Related

I'm unsure how to print the rest of the information i need from HTML

import requests
from bs4 import BeautifulSoup
from datetime import datetime
from dateutil.relativedelta import relativedelta
evr_begin = datetime.now().strftime("%m/%d/%Y")
evr_end = (datetime.now() + relativedelta(months=1)).strftime("%m/%d/%Y")
url = "https://mms.kcbs.us/members/evr_search_ol_json.php?" \
f"otype=TEXT&evr_map_type=2&org_id=KCBA&evr_begin={evr_begin}&evr_end=.
{evr_end}&" \
"evr_radius=50&evr_type=269&evr_region_type=1"
response = requests.request("GET", url)
soup = BeautifulSoup(response.text, features='lxml')
for event in soup.find_all('div', class_='row'):
print(event.find('b').getText())
print(event.find('i').getText())
Link to website https://mms.kcbs.us/members/evr_search.php?org_id=KCBA
I'm unsure on how to print what comes after the information I'm already printing. Part of the issue is some of the other texts share the same tag, while others I'm just unsure.
For Example for the first event Im needing to print
Frisco, CO 80443
UNITED STATES
STATE CHAMPIONSHIP
Reps: BUNNY TUTTLE, RICH TUTTLE, MICHAEL WINTER
Prize Money: $13,050.00
all separately.
If i use
print(event.find('div', class_='col-md-4').getText()) within the for loop it will print it clumped together
What I would do is create a dictionary containing all the names for the different pieces of data mapped to the order in which they appear in each row of the table. Then collect each row into it's own dictionary and append them to a list for you to deal with once it's all finished parsing.
For Example:
import requests
from bs4 import BeautifulSoup
from datetime import datetime
from dateutil.relativedelta import relativedelta
import json
data = {
0:{ 0:"title", 1:"dates", 2:"city/state", 3:"country" },
1:{ 0:"event", 1:"reps", 2:"prize" },
2:{ 0:"results" }
}
evr_begin = datetime.now().strftime("%m/%d/%Y")
evr_end = (datetime.now() + relativedelta(months=1)).strftime("%m/%d/%Y")
url = f"https://mms.kcbs.us/members/evr_search_ol_json.php?otype=TEXT&evr_map_type=2&org_id=KCBA&evr_begin={evr_begin}&evr_end=.{evr_end}&evr_radius=50&evr_type=269&evr_region_type=1"
response = requests.request("GET", url)
print(response.content)
soup = BeautifulSoup(response.text, features='lxml')
all_data = []
for element in soup.find_all('div', class_="row"):
event = {}
for i, col in enumerate(element.find_all('div', class_='col-md-4')):
for j, item in enumerate(col.strings):
event[data[i][j]] = item
all_data.append(event)
print(json.dumps(all_data,indent=4))
The output would look something like this:
{
"title": "Frisco BBQ Challenge",
"dates": "6/16/2022 - 6/18/2022",
"city/state": "Frisco, CO 80443",
"country": "UNITED STATES",
"event": "STATE CHAMPIONSHIP",
"reps": "Reps: BUNNY TUTTLE, RICH TUTTLE, MICHAEL WINTER",
"prize": "Prize Money: $13,050.00",
"results": "Results Not In"
},
{
"title": "York County BBQ Festival",
"dates": "6/17/2022 - 6/18/2022",
"city/state": "Delta, PA 17314",
"country": "UNITED STATES",
"event": "STATE CHAMPIONSHIP",
"reps": "Reps: ANGELA MCKEE, ROBERT MCKEE, LOUISE WEIDNER",
"prize": "Prize Money: $5,500.00",
"results": "Results Not In"
},
...

Python BeautifulSoup Find data inside a variable

I am trying to use BeautifulSoup to get some data from website the data is returned as follows
window._sharedData = {
"config": {
"csrf_token": "DMjhhPBY0i6ZyMKYQPjMjxJhRD0gkRVQ",
"viewer": null,
"viewerId": null
},
"country_code": "IN",
"language_code": "en",
"locale": "en_US"
}
How can I import the same into json.loads so I can extract the data?
You need to change it first to a json format by removing the variable name and parsing it as a string:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('window._sharedData = ', '')
data = json.loads(text)
country_code = data['country_code']
Or you can use the eval function to transform it to a python dictionary. For that you need to replace json types to python and parse it in a dictionary format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('null', None)
text = text.replace('window._sharedData = ', '')
data = eval(text)
country_code = data['country_code']

How to print json info with python?

I have a json (url = http://open.data.amsterdam.nl/ivv/parkeren/locaties.json) and I want to print all 'title', 'adres', 'postcode'. How can I do that?
I want to print it like this:
title.
adres.
postcode.
title.
adres.
postcode.
so among themselves
I hope you can help me with this
import urllib, json
url = "http://open.data.amsterdam.nl/ivv/parkeren/locaties.json"
import requests
search = requests.get(url).json()
print(search['title'])
print(search['adres'])
print(search['postcode'])
Using print(json.dumps(r, indent=4)) you can see that the structure is
{
"parkeerlocaties": [
{
"parkeerlocatie": {
"title": "Fietsenstalling Tolhuisplein",
"Locatie": "{\"type\":\"Point\",\"coordinates\":[4.9032801,52.3824545]}",
...
}
},
{
"parkeerlocatie": {
"title": "Fietsenstalling Paradiso",
"Locatie": "{\"type\":\"Point\",\"coordinates\":[4.8833735,52.3621851]}",
...
}
},
So to access the inner properties, you need to follow the JSON path
import requests
url = ' http://open.data.amsterdam.nl/ivv/parkeren/locaties.json'
search = requests.get(url).json()
for parkeerlocatie in search["parkeerlocaties"]:
content = parkeerlocatie['parkeerlocatie']
print(content['title'])
print(content['adres'])
print(content['postcode'])
print()

How to get text within <script> tag

I am scraping the LaneBryant website.
Part of the source code is
<script type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"#type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>
In order to get price in USD, I have written this script:
def getPrice(self,start):
fprice=[]
discount = ""
price1 = start.find('script', {'type': 'application/ld+json'})
data = ""
#print("price 1 is + "+ str(price1)+"data is "+str(data))
price1 = str(price1).split(",")
#price1=str(price1).split(":")
print("final price +"+ str(price1[11]))
where start is :
d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
d.get(url)
start = BeautifulSoup(d.page_source, 'html.parser')
It doesn't print the price even though I am getting correct text. How do I get just the price?
In this instance you can just regex for the price
import requests, re
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])
Otherwise, target the appropriate script tag by id and then parse the .text with json library
import requests, json
from bs4 import BeautifulSoup
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)
price1 = start.find('script', {'type': 'application/ld+json'})
This is actually the <script> tag, so a better name would be
script_tag = start.find('script', {'type': 'application/ld+json'})
You can access the text inside the script tag using .text. That will give you the JSON in this case.
json_string = script_tag.text
Instead of splitting by commas, use a JSON parser to avoid misinterpretations:
import json
clothing=json.loads(json_string)

Access a dictionary inside a script tag using beautiful soup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)
this gives-->
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "Organization",
"address": {
"#type": "PostalAddress",
"addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road,
Bibvewadi",
"postalCode": "411016 ",
"streetAddress": " Pune/Maharashtra "
},
"name": "Banctec Tps India Pvt Ltd",
"telephone": "(020) "
}
</script>
i need to print out the address dictionary which is inside a dictionary, i need to access the addressLocality, postal code, streetaddress.
tried differnt methods and failed.
String of JSON formatted data in Python, deserialize that with json.loads()
import json
links= soup.find('script')
print(links)
after this,
address = json.loads(links.text)['address']
print(address)
Use the string property to get the text of the element, then you can parse it as JSON.
links_dict = json.loads(links.string)
address = links_dict['address']
Use the json package:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)
for script in links:
if '#context' in script.text:
jsonStr = script.string
jsonObj = json.loads(jsonStr)
print (jsonObj['address'])
Output:
print (jsonObj['address'])
{'#type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}
Often times script tags contain a lot of javascript fluff. You can use regex to isolate the dictionary:
scripts = s.findAll('script')
for script in scripts:
if '#context' in script.text:
# Extra step to isolate the dictionary.
jsonStr = re.search(r'\{.*\}', str(script)).group()
# Create dictionary
dct = json.loads(jsonStr)
print(dct['address'])

Categories