I am trying to use BeautifulSoup to get some data from website the data is returned as follows
window._sharedData = {
"config": {
"csrf_token": "DMjhhPBY0i6ZyMKYQPjMjxJhRD0gkRVQ",
"viewer": null,
"viewerId": null
},
"country_code": "IN",
"language_code": "en",
"locale": "en_US"
}
How can I import the same into json.loads so I can extract the data?
You need to change it first to a json format by removing the variable name and parsing it as a string:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('window._sharedData = ', '')
data = json.loads(text)
country_code = data['country_code']
Or you can use the eval function to transform it to a python dictionary. For that you need to replace json types to python and parse it in a dictionary format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('null', None)
text = text.replace('window._sharedData = ', '')
data = eval(text)
country_code = data['country_code']
Related
I would like to put some data in a html file into a pandas dataframe but I'm getting the error '. My data has the following structures. It is the data between the square brackets after lots I would like to put into a dataframe but I'm pretty confused as to what type of object this is.
html_doc = """<html><head><script>
"unrequired_data = [{"ID":XXX, "Name":XXX, "Price":100GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null },
{"ID":XXX, "Name":XXX, "Price":150GBP, "description": null }]
"lots":[{"ID":123, "Name":ABC, "Price":100, "description": null },
{"ID":456, "Name":DEF, "Price":150, "description": null },
{"ID":789, "Name":GHI, "Price":150, "description": null }]
</script></head></html>"""
I have tried the following code
from bs4 import BeautifulSoup
import pandas as pd
soup = BeautifulSoup(html_doc)
df = pd.DataFrame("lots")
The output I would like to get would be in this format.
Your data is not valid JSON, so you need to fix it.
I would use:
from bs4 import BeautifulSoup
import pandas as pd
import json, re
soup = BeautifulSoup(html_doc)
# extract script
script = soup.find("script").text.strip()
# get first value that starts with "lot"
data = next((s.split(':', maxsplit=1)[-1] for s in re.split('\n{2,}', script) if s.startswith('"lots"')), None)
# fix the json
if data:
data = (re.sub(r':\s*([^",}]+)\s*', r':"\1"', data))
df = pd.DataFrame(json.loads(data))
print(df)
Output:
ID Name Price description
0 123 ABC 100 null
1 456 DEF 150 null
2 789 GHI 150 null
I am looking to get just the "ratingValue" and "reviewCount" from the following application/ld+json but cannot figure out how to do this after looking through numerous How-to's, so I've essentially given up. In advance thank you for your help.
Sample of the application/ld+json
{
"#context": "http://schema.org",
"#graph": [
{
"#type": "Product",
"name": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) ",
"description": "MERV 8 Replacement for Trion Air Bear 20x20x5 (19.63x20.13x4.88) - FilterBuy.com",
"productID": 30100,
"sku": "ABR20x20x5M8",
"mpn": "ABR20x20x5M8",
"url": "https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20/merv-8/",
"itemCondition": "new",
"brand": "FilterBuy",
"image": "https://filterbuy.com/media/pla_images/20x25x5AB/20x25x5AB-m8-(x1).jpg",
"aggregateRating": {
"#type": "AggregateRating",
**"ratingValue": 4.79926,
"reviewCount": 538**}
My Code:
from bs4 import BeautifulSoup
import bs4
import requests
import json
import re
import numpy as np
import csv
urls = ['https://filterbuy.com/brand/trion-air-bear-air-filters/20x20x5-air-bear-20x20',
'https://filterbuy.com/brand/trion-air-bear-air-filters/16x25x5-air-bear-1400/?selected_merv=11']
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")
mervs = BeautifulSoup(response.text, 'lxml').find_all('strong')
product = BeautifulSoup(response.text, 'lxml').find("h1", class_="text-center")
jsonString = soup.find_all('script', type='application/ld+json')[1].text
json_schema = soup.find_all('script', attrs={'type': 'application/ld+json'})[1]
json_file = json.loads(json_schema.get_text())
for i, cart in enumerate(BeautifulSoup(response.text, 'lxml').find_all('form', class_='cart')):
for tax in cart.attrs:
if 'data-price' in tax:
print(product.text, mervs[i].get_text(), [tax], cart[tax], json_file)
In python, pretty much anything can be nested inside other things. This is an example of nesting lists and dictionaries inside a dictionary. You can go about getting the value by thinking about what you need to do on each level.
Start by assigning the above dictionary to a variable, like the_dict. You want to access the "#graph" key, then access the first item in the list it returns, then access "aggregateRating". From there, you can get both the values you want. Your code may look something like this:
the_dict = ...
d = the_dict['#graph'][0][aggregateRating']
rating_value, review_count = d['ratingValue'], d['reviewCount']
I am scraping the LaneBryant website.
Part of the source code is
<script type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"#type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>
In order to get price in USD, I have written this script:
def getPrice(self,start):
fprice=[]
discount = ""
price1 = start.find('script', {'type': 'application/ld+json'})
data = ""
#print("price 1 is + "+ str(price1)+"data is "+str(data))
price1 = str(price1).split(",")
#price1=str(price1).split(":")
print("final price +"+ str(price1[11]))
where start is :
d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
d.get(url)
start = BeautifulSoup(d.page_source, 'html.parser')
It doesn't print the price even though I am getting correct text. How do I get just the price?
In this instance you can just regex for the price
import requests, re
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])
Otherwise, target the appropriate script tag by id and then parse the .text with json library
import requests, json
from bs4 import BeautifulSoup
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)
price1 = start.find('script', {'type': 'application/ld+json'})
This is actually the <script> tag, so a better name would be
script_tag = start.find('script', {'type': 'application/ld+json'})
You can access the text inside the script tag using .text. That will give you the JSON in this case.
json_string = script_tag.text
Instead of splitting by commas, use a JSON parser to avoid misinterpretations:
import json
clothing=json.loads(json_string)
I've been trying to scrape some contents of a news-site
such as news description, tags, comments etc. Successfully done with the description and tags. But, while scraping the comments, the tags are not showing after finding by the tags by beautifulsoup, although it is showing if I inspect the page.
I just want to scrape all the comments (nested comments also) in the page and make them a single string to save in a csv file.
import requests
import bs4
from time import sleep
import os
url = 'https://www.prothomalo.com/bangladesh/article/1573772/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE%E0%A6%A6%E0%A7%87%E0%A6%B6%E0%A6%BF-%E0%A6%AA%E0%A6%BE%E0%A6%B8%E0%A6%AA%E0%A7%8B%E0%A6%B0%E0%A7%8D%E0%A6%9F%E0%A6%A7%E0%A6%BE%E0%A6%B0%E0%A7%80-%E0%A6%B0%E0%A7%8B%E0%A6%B9%E0%A6%BF%E0%A6%99%E0%A7%8D%E0%A6%97%E0%A6%BE%E0%A6%B0%E0%A6%BE-%E0%A6%B8%E0%A7%8C%E0%A6%A6%E0%A6%BF-%E0%A6%A5%E0%A7%87%E0%A6%95%E0%A7%87-%E0%A6%A2%E0%A6%BE%E0%A6%95%E0%A6%BE%E0%A7%9F'
resource = requests.get(url, timeout = 3.0)
soup = bs4.BeautifulSoup(resource.text, 'lxml')
# working as expected
tags = soup.find('div', {'class':'topic_list'})
tag = ''
tags = tags.findAll('a', {'':''})
for t in range(len(tags)):
tag = tag + tags[t].text + '|'
# working as expected
content_tag = soup.find('div', {'itemprop':'articleBody'})
content_all = content_tag.findAll('p', {'':''})
content = ''
for c in range(len(content_all)):
content = content + content_all[c].text
# comments not found
comment = soup.find('div', {'class':'comments_holder'})
print(comment)
console:
<div class="comments_holder">
<div class="comments_holder_inner">
<div class="comments_loader"> </div>
<ul class="comments_holder_ul latest">
</ul>
</div>
</div>
What you see in Firefox/Developer tools is not what you received through requests. The comments are loading separately through AJAX and they are in JSON format.
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.prothomalo.com/bangladesh/article/1573772/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE%E0%A6%A6%E0%A7%87%E0%A6%B6%E0%A6%BF-%E0%A6%AA%E0%A6%BE%E0%A6%B8%E0%A6%AA%E0%A7%8B%E0%A6%B0%E0%A7%8D%E0%A6%9F%E0%A6%A7%E0%A6%BE%E0%A6%B0%E0%A7%80-%E0%A6%B0%E0%A7%8B%E0%A6%B9%E0%A6%BF%E0%A6%99%E0%A7%8D%E0%A6%97%E0%A6%BE%E0%A6%B0%E0%A6%BE-%E0%A6%B8%E0%A7%8C%E0%A6%A6%E0%A6%BF-%E0%A6%A5%E0%A7%87%E0%A6%95%E0%A7%87-%E0%A6%A2%E0%A6%BE%E0%A6%95%E0%A6%BE%E0%A7%9F'
comment_url = 'https://www.prothomalo.com/api/comments/get_comments_json/?content_id={}'
article_id = re.findall(r'article/(\d+)', url)[0]
comment_data = requests.get(comment_url.format(article_id)).json()
print(json.dumps(comment_data, indent=4))
Prints:
{
"5529951": {
"comment_id": "5529951",
"parent": "0",
"label_depth": "0",
"commenter_name": "MD Asif Iqbal",
"commenter_image": "//profiles.prothomalo.com/profile/999009/picture/",
"comment": "\u098f\u0987 \u09ad\u09be\u09b0 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u0995\u09c7 \u09b8\u09be\u09b0\u09be\u099c\u09c0\u09ac\u09a8 \u09ac\u09b9\u09a8 \u0995\u09b0\u09a4\u09c7 \u09b9\u09ac\u09c7",
"create_time": "2019-01-08 19:59",
"comment_status": "published",
"like_count": "\u09e6",
"dislike_count": "\u09e6",
"like_me": null,
"dislike_me": null,
"device": "phone",
"content_id": "1573772"
},
"5529952": {
"comment_id": "5529952",
"parent": "0",
... and so on.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)
this gives-->
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "Organization",
"address": {
"#type": "PostalAddress",
"addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road,
Bibvewadi",
"postalCode": "411016 ",
"streetAddress": " Pune/Maharashtra "
},
"name": "Banctec Tps India Pvt Ltd",
"telephone": "(020) "
}
</script>
i need to print out the address dictionary which is inside a dictionary, i need to access the addressLocality, postal code, streetaddress.
tried differnt methods and failed.
String of JSON formatted data in Python, deserialize that with json.loads()
import json
links= soup.find('script')
print(links)
after this,
address = json.loads(links.text)['address']
print(address)
Use the string property to get the text of the element, then you can parse it as JSON.
links_dict = json.loads(links.string)
address = links_dict['address']
Use the json package:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)
for script in links:
if '#context' in script.text:
jsonStr = script.string
jsonObj = json.loads(jsonStr)
print (jsonObj['address'])
Output:
print (jsonObj['address'])
{'#type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}
Often times script tags contain a lot of javascript fluff. You can use regex to isolate the dictionary:
scripts = s.findAll('script')
for script in scripts:
if '#context' in script.text:
# Extra step to isolate the dictionary.
jsonStr = re.search(r'\{.*\}', str(script)).group()
# Create dictionary
dct = json.loads(jsonStr)
print(dct['address'])