I've been trying to scrape some contents of a news-site
such as news description, tags, comments etc. Successfully done with the description and tags. But, while scraping the comments, the tags are not showing after finding by the tags by beautifulsoup, although it is showing if I inspect the page.
I just want to scrape all the comments (nested comments also) in the page and make them a single string to save in a csv file.
import requests
import bs4
from time import sleep
import os
url = 'https://www.prothomalo.com/bangladesh/article/1573772/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE%E0%A6%A6%E0%A7%87%E0%A6%B6%E0%A6%BF-%E0%A6%AA%E0%A6%BE%E0%A6%B8%E0%A6%AA%E0%A7%8B%E0%A6%B0%E0%A7%8D%E0%A6%9F%E0%A6%A7%E0%A6%BE%E0%A6%B0%E0%A7%80-%E0%A6%B0%E0%A7%8B%E0%A6%B9%E0%A6%BF%E0%A6%99%E0%A7%8D%E0%A6%97%E0%A6%BE%E0%A6%B0%E0%A6%BE-%E0%A6%B8%E0%A7%8C%E0%A6%A6%E0%A6%BF-%E0%A6%A5%E0%A7%87%E0%A6%95%E0%A7%87-%E0%A6%A2%E0%A6%BE%E0%A6%95%E0%A6%BE%E0%A7%9F'
resource = requests.get(url, timeout = 3.0)
soup = bs4.BeautifulSoup(resource.text, 'lxml')
# working as expected
tags = soup.find('div', {'class':'topic_list'})
tag = ''
tags = tags.findAll('a', {'':''})
for t in range(len(tags)):
tag = tag + tags[t].text + '|'
# working as expected
content_tag = soup.find('div', {'itemprop':'articleBody'})
content_all = content_tag.findAll('p', {'':''})
content = ''
for c in range(len(content_all)):
content = content + content_all[c].text
# comments not found
comment = soup.find('div', {'class':'comments_holder'})
print(comment)
console:
<div class="comments_holder">
<div class="comments_holder_inner">
<div class="comments_loader"> </div>
<ul class="comments_holder_ul latest">
</ul>
</div>
</div>
What you see in Firefox/Developer tools is not what you received through requests. The comments are loading separately through AJAX and they are in JSON format.
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.prothomalo.com/bangladesh/article/1573772/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE%E0%A6%A6%E0%A7%87%E0%A6%B6%E0%A6%BF-%E0%A6%AA%E0%A6%BE%E0%A6%B8%E0%A6%AA%E0%A7%8B%E0%A6%B0%E0%A7%8D%E0%A6%9F%E0%A6%A7%E0%A6%BE%E0%A6%B0%E0%A7%80-%E0%A6%B0%E0%A7%8B%E0%A6%B9%E0%A6%BF%E0%A6%99%E0%A7%8D%E0%A6%97%E0%A6%BE%E0%A6%B0%E0%A6%BE-%E0%A6%B8%E0%A7%8C%E0%A6%A6%E0%A6%BF-%E0%A6%A5%E0%A7%87%E0%A6%95%E0%A7%87-%E0%A6%A2%E0%A6%BE%E0%A6%95%E0%A6%BE%E0%A7%9F'
comment_url = 'https://www.prothomalo.com/api/comments/get_comments_json/?content_id={}'
article_id = re.findall(r'article/(\d+)', url)[0]
comment_data = requests.get(comment_url.format(article_id)).json()
print(json.dumps(comment_data, indent=4))
Prints:
{
"5529951": {
"comment_id": "5529951",
"parent": "0",
"label_depth": "0",
"commenter_name": "MD Asif Iqbal",
"commenter_image": "//profiles.prothomalo.com/profile/999009/picture/",
"comment": "\u098f\u0987 \u09ad\u09be\u09b0 \u09ac\u09be\u0982\u09b2\u09be\u09a6\u09c7\u09b6\u0995\u09c7 \u09b8\u09be\u09b0\u09be\u099c\u09c0\u09ac\u09a8 \u09ac\u09b9\u09a8 \u0995\u09b0\u09a4\u09c7 \u09b9\u09ac\u09c7",
"create_time": "2019-01-08 19:59",
"comment_status": "published",
"like_count": "\u09e6",
"dislike_count": "\u09e6",
"like_me": null,
"dislike_me": null,
"device": "phone",
"content_id": "1573772"
},
"5529952": {
"comment_id": "5529952",
"parent": "0",
... and so on.
Related
I'm trying to extract links from website using beautiful soup.The website link is https://www.thehindu.com/search/?q=central+vista&sort=relevance&start=#gsc.tab=0&gsc.q=central%20vista&gsc.page=1
The code which i used is given below
import requests
from bs4 import BeautifulSoup
url=[]
url = 'https://www.thehindu.com/search/?q=central+vista&sort=relevance&start=#gsc.tab=0&gsc.q=central%20vista&gsc.page=1'
reqs = requests.get(url)
soup = BeautifulSoup(reqs.text, 'html.parser')
urls = []
for link in soup.find_all('a'):
print(link.get('href'))
urls.append(link.get('href'))
The code runs and gives all the urls present in the website except the one present in the google search console which is the required part.I am basically stuck. Can someone help me to sort it out?
The data you see is loaded with JavaScript, so beautifulsoup doesn't see it. You can use requests + re/json modules to get the data:
import re
import json
import requests
url = "https://cse.google.com/cse/element/v1"
params = {
"rsz": "filtered_cse",
"num": "10",
"hl": "sk",
"source": "gcsc",
"gss": ".com",
"cselibv": "f275a300093f201a",
"cx": "264d7caeb1ba04bfc",
"q": "central vista",
"safe": "active",
"cse_tok": "AB1-RNWPlN01WUQgebV0g3LpWU6l:1670351743367",
"lr": "",
"cr": "",
"gl": "",
"filter": "0",
"sort": "",
"as_oq": "",
"as_sitesearch": "",
"exp": "csqr,cc,4861326",
"callback": "google.search.cse.api3099",
}
data = requests.get(url, params=params).text
data = re.search(r"(?s)\((.*)\)", data).group(1)
data = json.loads(data)
for r in data["results"]:
print(r["url"])
Prints:
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece
https://www.thehindu.com/news/national/central-vista-project-sc-dismisses-plea-against-delhi-hc-verdict-refusing-to-halt-work/article35031575.ece
https://www.thehindu.com/opinion/editorial/monumental-hurry-on-central-vista-project/article31734021.ece
https://www.thehindu.com/news/national/central-vista-new-buildings-on-kg-marg-africa-avenue-proposed-for-relocating-govt-offices/article31702494.ece
https://www.thehindu.com/society/beyond-the-veils-of-secrecy-the-central-vista-project-is-both-the-cause-and-effect-of-its-own-multiple-failures/article32980560.ece
https://www.thehindu.com/news/national/estimated-cost-of-central-vista-revamp-plan-without-pmo-goes-up-to-13450-cr/article33358124.ece%3Fhomepage%3Dtrue
https://www.thehindu.com/news/national/work-on-new-parliament-central-vista-avenue-projects-on-track/article36296821.ece
https://www.thehindu.com/news/national/2466-trees-removed-for-central-vista-projects-so-far-govt/article65665595.ece
https://www.thehindu.com/news/national/central-vista-avenue-redevelopment-project-to-be-completed-by-july-18-puri/article65611471.ece%3Fhomepage%3Dtrue
https://www.thehindu.com/news/national/central-vista-jharkhand-firm-is-lowest-bidder-for-vice-president-enclave/article37310541.ece
I am trying to scrape a university world ranking website; however I have trouble extracting one of the keys without its html tags.
I get <div class="td-wrap"> Massachusetts Institute of Technology (MIT) </div>
I'd like to get: Massachusetts Institute of Technology (MIT)
Here is how I parse the data:
def parser_page(json):
if json:
items = json.get('data')
for i in range(len(items)):
item = items[i]
qsrank = {}
if "=" in item['rank_display']:
rk_str = str(item['rank_display']).split('=')[-1]
qsrank['rank_display'] = rk_str
else:
qsrank['rank_display'] = item['rank_display']
qsrank['title'] = item['title']
qsrank['region'] = item['region']
qsrank['score'] = item['score']
yield qsrank
More information, here is how the keys are presented:
https://www.topuniversities.com/sites/default/files/qs-rankings-data/en/3740566.txt?1624879808?v=1625562924528
Everything is fine beside the title as you can see above, I am trying to extract the data without the tags around it.
To get the text from Json file, you can use beautifulsoup. For example:
import json
from bs4 import BeautifulSoup
json_data = r"""{
"core_id": "624",
"country": "Italy",
"city": "Trieste",
"guide": "",
"nid": "297237",
"title": "<div class=\"td-wrap\"><a href=\"\/universities\/university-trieste\" class=\"uni-link\">University of Trieste<\/a><\/div>",
"logo": "\/sites\/default\/files\/university-of-trieste_624_small.jpg",
"score": "",
"rank_display": "651-700",
"region": "Europe",
"stars": "",
"recm": "0--"
}"""
json_data = json.loads(json_data)
soup = BeautifulSoup(json_data["title"], "html.parser")
print(soup.get_text(strip=True))
Prints:
University of Trieste
I am trying to use BeautifulSoup to get some data from website the data is returned as follows
window._sharedData = {
"config": {
"csrf_token": "DMjhhPBY0i6ZyMKYQPjMjxJhRD0gkRVQ",
"viewer": null,
"viewerId": null
},
"country_code": "IN",
"language_code": "en",
"locale": "en_US"
}
How can I import the same into json.loads so I can extract the data?
You need to change it first to a json format by removing the variable name and parsing it as a string:
import json
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('window._sharedData = ', '')
data = json.loads(text)
country_code = data['country_code']
Or you can use the eval function to transform it to a python dictionary. For that you need to replace json types to python and parse it in a dictionary format:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
text = soup.find('script').text
text = text.replace('null', None)
text = text.replace('window._sharedData = ', '')
data = eval(text)
country_code = data['country_code']
I am scraping the LaneBryant website.
Part of the source code is
<script type="application/ld+json">
{
"#context": "http://schema.org/",
"#type": "Product",
"name": "Flip Sequin Teach & Inspire Graphic Tee",
"image": [
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477",
"http://lanebryant.scene7.com/is/image/lanebryantProdATG/356861_0000015477_Back"
],
"description": "Get inspired with [...]",
"brand": "Lane Bryant",
"sku": "356861",
"offers": {
"#type": "Offer",
"url": "https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861",
"priceCurrency": "USD",
"price":"44.95",
"availability": "http://schema.org/InStock",
"itemCondition": "https://schema.org/NewCondition"
}
}
}
}
</script>
In order to get price in USD, I have written this script:
def getPrice(self,start):
fprice=[]
discount = ""
price1 = start.find('script', {'type': 'application/ld+json'})
data = ""
#print("price 1 is + "+ str(price1)+"data is "+str(data))
price1 = str(price1).split(",")
#price1=str(price1).split(":")
print("final price +"+ str(price1[11]))
where start is :
d = webdriver.Chrome('/Users/fatima.arshad/Downloads/chromedriver')
d.get(url)
start = BeautifulSoup(d.page_source, 'html.parser')
It doesn't print the price even though I am getting correct text. How do I get just the price?
In this instance you can just regex for the price
import requests, re
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
p = re.compile(r'"price":"(.*?)"')
print(p.findall(r.text)[0])
Otherwise, target the appropriate script tag by id and then parse the .text with json library
import requests, json
from bs4 import BeautifulSoup
r = requests.get('https://www.lanebryant.com/flip-sequin-teach-inspire-graphic-tee/prd-356861#color/0000015477', headers = {'User-Agent':'Mozilla/5.0'})
start = BeautifulSoup(r.text, 'html.parser')
data = json.loads(start.select_one('#pdpInitialData').text)
price = data['pdpDetail']['product'][0]['price_range']['sale_price']
print(price)
price1 = start.find('script', {'type': 'application/ld+json'})
This is actually the <script> tag, so a better name would be
script_tag = start.find('script', {'type': 'application/ld+json'})
You can access the text inside the script tag using .text. That will give you the JSON in this case.
json_string = script_tag.text
Instead of splitting by commas, use a JSON parser to avoid misinterpretations:
import json
clothing=json.loads(json_string)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add delay
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-
Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find('script')
print(links)
this gives-->
<script type="application/ld+json">
{
"#context": "https://schema.org",
"#type": "Organization",
"address": {
"#type": "PostalAddress",
"addressLocality": "3rd Floor, Sharda Arcade, Pune Satara Road,
Bibvewadi",
"postalCode": "411016 ",
"streetAddress": " Pune/Maharashtra "
},
"name": "Banctec Tps India Pvt Ltd",
"telephone": "(020) "
}
</script>
i need to print out the address dictionary which is inside a dictionary, i need to access the addressLocality, postal code, streetaddress.
tried differnt methods and failed.
String of JSON formatted data in Python, deserialize that with json.loads()
import json
links= soup.find('script')
print(links)
after this,
address = json.loads(links.text)['address']
print(address)
Use the string property to get the text of the element, then you can parse it as JSON.
links_dict = json.loads(links.string)
address = links_dict['address']
Use the json package:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import lxml
import time #to add dealay
import json
url ='https://www.fundoodata.com/companies-detail/Banctec-Tps-India-Pvt-Ltd/48600.html' #from where i need data
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
type(soup)
links= soup.find_all('script')
print(links)
for script in links:
if '#context' in script.text:
jsonStr = script.string
jsonObj = json.loads(jsonStr)
print (jsonObj['address'])
Output:
print (jsonObj['address'])
{'#type': 'PostalAddress', 'addressLocality': '3rd Floor, Sharda Arcade, Pune Satara Road, Bibvewadi', 'postalCode': '411016 ', 'streetAddress': ' Pune/Maharashtra '}
Often times script tags contain a lot of javascript fluff. You can use regex to isolate the dictionary:
scripts = s.findAll('script')
for script in scripts:
if '#context' in script.text:
# Extra step to isolate the dictionary.
jsonStr = re.search(r'\{.*\}', str(script)).group()
# Create dictionary
dct = json.loads(jsonStr)
print(dct['address'])