Web Scraping - No content displayed - python

I am trying to fetch the stock of a company specified by a user by taking the input. I am using requests to get the source code and BeautifulSoup to scrape. I am fetching the data from google.com. I am trying the fetch only the last stock price (806.93 in the picture). When I run my script, it prints none. None of the data is being fetched. What am I missing ?
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
company = raw_input("Enter the company name:")
URL = "https://www.google.co.in/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q="+company+"+stock"
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})
print code.contents[0]
The source code of the page looks like this :

Looks like that source is from inspecting the element, not the actual source. A couple of suggestions. Use google finance to get rid of some noise - https://www.google.com/finance?q=googl would be the URL. On that page there is a section that looks like this:
<div class=g-unit>
<div id=market-data-div class="id-market-data-div nwp g-floatfix">
<div id=price-panel class="id-price-panel goog-inline-block">
<div>
<span class="pr">
<span id="ref_694653_l">806.93</span>
</span>
<div class="id-price-change nwp">
<span class="ch bld"><span class="chg" id="ref_694653_c">+9.68</span>
<span class="chg" id="ref_694653_cp">(1.21%)</span>
</span>
</div>
</div>
You should be able to pull the number out of that.

I went to
https://www.google.com/?gfe_rd=cr&ei=-AKmV6eqC-LH8AfRqb_4Aw#newwindow=1&safe=off&q=+google+stock
, did a right click and "View Page Source" but did not see the code that you screenshotted.
Then I typed out a section of your code screenshot and created a BeautifulSoup object with it and then ran your find on it:
test_screenshot = BeautifulSoup('<div class="_F0c" data-tmid="/m/07zln7n"><span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span> = $0<span class ="_hgj">USD</span>')
test_screenshot.find('span',{'class':'_Rnb fmob_pr fac-l','data-symbol':'GOOGL'})`
Which will output what you want:
<span class="_Rnb fmob_pr fac-l" data-symbol="GOOGL" data-tmid="/m/07zln7n" data-value="806.93">806.93.</span>
This means that the code you are getting is not the code you expect to get.
I suggest using the google finance page:
https://www.google.com/finance?q=google (replace 'google' with what you want to search), which will give you wnat you are looking for:
request = requests.get(URL)
soup = BeautifulSoup(request.content,"lxml")
code = soup.find("span",{'class':'pr'})
print code.contents
Will give you
[u'\n', <span id="ref_694653_l">806.93</span>, u'\n'].
In general, scraping Google search results can get really nasty, so try to avoid it if you can.
You might also want to look into Yahoo Finance Python API.

You're looking for this:
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
It might be because there's no user-agent specified in your request headers.
The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit and you received a different HTML with different selectors and elements, and some sort of an error. User-agent fakes user visit by adding this information into HTTP request headers.
Pass user-agent into request headers:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
response = requests.get('YOUR_URL', headers=headers)
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0'
params = {
'q': 'alphabet inc class a stock',
'gl': 'us'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# two selectors which will handle two layouts
current_price = soup.select_one('.wT3VGc, .XcVN5d').text
print(current_price)
# 2,816.00
Alternatively, you can achieve the same thing by using Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast rather than figuring out why certain things don't work as expected and then to maintain it over time.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "alphabet inc class a stock",
"gl": "us",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
current_price = results['answer_box']['price']
print(current_price)
# 2,816.00
P.S - I wrote an in-depth blog post about how to reduce the chance of being blocked while web scraping search engines.
Disclaimer, I work for SerpApi.

Related

Empty list while scraping Google Search Result

I'm trying to scrape Google Search Result but all I'm getting as an output is empty list. Do you have any idea what's wrong here? I found the similar post on Stack Overflow where solution says you should try putting user_agent. I tried but it still returns nothing. Please share if you have any idea.
import requests, webbrowser
from bs4 import BeautifulSoup
user_input = input("Enter something to search:")
print("googling.....")
google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)
soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())
search_results = soup.select('.r a')
# print(search_results)
for link in search_results[:5]:
actual_link = link.get('href')
print(actual_link)
webbrowser.open('https://google.com/'+actual_link)
Google blocks your requests and threw this error This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly..
Try using selenium + python to get all the links
To get results from Google page, you have to specify User-Agent http header. For english results, add hl=en parameter to search URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
print(actual_link)
Prints:
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk
EDIT: To filter results you can use this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
if actual_link.startswith('#') or \
actual_link.startswith('https://webcache.googleusercontent.com') or \
actual_link.startswith('/search?'):
continue
print(actual_link)
Prints (for example):
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits
Most websites nowadays use JavaScript to dynamically load their webpages. Google is one of those websites. In order for the full DOM (document object model) to load in, you need a Javascript engine, which beautifulsoup and requests don't have. Arun recommended selenium, and I do to, as it has an embedded Javascript engine.
Here is the Python Selenium documentation:
https://selenium-python.readthedocs.io/
The OP desired output doesn't come from JavaScript as Serket mentioned. All data that OP needed is located in the HTML.
There's no point in selenium as well for the same reason, it's all there, in the HTML, not rendered via JavaScript.
One of the problems as other people mentioned is because of no user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent.
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
You can also grab attributes by passing them in square brackets:
element.get('href')
# is equivalent to
element['href']
Code and example in the online IDE (CSS selectors reference):
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah" # query
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''
Alternatively, you can achieve the same thing by using Google Search Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out why or how to deal with such a problem since this part (extraction/scraping) is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get what you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro day",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''
P.S - I have a blog post that covers a bit more in-depth how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Using linkGrabber to get 'href' from google search in python

Ok, so all I want to do is get the very first link inside the first google search. I tried to use beautifoulsoup but it didn't work out at all, I couldn't seem to find a way to get the link. I tried using linkGrabber, so now I get all the urls in the google search (I have limited the results to only 1 per page). My code is:
import re
import linkGrabber
import urllib
input = str(input('Give movie name: '))
input = urllib.parse.quote_plus(input)
imdb_s = '+imdb+review'
n = 1
g_s = 'https://www.google.com/search?q='+ input + imdb_s +'&num=' + str(n)
links = linkGrabber.Links(g_s)
gb = links.find(pretty=True)
print(gb)
however when I print, i get like 15 links that are from google and which I do not want to use, I want to focus only on one specific href, and get this. Can anyone please help me?
you can use the google search library - i think pip install google. This library also relies on beautiful soup, but is fit to return only search results. The problem is that the page that google returns when you search has ads and a bunch of other links that aren't the actual search results.
You can also change your query to "site:imdb.com+" to only search on imbd.
That said, I've stopped using that for my googling needs because it's against googles terms of service. I'm not moralizing anything, but the reality is that I can't seem to get much reliability as google keeps sniffing bots and recaptcha-ing them.
The correct way to do it would be to use google's custom search API - which is also good for only returning the info you need, and it's free for 100 searches per day.
To get the very first link you can use select_one() bs4 method.
It didn't work because you don't specify a user-agent (headers) which is faking real user visits, so Google won't treat your request as a default request user-agent which is: python-requests.
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get(f'https://www.google.com/search?q=minecraft', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
title = container.select_one('.DKV0Md').text
link = container.find('a')['href']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Alternatively, you can do it as well by using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches.
The main difference is that you don't have to think about why Google is blocks you, why certain selector is giving wrong output, even though it shouldn't. It's already done for the end-user with a JSON output.
Check out the Playground.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment for API_KEY
"engine": "google",
"q": "minecraft",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
link = result['link']
print(f'{title}\n{link}')
# part of the output:
'''
Minecraft Official Site | Minecraft
https://www.minecraft.net/en-us/
Minecraft Classic
https://classic.minecraft.net/
'''
Disclaimer, I work for SerpApi.

Beautiful Soup CSS selector not finding anything

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').
Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.
But when I run my code, the linkElems list is empty. What is going wrong?
import requests, sys, bs4
print('Googling...')
res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
print(res.raise_for_status())
soup = bs4.BeautifulSoup(res.text, 'html5lib')
linkElems = soup.select(".r a")
print(linkElems)
The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="
import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))
The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.
Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.
Pass user-agent into HTTP headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
link = result['link']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

How to scrape data in h4 with beautifulsoup?

I am trying to scrape the results data from this website (https://www.ufc.com/matchup/908/7717/post) and I am completely at a loss for why my proposed solution isn't working.
The outer html that I am trying to scrape is <h4 class="e-t5 winner">Jon Jones</h4>. I don't have a lot of experience with web scraping or HTML but all of the relevant information is contained in the h4 tag.
I have been successful in extracting the data from the h2 tag but I am confused as to why the same approach doesn't work for h4. For example, to extract the relevant data from <h2 class="field--name-name name_given red">Jon Jones <span class="field--field-rank rank"></span></h2> the following code works.
from requests import get
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36',
}
raw_html = get('https://www.ufc.com/matchup/908/7717/post', headers=headers)
html = BeautifulSoup(raw_html.content)
# this works
html.find_all('h2', attrs={'class': 'field--name-name name_given red'})[0].get_text().strip()
# this does not work?
html.find_all('h4', attrs={'class': 'e-t5 winner red'})
# this code gets me to the headers but not the actual listed data inside
html.find('div', attrs={'class': 'l-flex--4col-2to4'})
I am mostly confused as to why the above doesn't work and why the text I can see when inspecting the element in my browser, doesn't appear in the scraped HTML.
It is added dynamically. You can find the source in the network tab. Assuming there is always one winner you can use something like
import requests
r = requests.get('https://dvk92099qvr17.cloudfront.net/V1/908/Fnt.json').json()
winner = [fighter['FullName'] for fighter in r['FMLiveFeed']['Fights'][0]['Fighters'] if fighter['Outcome'] == 'Win'][0]
print(winner)

Retrieve a number from a span tag, using Python requests and Beautiful Soup

I'm new to python and html. I am trying to retrieve the number of comments from a page using requests and BeautifulSoup.
In this example I am trying to get the number 226. Here is the code as I can see it when I inspect the page in Chrome:
<a title="Go to the comments page" class="article__comments-counts" href="http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/comments/">
<span class="civil-comment-count" data-site-id="globeandmail" data-id="33519766" data-language="en">
226
</span>
Comments
</a>
When I request the text from the URL, I can find the code but there is no content between the span tags, no 226. Here is my code:
import requests, bs4
url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
r = requests.get()
soup = bs4.BeautifulSoup(r.text, 'html.parser')
span = soup.find('span', class_='civil-comment-count')
It returns this, same as the above but no 226.
<span class="civil-comment-count" data-id="33519766" data-language="en" data-site-id="globeandmail">
</span>
I'm at a loss as to why the value isn't appearing. Thank you in advance for any assistance.
The page, and specifically the number of comments, does involve JavaScript to be loaded and shown. But, you don't have to use Selenium, make a request to the API behind it:
import requests
with requests.Session() as session:
session.headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36"}
# visit main page
base_url = 'http://www.theglobeandmail.com/opinion/will-kevin-oleary-be-stopped/article33519766/'
session.get(base_url)
# get the comments count
url = "https://api-civilcomments.global.ssl.fastly.net/api/v1/topics/multiple_comments_count.json"
params = {"publication_slug": "globeandmail",
"reference_language": "en",
"reference_ids": "33519766"}
r = session.get(url, params=params)
print(r.json())
Prints:
{'comment_counts': {'33519766': 226}}
This page use JavaScript to get the comment number, this is what the page look like when disable the JavaScript:
You can find the real url which contains the number in Chrome's Developer tools:
Than you can mimic the requests using #alecxe code.

Categories