BeautifulSoup - find() function not working for some elements - python

I'm trying to scrape financial data off this URL:https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals
In this webpage, scraping the h1 tag works perfectly by referencing its class.
Source HTML:
<h1 _ngcontent-ng-lseg-c11="" class="company-name font-bold hero-font"><!----><!---->STANDARD CHARTERED PLC<!----><!----><!----></h1>
My Python Code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = 'https://www.londonstockexchange.com/stock/{}/{}'
stock = 'STAN/standard-chartered-plc'
info = 'fundamentals'
full_url = url.format(stock, info)
print(full_url)
r = requests.get(full_url)
soup = BeautifulSoup(r.text, 'lxml')
title = soup.find('title')
print(title)
rows = soup.find(class_='company-name font-bold hero-font')
print(rows)
Output:
https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals
<title>STANDARD CHARTERED PLC STAN Fundamentals - Stock | London Stock Exchange</title>
<h1 _ngcontent-sc12="" class="company-name font-bold hero-font"><!-- --><!-- -->STANDARD CHARTERED PLC<!-- --><!-- --><!-- --></h1>
But when trying to scrape another part of the webpage, namely the following tag, this function ceases to work:
<thead _ngcontent-ng-lseg-c21="" class="accordion-header gtm-trackable">
My Python Code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
url = 'https://www.londonstockexchange.com/stock/{}/{}'
stock = 'STAN/standard-chartered-plc'
info = 'fundamentals'
full_url = url.format(stock, info)
print(full_url)
r = requests.get(full_url)
soup = BeautifulSoup(r.text, 'lxml')
title = soup.find('title')
print(title)
rows = soup.find(class_='accordion-header gtm-trackable')
print(rows)
My output is as follows:
https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals
<title>STANDARD CHARTERED PLC STAN Fundamentals - Stock | London Stock Exchange</title>
None
I've tried using 'html.parser' and 'lxml' and both cause the same problem.

Data is dynamically loaded from a script tag. You can regex out the string holding your data, then do a replace on some entities to get a string json can turn in to a json object. Then parse out what you what.
import requests
from bs4 import BeautifulSoup
import json
r = requests.get('https://www.londonstockexchange.com/stock/STAN/standard-chartered-plc/fundamentals')
soup = BeautifulSoup(r.text, 'lxml')
data = json.loads(soup.select_one('#ng-lseg-state').string.replace('&q;','"'))
print(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'].keys())
There may be some other entities to replace. It may be sufficient to add the following:
import html
and later
data = json.loads(html.unescape(soup.select_one('#ng-lseg-state').string.replace('&q;','"')))
Sample of data:
To match image:
from pprint import pprint
pprint(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'])
String pasted into json viewer:
json.dumps(data['sortedComponents']['content'][1]['status']['childComponents'][1]['content'])

Related

How can I scrape this data when requests doesn't return it?

I want to scrape the information from this page:
https://databases.usatoday.com/nfl-arrests/
Each of the arrests is listed in a table on the page under the css selector: #csp-data I can see this in the page's source as well: <div id="csp-data" class="csp-data"></div> but there is nothing in-between those tags for me to parse.
When I try to run the following code, I return no results.
import requests
from bs4 import BeautifulSoup
url = "https://databases.usatoday.com/nfl-arrests/"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
test = soup.select('#csp-data > div > div:nth-child(3) > div > div.table-responsive > table > tbody')
print(test)
If I use test = soup.select('#csp-data'), I return <div class="csp-data" id="csp-data"></div> If I move to the next step #csp-data > div, I return no results.
I'm assuming that the data isn't being loaded when requests gets the data, but I'm not sure. When I go in through my browser and use inspect element, I can see the table has loaded.
Does anyone have an idea on how I could move forward here?
Here is the working output from ajax calls
import requests
import json
body = 'action=cspFetchTable&security=3193d24eb0&pageID=10&blogID=&sortBy=Date&sortOrder=desc&page=1&searches={}&heads=true'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36',
'Content-Type': 'application/x-www-form-urlencoded'}
url='https://databases.usatoday.com/wp-admin/admin-ajax.php'
r = requests.post(url, data=body, headers =headers)
tables = r.json()['data']['Result']
for table in tables:
print(table['First_name'])
Output:Example
Bradley
Deonte
Barkevious
Darius
Jarron
Tamorrion
Zaven
Frank
Justin
Aldon
Jeff
Marshon
Broderick
Frank
Jaydon
Kevin
Kemah
Chad
Isaiah
Rashard

BeautifulSoup returning 'None' object type

After watching a video I tried to fetch price for an item from a amazon.de website using BeautifulSoup api.
#My CODE
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text()
price = soup.find(id='priceblock_ourprice')
print(title) #returns correct string from the URL above
print(price)
#returns 'None'. Unexpected. Expecting price with some extensions from <span id="priceblock_ourprice"
Anyone who finds something wrong in my code would be really helpful for me.
Thanks in Advance!
Can not reproduce the 'None', code works fine, just added get_text() to the price and strip() both variables, to make the result a little bit cleaner.
import requests, time
from bs4 import BeautifulSoup
URL = 'https://www.amazon.de/Neues-Apple-iPhone-Pro-128-GB/dp/B08L5SNWD2/ref=sr_1_1_sspa?__mk_de_DE=%C3%85M%C3%85%C5%BD%C3%95%C3%91&crid=3UH87RWLLO40E&dchild=1&keywords=iphone+12+pro&qid=1605603669&sprefix=Iphone+12%2Caps%2C175&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzRjAxN0xWNTk0TVpYJmVuY3J5cHRlZElkPUEwNzE4ODIxMktCWlhJMVlHWDFNMyZlbmNyeXB0ZWRBZElkPUExMDMwODk2Tk5OVkdZRTJISDVMJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36',
'Cache-Control': 'no-cache'
}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
#I tried other parsing methods too: 'html.parser', 'html5lib'. Not helpful
title = soup.find(id="productTitle").get_text().strip()
# to prevent script from crashing when there isn't a price for the product
try:
price = soup.find(id='priceblock_ourprice').get_text().strip()
#convert price to float by slicing
convertedPrice = price[:8]
except:
price = 'not loaded'
convertedPrice = 'not loaded'
print(title) #returns correct string from the URL above
print(price)
print(convertedPrice)
Output
Neues Apple iPhone 12 Pro (128 GB) - Graphit
1.120,00 €
1.120,00
But
As #Chase mentioned, if it is an dynamicly genereated content, you may give Selenium a try, this can handle the load with its Waits - By adding a delay, you can wait until page is loaded, dynamicly generated content to and then grap your information.

Unable to parse a rating information from a webpage using requests

I tried to scrape a certain information from a webpage but failed miserably. The text I wish to grab is available in the page source but I still can't fetch it. This is the site address. I'm after the portion visible in the image as Not Rated.
Relevant html:
<div class="subtext">
Not Rated
<span class="ghost">|</span> <time datetime="PT188M">
3h 8min
</time>
<span class="ghost">|</span>
Drama,
Musical,
Romance
<span class="ghost">|</span>
<a href="/title/tt0150992/releaseinfo?ref_=tt_ov_inf" title="See more release dates">18 June 1999 (India)
</a> </div>
I've tried with:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
I get None using the script above.
Expected output:
Not Rated
How can I get the rating from that webpage?
If you want to get correct version of HTML page, specify Accept-Language http header:
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
s.headers['Accept-Language'] = 'en-US,en;q=0.5' # <-- specify also this!
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
rating = soup.select_one(".titleBar .subtext").next_element
print(rating)
Prints:
Not Rated
There is a better way to getting info on the page. If you dump the html content returned by the request.
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
with open("response.html", "w", encoding=r.encoding) as file:
file.write(r.text)
you will find a element <script type="application/ld+json"> which contains all the information about the movie.
Then, you simply get the element text, parse it as json, and use the json to extract the info you wanted.
here is a working example
import json
import requests
from bs4 import BeautifulSoup
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
with requests.Session() as s:
s.headers['User-Agent'] = "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36"
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
movie_data = soup.find("script", attrs={"type": "application/ld+json"}).next # Find the element <script type="application/ld+json"> and get it's content
movie_data = json.loads(movie_data) # parse the data to json
content_rating = movie_data["contentRating"] # get rating
IMDB is one of those webpages that makes it incredible easy to do webscraping and I love it. So what they do to make it easy for webscrapers is to put a script in the top of the html that contains the whole movie object in the format of JSON.
So to get all the relevant information and organize it you simply need to get the content of that single script tag, and convert it to JSON, then you can simply ask for the specific information like with a dictionary.
import requests
import json
from bs4 import BeautifulSoup
#This part is basically the same as yours
link = "https://www.imdb.com/title/tt0150992/?ref_=ttfc_fc_tt"
r = requests.get(link)
soup = BeautifulSoup(r.content,"lxml")
#Why not get the whole json element of the movie?
script = soup.find('script', {"type" : "application/ld+json"})
element = json.loads(script.text)
print(element['contentRating'])
#Outputs "Not Rated"
# You can also inspect te rest of the json it has all the relevant information inside
#Just -> print(json.dumps(element, indent=2))
Note:
Headers and session are not necessary in this example.

Webscraping latitude longitude from google results

How can I scrape latitude and longitude from the google results in the image below using beautiful soup.
Google result latitude longitude
Here is the code for do it with bs4:
from requests import get
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',}
response = get("https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france",headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
a = soup.find("div", class_= "Z0LcW").text
print(a)
Please provide more input on further questions since we don't want to do the pre-work to create a solution.
You will have to grab this container:
<div class="HwtpBd gsrt PZPZlf" data-attrid="kc:/location/location:coordinates" aria-level="3" role="heading"><div class="Z0LcW XcVN5d">48.8573° N, 2.3370° E</div><div></div></div>
BS4
#BeautifoulSoup Stuff
import requests
from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re
# Make the request
url = "https://www.google.com/search?q=latitude+longitude+of+75270+postal+code+paris+france&rlz=1C1CHBF_deDE740DE740&oq=latitude+longitude+of+75270+postal+code+paris+france&aqs=chrome..69i57.4020j0j8&sourceid=chrome&ie=UTF-8"
response = requests.get(url)
# Convert it to proper html
html = response.text
# Parse it in html document
soup = BeautifulSoup(html, 'html.parser')
# Grab the container and its content
target_container = soup.find("div", {"class": "Z0LcW XcVN5d"}).text
Then you have a string inside the div returned.
..Assuming google doesn't change the class declarations randomly. I tried five refreshes and the classname didn't change, but who knows.
Make sure you're using user-agent (you can also use python fake user-agents library)
Code and replit.com that grabs location from Google Search results:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=latitude longitude of 75270 postal code paris france',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
location = soup.select_one('.XcVN5d').text
print(location)
Output:
48.8573° N, 2.3370° E

Cleaning up scraping results to return anchor text, but not HTML

I'm trying to scrape the prices of hockey sticks from the given URL. Eventually I'd also like to grab the names + URLs, but I don't consider that necessary to solving this.
Here's what I've got:
import requests
from pandas.io.json import json_normalize
from bs4 import BeautifulSoup
url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
stick_names = soup.find_all(class_='product-title')
stick_prices = soup.find_all(class_='regular-product')
print(stick_prices)
The above code successfully returns prices of the hockey sticks, but it looks like this:
[<p class="regular-product">
<span>$319.99</span>
</p>, <p class="regular-product">
<span>$339.99</span>
</p>, <p class="regular-product">
<span>$319.99</span>
I'd like to clean it up and have only the actual price returned.
I've tried a few things, including:
dirty_prices = soup.find_all(class_='regular-product')
clean_prices = dirty_prices.get('a')
print(clean_prices)
But to little success. Pointers are appreciated!
Not sure, but I think the following is what you may be looking for:
Instead of print(stick_prices), use:
for name,price in zip(stick_names,stick_prices):
print(name["href"],name.text,price.text)
The start of the output is:
/collections/senior-hockey-sticks/products/ccm-ribcor-trigger-3d-sr-hockey-stick
CCM RIBCOR TRIGGER 3D SR HOCKEY STICK
$319.99
/collections/senior-hockey-sticks/products/bauer-vapor-1x-lite-sr-hockey-stick
BAUER VAPOR 1X LITE SR HOCKEY STICK
$339.99
etc.
You need the .text property which you can also extract during a list comprehension. Then list/zip for a list of tuples of names/prices at end
import requests
from bs4 import BeautifulSoup
url = 'https://www.prohockeylife.com/collections/senior-hockey-sticks'
headers = {'user-agent': 'Mozilla/5.0'}
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
stick_names = [item.text.strip() for item in soup.find_all(class_='product-title')]
stick_prices = [item.text.strip() for item in soup.find_all(class_='regular-product')]
print(list(zip(stick_names, stick_prices)))

Categories