Using Selenium "find_elements_by_class_name" to retrieve data from website - python

I'm trying my hand at some python code, and am having a hell of a time with Selenium. Any help you could offer would be super appreciated. Long story short, I'm trying to pull the average rating of a given movie from Letterboxd.com. For example:
https://letterboxd.com/film/the-dark-knight/
The value I'm looking for is the average rating to 2 decimal places, which you can see if you mouseover the rating that's displayed on the page:
Average Rating 4.43 displayed on mousover
In this case, the average rating is 4.43, and that's the number I'm trying to retrieve.
So far, I've managed to successfully grab the 1 decimal place version using driver.find_elements_by_class_name('average-rating')
In this case, that returns "4.4". But I need "4.43."
I can see the correct value in the developer tools. It appears twice. Once here:
<span class="average-rating" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
4.4
And again in what appears to be metadata:
<meta name="twitter:data2" content="4.43 out of 5">
Any suggestions on how I can grab that value correctly? Thanks so much!
Cheers,
Ari

There is another way you might wanna think of using to get the rating along with the counting of users voted for that rating. Given that they all are available in the page source within some script tag.
import re
import json
import requests
URL = 'https://letterboxd.com/film/the-dark-knight/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
r = s.get(URL)
data = json.loads(re.findall(r"CDATA[^{]+(.*)",r.text)[0])
rating = data['aggregateRating']['ratingValue']
user_voted = data['aggregateRating']['ratingCount']
print(rating,user_voted)

Please find the code and let me know if you don't understand anything. To Hover over Main Rating you should use actionchains.
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
import time
driver = webdriver.Chrome()
driver.get("https://letterboxd.com/film/the-dark-knight/")
wait = WebDriverWait(driver, 20)
time.sleep(10)
Main_Rating = driver.find_element_by_class_name('average-rating')
print("Main Rating is :- " + Main_Rating.text)
time.sleep(5)
ActionChains(driver).move_to_element(Main_Rating).perform()
More_Rating_info = driver.find_element_by_xpath('//div[#class="twipsy-inner"]').text
More_Message = More_Rating_info.split()
print("More Rating :- " + More_Message[3])
Note - If this resolves your problem then please mark it as answer.

Try below code using beautiful soup and requests:
Benefit of using Beautiful soup and requests:
Fast in terms of getting result.
Less error.
More accessibility to html tags.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def extract_avg_rating():
movie_name = 'the-dark-knight'
url = 'https://letterboxd.com/film/' + movie_name
session = requests.Session()
url_response = session.get(url,verify=False)
soup = bs(url_response.text, 'html.parser')
extracted_meta = soup.find_all('meta')[19]
extracted_rating = extracted_meta.attrs['content'].split(' ')[0]
print('Movie ' + movie_name + ' rating ' + extracted_rating)
extract_avg_rating()
In the above code movie_name parameter you can put any film name for ex: lucky-grandma and it will give you the accurate rating. Code is dynamic and help you in extracting other movies ratings and other information despite of only one thing as per your requirement.

Related

How to fetch this data with Beautiful Soup 4 or lxml?

Here's the website in question:
https://www.gurufocus.com/stock/AAPL
And the part that interests me is this one (it's the GF Score in the upper part of the website):
I need to extract the strings 'GF Score' and '98/100'.
Firefox Inspector gives me span.t-h6 > span:nth-child(1) as a CSS Selector but I just can't seem to fetch neither the numbers nor the descriptor.
Here's the code that I've used so far to extract the "GF Score" part:
import requests
import bs4 as BeautifulSoup
from lxml import html
req = requests.get('https://www.gurufocus.com/stock/AAPL')
soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)
tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[#id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)
As a result, I'm getting three empty brackets.
The xpath was taken directly out of chrome via the copy function and the nth-child expression in the BS4 part also.
Any suggestions as to what might be at fault here?
Unfortunately get the page using Requests lib impossible, as well as access to the api to which the signature is needed.
There is 2 option:
Use API. It's not free, but much more convenient and faster.
And second one - Selenium. It's free, but the speed is without fine-tuning the wait element. The second problem is protection - cloudflare. Soon without changing the headers and\or IP you probably ll get a ban. So there is example:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def get_gf_score(ticker_symbol: str, timeout=10):
driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
try:
element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
WebDriverWait(driver, timeout).until(element_present)
return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
except TimeoutException:
print("Timed out waiting for page to load")
tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
print(ticker, get_gf_score(ticker), sep=': ')
OUTPUT:
AAPL: 98/100
MSFT: 97/100
AMZN: 88/100
One way you could get the desired value is:
Make the request to the page - by pretending the request comes from a browser and then, extract the info you need from the JSON object inside the script HTML tag1.
NOTE:
1 Please be warned there that I couldn't get the JSON
object -This is the JSON result, btw- and extract the value by following the path:
js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']
So, as alternative (not a very good one, IMHO, but, works for your purpose, though), I decided to find certain elements on the JSON/string result and then extract the desired value (by dividing the string - i.e. substring).
Full code:
import requests
from bs4 import BeautifulSoup
import json
geturl = r'https://www.gurufocus.com/stock/AAPL'
getheaders = {
'Accept': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.gurufocus.com'
}
s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")
# This is the "<script>" element tag that contains the full JSON object
# with all the data.
scripts = soup.findAll("script")[1]
# Get only the JSON data:
js_data = scripts.get_text("", strip=True)
# -- Get the value from the "gf_score" string - by getting its position:
# Part I: is where the "gf_score" begins.
part_I = js_data.find("gf_score")
# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")
# Build the desired result and print it:
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)
Result:
GF Score: 98/100
the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?

How can I get fast data in python?

everyone.
I am working on a python project with selenium to scrape data.
But there is one problem, I have to scrape the data every 5mins.
So I run chrome driver with selenium, the problem is selenium scrape speed is very slow.
If I run this project, It takes at least 30mins. I can't get data every 5mins.
If you have experience in this field, please help me.
If you can give me other ways(for example beautiful soap), I will be very happy.
Note: This site that I want to get data is rendering using javascript.
This is my source code. I am testing it.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import pandas as pd
import time
driver = webdriver.Chrome()
driver.set_window_size(800, 600)
tickerNames = []
finvizUrl = "https://finviz.com/screener.ashx?v=111&f=exch_nasd,geo_usa,sh_float_u10,sh_price_u10,sh_relvol_o2"
nasdaqUrl = "https://www.nasdaq.com/market-activity/stocks/"
tickerPrice = []
def openPage(url):
driver.get(url)
def exitDrive():
driver.quit()
def getTickers():
tickers = driver.find_elements_by_class_name('screener-link-primary')
for i in range(len(tickers)):
tickerNames.append(tickers[i].text)
return tickerNames
def comparePrice(tickers):
for i in range(len(tickers)):
openPage(nasdaqUrl+tickers[i])
tickerPrice[i] = driver.find_element_by_class_name('symbol-page-header__pricing-price').text
return tickerPrice
openPage(finvizUrl)
comparePrice(getTickers())
# getTickers()
print(comparePrice())
There seems to be an API on the nasdaq site that you can query (found using network tools), so there isn't really any need to use selenium for this. Here is an example that gets the data using requests
import requests
import lxml.html
import time
FINVIZ_URL = "https://finviz.com/screener.ashx?v=111&f=exch_nasd,geo_usa,sh_float_u10,sh_price_u10,sh_relvol_o2"
NASDAQ_URL = "https://api.nasdaq.com/api/quote/{}/summary?assetclass=stocks"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/14.0.3 Safari/605.1.15"
}
session = requests.Session()
session.headers.update(headers)
r = session.get(FINVIZ_URL)
# Get data using lxml xpath but use whatever you want
x = lxml.html.fromstring(r.text)
stocks = x.xpath("//*[#class='screener-link-primary']/text()")
for stock in stocks:
data = session.get(NASDAQ_URL.format(stock))
print(f"INFO for {stock}")
print(data.json()) # This might have the data you want
# Sleep in case there is a rate limit (may not be needed)
time.sleep(5)

Python web scraping with bs4 on Patreon

I've written a script that looks up a few blogs and sees if a new post has been added. However, when I try to do this on Patreon I cannot find the right element with bs4.
Let's take https://www.patreon.com/cubecoders for example.
Say I want to get the number of exclusive posts under the 'Become a patron to' section, which would be 25 as of now.
This code works just fine:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("div", class_="sc-AxjAm fXpRSH").text
print(text_of_newest_post)
Output: 25
Now, I want to get the title of the newest post, which would be 'New in AMP 2.0.2 - Integrated SCP/SFTP server!' as of now.
I inspect the title in my browser and see that it is contained by a span tag with the class 'sc-1di2uql-1 vYcWR'.
However, when I try to run this code I cannot fetch the element:
import requests
from bs4 import BeautifulSoup
plain_html = requests.get("https://www.patreon.com/cubecoders").text
full_html = BeautifulSoup(plain_html, "html.parser")
text_of_newest_post = full_html.find("span", class_="sc-1di2uql-1 vYcWR")
print(text_of_newest_post)
Output: None
I've already tried to fetch the element with XPath or CSS selector but couldn't do it. I thought it might be because the site is rendered first with JavaScript and thus I cannot access the elements before they are rendered correctly.
When I use Selenium to render the site first I can see the title when printing out all div tags on the page but when I want to get only the very first title I can't access it.
Do you guys know a workaround maybe?
Thanks in advance!
EDIT:
In Selenium I can do this:
from selenium import webdriver
browser = webdriver.Chrome("C:\webdrivers\chromedriver.exe")
browser.get("https://www.patreon.com/cubecoders")
divs = browser.find_elements_by_tag_name("div")
def find_text(divs):
for div in divs:
for span in div.find_elements_by_tag_name("span"):
if span.get_attribute("class") == "sc-1di2uql-1 vYcWR":
return span.text
print(find_text(divs))
browser.close()
Output: New in AMP 2.0.2 - Integrated SCP/SFTP server!
When I just try to search for the spans with class 'sc-1di2uql-1 vYcWR' from the start it won't give me the result though. Could it be that the find_elements method does not look deeper inside for nestled tags?
The data you see is loaded via Ajax from their API. You can use requests module to load the data.
For example:
import re
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.patreon.com/cubecoders'
api_url = 'https://www.patreon.com/api/posts'
headers = {
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0',
'Accept-Language': 'en-US,en;q=0.5',
'Referer': url
}
with requests.session() as s:
html_text = s.get(url, headers=headers).text
campaign_id = re.search(r'https://www\.patreon\.com/api/campaigns/(\d+)', html_text).group(1)
data = s.get(api_url, headers=headers, params={'filter[campaign_id]': campaign_id, 'filter[contains_exclusive_posts]': 'true', 'sort': '-published_at'}).json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some information to screen:
for d in data['data']:
print('{:<70} {}'.format(d['attributes']['title'], d['attributes']['published_at']))
Prints:
New in AMP 2.0.2 - Integrated SCP/SFTP server! 2020-07-17T13:28:49.000+00:00
AMP Enterprise Pricing Reveal! 2020-07-07T10:02:02.000+00:00
AMP Enterprise Edition Waiting List 2020-07-03T13:25:35.000+00:00
Upcoming changes to the user system 2020-05-29T10:53:43.000+00:00
More video tutorials! What do you want to see? 2020-05-21T12:20:53.000+00:00
Third AMP tutorial - Windows installation! 2020-05-21T12:19:23.000+00:00
Another day, another video tutorial! 2020-05-08T22:56:45.000+00:00
AMP Video Tutorial - Out takes! 2020-05-05T23:01:57.000+00:00
AMP Video Tutorials - Installing AMP on Linux 2020-05-05T23:01:46.000+00:00
What is the AMP Console Assistant (AMPCA), and why does it exist? 2020-05-04T01:14:39.000+00:00
Well that was unexpected... 2020-05-01T11:21:09.000+00:00
New Goal - MariaDB/MySQL Support! 2020-04-22T13:41:51.000+00:00
Testing out AMP Enterprise Features 2020-03-31T18:55:42.000+00:00
Temporary feature unlock for all Patreon backers! 2020-03-11T14:53:31.000+00:00
Preparing for Enterprise 2020-03-11T13:09:40.000+00:00
Aarch64/ARM64 and Raspberry Pi is here! 2020-03-06T19:07:09.000+00:00
Aarch64/ARM64 and Raspberry Pi progress! 2020-02-26T17:53:53.000+00:00
Wallpaper! 2020-02-13T11:04:39.000+00:00
Instance Templating - Make once, deploy many. 2020-02-06T15:26:09.000+00:00
Time for a new module! 2020-01-07T13:41:17.000+00:00

Item visible in browser not collected by scraper

I'm trying to collect data from the SumofUs website; specifically the number of signatures on the petition. The datum is presented like this: <div class="percent">256,485 </div> (this is the only item of this class on the Page.)
So I tried this:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
raw = requests.get(url, headers = user_agent)
html = BeautifulSoup(raw.text)
# get the item we're seeking
number = html.find("div", class_="percent")
print number
It seems that the number isn't rendered (I've tried a couple of user agent strings.) What else could be causing this? How can I work around this in future?
In the general case you should use a headless browser. Ghost.py is written in python so its probably a good choice to try first.
In this specific case a little research reveals that there's a much simpler method. By using the network tab in chrome you can see that the site makes an ajax call to populate the value. So you can just get it directly:
url = "http://action.sumofus.org/api/ak_action_count_by_action/?action=nhs-patient-corporations&additional="
number = int(requests.get(url).text)
You could use Selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
# then load BeautifulSoup with browsers content
html = BeautifulSoup(driver.page_source)
...

Open a page programmatically in python

Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin
That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.
You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.
You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()
You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']

Categories