How to fetch this data with Beautiful Soup 4 or lxml?

How to fetch this data with Beautiful Soup 4 or lxml? - python

Here's the website in question:
https://www.gurufocus.com/stock/AAPL
And the part that interests me is this one (it's the GF Score in the upper part of the website):
I need to extract the strings 'GF Score' and '98/100'.
Firefox Inspector gives me span.t-h6 > span:nth-child(1) as a CSS Selector but I just can't seem to fetch neither the numbers nor the descriptor.
Here's the code that I've used so far to extract the "GF Score" part:
import requests
import bs4 as BeautifulSoup
from lxml import html
req = requests.get('https://www.gurufocus.com/stock/AAPL')
soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)
tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[#id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)
As a result, I'm getting three empty brackets.
The xpath was taken directly out of chrome via the copy function and the nth-child expression in the BS4 part also.
Any suggestions as to what might be at fault here?

Unfortunately get the page using Requests lib impossible, as well as access to the api to which the signature is needed.
There is 2 option:
Use API. It's not free, but much more convenient and faster.
And second one - Selenium. It's free, but the speed is without fine-tuning the wait element. The second problem is protection - cloudflare. Soon without changing the headers and\or IP you probably ll get a ban. So there is example:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def get_gf_score(ticker_symbol: str, timeout=10):
driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
try:
element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
WebDriverWait(driver, timeout).until(element_present)
return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
except TimeoutException:
print("Timed out waiting for page to load")
tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
print(ticker, get_gf_score(ticker), sep=': ')
OUTPUT:
AAPL: 98/100
MSFT: 97/100
AMZN: 88/100

One way you could get the desired value is:
Make the request to the page - by pretending the request comes from a browser and then, extract the info you need from the JSON object inside the script HTML tag1.
NOTE:
1 Please be warned there that I couldn't get the JSON
object -This is the JSON result, btw- and extract the value by following the path:
js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']
So, as alternative (not a very good one, IMHO, but, works for your purpose, though), I decided to find certain elements on the JSON/string result and then extract the desired value (by dividing the string - i.e. substring).
Full code:
import requests
from bs4 import BeautifulSoup
import json
geturl = r'https://www.gurufocus.com/stock/AAPL'
getheaders = {
'Accept': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.gurufocus.com'
}
s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")
# This is the "<script>" element tag that contains the full JSON object
# with all the data.
scripts = soup.findAll("script")[1]
# Get only the JSON data:
js_data = scripts.get_text("", strip=True)
# -- Get the value from the "gf_score" string - by getting its position:
# Part I: is where the "gf_score" begins.
part_I = js_data.find("gf_score")
# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")
# Build the desired result and print it:
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)
Result:
GF Score: 98/100

the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?

Related

Accessing href link using BeautifulSoup

I'm trying to scrape the href of the first link titled "BACC B ET A COMPTABILITE CONSEIL". However, I can't seem to extract the href when I'm using BeautifulSoup. Could you please recommend a solution?
Here's the link to the url - https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160
My code:
url = 'https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
resp = requests.get(str(url), headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
a = soup.find('div', {'class': 'nom-entreprise'})
print(a)
Result:
None.

The link is constructed dynamically with JavaScript. All you need is a number, which is obtained with Ajax query:
import json
import requests
# url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
api_url = "https://api.pappers.fr/v2/recherche"
payload = {
"q": "B & A COMPTABILITE CONSEIL", # <-- your search query
"code_naf": "",
"code_postal": "94160", # <-- this is "ville" from URL
"api_token": "97a405f1664a83329a7d89ebf51dc227b90633c4ba4a2575",
"precision": "standard",
"bases": "entreprises,dirigeants,beneficiaires,documents,publications",
"page": "1",
"par_page": "20",
}
data = requests.get(api_url, params=payload).json()
# uncomment this to print all data (all details):
# print(json.dumps(data, indent=4))
print("https://www.pappers.fr/entreprise/" + data["resultats"][0]["siren"])
Prints:
https://www.pappers.fr/entreprise/378002208
Opening the link will automatically redirects to:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

The website uses is loaded dynamically, therefore requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
To find the links you can use a CSS selector: a.gros-gros-nom
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
driver = webdriver.Chrome()
driver.get(url)
# Wait for the link to be visible on the page and save element to a variable `link`
link = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "a.gros-gros-nom"))
)
print(link.get_attribute("href"))
driver.quit()
Output:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

Using Selenium "find_elements_by_class_name" to retrieve data from website

I'm trying my hand at some python code, and am having a hell of a time with Selenium. Any help you could offer would be super appreciated. Long story short, I'm trying to pull the average rating of a given movie from Letterboxd.com. For example:
https://letterboxd.com/film/the-dark-knight/
The value I'm looking for is the average rating to 2 decimal places, which you can see if you mouseover the rating that's displayed on the page:
Average Rating 4.43 displayed on mousover
In this case, the average rating is 4.43, and that's the number I'm trying to retrieve.
So far, I've managed to successfully grab the 1 decimal place version using driver.find_elements_by_class_name('average-rating')
In this case, that returns "4.4". But I need "4.43."
I can see the correct value in the developer tools. It appears twice. Once here:
<span class="average-rating" itemprop="aggregateRating" itemscope itemtype="http://schema.org/AggregateRating">
4.4
And again in what appears to be metadata:
<meta name="twitter:data2" content="4.43 out of 5">
Any suggestions on how I can grab that value correctly? Thanks so much!
Cheers,
Ari

There is another way you might wanna think of using to get the rating along with the counting of users voted for that rating. Given that they all are available in the page source within some script tag.
import re
import json
import requests
URL = 'https://letterboxd.com/film/the-dark-knight/'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36'
r = s.get(URL)
data = json.loads(re.findall(r"CDATA[^{]+(.*)",r.text)[0])
rating = data['aggregateRating']['ratingValue']
user_voted = data['aggregateRating']['ratingCount']
print(rating,user_voted)

Please find the code and let me know if you don't understand anything. To Hover over Main Rating you should use actionchains.
from selenium import webdriver
from selenium.webdriver import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
import time
driver = webdriver.Chrome()
driver.get("https://letterboxd.com/film/the-dark-knight/")
wait = WebDriverWait(driver, 20)
time.sleep(10)
Main_Rating = driver.find_element_by_class_name('average-rating')
print("Main Rating is :- " + Main_Rating.text)
time.sleep(5)
ActionChains(driver).move_to_element(Main_Rating).perform()
More_Rating_info = driver.find_element_by_xpath('//div[#class="twipsy-inner"]').text
More_Message = More_Rating_info.split()
print("More Rating :- " + More_Message[3])
Note - If this resolves your problem then please mark it as answer.

Try below code using beautiful soup and requests:
Benefit of using Beautiful soup and requests:
Fast in terms of getting result.
Less error.
More accessibility to html tags.
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
from bs4 import BeautifulSoup as bs
def extract_avg_rating():
movie_name = 'the-dark-knight'
url = 'https://letterboxd.com/film/' + movie_name
session = requests.Session()
url_response = session.get(url,verify=False)
soup = bs(url_response.text, 'html.parser')
extracted_meta = soup.find_all('meta')[19]
extracted_rating = extracted_meta.attrs['content'].split(' ')[0]
print('Movie ' + movie_name + ' rating ' + extracted_rating)
extract_avg_rating()
In the above code movie_name parameter you can put any film name for ex: lucky-grandma and it will give you the accurate rating. Code is dynamic and help you in extracting other movies ratings and other information despite of only one thing as per your requirement.

Selenium is really slow for me, is there something wrong with my code?

im new to webscraping and python. I have done a script before that worked just fine. Im doing basically the same thing in this one but it runs way slower.
This is my code:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
from selenium.webdriver import Firefox
from selenium.webdriver.firefox.options import Options
import time
start = time.time()
opp = Options()
opp.add_argument('-headless')
browser = webdriver.Firefox(executable_path = "/Users/0581279/Desktop/L&S/Watchlist/geckodriver", options=opp)
browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)
Sometimes a single page can take up to 2 minutes to load. Also im just webscraping Bloomberg.
Any help would be appreciated :)

Using requests and BeautifulSoup you can scrape information easy and fast. Here code to get Key Statistics for bloomberg's MSGFINA:LX:
import requests
from bs4 import BeautifulSoup
headers = {
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_2) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/72.0.3626.119 Safari/537.36',
'DNT': '1'
}
response = requests.get('https://www.bloomberg.com/quote/MSGFINA:LX', headers=headers)
page = BeautifulSoup(response.text, "html.parser")
key_statistics = page.select("div[class^='module keyStatistics'] div[class^='rowListItemWrap']")
for key_statistic in key_statistics:
fieldLabel = key_statistic.select_one("span[class^='fieldLabel']")
fieldValue = key_statistic.select_one("span[class^='fieldValue']")
print("%s: %s" % (fieldLabel.text, fieldValue.text))

Selenium effect some parameters like :
If the site is slow, the Selenium script is slow.
If the performance of the internet connection is not good, the Selenium script is slow.
If the computer running the script is not performing well, the Selenium script is slow.
These situations are not usually in our hands. But programming are.
One of the ways to increase speed is blocking the images load (if we don't use it.)
Blocking the load images will effect the runtime.This is the way to block it :
opp.add_argument('--blink-settings=imagesEnabled=false')
And when you open Driver you dont need to again use BeautifulSoap function to get datas. Selenium functions provide it.Try to below code , Selenium will faster
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
import time
start = time.time()
opp = Options()
opp.add_argument('--blink-settings=imagesEnabled=false')
driver_path = r'Your driver path'
browser = webdriver.Chrome(executable_path=driver_path , options=opp)
browser.delete_all_cookies()
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
get_element = browser.find_elements_by_css_selector("span[class='fieldValue__2d582aa7']")
print(get_element[6].text)
browser.quit()
end = time.time()
print(end-start)

So I made some alterations to your code and could load it almost instantly, I used chrome driver which I had installed and then ran the following code.
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import selenium
import time
start = time.time()
browser = webdriver.Chrome("/Users/XXXXXXXX/Desktop/Programming/FacebookControl/package/chromedriver")
browser.get("https://www.bloomberg.com/quote/MSGFINA:LX")
c = browser.page_source
soup = BeautifulSoup(c, "html.parser")
all = soup.find_all("span", {"class": "fieldValue__2d582aa7"})
price = all[6].text
browser.quit()
print(price)
end = time.time()
print(end-start)
while testing they did block me lol, might want to change headers every once and a while. it printed the price as well.
chromedriver link http://chromedriver.chromium.org/
hope this helps.
output was this:
34.54
7.527994871139526

Item visible in browser not collected by scraper

I'm trying to collect data from the SumofUs website; specifically the number of signatures on the petition. The datum is presented like this: <div class="percent">256,485 </div> (this is the only item of this class on the Page.)
So I tried this:
import requests
from bs4 import BeautifulSoup
user_agent = {'User-agent': 'Mozilla/5.0'}
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
raw = requests.get(url, headers = user_agent)
html = BeautifulSoup(raw.text)
# get the item we're seeking
number = html.find("div", class_="percent")
print number
It seems that the number isn't rendered (I've tried a couple of user agent strings.) What else could be causing this? How can I work around this in future?

In the general case you should use a headless browser. Ghost.py is written in python so its probably a good choice to try first.
In this specific case a little research reveals that there's a much simpler method. By using the network tab in chrome you can see that the site makes an ajax call to populate the value. So you can just get it directly:
url = "http://action.sumofus.org/api/ak_action_count_by_action/?action=nhs-patient-corporations&additional="
number = int(requests.get(url).text)

You could use Selenium:
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'http://action.sumofus.org/a/nhs-patient-corporations/'
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5) # wait to load
# then load BeautifulSoup with browsers content
html = BeautifulSoup(driver.page_source)
...

Open a page programmatically in python

Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin

That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.

You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.

You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()

You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to fetch this data with Beautiful Soup 4 or lxml? - python

the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?

Related

Accessing href link using BeautifulSoup

Using Selenium "find_elements_by_class_name" to retrieve data from website

Selenium is really slow for me, is there something wrong with my code?

Item visible in browser not collected by scraper

Open a page programmatically in python

Categories

Resources