Accessing href link using BeautifulSoup - python

I'm trying to scrape the href of the first link titled "BACC B ET A COMPTABILITE CONSEIL". However, I can't seem to extract the href when I'm using BeautifulSoup. Could you please recommend a solution?
Here's the link to the url - https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160
My code:
url = 'https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/67.0.3396.87 Safari/537.36'}
resp = requests.get(str(url), headers=headers)
soup = BeautifulSoup(resp.content, 'html.parser')
a = soup.find('div', {'class': 'nom-entreprise'})
print(a)
Result:
None.

The link is constructed dynamically with JavaScript. All you need is a number, which is obtained with Ajax query:
import json
import requests
# url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
api_url = "https://api.pappers.fr/v2/recherche"
payload = {
"q": "B & A COMPTABILITE CONSEIL", # <-- your search query
"code_naf": "",
"code_postal": "94160", # <-- this is "ville" from URL
"api_token": "97a405f1664a83329a7d89ebf51dc227b90633c4ba4a2575",
"precision": "standard",
"bases": "entreprises,dirigeants,beneficiaires,documents,publications",
"page": "1",
"par_page": "20",
}
data = requests.get(api_url, params=payload).json()
# uncomment this to print all data (all details):
# print(json.dumps(data, indent=4))
print("https://www.pappers.fr/entreprise/" + data["resultats"][0]["siren"])
Prints:
https://www.pappers.fr/entreprise/378002208
Opening the link will automatically redirects to:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

The website uses is loaded dynamically, therefore requests doesn't support it. We can use Selenium as an alternative to scrape the page.
Install it with: pip install selenium.
Download the correct ChromeDriver from here.
To find the links you can use a CSS selector: a.gros-gros-nom
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium import webdriver
url = "https://www.pappers.fr/recherche?q=B+%26+A+COMPTABILITE+CONSEIL&ville=94160"
driver = webdriver.Chrome()
driver.get(url)
# Wait for the link to be visible on the page and save element to a variable `link`
link = WebDriverWait(driver, 20).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, "a.gros-gros-nom"))
)
print(link.get_attribute("href"))
driver.quit()
Output:
https://www.pappers.fr/entreprise/bacc-b-et-a-comptabilite-conseil-378002208

Related

How to fetch this data with Beautiful Soup 4 or lxml?

Here's the website in question:
https://www.gurufocus.com/stock/AAPL
And the part that interests me is this one (it's the GF Score in the upper part of the website):
I need to extract the strings 'GF Score' and '98/100'.
Firefox Inspector gives me span.t-h6 > span:nth-child(1) as a CSS Selector but I just can't seem to fetch neither the numbers nor the descriptor.
Here's the code that I've used so far to extract the "GF Score" part:
import requests
import bs4 as BeautifulSoup
from lxml import html
req = requests.get('https://www.gurufocus.com/stock/AAPL')
soup = BeautifulSoup(req.content, 'html.parser')
score_soup = soup.select('#gf-score-section-003550 > span > span:nth-child(1)')
score_soup_2 = soup.select('span.t-h6 > span')
print(score_soup)
print(score_soup_2)
tree = html.fromstring(req.content)
score_lxml = tree.xpath ('//*[#id="gf-score-section-003550"]/span/span[1]')
print(score_lxml)
As a result, I'm getting three empty brackets.
The xpath was taken directly out of chrome via the copy function and the nth-child expression in the BS4 part also.
Any suggestions as to what might be at fault here?
Unfortunately get the page using Requests lib impossible, as well as access to the api to which the signature is needed.
There is 2 option:
Use API. It's not free, but much more convenient and faster.
And second one - Selenium. It's free, but the speed is without fine-tuning the wait element. The second problem is protection - cloudflare. Soon without changing the headers and\or IP you probably ll get a ban. So there is example:
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
def get_gf_score(ticker_symbol: str, timeout=10):
driver.get(f'https://www.gurufocus.com/stock/{ticker_symbol}/summary')
try:
element_present = EC.presence_of_element_located((By.ID, f'register-dialog-email-input'))
WebDriverWait(driver, timeout).until(element_present)
return BeautifulSoup(driver.page_source, 'lxml').find('span', text='GF Score:').findNext('span').get_text(strip=True)
except TimeoutException:
print("Timed out waiting for page to load")
tickers = ['AAPL', 'MSFT', 'AMZN']
driver = webdriver.Chrome()
for ticker in tickers:
print(ticker, get_gf_score(ticker), sep=': ')
OUTPUT:
AAPL: 98/100
MSFT: 97/100
AMZN: 88/100
One way you could get the desired value is:
Make the request to the page - by pretending the request comes from a browser and then, extract the info you need from the JSON object inside the script HTML tag1.
NOTE:
1 Please be warned there that I couldn't get the JSON
object -This is the JSON result, btw- and extract the value by following the path:
js_data['fetch']['data-v-4446338b:0']['stock']['gf_score']
So, as alternative (not a very good one, IMHO, but, works for your purpose, though), I decided to find certain elements on the JSON/string result and then extract the desired value (by dividing the string - i.e. substring).
Full code:
import requests
from bs4 import BeautifulSoup
import json
geturl = r'https://www.gurufocus.com/stock/AAPL'
getheaders = {
'Accept': 'text/html; charset=utf-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36',
'Referer': 'https://www.gurufocus.com'
}
s = requests.Session()
r = requests.get(geturl, headers=getheaders)
soup = BeautifulSoup(r.text, "html.parser")
# This is the "<script>" element tag that contains the full JSON object
# with all the data.
scripts = soup.findAll("script")[1]
# Get only the JSON data:
js_data = scripts.get_text("", strip=True)
# -- Get the value from the "gf_score" string - by getting its position:
# Part I: is where the "gf_score" begins.
part_I = js_data.find("gf_score")
# Part II: is where the final position is declared - in this case AFTER the "gf_score" value.
part_II = js_data.find(",gf_score_med")
# Build the desired result and print it:
gf_score = js_data[part_I:part_II].replace("gf_score:", "GF Score: ") + "/100"
print(gf_score)
Result:
GF Score: 98/100
the data is dynamic. I think rank is what you are looking for but the api required authentication. Maybe you can use selenium or playwright to render the page?

How to scrape information from website that requires login

I am working on a python web scraping project. The website I am trying to scrape data from contains info about all the medicines sold in India. The website requires a user to login before giving access to this information.
I want to access all the links in this url https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand and store it in an array.
Here is my code for logging into the website
##################################### Method 1
import mechanize
import http.cookiejar as cookielib
from bs4 import BeautifulSoup
import html2text
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Chrome')]
br.open('https://sso.mims.com/Account/SignIn')
# View available forms
for f in br.forms():
print(f)
br.select_form(nr=0)
# User credentials
br.form['EmailAddress'] = <USERNAME>
br.form['Password'] = <PASSWORD>
# Login
br.submit()
print(br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read())
But the problem is that when the credentials are submitted, a middle page pops up with the following information.
You will be redirected to your destination shortly.
This page submits a hidden form and only then is the required end page shown. I want to access the end page. But br.open('https://mims.com/india/browse/alphabet/a?cat=drug&tab=brand').read() accesses the middle page and prints the results.
How do I wait for the middle page to submit the hidden form and then access the contents of the end page?
I've posted a selenium solution below, which works, but after understanding a bit more about the login process, it's possible to login using BeautifulSoup and requests only. Please read the comments on the code.
BeautifulSoup / requests solution
import requests
from bs4 import BeautifulSoup
d = {
"EmailAddress": "your#email.tld",
"Password": "password",
"RememberMe": True,
"SubscriberId": "",
"LicenseNumber": "",
"CountryCode": "SG"
}
req = requests.Session()
login_u = "https://sso.mims.com/"
html = req.post(login_u, data=d)
products_url = "https://mims.com/india/browse/alphabet/a?cat=drug"
html = req.get(products_url) # The cookies generated on the previous request will be use on this one automatically because we use Sessions
# Here's the tricky part. The site uses 2 intermediary "relogin" pages that (theoretically) are only available with JavaScript enabled, but we can bypass that, i.e.:
soup = BeautifulSoup(html.text, "html.parser")
form = soup.find('form', {"id": "openid_message"})
form_url = form['action'] # used on the next post request
inputs = form.find_all('input')
form_dict = {}
for input in inputs:
if input.get('name'):
form_dict[input.get('name')] = input.get('value')
form_dict['submit_button'] = "Continue"
relogin = req.post(form_url, data=form_dict)
soup = BeautifulSoup(relogin.text, "html.parser")
form = soup.find('form', {"id": "openid_message"})
form_url = form['action'] # used
inputs = form.find_all('input')
form_dict = {}
for input in inputs:
if input.get('name'):
form_dict[input.get('name')] = input.get('value')
products_a = req.post(form_url, data=form_dict)
print(products_a.text)
# You can now request any url normally because the necessary cookies are already present on the current Session()
products_url = "https://mims.com/india/browse/alphabet/c?cat=drug"
products_c = req.get(products_url)
print(products_c.text)
Selenium solution
from selenium import webdriver
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.keys import Keys
from time import sleep
driver = webdriver.Firefox()
wait = WebDriverWait(driver, 10)
driver.maximize_window()
driver.get("https://sso.mims.com/")
el = wait.until(EC.element_to_be_clickable((By.ID, "EmailAddress")))
el.send_keys("your#email.com")
el = wait.until(EC.element_to_be_clickable((By.ID, "Password")))
el.send_keys("password")
el = wait.until(EC.element_to_be_clickable((By.ID, "btnSubmit")))
el.click()
wait.until(EC.element_to_be_clickable((By.CLASS_NAME, "profile-section-header"))) # we logged in successfully
driver.get("http://mims.com/india/browse/alphabet/a?cat=drug")
wait.until(EC.visibility_of_element_located((By.CLASS_NAME, "searchicon")))
print(driver.page_source)
# do what you need with the source code

Scraping Google Finance (BeautifulSoup)

I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)
But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?
The page is rendered with JavaScript. There are several ways to render and scrape it.
I can scrape it with Selenium.
First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively PyQt5
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively Dryscrape
import bs4 as bs
import dryscrape
url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
all output:
Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B
EDIT
QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.
You can switch to PyQt5.QtWebEngineWidgets
You can scrape Google Finance using BeautifulSoup web scraping library without the need to use selenium as the data you want to extract doesn't render via Javascript. Plus it will be much faster than launching the whole browser.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, lxml, json
params = {
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance?q=tsla)", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = []
for ticker in soup.select('.tOzDHb'):
title = ticker.select_one('.RwFyvf').text
price = ticker.select_one('.YMlKec').text
index = ticker.select_one('.COaKTb').text
price_change = ticker.select_one("[jsname=Fe7oBc]")["aria-label"]
ticker_data.append({
"index": index,
"title" : title,
"price" : price,
"price_change" : price_change
})
print(json.dumps(ticker_data, indent=2))
Example output
[
{
"index": "Index",
"title": "Dow Jones Industrial Average",
"price": "32,774.41",
"price_change": "Down by 0.18%"
},
{
"index": "Index",
"title": "S&P 500",
"price": "4,122.47",
"price_change": "Down by 0.42%"
},
{
"index": "TSLA",
"title": "Tesla Inc",
"price": "$850.00",
"price_change": "Down by 2.44%"
},
# ...
]
There's a scrape Google Finance Ticker Quote Data in Python blog post if you need to scrape more data from Google Finance.
Most website owners don't like scrapers because they take data the company values, use up a whole bunch of their server time and bandwidth, and give nothing in return. Big companies like Google may have entire teams employing a whole host of methods to detect and block bots trying to scrape their data.
There are several ways around this:
Scrape from another less secured website.
See if Google or another company has an API for public use.
Use a more advanced scraper like Selenium (and probably still be blocked by google).

Navigating a website in python, scraping, and posting

There are many good resources already on stackoverflow but I'm still having an issue. I've visited these sources:
how to submit query to .aspx page in python
Submitting a post request to an aspx page
Scrapping aspx webpage with Python using BeautifulSoup
http://www.pythonforbeginners.com/cheatsheet/python-mechanize-cheat-sheet
I'm attempting to visit http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx and select a Parish. I believe this forces a post and allows me to select a year, which posts again, and allows for yet more selection. I've written my script a few different ways following the above sources and haven't successfully been able to submit the site to allow me to enter a year.
My current code
import urllib
from bs4 import BeautifulSoup
import mechanize
headers = [
('Accept','text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8'),
('Origin', 'http://www.indiapost.gov.in'),
('User-Agent', 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17'),
('Content-Type', 'application/x-www-form-urlencoded'),
('Referer', 'http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx'),
('Accept-Encoding', 'gzip,deflate,sdch'),
('Accept-Language', 'en-US,en;q=0.8'),
]
br = mechanize.Browser()
br.addheaders = headers
url = 'http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx'
response = br.open(url)
# first HTTP request without form data
soup = BeautifulSoup(response)
# parse and retrieve two vital form values
viewstate = soup.findAll("input", {"type": "hidden", "name": "__VIEWSTATE"})
eventvalidation = soup.findAll("input", {"type": "hidden", "name": "__EVENTVALIDATION"})
formData = (
('__EVENTVALIDATION', eventvalidation[0]['value']),
('__VIEWSTATE', viewstate[0]['value']),
('__VIEWSTATEENCRYPTED',''),
)
try:
fout = open('C:\\GIS\\tmp.htm', 'w')
except:
print('Could not open output file\n')
fout.writelines(response.readlines())
fout.close()
I've also attempted this in the shell and what I entered plus what I received (modified to cut down on the bulk) can be found http://pastebin.com/KAW5VtXp
Anyway I try to change the value in the Parish dropdown list and post I get taken to a webmaster login page.
Am I approaching this the correct way? Any thoughts would be extremely helpful.
Thanks!
I ended up using selenium.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Firefox()
driver.get("http://www.latax.state.la.us/Menu_ParishTaxRolls/TaxRolls.aspx")
elem = driver.find_element_by_name("ctl00$ContentPlaceHolderMain$ddParish")
elem.send_keys("TERREBONNE PARISH")
elem.send_keys(Keys.RETURN)
elem = driver.find_element_by_name("ctl00$ContentPlaceHolderMain$ddYear")
elem.send_keys("2013")
elem.send_keys(Keys.RETURN)
elem = driver.find_element_by_id("ctl00_ContentPlaceHolderMain_rbSearchField_1")
elem.click()
APN = 'APN # here'
elem = driver.find_element_by_name("ctl00$ContentPlaceHolderMain$txtSearch")
elem.send_keys(APN)
elem.send_keys(Keys.RETURN)
# Access the PDF
elem = driver.find_element_by_link_text('Generate Report')
elem.click()
elements = driver.find_elements_by_tag_name('a')
elements[1].click()

Open a page programmatically in python

Can you extract the VIN number from this webpage?
I tried urllib2.build_opener, requests, and mechanize. I provided user-agent as well, but none of them could see the VIN.
opener = urllib2.build_opener()
opener.addheaders = [('User-agent',('Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_7) ' 'AppleWebKit/535.1 (KHTML, like Gecko) ' 'Chrome/13.0.782.13 Safari/535.1'))]
page = opener.open(link)
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents[0]
print vin
That page has much of the information loaded and displayed with Javascript (probably through Ajax calls), most likely as a direct protection against scraping. To scrape this you therefore either need to use a browser that runs Javascript, and control it remotely, or write the scraper itself in javascript, or you need to deconstruct the site and figure out exactly what it loads with Javascript and how, and see if you can duplicate these calls.
You can use browser automation tools for the purpose.
For example this simple selenium script can do your work.
from selenium import webdriver
from bs4 import BeautifulSoup
link = "https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0"
browser = webdriver.Firefox()
browser.get(link)
page = browser.page_source
soup = BeautifulSoup(page)
table = soup.find('dd', attrs = {'class': 'tip_vehicleStats'})
vin = table.contents.span.contents[0]
print vin
BTW, table.contents[0] prints the entire span, including the span tags.
table.contents.span.contents[0] prints only the VIN no.
You could use selenium, which calls a browser. This works for me :
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
import time
# See: http://stackoverflow.com/questions/20242794/open-a-page-programatically-in-python
browser = webdriver.Firefox() # Get local session of firefox
browser.get("https://www.iaai.com/Vehicles/VehicleDetails.aspx?auctionID=14712591&itemID=15775059&RowNumber=0") # Load page
time.sleep(0.5) # Let the page load
# Search for a tag "span" with an attribute "id" which contains "ctl00_ContentPlaceHolder1_VINc_VINLabel"
e=browser.find_element_by_xpath("//span[contains(#id,'ctl00_ContentPlaceHolder1_VINc_VINLabel')]")
e.text
# Works for me : u'4JGBF7BE9BA648275'
browser.close()
You do not have to use Selenium.
Just make an additional get request:
import requests
stock_number = '123456789' # located at VEHICLE INFORMATION
url = 'https://www.clearvin.com/ads/iaai/check?stockNumber={}&vin='.format(stock_number)
vin = requests.get(url).json()['car']['vin']

Categories