I want to build a program that automatically gets the live price of the german index (DAX). Therefore i use a website with the price provider FXCM.
In my code i use beautifulsoup and requests as packages. The div Box where the current value is stored looks like this :
<div class="left" data-item="quoteContainer" data-bg_quotepush="133962:74:bid">
<div class="wrapper cf">
<div class="left">
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="quote" data-bg_quotepush_c="40">13.599,24</span>
<span class="label" data-bg_quotepush="time" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="time" data-bg_quotepush_c="41">25.12.2020</span>
<span class="label"> • </span>
<span class="label" data-item="currency"></span>
</div>
<div class="right">
<span class="percent up" data-bg_quotepush="percent" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="percent" data-bg_quotepush_c="42">+0,00<span>%</span></span>
<span class="label up" data-bg_quotepush="change" data-bg_quotepush_i="133962:74:bid" data-bg_quotepush_f="change" data-bg_quotepush_c="43">0,00</span>
</div>
</div>
</div>
The value i want to have is the one after data-bg_quotepush_c="40" and has a vaulue of 13.599,24.
My Python code looks like this:
import requests as rq
from bs4 import BeautifulSoup as bs
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
response = rq.get(url)
soup = bs(response.text, "lxml")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price["data-bg_quotepush_c"])
It returns the following error:
File "C:\Users\Felix\anaconda3\lib\site-packages\bs4\element.py", line 1406, in __getitem__
return self.attrs[key]
KeyError: 'data-bg_quotepush_c'
Use Selenium instead of requests if working with dynamically generated content
What is going on?
Requesting the website with requests just provide the initial content, that not contains all the dynamically generatet information, so you can not find what your looking for.
To wait until website loaded completely use Selenium and sleep() as simple method or selenium waits in advanced.
Avoiding the error
Use price.text to get the text of the element that looks like this:
<span class="quote quote_standard" data-bg_quotepush="quote" data-bg_quotepush_c="40" data-bg_quotepush_f="quote" data-bg_quotepush_i="133962:74:bid">13.599,24</span>
Example
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://news.guidants.com/#Ticker/Profil/?i=133962&e=74"
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
driver.get(url)
driver.implicitly_wait(3)
soup = BeautifulSoup(driver.page_source,"html5lib")
price = soup.find_all("div", {"class":"left"})[0].find("span")
print(price.text)
driver.close()
Output
13.599,24
if you scraping the value of div class try this, example
driver = webdriver.Chrome(YourPATH to driver)
from bs4 import BeautifulSoup
# create variable to store a url strings
url = 'https://news.guidants.com/#Ticker/Profil/?i=133962&e=74'
driver.get(url)
# scraping proccess
soup = BeautifulSoup(driver.page_source,"html5lib")
# parse
prices = soup.find_all("div", attrs={"class":"left"})
for price in prices:
total_price = price.find('span')
# close the driver
driver.close()
if you using requests module try use different parser
you can install with pip example html5lib
pip install html5lib
thanks
I'm trying to get links to group members:
response.css('.text--ellipsisOneLine::attr(href)').getall()
Why isn't this working?
html:
<div class="flex flex--row flex--noGutters flex--alignCenter">
<div class="flex-item _memberItem-module_name__BSx8i">
<a href="/ru-RU/Connect-IT-Meetup-in-Chisinau/members/280162178/profile/?returnPage=1">
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
</a>
</div>
</div>
Your selector isn't working because you are looking for a attribute (href) that this element doesn't have.
response.css('.text--ellipsisOneLine::attr(href)').getall()
This selector is searching for href inside elements of class text--ellipsisOneLine. In your HTML snippet that class matches only with this:
<h4 class="text--bold text--ellipsisOneLine">Liviu Cernei</h4>
As you can see, there is no href attribute. Now, if you want the text between this h4 element you need to use ::text pseudo-element.
response.css('.text--ellipsisOneLine::text').getall()
Read more here.
I realize that this isn't scrapy, but personally for web scraping I use the requests module and BeautifulSoup4, and the following code snippet will get you a list of users with the aforementioned modules:
import requests
from bs4 import BeautifulSoup
response = requests.get('https://www.meetup.com/ru-RU/Connect-IT-Meetup-in-Chisinau/members/')
if response.status_code == 200:
html_doc = response.text
html_source = BeautifulSoup(html_doc, 'html.parser')
users = html_source.findAll('h4')
for user in users:
print(user.text)
css:
response.css('.member-item .flex--alignCenter a::attr(href)').getall()
from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
import selenium.webdriver.common.keys
from bs4 import BeautifulSoup
import requests
import time
driver = webdriver.Chrome(executable_path="../drivers/chromedriver.exe")
driver.get("https://www.Here the address of the relevant website ends with aspx.com.aspx")
element=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_SEKTOR")
drp=Select(element)
drp.select_by_index(0)
element1=driver.find_element_by_id("ctl00_ContentPlaceHolder1_Lb_Oran")
drp=Select(element1)
drp.select_by_index(41)
element2=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_DONEM")
drp=Select(element2)
drp.select_by_index(1)
driver.find_element_by_id("ctl00_ContentPlaceHolder1_ImageButton1").click()
time.sleep(1)
print(driver.page_source)
The last part of these codes, I can print the source codes of the page as a result. So I can get the source codes of the page as a print.
But in source codes of the page I just need the following table part written in java. How can I extract this section. and I can output csv as a table. (How can I get the table in the Java section.)
Not:In the Selenium test, I thought of pressing the CTRL U keys while in Chrome, but I was not successful in this.The web page is a user interactive page. Some interactions are required to get the data I want. That's why I used Selenium.
<span id="ctl00_ContentPlaceHolder1_Label2" class="Georgia_10pt_Red"></span>
<div id="ctl00_ContentPlaceHolder1_Divtable">
<div id="table">
<layer name="table" top="0"><IMG height="2" src="../images/spacer.gif" width="2"><br>
<font face="arial" color="#000000" size="2"><b>Tablo Yükleniyor. Lütfen Bekleyiniz...</b></font><br>
</layer>
</div>
</div>
<script language=JavaScript> var theHlp='/yardim/matris.asp';var theTitle = 'Piya Deg';var theCaption='OtomoT (TL)';var lastmod = '';var h='<a class=hislink href=../Hisse/Hisealiz.aspx?HNO=';var e='<a class=hislink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('cksart',4,50);theCols[1] = new Array('2018.12',1,60);theCols[2] = new Array('2019.03',1,60);theCols[3] = new Array('2019.06',1,60);theCols[4] = new Array('2019.09',1,60);theCols[5] = new Array('2019.12',1,60);theCols[6] = new Array('2020.03',1,60);var theRows = new Array();theRows[0] = new Array ('<b>'+h+'42>AHRT</B></a>','519,120,000.00','590,520,000.00','597,240,000.00','789,600,000.00','1,022,280,000.00','710,640,000.00');
theRows[1] = new Array ('<b>'+h+'427>SEEL</B></a>','954,800,000.00','983,400,000.00','1,201,200,000.00','1,716,000,000.00','2,094,400,000.00','-');
theRows[2] = new Array ('<b>'+h+'140>TOFO</B></a>','17,545,500,000.00','17,117,389,800.00','21,931,875,000.00','20,844,054,000.00','24,861,973,500.00','17,292,844,800.00');
theRows[3] = new Array ('<b>'+h+'183>MSO</B></a>','768,000,000.00','900,000,000.00','732,000,000.00','696,000,000.00','1,422,000,000.00','1,134,000,000.00');
theRows[4] = new Array ('<b>'+h+'237>KURT</B></a>','2,118,000,000.00','2,517,600,000.00','2,736,000,000.00','3,240,000,000.00','3,816,000,000.00','2,488,800,000.00');
theRows[5] = new Array ('<b>'+h+'668>GRTY</B></a>','517,500,000.00','500,250,000.00','445,050,000.00','552,000,000.00','737,150,000.00','-');
theRows[6] = new Array ('<b>'+h+'291>MEME</B></a>','8,450,000,000.00','8,555,000,000.00','9,650,000,000.00','10,140,000,000.00','13,430,000,000.00','8,225,000,000.00');
theRows[7] = new Array ('<b>'+h+'292>AMMI</B></a>','-','-','-','-','-','-');
theRows[8] = new Array ('<b>'+h+'426>GOTE</B></a>','1,862,578,100.00','1,638,428,300.00','1,689,662,540.00','2,307,675,560.00','2,956,642,600.00','2,121,951,440.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script></form>
<div style="clear: both; margin-top: 10px;">
<div style="background-color: Red; border: 2px solid Green; display: none">
TABLO-ALT</div>
<div id="Bannerctl00_SiteBannerControl2">
<div id="_bannerctl00_SiteBannerControl2">
<div id="Sayfabannerctl00_SiteBannerControl2" class="banner_Codex">
</div>
Please, note that I've only used Selenium in Java, so I'll give you the most generic and languaje-agnostic answer I can. Keep in mind that Python Selenium MAY provide a method to do this directly.
Steps:
Make all Selenium interactions so the WebDriver actually has a VALID page version with all your contents loaded
Extract from selenium the current contents of the whole page
Load it with a HTML parsing library. I use JSoup in Java, I don't now if there's a Python version. From now on, Selenium does not matter.
Use CSS selectors on your parser Object to get the section you want
Convert that section to String to print.
If performance is a requeriment this approach may be a bit too expensive, as the contents are parsed twice: Selenium does it first, and your HTML parser will do it again later with the extracted String from Selenium.
ALTERNATIVE: If your "target page" uses AJAX, you may directly interact with the REST API that javascript is accesing to get the data to fill for you. I tend to follow this approach when doing serious web scraping, but sometimes this is not an option, so I use the above approach.
EDIT
Some more details base on questions in comments:
You can use BeautifullSoup as a html parsing library.
To load a page in BeautifullSoup use:
html = "<html><head></head><body><div id=\"events-horizontal\">Hello world</div></body></html>"
soup = BeautifulSoup(html, "html.parser")
Then look at this answer to see how to extract the specific contents from your soup:
your_div = soup.select_one('div#events-horizontal')
That would give you the first div with events-horizontal id:
<div id="events-horizontal">Hello world</div>
BeautifullSoup code based on:
How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?