Taking a certain part of the page with selenium - python

from selenium import webdriver
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ActionChains
import selenium.webdriver.common.keys
from bs4 import BeautifulSoup
import requests
import time
driver = webdriver.Chrome(executable_path="../drivers/chromedriver.exe")
driver.get("https://www.Here the address of the relevant website ends with aspx.com.aspx")
element=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_SEKTOR")
drp=Select(element)
drp.select_by_index(0)
element1=driver.find_element_by_id("ctl00_ContentPlaceHolder1_Lb_Oran")
drp=Select(element1)
drp.select_by_index(41)
element2=driver.find_element_by_id("ctl00_ContentPlaceHolder1_LB_DONEM")
drp=Select(element2)
drp.select_by_index(1)
driver.find_element_by_id("ctl00_ContentPlaceHolder1_ImageButton1").click()
time.sleep(1)
print(driver.page_source)
The last part of these codes, I can print the source codes of the page as a result. So I can get the source codes of the page as a print.
But in source codes of the page I just need the following table part written in java. How can I extract this section. and I can output csv as a table. (How can I get the table in the Java section.)
Not:In the Selenium test, I thought of pressing the CTRL U keys while in Chrome, but I was not successful in this.The web page is a user interactive page. Some interactions are required to get the data I want. That's why I used Selenium.
<span id="ctl00_ContentPlaceHolder1_Label2" class="Georgia_10pt_Red"></span>
<div id="ctl00_ContentPlaceHolder1_Divtable">
<div id="table">
<layer name="table" top="0"><IMG height="2" src="../images/spacer.gif" width="2"><br>
<font face="arial" color="#000000" size="2"><b>Tablo Yükleniyor. Lütfen Bekleyiniz...</b></font><br>
</layer>
</div>
</div>
<script language=JavaScript> var theHlp='/yardim/matris.asp';var theTitle = 'Piya Deg';var theCaption='OtomoT (TL)';var lastmod = '';var h='<a class=hislink href=../Hisse/Hisealiz.aspx?HNO=';var e='<a class=hislink href=../endeks/endeksAnaliz.aspx?HNO=';var d='<center><font face=symbol size=1 color=#FF0000><b>ß</b></font></center>';var u='<center><font face=symbol size=1 color=#008000><b>İ</b></font></center>';var n='<center><font face=symbol size=1 color=#00A000><b>=</b></font></center>';var fr='<font color=#FF0000>';var fg='<font color=#008000>';var theFooter=new Array();var theCols = new Array();theCols[0] = new Array('cksart',4,50);theCols[1] = new Array('2018.12',1,60);theCols[2] = new Array('2019.03',1,60);theCols[3] = new Array('2019.06',1,60);theCols[4] = new Array('2019.09',1,60);theCols[5] = new Array('2019.12',1,60);theCols[6] = new Array('2020.03',1,60);var theRows = new Array();theRows[0] = new Array ('<b>'+h+'42>AHRT</B></a>','519,120,000.00','590,520,000.00','597,240,000.00','789,600,000.00','1,022,280,000.00','710,640,000.00');
theRows[1] = new Array ('<b>'+h+'427>SEEL</B></a>','954,800,000.00','983,400,000.00','1,201,200,000.00','1,716,000,000.00','2,094,400,000.00','-');
theRows[2] = new Array ('<b>'+h+'140>TOFO</B></a>','17,545,500,000.00','17,117,389,800.00','21,931,875,000.00','20,844,054,000.00','24,861,973,500.00','17,292,844,800.00');
theRows[3] = new Array ('<b>'+h+'183>MSO</B></a>','768,000,000.00','900,000,000.00','732,000,000.00','696,000,000.00','1,422,000,000.00','1,134,000,000.00');
theRows[4] = new Array ('<b>'+h+'237>KURT</B></a>','2,118,000,000.00','2,517,600,000.00','2,736,000,000.00','3,240,000,000.00','3,816,000,000.00','2,488,800,000.00');
theRows[5] = new Array ('<b>'+h+'668>GRTY</B></a>','517,500,000.00','500,250,000.00','445,050,000.00','552,000,000.00','737,150,000.00','-');
theRows[6] = new Array ('<b>'+h+'291>MEME</B></a>','8,450,000,000.00','8,555,000,000.00','9,650,000,000.00','10,140,000,000.00','13,430,000,000.00','8,225,000,000.00');
theRows[7] = new Array ('<b>'+h+'292>AMMI</B></a>','-','-','-','-','-','-');
theRows[8] = new Array ('<b>'+h+'426>GOTE</B></a>','1,862,578,100.00','1,638,428,300.00','1,689,662,540.00','2,307,675,560.00','2,956,642,600.00','2,121,951,440.00');
var thetable=new mytable();thetable.tableWidth=650;thetable.shownum=false;thetable.controlaccess=true;thetable.visCols=new Array(true,true,true,true,true);thetable.initsort=new Array(0,-1);thetable.inittable();thetable.refreshTable();</script></form>
<div style="clear: both; margin-top: 10px;">
<div style="background-color: Red; border: 2px solid Green; display: none">
TABLO-ALT</div>
<div id="Bannerctl00_SiteBannerControl2">
<div id="_bannerctl00_SiteBannerControl2">
<div id="Sayfabannerctl00_SiteBannerControl2" class="banner_Codex">
</div>

Please, note that I've only used Selenium in Java, so I'll give you the most generic and languaje-agnostic answer I can. Keep in mind that Python Selenium MAY provide a method to do this directly.
Steps:
Make all Selenium interactions so the WebDriver actually has a VALID page version with all your contents loaded
Extract from selenium the current contents of the whole page
Load it with a HTML parsing library. I use JSoup in Java, I don't now if there's a Python version. From now on, Selenium does not matter.
Use CSS selectors on your parser Object to get the section you want
Convert that section to String to print.
If performance is a requeriment this approach may be a bit too expensive, as the contents are parsed twice: Selenium does it first, and your HTML parser will do it again later with the extracted String from Selenium.
ALTERNATIVE: If your "target page" uses AJAX, you may directly interact with the REST API that javascript is accesing to get the data to fill for you. I tend to follow this approach when doing serious web scraping, but sometimes this is not an option, so I use the above approach.
EDIT
Some more details base on questions in comments:
You can use BeautifullSoup as a html parsing library.
To load a page in BeautifullSoup use:
html = "<html><head></head><body><div id=\"events-horizontal\">Hello world</div></body></html>"
soup = BeautifulSoup(html, "html.parser")
Then look at this answer to see how to extract the specific contents from your soup:
your_div = soup.select_one('div#events-horizontal')
That would give you the first div with events-horizontal id:
<div id="events-horizontal">Hello world</div>
BeautifullSoup code based on:
How to use CSS selectors to retrieve specific links lying in some class using BeautifulSoup?

Related

Scraping giphy with selenium. Unable to retrieve the correct 'src' attribute

I am trying to scrape giphy.com with python selenium package. When I select the required attribute 'src' from the xPath, it is returning something different than is in the 'inspect' section of the website.
it returns this: giphy-gif-img giphy-img-loaded

whereas I am looking to extract the src element as per the website:
src="https://media0.giphy.com/media/j6x5zFoaJN9rAejDfZ/giphy.gif?cid=ecf05e47v5k8qf29vp649xd8nsbba2c0ai8m6ftuifkrnipp&rid=giphy.gif&ct=g"
Weirdly, when i was running this previously, it would get me the required element but has now stopped providing that element!
url = 'https://giphy.com/search/fall-over'
img_x_path = '//*[#id="react-target"]/div/div[6]/div[2]/div[1]/a[11]/div/picture/img'
#%%
#first initialise the driver and then get the webpage
def initialise_chrome():
driver = webdriver.Chrome()
driver.get(url)
return driver
driver = initialise_chrome()
# then let's find the xpath element
print(driver)
#%%
x_path_req = driver.find_element_by_xpath(img_x_path)
def retrive_image_link(x_path_req):
#first - locate the img with pre-defined x_path
print(x_path_req)
#from that, then pick the src bit
image_link = x_path_req.get_attribute('src')
print(image_link)
retrive_image_link(x_path_req)
It looks to me like giphy is using base64 encoded images rather than loading them from a URL source. For example I see this when I inspect the page.
<a href="https://giphy.com/gifs/1stLookTV-montreal-johnny-bananas-1st-look-tv-fCUCWxvDVyuE9gLQSC" class="giphy-gif css-r2u7fp" tabindex="0" style="width: 248px; height: 136px; position: absolute; transform: translate3d(792px, 517px, 0px);">
<div style="width: 248px; height: 136px; position: relative;">
<picture>
<source type="image/webp" srcset="https://media4.giphy.com/media/fCUCWxvDVyuE9gLQSC/200w.webp?cid=ecf05e478y8zdwetr9wadyulcwbss0njcd0gvhp8wbw1sf59&rid=200w.webp&ct=g">.
<img class="giphy-gif-img " src="" width="248" height="136" alt="fall over wipe out GIF by 1st Look" style="background: rgb(153, 51, 255);">
</picture>
</div>
</a>
It does appear though that the <source> element above the <img> has the actual URL source in the srcset attribute, so maybe you can alter your XPath to extract that instead.
You could also edit you XPath to extract the srcset attribute too. I think it would be this instead
//*[#id="react-target"]/div/div[6]/div[2]/div[1]/a[11]/div/picture/source/#srcset

Scrape data and interact with webpage rendered in HTML

I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal

Beautiful Soup find() isn't finding all results for Class

I have code trying to pull all the html stuff within the tracklist container, which should have 88 songs. The information is definitely there (I printed the soup to check), so I'm not sure why everything after the first 30 react-contextmenu-wrapper are lost.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
spotify = 'https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP'
html = urlopen(spotify)
soup = BeautifulSoup(html, "html5lib")
main = soup.find(class_ = 'tracklist-container')
print(main)
Thank you for the help.
Current output from printing is as follows:
1.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Move On - Teen Daze Remix</span><span class="artists-albums"><span dir="auto">Garden City Movement</span> • <span dir="auto">Entertainment</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">5:11</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="2" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
2.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Flicker</span><span class="artists-albums"><span dir="auto">Forhill</span> • <span dir="auto">Flicker</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">3:45</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="3" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
...
30.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Trapdoor</span><span class="artists-albums"><span dir="auto">Eagle Eyed Tiger</span> • <span dir="auto">Future or Past</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">4:14</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li></ol><button class="link js-action-button" data-track-type="view-all-button">View all on Spotify</button></div>
Last entry should be the 88th. It just feels like my search results got truncated.
It is all there in the response just within a script tag.
You can see the start of the relevant javascript object here:
I would regex out the required string and parse with json library.
Py:
import requests, re, json
r = s.get('https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP')
p = re.compile(r'Spotify\.Entity = (.*?);')
data = json.loads(p.findall(r.text)[0])
print(len(data['tracks']['items']))
Since it seemed you were on right track, I did not try to solve the full problem and rather tried to provide you a hint which could be helpful: Do dynamic webscraping.
"Why Selenium? Isn’t Beautiful Soup enough?
Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup."
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
Here is what I see at the end of the 30 songs in the DOM which refers to a button:
</li>
</ol>
<button class="link js-action-button" data-track-type="view-all-button">
View all on Spotify
</button>
</div>
It's because you're doing
main = soup.find(class_ = 'tracklist-container')
the class "tracklist-container" only holds these 30 items,
i'm not sure what you're trying to accomplish, but if you want
what's afterwards try parsing the class afterwards.
in other words, the class contains 30 songs, i visited the site and found 30 songs so it might be only for logged in users.

How to get specific data using BeautifulSoup

I'm not sure how to get a specific result from this:
<div class="videoPlayer">
<div class="border-radius-player">
<div id="allplayers" style="position:relative;width:100%;height:100%;overflow: hidden;">
<div id="box">
<div id="player_content" class="todo" style="text-align: center; display: block;">
<div id="player" class="jwplayer jew-reset jew-skin-seven jw-state-paused jw-flag-user-inactive" tabindex="0">
<div class="jw-media jw-reset">
<video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
</div">
How would I get the src in <video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
This is what I've tried so far:
import urllib.request
from bs4 import BeautifulSoup
url = "https://someurlhere"
a = urllib.request.Request(url, headers={'User-Agent' : "Cliqz"})
b = urllib.request.urlopen(a) # prevent "Permission denies"
soup = BeautifulSoup(b, 'html.parser')
for video_class in soup.select("div.videoPlayer"):
print(video_class.text)
Which returns parts of it but not down to video class
Requests is a simple html client, it cannot execute javascripts.
You have three more options to try here though!
try going over the html source (b) and see if any of the javascripts in the site have the data you need. usually, the page would have the url (which, i assume you want to scrape) in some sort of holder (a javascript code or a json object) that you can scrape off.
Try looking at the XHR requests of the site and see if any of the requests query external sources for the video data. In this case, see if you can imitate that request to get the data you need.
(last resort) You need to use a phantomjs + selenium browser to download the website (Link1, Link2). You can find out more about how to use selenium in this SO post: https://stackoverflow.com/a/26440563/3986395

Python - beautifulSoup unable to iterate repetitive blocks

Unsure how to properly word the issue.
I am trying to parse through an HTML document with a tree similar to that of
div(unique-class)
|-a
|-h4
|-div(class-a)
|-div(class-b)
|-div(class-c)
|-p
Etc, it continues. I only listed the few items I need. It is a lot of sibling hierarchy, all existing within one div.
I've been working quite a bit with BeautifulSoup for the past few hours, and I finally have a working version (Beta) of what I'm trying to parse, in this example.
from bs4 import BeautifulSoup
import urllib2
import csv
file = "C:\\Python27\\demo.html"
soup = BeautifulSoup (open(file), 'html.parser')
#(page, 'html.parser')
#Let's pull prices
names = []
pricing = []
discounts = []
for name in soup.find_all('div', attrs={'class': 'unique_class'}):
names.append(name.h4.text)
for price in soup.find_all('div', attrs={'class': 'class-b'}):
pricing.append(price.text)
for discount in soup.find_all('div', attrs={'class': 'class-a'}):
discounts.append(discount.text)
ofile = open('output2.csv','wb')
fieldname = ['name', 'discountPrice', 'originalPrice']
writer = csv.DictWriter(ofile, fieldnames = fieldname)
writer.writeheader()
for i in range(len(names)):
print (names[i], pricing[i], discounts[i])
writer.writerow({'name': names[i], 'discountPrice':pricing[i], 'originalPrice': discounts[i]})
ofile.close()
As you can tell this it iterating from top to bottom and appending to a distinct array for each one. The issue is, if I'm iterating over, let's say, 30,000 items and the website can modify itself (We'll say a ScoreBoard app on a JS Framework), by the time I get to the 2nd iteration, the order may have changed. (As I type this I realize this scenario actually would need more variables since BS would 'catch' the website at time of load, but I think the point still stands.)
I believe I need to leverage the next_sibling function within BS4 but when I did that I started capturing items I wasn't specifying, because I couldn't apply a 'class' to the sibling.
Update
An additional issue I encouraged when trying to do a loop within a loop to find the 3 children I need under the unique-class was I would end up with the first price being listed for all names.
Update - Adding sample HTML
<div class="unique_class">
<h4>World</h4>
<div class="class_b">$1.99</div>
<div class="class_a">$1.99</div>
</div>
<div class="unique_class">
<h4>World2</h4>
<div class="class_b">$2.99</div>
<div class="class_a">$2.99</div>
</div>
<div class="unique_class">
<h4>World3</h4>
<div class="class_b">$3.99</div>
<div class="class_a">$3.99</div>
</div>
<div class="unique_class">
<h4>World4</h4>
<div class="class_b">$4.99</div>
<div class="class_a">$3.99</div>
</div>
I have also found a fix, and submitted the answer to be Optimized - Located at CodeReview
If the site you are looking to scrape the data from is using JS you may want to use selenium and use its page_source method to extract snapshots of the page with loaded JS you can then load into BS.
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get(<url>)
page = driver.page_source
Then you can use BS to parse the JS loaded 'page'
If you want to wait for other JS events to load up you are able to specify events to wait for in selenium.

Categories