parse page with beautifulsoup - python

I'm trying to parse this webpage and take some of information:
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513
import requests
page = requests.get("http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=778253364357513")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
All_Information = soup.find(id="MainContent")
print(All_Information)
it seams all information between tag is hidden. when i run the code this data is returned.
<div class="tabcontent content" id="MainContent">
<div id="TopBox"></div>
<div id="ThemePlace" style="text-align:center">
<div class="box1 olive tbl z2_4 h250" id="Section_relco" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_history" style="display:none"></div>
<div class="box1 silver tbl z2_4 h250" id="Section_tcsconfirmedorders" style="display:none"></div>
</div>
</div>
Why is the information not there, and how can I find and/or access it?

The information that I assume you are looking for is not loaded in your request. The webpage makes additional requests after it has initally loaded. There are a few ways you can get that information.
You can try selenium. It is a python package that simulates a web browser. This allows the page to load all the information before you try to scrape.
Another way is to reverse enginneer the website and find out where it is getting the information you need.
Have a look at this link.
http://www.tsetmc.com/tsev2/data/instinfofast.aspx?i=778253364357513&c=57+
It is called by your page every few seconds, and it appears to contain all the pricing information you are looking for. It may be easier to call that webpage to get your information.

Related

Scrape data and interact with webpage rendered in HTML

I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal

Scraping webpage with _ngcontent value within different html tags

I am new to scraping and coding as well. So far I am able to scrape data using beautiful soup using below code:
sub_soup = BeautifulSoup(sub_page, 'html.parser')
content = sub_soup.find('div',class_='detail-view-content')
print(content)
This works correct when tag and class are in format:
<div class="masthead-card masthead-hover">
But fail when format is with _ngcontent:
<span _ngcontent-ixr-c5="" class="btn-trailer-text">
or
<div _ngcontent-wak-c4="" class="col-md-6">
An example of _ngcontent webpage screenshot I am trying to scrape is below :
All I tried results in blank or 'None'. What am I missing.
BeautifulSoup runs faster than page loading.
so you should use Selenium library and ChromeDriver.
here it is

Beautiful Soup find() isn't finding all results for Class

I have code trying to pull all the html stuff within the tracklist container, which should have 88 songs. The information is definitely there (I printed the soup to check), so I'm not sure why everything after the first 30 react-contextmenu-wrapper are lost.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
spotify = 'https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP'
html = urlopen(spotify)
soup = BeautifulSoup(html, "html5lib")
main = soup.find(class_ = 'tracklist-container')
print(main)
Thank you for the help.
Current output from printing is as follows:
1.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Move On - Teen Daze Remix</span><span class="artists-albums"><span dir="auto">Garden City Movement</span> • <span dir="auto">Entertainment</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">5:11</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="2" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
2.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Flicker</span><span class="artists-albums"><span dir="auto">Forhill</span> • <span dir="auto">Flicker</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">3:45</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="3" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
...
30.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Trapdoor</span><span class="artists-albums"><span dir="auto">Eagle Eyed Tiger</span> • <span dir="auto">Future or Past</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">4:14</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li></ol><button class="link js-action-button" data-track-type="view-all-button">View all on Spotify</button></div>
Last entry should be the 88th. It just feels like my search results got truncated.
It is all there in the response just within a script tag.
You can see the start of the relevant javascript object here:
I would regex out the required string and parse with json library.
Py:
import requests, re, json
r = s.get('https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP')
p = re.compile(r'Spotify\.Entity = (.*?);')
data = json.loads(p.findall(r.text)[0])
print(len(data['tracks']['items']))
Since it seemed you were on right track, I did not try to solve the full problem and rather tried to provide you a hint which could be helpful: Do dynamic webscraping.
"Why Selenium? Isn’t Beautiful Soup enough?
Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup."
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
Here is what I see at the end of the 30 songs in the DOM which refers to a button:
</li>
</ol>
<button class="link js-action-button" data-track-type="view-all-button">
View all on Spotify
</button>
</div>
It's because you're doing
main = soup.find(class_ = 'tracklist-container')
the class "tracklist-container" only holds these 30 items,
i'm not sure what you're trying to accomplish, but if you want
what's afterwards try parsing the class afterwards.
in other words, the class contains 30 songs, i visited the site and found 30 songs so it might be only for logged in users.

Python HTML Parsing with BS4

I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.

How to get specific data using BeautifulSoup

I'm not sure how to get a specific result from this:
<div class="videoPlayer">
<div class="border-radius-player">
<div id="allplayers" style="position:relative;width:100%;height:100%;overflow: hidden;">
<div id="box">
<div id="player_content" class="todo" style="text-align: center; display: block;">
<div id="player" class="jwplayer jew-reset jew-skin-seven jw-state-paused jw-flag-user-inactive" tabindex="0">
<div class="jw-media jw-reset">
<video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
</div">
How would I get the src in <video class="jw-video jw-reset" x-webkit-playsinline="" src="https:EXAMPLE-URL-HERE" preload="metadata"></video>
This is what I've tried so far:
import urllib.request
from bs4 import BeautifulSoup
url = "https://someurlhere"
a = urllib.request.Request(url, headers={'User-Agent' : "Cliqz"})
b = urllib.request.urlopen(a) # prevent "Permission denies"
soup = BeautifulSoup(b, 'html.parser')
for video_class in soup.select("div.videoPlayer"):
print(video_class.text)
Which returns parts of it but not down to video class
Requests is a simple html client, it cannot execute javascripts.
You have three more options to try here though!
try going over the html source (b) and see if any of the javascripts in the site have the data you need. usually, the page would have the url (which, i assume you want to scrape) in some sort of holder (a javascript code or a json object) that you can scrape off.
Try looking at the XHR requests of the site and see if any of the requests query external sources for the video data. In this case, see if you can imitate that request to get the data you need.
(last resort) You need to use a phantomjs + selenium browser to download the website (Link1, Link2). You can find out more about how to use selenium in this SO post: https://stackoverflow.com/a/26440563/3986395

Categories