I'm currently experimenting with Beautiful Soup 4 in Python 2.7.6
Right now, I have a simple script to scrape Soundcloud.com. I'm trying to print out the number of button tags on the page, but I'm not getting the answer I expect.
from bs4 import BeautifulSoup
import requests
page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
data = page.text
soup = BeautifulSoup(data)
buttons = soup.findAll('button')
print len(buttons)
When I run this, I get the output
num buttons = 0
This confuses me. I know for a fact that the button tags exist on this page so it shouldn't be returning 0. Upon inspecting the button elements directly underneath the waveform, I find these...
<button class="sc-button sc-button-like sc-button-medium sc-button-responsive" tabindex="0" title="Like">Like</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtoset" tabindex="0" title="Add to playlist">Add to playlist</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtogroup" tabindex="0" title="Add to group">Add to group</button>
<button class="sc-button sc-button-share sc-button-medium sc-button-responsive" title="Share" tabindex="0">Share</button>
At first I thought that the way I was trying to find the button elements was incorrect. However, if I modify my code to scrape an arbitrary youtube page...
page = requests.get('http://www.youtube.com/watch?v=UiyDmqO59QE')
then I get the output
num buttons = 37
So that means that soup.findAll('button') is doing what it's suppose to, just not on soundcloud.
I've also tried specifying the exact button I want, expecting to get a return result of 1
buttons = soup.findAll('button', class_='sc-button sc-button-like sc-button-medium sc-button-responsive')
print 'num buttons =', len(buttons)
but it still returns 0.
I'm kind of stumped on this one. Can anyone explain why this is?
The reason you cannot get the buttons is that there are no button tags inside the html you are getting:
>>> import requests
>>> page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
>>> data = page.text
>>> '<button' in data
False
This means that there is much more involved in forming the page: AJAX requests, javascript function calls etc
Also, note that soundcloud provides an API - there is no need to crawl HTML pages of the site. There is also a python wrapper around the Soundcloud API available.
Also, be careful about web-scraping, study Terms of Use:
You must not employ scraping or similar techniques to aggregate,
repurpose, republish or otherwise make use of any Content.
Related
I am trying to scrape some data off of a FanGraphs webpage as well as interact with the page itself. Since there are many buttons and dropdowns on the page to narrow down my search results, I need to be able to find the corresponding elements in the HTML. However, when I tried to use a 'classic' approach and use modules like requests and urllib.requests, the portions of the HTML containing the data I need did not appear.
HTML Snippet
Here is a part of the HTML which contains the elements which I need.
<div id="root-season-grid">
<div class="season-grid-wrapper">
<div class="season-grid-title">Season Stat Grid</div>
<div class="season-grid-controls">
<div class="season-grid-controls-button-row">
<div class="fgButton button-green active isActive">Batting</div>
<div class="fgButton button-green">Pitching</div>
<div class="spacer-v-20"></div>
<div class="fgButton button-green active isActive">Normal</div>
<div class="fgButton button-green">Normal & Changes</div>
<div class="fgButton button-green">Year-to-Year Changes</div>
</div>
</div>
</div>
</div>
</div>
The full CSS path:
html > body > div#wrapper > div#content > div#root-season-grid div.season-grid-wrapper > div.season-grid-controls > div.season-grid-controls-button-row
Attempts
requests and bs4
>>> res = requests.get("https://fangraphs.com/leaders/season-stat-grid")
>>> soup = bs4.BeautifulSoup4(res.text, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"></div>]
>>> soup.select(".season-grid-wrapper")
[]
So bs4 was able to find the <div id="root-season-grid"></div> element, but could not find any descendants of that element.
urllib and lxml
>>> res = urllib.request.urlopen("https://fangraphs.com/leaders/season-stat-grid")
>>> parser = lxml.etree.HTMLParser()
>>> tree = lxml.etree.parse(res, parser)
>>> tree.xpath("//div[#id='root-season-grid']")
[<Element div at 0x131e1b3f8c0>]
>>> tree.xpath("//div[#class='season-grid-wrapper']")
[]
Again, no descendants of the div element could be found, this time with lxml.
I started to wonder if I should be using a different URL address to pass to both requests.get() and urlopen(), so I created a selenium remote browser, browser, then passed browser.current_url to both function. Unfortunately, the results were identical.
selenium
I did notice however, that using selenium.find_element_by_* and selenium.find_elements_by_* were able to find the elements, so I started using that. However, doing so took a lot of memory and was extremely slow.
selenium and bs4
Since selenium.find_element_by_* worked properly, I came up with a very hacky 'solution'. I selected the full HTML by using the "*" CSS selector then passed that to bs4.BeautifulSoup()
>>> browser = selenium.webdriver.Firefox()
>>> html_elem = browser.find_element_by_css_selector("*")
>>> html = html_elem.get_attribute("innerHTML")
>>> soup = bs4.BeautifulSoup(html, features="lxml")
>>> soup.select("#root-season-grid")
[<div id="root-season-grid"><div class="season-grid-wrapper">...</div></div>]
>>> soup.select(".season-grid-wrapper")
[<div class="season-grid-wrapper">...</div>]
So this last attempt was somewhat of a success, as I was able to get the elements I needed. However, after running a bunch of unit test and a few integration tests for the module, I realized how inconsistent this is.
Problem
After doing a bunch of research, I concluded the reason why Attempts (1) and (2) didn't work and why Attempt (3) is inconsistent is because the table in the page is rendered by JavaScript, along with the buttons and dropdowns. This also explains why the HTML above is not present when you click View Page Source. It seems that, when requests.get() and urlopen() are called, the JavaScript is not fully rendered, and whether bs4+selenium works depends on how fast the JavaScript renders. Are there any Python libraries which can render the JavaScript before returning the HTML content?
Hopefully this isn't too long of a question. I tried to condense as far as possible without sacrificing clarity.
Just get the page_source from Selenium and pass it to bs4.
browser.get("https://fangraphs.com/leaders/season-stat-grid")
soup = bs4.BeautifulSoup(browser.page_source, features="lxml")
print(soup.select("#root-season-grid"))
I'd recommend using their api however https://www.fangraphs.com/api/leaders/season-grid/data?position=B&seasonStart=2011&seasonEnd=2019&stat=WAR&pastMinPt=400&curMinPt=0&mode=normal
I want to scrape the href link using python3
existing code:
import lxml.html
import requests
dom = lxml.html.fromstring(requests.get('https://www.tripadvisor.co.uk/Search?singleSearchBox=true&geo=191&pid=3825&redirect=&startTime=1576072392277&uiOrigin=MASTHEAD&q=the%20grilled%20cheese%20truck&supportedSearchTypes=find_near_stand_alone_query&enableNearPage=true&returnTo=https%253A__2F____2F__www__2E__tripadvisor__2E__co__2E__uk__2F__&searchSessionId=AF4BFA0308CF336B90FD9602FA122CD11576072382852ssid&social_typeahead_2018_feature=true&sid=AF4BFA0308CF336B90FD9602FA122CD11576072410521&blockRedirect=true&ssrc=a&rf=1').content)
result = dom.xpath("//a[#class='review_count']/#href")
print (result)
from this code:
<a class="review_count" href="/Restaurant_Review-g54774-d10073153-Reviews-The_Grilled_Cheese_Truck-Rapid_City_South_Dakota.html#REVIEWS" onclick="return false;" data-clicksource="ReviewCount">3 reviews</a>
with my existing code I'm getting empty print
i have located the link here yet:
widgetEvCall('handlers.openResult', event, this, '/Restaurant_Review-g54774-d10073153-Reviews-The_Grilled_Cheese_Truck-Rapid_City_South_Dakota.html', {type: 'EATERY',element: this,index: 0,section: 1,locationId: '10073153',parentId: '54774',elementType: 'title',selectedId: '10073153'});
so will need help on this , in this case will be even better to get locationId and selectedId to print
any ideas ?
The problem you're having is because the data is loaded over javascript - try viewing the page with javascript disabled
You could try using a tool that will function with javascript eg. selenium - https://selenium-python.readthedocs.io/
Or try to track down where the JavaScript is loading the data from and then request that directly using python
I have code trying to pull all the html stuff within the tracklist container, which should have 88 songs. The information is definitely there (I printed the soup to check), so I'm not sure why everything after the first 30 react-contextmenu-wrapper are lost.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
spotify = 'https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP'
html = urlopen(spotify)
soup = BeautifulSoup(html, "html5lib")
main = soup.find(class_ = 'tracklist-container')
print(main)
Thank you for the help.
Current output from printing is as follows:
1.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Move On - Teen Daze Remix</span><span class="artists-albums"><span dir="auto">Garden City Movement</span> • <span dir="auto">Entertainment</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">5:11</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="2" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
2.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Flicker</span><span class="artists-albums"><span dir="auto">Forhill</span> • <span dir="auto">Flicker</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">3:45</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="3" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
...
30.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Trapdoor</span><span class="artists-albums"><span dir="auto">Eagle Eyed Tiger</span> • <span dir="auto">Future or Past</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">4:14</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li></ol><button class="link js-action-button" data-track-type="view-all-button">View all on Spotify</button></div>
Last entry should be the 88th. It just feels like my search results got truncated.
It is all there in the response just within a script tag.
You can see the start of the relevant javascript object here:
I would regex out the required string and parse with json library.
Py:
import requests, re, json
r = s.get('https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP')
p = re.compile(r'Spotify\.Entity = (.*?);')
data = json.loads(p.findall(r.text)[0])
print(len(data['tracks']['items']))
Since it seemed you were on right track, I did not try to solve the full problem and rather tried to provide you a hint which could be helpful: Do dynamic webscraping.
"Why Selenium? Isn’t Beautiful Soup enough?
Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup."
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
Here is what I see at the end of the 30 songs in the DOM which refers to a button:
</li>
</ol>
<button class="link js-action-button" data-track-type="view-all-button">
View all on Spotify
</button>
</div>
It's because you're doing
main = soup.find(class_ = 'tracklist-container')
the class "tracklist-container" only holds these 30 items,
i'm not sure what you're trying to accomplish, but if you want
what's afterwards try parsing the class afterwards.
in other words, the class contains 30 songs, i visited the site and found 30 songs so it might be only for logged in users.
I have been stuck on this for awhile... I am trying to scrape the player name and projection from this site: https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793
The script is going to loop through the past by just going through all the PID's in a range, but that isnt the problem. The main problem is when I inspect the element I find the value is stored within this class:
<div class="salarybox expanded"...
which is located in the 5th position of my projectionsView list.
The scraper finds the projectionsView class fine but can't find anything within it.
When I goto view the actual HTML of the site it seems this content just doesn't exsist within it..
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
I'm super new to scraping and have successfully scraped everything else I need for my project just not this damn site... I think it may be because I have to sign up for the site? But either way the information is viewable without signing in so I figured I didnt need to use Selenium, and even if I did that wouldn't find it I don't think.
Anyway here's the code I have so far that is obviously returning a blank list.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
import os
url = "https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793"
uClient = uReq(url)
page_read = uClient.read()
uClient.close()
page_soup = soup(page_read, "html.parser")
salarybox = page_soup.findAll("div",{"class":"projectionsView"})
print(salarybox[4].findAll("div",{"class":"salarybox expanded"}))
Any ideas would be greatly appreciated!
The whole idea of the script is to just find the ppText of each "salarybox expanded" class on each page. I just want to know how to find these elements. Perhaps a different parser?
Based on your url page, the <div id="salData" class="projectionsView">is re-write by the javascript, but urllib.request will get the whole response before running your callback, it means that the javascript generated content will be not in the response. Hence the div will be empty:
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
you better try with selenium and splash will work for this kind of dynamic website.
BTW, after you get the right response, you select div by id, it will be more specific:
salarybox = page_soup.find("div",{"id":"salData"})
Pretty new to python... and I'm trying to my hands at my first project.
Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying to do.
I'm trying to scrape the gamelogs for from the NHL website
Here is that i came up with... similar code work for the top section of the site (ex: get the age) but it fail on the section with display logic (dependent if the user click on Career, game Logs or splits)
Thanks in advance for your help
import urllib2
from bs4 import BeautifulSoup
url = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
Test = soup.find_all('div', attrs={'id': "gamelogsTable"})
This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.
When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.
>>> getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
>>> import requests
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> browser = webdriver.Chrome()
>>> browser.get(getzlafURL)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> open('c:/scratch/temp.htm', 'w').write(content)
775838
By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.
</div>
</li>
</ul>
<h5 class="statistics__subheading">Ryan Getzlaf Game Logs</h5>
<div id="gamelogsTable"><div class="responsive-datatable">
I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.