BS4 Scraping Hidden Content

BS4 Scraping Hidden Content - python

I have been stuck on this for awhile... I am trying to scrape the player name and projection from this site: https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793
The script is going to loop through the past by just going through all the PID's in a range, but that isnt the problem. The main problem is when I inspect the element I find the value is stored within this class:
<div class="salarybox expanded"...
which is located in the 5th position of my projectionsView list.
The scraper finds the projectionsView class fine but can't find anything within it.
When I goto view the actual HTML of the site it seems this content just doesn't exsist within it..
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
I'm super new to scraping and have successfully scraped everything else I need for my project just not this damn site... I think it may be because I have to sign up for the site? But either way the information is viewable without signing in so I figured I didnt need to use Selenium, and even if I did that wouldn't find it I don't think.
Anyway here's the code I have so far that is obviously returning a blank list.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
import os
url = "https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793"
uClient = uReq(url)
page_read = uClient.read()
uClient.close()
page_soup = soup(page_read, "html.parser")
salarybox = page_soup.findAll("div",{"class":"projectionsView"})
print(salarybox[4].findAll("div",{"class":"salarybox expanded"}))
Any ideas would be greatly appreciated!
The whole idea of the script is to just find the ppText of each "salarybox expanded" class on each page. I just want to know how to find these elements. Perhaps a different parser?

Based on your url page, the <div id="salData" class="projectionsView">is re-write by the javascript, but urllib.request will get the whole response before running your callback, it means that the javascript generated content will be not in the response. Hence the div will be empty:
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
you better try with selenium and splash will work for this kind of dynamic website.
BTW, after you get the right response, you select div by id, it will be more specific:
salarybox = page_soup.find("div",{"id":"salData"})

Related

Webscraping with BeautifulSoup, can't find table within html

I am trying to webscrape the main table from this site: https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false
Here is my code:
import requests
from bs4 import BeautifulSoup, Comment
import pandas as pd
url = "https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false"
request = requests.get(url).text
soup = BeautifulSoup(request, 'lxml')
divs = soup.findAll('tbody', id = 'leaderboardTable')
print(divs)
However, this is the only output of this:
How do I access the rest of the html? It appears to not be there when I search through the soup. I have also attached an image of the html I am seeking to access. Any help is appreciated. Thank you!

There is an ajax request that fetches that data, however it's blocked by cloudscraper. There is a package that can bypass that, however doesn't seem to work for this site.
What you'd need to do now, is use something like Selenium to allow the page to render first, then pull the data.
from selenium import webdriver
import pandas as pd
browser = webdriver.Chrome('C:/chromedriver_win32/chromedriver.exe')
browser.get("https://www.atptour.com/en/stats/leaderboard?boardType=serve&timeFrame=52Week&surface=all&versusRank=all&formerNo1=false")
df= pd.read_html(browser.page_source, header=0)[0]
browser.close()
Output:

Your code is working as expected. The HTML you are parsing does not have any data under the table.
$ wget https://www.atptour.com/en/stats/leaderboard\?boardType\=serve\&timeFrame\=52Week\&surface\=all\&versusRank\=all\&formerNo1\=false -O page.html
$ grep -C 3 'leaderboardTable' page.html
class="stat-listing-table-content no-pagination">
<table class="stats-listing-table">
<!-- TODO: This table head will only appear on DESKTOP-->
<thead id="leaderboardTableHeader" class="leaderboard-table-header">
</thead>
<tbody id="leaderboardTable"></tbody>
</table>
</div>
You have shown a screenshot of the developer view that does contain the data. I would guess that there is a Javascript that modifies the HTML after it is loaded and puts in the rows. Your browser is able to run this Javascript, and hence you see the rows. requests of course doesn't run any scripts, it only downloads the HTML.
You can do "save as" in your browser to get the reuslting HTML, or you will have to use a more advanced web module such as Selenium that can run scripts.

Beautiful Soup find() isn't finding all results for Class

I have code trying to pull all the html stuff within the tracklist container, which should have 88 songs. The information is definitely there (I printed the soup to check), so I'm not sure why everything after the first 30 react-contextmenu-wrapper are lost.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
spotify = 'https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP'
html = urlopen(spotify)
soup = BeautifulSoup(html, "html5lib")
main = soup.find(class_ = 'tracklist-container')
print(main)
Thank you for the help.
Current output from printing is as follows:
1.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Move On - Teen Daze Remix</span><span class="artists-albums"><span dir="auto">Garden City Movement</span> • <span dir="auto">Entertainment</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">5:11</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="2" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
2.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Flicker</span><span class="artists-albums"><span dir="auto">Forhill</span> • <span dir="auto">Flicker</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">3:45</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="3" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
...
30.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Trapdoor</span><span class="artists-albums"><span dir="auto">Eagle Eyed Tiger</span> • <span dir="auto">Future or Past</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">4:14</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li></ol><button class="link js-action-button" data-track-type="view-all-button">View all on Spotify</button></div>
Last entry should be the 88th. It just feels like my search results got truncated.

It is all there in the response just within a script tag.
You can see the start of the relevant javascript object here:
I would regex out the required string and parse with json library.
Py:
import requests, re, json
r = s.get('https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP')
p = re.compile(r'Spotify\.Entity = (.*?);')
data = json.loads(p.findall(r.text)[0])
print(len(data['tracks']['items']))

Since it seemed you were on right track, I did not try to solve the full problem and rather tried to provide you a hint which could be helpful: Do dynamic webscraping.
"Why Selenium? Isn’t Beautiful Soup enough?
Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup."
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
Here is what I see at the end of the 30 songs in the DOM which refers to a button:
</li>
</ol>
<button class="link js-action-button" data-track-type="view-all-button">
View all on Spotify
</button>
</div>

It's because you're doing
main = soup.find(class_ = 'tracklist-container')
the class "tracklist-container" only holds these 30 items,
i'm not sure what you're trying to accomplish, but if you want
what's afterwards try parsing the class afterwards.
in other words, the class contains 30 songs, i visited the site and found 30 songs so it might be only for logged in users.

how can I scrape some div sections which can not be acquired by beautifulsoup?

I want to scrape the company info from this.
Div section related to data is div class="col-xs-12 col-md-6 col-lg-6 but when run the following code to extract all classes, this class is not available
import requests
from bs4 import BeautifulSoup
page = requests.get("http://gyeonquartz.com/distributors-detailers/")
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.prettify())
When we inspect the web source, all dealer's detail are given under the div class="col-xs-12 col-md-6 col-lg-6" but in parsing, there is no such div.

The data you want to scrape are populated once the page is loaded through an ajax request. When you are making a request through the python Requests library, you are only given the page html.
You have 2 options.
Use selenium (or other options such as requests-html) to render the javascript loaded contents.
Directly make the ajax requests and get the json response. You can find this by using the network tab on the inspect tool in your browser.
The second option in this case as follows.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://gyeonquartz.com/wp-admin/admin-ajax.php?action=gyeon_load_partners")
print(page.json())
This will output a very long json. I have converted it into a DataFrame to view it better.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("http://gyeonquartz.com/wp-admin/admin-ajax.php?action=gyeon_load_partners")
df=pd.DataFrame.from_dict(page.json())
df['address'] = [BeautifulSoup(text,'html.parser').get_text().replace("\r\n","") for text in df['address'] ]
print(df) #just use df if in jupyter notebook
Sample output from my jupyter notebook is as follows.

If you look at the page source you'll see that none of the div tags you are looking for exist within the source code of the page. Because requests only makes the initial request and does not load any dynamic content done by javascript the tags you are looking for are not contained within the returned html.
To get the dynamic content you would instead need to mimic whatever requests the page is making (like with a curl request) or load the page within a headless browser(like selenium). The problem is not with the parser but with the content.
Very similar to the solution for How to use requests or other module to get data from a page where the url doesn't change?

Web Scrape page with multiple sections

Pretty new to python... and I'm trying to my hands at my first project.
Been able to replicate few simple demo... but i think there are few extra complexities with what I'm trying to do.
I'm trying to scrape the gamelogs for from the NHL website
Here is that i came up with... similar code work for the top section of the site (ex: get the age) but it fail on the section with display logic (dependent if the user click on Career, game Logs or splits)
Thanks in advance for your help
import urllib2
from bs4 import BeautifulSoup
url = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
page = urllib2.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
Test = soup.find_all('div', attrs={'id': "gamelogsTable"})

This happens with many web pages. It's because some of the content is downloaded by Javascript code that is part of the initial download. By doing does this designers are able to show visitors the most important parts of a page without waiting for the entire page to download.
When you want to scrape a page the first thing you should do is to examine the source code for it (often using Ctrl-u in a Windows environment) to see if the content you require is available. If not then you will need to use something beyond BeautifulSoup.
>>> getzlafURL = 'https://www.nhl.com/player/ryan-getzlaf-8470612?stats=gamelogs-r-nhl&season=20162017'
>>> import requests
>>> import selenium.webdriver as webdriver
>>> import lxml.html as html
>>> import lxml.html.clean as clean
>>> browser = webdriver.Chrome()
>>> browser.get(getzlafURL)
>>> content = browser.page_source
>>> cleaner = clean.Cleaner()
>>> content = cleaner.clean_html(content)
>>> doc = html.fromstring(content)
>>> type(doc)
<class 'lxml.html.HtmlElement'>
>>> open('c:/scratch/temp.htm', 'w').write(content)
775838
By searching within the file temp.htm for the heading 'Ryan Getzlaf Game Logs' I was able to find this section of HTML code. As you can see, it's about what you expected to find in the original downloaded HTML. However, this additional step is required to get at it.
</div>
</li>
</ul>
<h5 class="statistics__subheading">Ryan Getzlaf Game Logs</h5>
<div id="gamelogsTable"><div class="responsive-datatable">
I should mention that there are alternative ways of accessing such code, one of them being dryscrape. I simply can't be bothered installing that one on this Windows machine.

HTML in browser doesn't correspond to scraped data in python

For a project I've to scrap datas from a different website, and I'm having problem with one.
When I look at the source code the things I want are in a table, so it seems to be easy to scrap. But when I run my script that part of the code source doesn't show.
Here is my code. I tried different things. At first there wasn't any headers, then I added some but no difference.
# import libraries
import urllib2
from bs4 import BeautifulSoup
import csv
import requests
# specify the url
quote_page = 'http://www.airpl.org/Pollens/pollinariums-sentinelles'
# query the website and return the html to the variable 'page'
response = requests.get(quote_page)
response.addheaders = [('User-agent', 'Mozilla/5.0')]
print(response.text)
# parse the html using beautiful soap and store in variable `response`
soup = BeautifulSoup(response.text, 'html.parser')
with open('allergene.txt', 'w') as f:
f.write(soup.encode('UTF-8', 'ignore'))
What I'm looking for in the website is the things after "Herbacée" whose HTML Look like :
<p class="level1">
<img src="/static/img/state-0.png" alt="pas d'émission" class="state">
Herbacee
</p>
Do you have any idea what's wrong ?
Thanks for your help and happy new year guys :)

This page use JavaScript to render the table, the real page contains the table is:
http://www.alertepollens.org/gardens/garden/1/state/
You can find this url in Chrome Dev tools>>>Network.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

BS4 Scraping Hidden Content - python

Related

Webscraping with BeautifulSoup, can't find table within html

Beautiful Soup find() isn't finding all results for Class

how can I scrape some div sections which can not be acquired by beautifulsoup?

Web Scrape page with multiple sections

HTML in browser doesn't correspond to scraped data in python

Categories

Resources