I have code trying to pull all the html stuff within the tracklist container, which should have 88 songs. The information is definitely there (I printed the soup to check), so I'm not sure why everything after the first 30 react-contextmenu-wrapper are lost.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
spotify = 'https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP'
html = urlopen(spotify)
soup = BeautifulSoup(html, "html5lib")
main = soup.find(class_ = 'tracklist-container')
print(main)
Thank you for the help.
Current output from printing is as follows:
1.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Move On - Teen Daze Remix</span><span class="artists-albums"><span dir="auto">Garden City Movement</span> • <span dir="auto">Entertainment</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">5:11</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="2" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
2.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Flicker</span><span class="artists-albums"><span dir="auto">Forhill</span> • <span dir="auto">Flicker</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">3:45</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li><li class="tracklist-row js-track-row tracklist-row--track track-has-preview" data-position="3" role="button" tabindex="0"><div class="tracklist-col position-outer"><div class="play-pause top-align"><svg aria-label="Play" class="svg-play" role="button"><use xlink:href="#icon-play" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg><svg aria-label="Pause" class="svg-pause" role="button"><use xlink:href="#icon-pause" xmlns:xlink="http://www.w3.org/1999/xlink"></use></svg></div><div class="tracklist-col__track-number position top-align">
...
30.
</div></div><div class="tracklist-col name"><div class="top-align track-name-wrapper"><span class="track-name" dir="auto">Trapdoor</span><span class="artists-albums"><span dir="auto">Eagle Eyed Tiger</span> • <span dir="auto">Future or Past</span></span></div></div><div class="tracklist-col explicit"></div><div class="tracklist-col duration"><div class="top-align"><span class="total-duration">4:14</span><span class="preview-duration">0:30</span></div></div><div class="progress-bar-outer"><div class="progress-bar"></div></div></li></ol><button class="link js-action-button" data-track-type="view-all-button">View all on Spotify</button></div>
Last entry should be the 88th. It just feels like my search results got truncated.
It is all there in the response just within a script tag.
You can see the start of the relevant javascript object here:
I would regex out the required string and parse with json library.
Py:
import requests, re, json
r = s.get('https://open.spotify.com/playlist/3vSFv2hZICtgyBYYK6zqrP')
p = re.compile(r'Spotify\.Entity = (.*?);')
data = json.loads(p.findall(r.text)[0])
print(len(data['tracks']['items']))
Since it seemed you were on right track, I did not try to solve the full problem and rather tried to provide you a hint which could be helpful: Do dynamic webscraping.
"Why Selenium? Isn’t Beautiful Soup enough?
Web scraping with Python often requires no more than the use of the Beautiful Soup to reach the goal. Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser. You get exactly what you see in “view page source”, and then you slice and dice it. If the data you are looking for is available in “view page source” only, you don’t need to go any further. But if you need data that are present in components which get rendered on clicking JavaScript links, dynamic scraping comes to the rescue. The combination of Beautiful Soup and Selenium will do the job of dynamic scraping. Selenium automates web browser interaction from python. Hence the data rendered by JavaScript links can be made available by automating the button clicks with Selenium and then can be extracted by Beautiful Soup."
https://medium.com/ymedialabs-innovation/web-scraping-using-beautiful-soup-and-selenium-for-dynamic-page-2f8ad15efe25
Here is what I see at the end of the 30 songs in the DOM which refers to a button:
</li>
</ol>
<button class="link js-action-button" data-track-type="view-all-button">
View all on Spotify
</button>
</div>
It's because you're doing
main = soup.find(class_ = 'tracklist-container')
the class "tracklist-container" only holds these 30 items,
i'm not sure what you're trying to accomplish, but if you want
what's afterwards try parsing the class afterwards.
in other words, the class contains 30 songs, i visited the site and found 30 songs so it might be only for logged in users.
Related
I'm a complete beginner with webscraping and programming with Python. The answer might be somewhere at the forum, but i'm so new, that i dont really now, what to look for. So i hope, you can help me:
Last week I completed a three day course in webscraping with Python, and at the moment i'm trying to brush up on what i've learned so far.
I'm trying to scrape out a spcific link from a website, so that i later on can create a loop, that extracts all the other links. But i can't seem to extract any link even though they are visible in the HTML-code.
Here is the website (danish)
Here is the link i'm trying to extract
The link i'm trying extract is located in this html-code:
<a class="nav-action-arrow-underlined" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/" aria-label="Læs mere om Regionen tilbød ikke"\>Læs mere\</a\>
Here is the programming in Python, that i've tried so far:
url = "https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html.parser")
a_tags = soup.find_all("a") len(a_tags)
#there is 34
've then tried going through all "a-tags" from 0-33 without finding the link.
If i'm printing a_tags [26] - i'm getting this code:
<a aria-current="page" class="nav-action is-current" href="/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/"\>Afgørelser fra Styrelsen for Patientklager\</a\>
Which is somewhere at the top of the website. But the next a_tag [27] is a code at the bottom of the site:
<a class="footer-linkedin" href="``https://www.linkedin.com/company/styrelsen-for-patientklager/``" rel="noopener" target="_blank" title="``https://www.linkedin.com/company/styrelsen-for-patientklager/``"><span class="sr-only">Linkedin profil</span></a>
Can anyone help me by telling me, how to access the specific part of the HTML-code, that contains the link?
When i find out how to pull out the link, my plan is to make the following programming:
path = "/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/"
full_url = f"htps://stpk.dk{path}"
print(full_url)
You will not find what you are looking for, cause requests do not render websites like a browser will do - but no worry, there is an alterntive.
Content is dynamically loaded via api, so you should call these directly and you will get JSON that contains the displayed information.
To find such information take a closer look into the developer tools of your browser and check the tab for the XHR Requests - May take a
minute to read and follow the topic:
https://developer.mozilla.org/en-US/docs/Glossary/XHR_(XMLHttpRequest)
Simply iterate over the items, extract the url value and prepend the base_url.
Check and manipulate the following parameters to your needs:
containerKey: a76f4a50-6106-4128-bc09-a1da7695902b
query:
year:
category:
legalTheme:
specialty:
profession:
treatmentPlace:
critiqueType:
take: 200
skip: 0
Example
import requests
url = 'https://stpk.dk/api/verdicts/settlements/?containerKey=a76f4a50-6106-4128-bc09-a1da7695902b&query=&year=&category=&legalTheme=&specialty=&profession=&treatmentPlace=&critiqueType=&take=200&skip=0'
base_url = 'https://stpk.dk'
for e in requests.get(url).json()['items']:
print(base_url+e['url'])
Output
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp108/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp107/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp106/
https://stpk.dk/afgorelser-og-domme/afgorelser-fra-styrelsen-for-patientklager/22sfp105/
...
I tried this code but it isn't working right (not extracting from all sites etc and many other issues with this). Need help!
from bs4 import BeautifulSoup
import re
import requests
allsite = ["https://www.ionixxtech.com/", "https://sumatosoft.com", "https://4irelabs.com/", "https://www.leewayhertz.com/",
"https://stackoverflow.com", "https://www.vardot.com/en", "http://www.clickjordan.net/", "https://vtechbd.com/"]
emails = []
tels = []
for l in allsite:
r = requests.get(l)
soup = BeautifulSoup(r.content, "html.parser")
for link in soup.findAll('a', attrs={'href': re.compile("^mailto:")}):
emails.append(link.get('href'))
for tel in soup.findAll('a', attrs={'href': re.compile("^tel:")}):
tels.append(tel.get('href'))
print(emails)
print(tels)
this is neither a regex nor an html parsing issue. print out r.content and you will notice (e.g. for https://vtechbd.com/) that the actual html source you are parsing isn't the same as the one rendered by your browser when you access the site.
<!-- Contact Page -->
<section class="content hide" id="contact">
<h1>Contact</h1>
<h5>Get in touch.</h5>
<p>Email: <span class="__cf_email__" data-cfemail="2e474048416e585a4b4d464c4a004d4143">[email protected]</span><br />
so I assume the information you are interested in is loaded dynamically by some javascript. python's requests library is an http client, not a web scraper.
...also, it's not cool to ask people to debug your code because it's 5pm, you want to get out of the office and hope somebody will have solved your issue by tomorrow morning...I may be wrong but the way your question is asked leaves me under the impression you spent like 2min pasting your source code in...
I have been stuck on this for awhile... I am trying to scrape the player name and projection from this site: https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793
The script is going to loop through the past by just going through all the PID's in a range, but that isnt the problem. The main problem is when I inspect the element I find the value is stored within this class:
<div class="salarybox expanded"...
which is located in the 5th position of my projectionsView list.
The scraper finds the projectionsView class fine but can't find anything within it.
When I goto view the actual HTML of the site it seems this content just doesn't exsist within it..
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
I'm super new to scraping and have successfully scraped everything else I need for my project just not this damn site... I think it may be because I have to sign up for the site? But either way the information is viewable without signing in so I figured I didnt need to use Selenium, and even if I did that wouldn't find it I don't think.
Anyway here's the code I have so far that is obviously returning a blank list.
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq
import pandas as pd
import os
url = "https://www.fantasysportsco.com/Projections/Sport/MLB/Site/DraftKings/PID/793"
uClient = uReq(url)
page_read = uClient.read()
uClient.close()
page_soup = soup(page_read, "html.parser")
salarybox = page_soup.findAll("div",{"class":"projectionsView"})
print(salarybox[4].findAll("div",{"class":"salarybox expanded"}))
Any ideas would be greatly appreciated!
The whole idea of the script is to just find the ppText of each "salarybox expanded" class on each page. I just want to know how to find these elements. Perhaps a different parser?
Based on your url page, the <div id="salData" class="projectionsView">is re-write by the javascript, but urllib.request will get the whole response before running your callback, it means that the javascript generated content will be not in the response. Hence the div will be empty:
<div id="salData" class="projectionsView">
<!-- Fill in with Salary Data -->
</div>
you better try with selenium and splash will work for this kind of dynamic website.
BTW, after you get the right response, you select div by id, it will be more specific:
salarybox = page_soup.find("div",{"id":"salData"})
I am trying to change content of a <p> tag in a HTML DOM using. I have used bs4 to successfully select the particular tag and update it but how do i save it back to HTML file?.
Python code
import bs4
exampleFile = open('stefan/index.html')
exampleSoup = bs4.BeautifulSoup(exampleFile.read())
test="testing testing"
elems = exampleSoup.select('#page_main_text')
print(elems[0].getText())
tag=exampleSoup.p
tag.string=test
print(tag)
HTML
<p class="grey" id="page_main_text">It was an awesome experience to grow up with you. I can recall all the memories that we shared together. We laughed together and cried together. All these things are to remind you – happy birthday.</p>
I'm currently experimenting with Beautiful Soup 4 in Python 2.7.6
Right now, I have a simple script to scrape Soundcloud.com. I'm trying to print out the number of button tags on the page, but I'm not getting the answer I expect.
from bs4 import BeautifulSoup
import requests
page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
data = page.text
soup = BeautifulSoup(data)
buttons = soup.findAll('button')
print len(buttons)
When I run this, I get the output
num buttons = 0
This confuses me. I know for a fact that the button tags exist on this page so it shouldn't be returning 0. Upon inspecting the button elements directly underneath the waveform, I find these...
<button class="sc-button sc-button-like sc-button-medium sc-button-responsive" tabindex="0" title="Like">Like</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtoset" tabindex="0" title="Add to playlist">Add to playlist</button>
<button class="sc-button sc-button-medium sc-button-responsive sc-button-addtogroup" tabindex="0" title="Add to group">Add to group</button>
<button class="sc-button sc-button-share sc-button-medium sc-button-responsive" title="Share" tabindex="0">Share</button>
At first I thought that the way I was trying to find the button elements was incorrect. However, if I modify my code to scrape an arbitrary youtube page...
page = requests.get('http://www.youtube.com/watch?v=UiyDmqO59QE')
then I get the output
num buttons = 37
So that means that soup.findAll('button') is doing what it's suppose to, just not on soundcloud.
I've also tried specifying the exact button I want, expecting to get a return result of 1
buttons = soup.findAll('button', class_='sc-button sc-button-like sc-button-medium sc-button-responsive')
print 'num buttons =', len(buttons)
but it still returns 0.
I'm kind of stumped on this one. Can anyone explain why this is?
The reason you cannot get the buttons is that there are no button tags inside the html you are getting:
>>> import requests
>>> page = requests.get('http://soundcloud.com/sondersc/waterfalls-sonder')
>>> data = page.text
>>> '<button' in data
False
This means that there is much more involved in forming the page: AJAX requests, javascript function calls etc
Also, note that soundcloud provides an API - there is no need to crawl HTML pages of the site. There is also a python wrapper around the Soundcloud API available.
Also, be careful about web-scraping, study Terms of Use:
You must not employ scraping or similar techniques to aggregate,
repurpose, republish or otherwise make use of any Content.