I'm trying to create a basic scraper that will scrape Username and Song Title from a search on Soundcloud. By inspecting the element I needed (using Chrome), I found I needed to find the string associated with every tag 'span' with title="soundTitle__usernameText". Using BeautifulSoup, urllib2, and lxml, I have the following code for a search 'robert delong':
from lxml import html
from bs4 import BeautifulSoup
from urllib2 import urlopen
import requests
def search_results(url):
html = urlopen(url).read()
# html = requests.get(url) I've tried this also
soup = BeautifulSoup(html, "lxml")
usernames = [span.string for span in soup.find_all("span", "soundTitle__usernameText")]
return usernames
print search_results('http://soundcloud.com/search?q=robert%20delong')
This returns an empty list. However, when I save the complete webpage on Chrome by selecting File>Save>Format-Webpage, Complete, and use that associated HTML file instead of the file obtained with urlopen, the code then prints
[u'Two Door Cinema Club', u'whatever-28', u'AWOLNATION', u'Two Door Cinema Club', u'Sean Glass', u'Capital Cities', u'Robert DeLong', u'RAC', u'JR JR']
which is the ideal outcome. To me, it appears that urlopen uses somewhat truncated HTML code to conduct its search, which is why it returns an empty list.
Any thoughts on how I may be able to access the same HTML obtained by manually saving the webpage, but using Python/Terminal? Thank you.
You guessed right. Downloaded HTML does not contain all the data. Javascript is used to request information in JSON format which is then inserted into the document.
By looking at the request Chrome made (ctrl+shift+i, "Network"), I see that it requested https://api-v2.soundcloud.com/search?q=robert%20delong. I believe the response to that has the information you need.
Actually, this is good for you. Reading JSON should me much more straight-forward than parsing HTML ;)
This is the code that you can use to download the html of the webpage using terminal and its related links and images:
wget -p --convert-links http://www.website.com/directory/webpage.html
Related
I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.
I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
html_content = html.fromstring(page.content)
for i in html_content.iterchildren():
print i
news_stories = html_content.xpath('//h2[#data-analytics]/a/span/text()')
news_links = html_content.xpath('//h2[#data-analytics]/a/#href')
I am trying to run this code to understand how web scraping in python works.
I want to scrap top news stories and their links from CNN.
When i run this in Python Shell, the output for news_stories and news_links i get is:
[]
My question is where am i going wrong with this and is there a better way to achieve what i am trying to than this one?
In your code html_content is returning only page address and not the actual content of the page.
html_content = html.fromstring(page.content)
You can try printing following to see complete HTML code for that page:
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
print page.text
Even though if you'll get the content also somehow, you will get it a gzipped response from the server. (Get html using Python requests?)
I would highly recommend you to use httplib2 library and BeautifulSoup to scrape news stories from CNN. That is really handy in use and get you what you want. You can see another stackoverflow post here (retrieve links from web page using python and BeautifulSoup)
I hope that help you.
I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.
The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network tab of Web Developer in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001<=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information.
The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above.
But, I always prefer more then less!
I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib
To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()
Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.
works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]
we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)