I want to make a webcrawler that can fetch .mp3 links from a website. However the content is dynamic in the example site http://raagtune.net/ . In this site the content is fetched from "http://raag.me//music/data/.. some random .mp3" how can we crawl the entire site and get all MP3 links using scrapy in python? Or is there any alternative methods?
You can use beautifulsoup package of python.
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('http://www.nytimes.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_attr('href'):
print link['href']
Just check whether that URL has .mp3 in that link. If yes, store it in list else continue.
Hope this helps !
Related
Link I want to scrape: https://digital.auraria.edu/work/ns/8fb66c05-0ad2-4e56-8cc7-6ced34d0c126
I'm currently having some trouble scrapping the "Download" button on this website to download the pdf file using python and beautiful soup since normally, there's a link
and I can just do
soup = BeautifulSoup(r.content, 'lxml')
links = soup.find_all("a")
for link in links:
if ('pdf' in link.get('href')): #find if the book pdf link is in there.
i += 1
response = requests.get(link.get('href'))
print(f"Retrieving PDF for: {title}")
write_pdf(pdf_path, response.content)
However I'm not quite sure what the link for the pdf is in this. I'm wondering if I had to use a headless browser and how would I be able to extract this link?
Here is the Image of inspect element of the link below
The way I found the PDF link is by going to the page and looking at the page source. Then I used the finder tool and searched for PDF and found a meta tag.
<meta name="citation_pdf_url" content="https://dashboard.digital.auraria.edu/downloads/1e0b44c6-cd79-49a3-9eac-0b10d1a4392e"/>
I followed the link and it downloaded a PDF with the same title. In the following code below, you can get the entire tag or the contents using .attrs.get('content') at the end.
Required -> pip install bs4 requests
from bs4 import BeautifulSoup
import requests
url = "https://digital.auraria.edu/work/ns/8fb66c05-0ad2-4e56-8cc7-6ced34d0c126"
req = requests.get(url)
soup = BeautifulSoup(req.content, 'html.parser')
pdf_link = soup.find("meta", attrs={'name': "citation_pdf_url"}).attrs.get('content')
print(pdf_link)
Good luck and let me know if you face any other issues!
Just scrape the filename and add that name to this link, I got the link by actually downloading the file copying it's download address, removing file name and adding different one to test it, it works like a charm. https://storage.googleapis.com/us-fedora/fcrepo.binary.directory/ea/05/2a/ea052a597af18fb6a46c44254e9e596a7e93571f?GoogleAccessId=k8s-storage%40repositories-307112.iam.gserviceaccount.com&Expires=1643411246&Signature=ICo51cFbe3By7JPJol8nLfxcic%2BV%2Bv1uvGYjodATCXJc2I6XWSi7JWC8l%2BM6BTSVFOL8A0YioZOQggY8Afc0JJtiwInkxFHmVjleQ41he3RK5pwF4IwONeuQxcgUXYzd8p94sA5L0YZC6drAFb9mx4AJLwTdKQt7dZh146FmaQYY8ElGT6BpHX2t%2BK31UGP0pC75uFGUq6b3IDK11gPOCSvnrLGSAM1yulE8togDgZmw0BU77nLPkinXSIATCTjlHNxf5aUxlJkg0%2FtSM21b53JFvHGHHCQf8QSKtST4WCBA1up6BVX1YLbGLZXxQ07mf8K7jnQ4U%2FXfnw6IoTpQxw%3D%3D&response-content-disposition=attachment%3B%20filename%3D
so according to your example the link would look like this https://storage.googleapis.com/us-fedora/fcrepo.binary.directory/ea/05/2a/ea052a597af18fb6a46c44254e9e596a7e93571f?GoogleAccessId=k8s-storage%40repositories-307112.iam.gserviceaccount.com&Expires=1643411246&Signature=ICo51cFbe3By7JPJol8nLfxcic%2BV%2Bv1uvGYjodATCXJc2I6XWSi7JWC8l%2BM6BTSVFOL8A0YioZOQggY8Afc0JJtiwInkxFHmVjleQ41he3RK5pwF4IwONeuQxcgUXYzd8p94sA5L0YZC6drAFb9mx4AJLwTdKQt7dZh146FmaQYY8ElGT6BpHX2t%2BK31UGP0pC75uFGUq6b3IDK11gPOCSvnrLGSAM1yulE8togDgZmw0BU77nLPkinXSIATCTjlHNxf5aUxlJkg0%2FtSM21b53JFvHGHHCQf8QSKtST4WCBA1up6BVX1YLbGLZXxQ07mf8K7jnQ4U%2FXfnw6IoTpQxw%3D%3D&response-content-disposition=attachment%3B%20filename%3DIR00000195_00001.pdf
I tried to scrape my YouTube subscriptions list into a csv file. But I faced a problem in the middle of the code. Here is my code:
import requests
from bs4 import BeautifulSoup
import csv
url = 'https://www.youtube.com/feed/channels'
source = requests.get(url)
soup = BeautifulSoup(source.content, 'lxml')
text = soup.find_all('yt-formatted-string', {'id': 'text'})
for i in range(len(text)):
print(text[i].yt-formatted-string.text)
I am wondering why vscode didn't recognize 'yt-formatted-string' while it's found on the HTML page. Also when I tried another div from HTML, this code didn't give any output.
Your code is not working because you haven't logged in to your account while sending requests to "https://www.youtube.com/feed/channels". You must have to login first to get All subscriptions.
You can solve this problem by using selenium. First, login to your account using selenium, and then you can use either selenium or beautifulsoup to extract subscriptions from that page.
I need to download all the files under this links where only the suburb name keep changing in each link
Just a reference
https://www.data.vic.gov.au/data/dataset/2014-town-and-community-profile-for-thornbury-suburb
All the files under this search link:
https://www.data.vic.gov.au/data/dataset?q=2014+town+and+community+profile
Any possibilities?
Thanks :)
You can download file like this
import urllib2
response = urllib2.urlopen('http://www.example.com/file_to_download')
html = response.read()
To get all the links in a page
from bs4 import BeautifulSoup
import requests
r = requests.get("http://site-to.crawl")
data = r.text
soup = BeautifulSoup(data)
for link in soup.find_all('a'):
print(link.get('href'))
You should first read the html, parse it using Beautiful Soup and then find links according to the file type you want to download. For instance, if you want to download all pdf files, you can check if the links end with the .pdf extension or not.
There's a good explanation and code available here:
https://medium.com/#dementorwriter/notesdownloader-use-web-scraping-to-download-all-pdfs-with-python-511ea9f55e48
i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.
I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']