I want to get all links from a certain webpage using python - python

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link

perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.

There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

Related

Exporting Linkedin Learning Video source to .txt file

I am very new to python, and programming in general.
I am trying to scrape a LinkedIn Learning web page to locate the full file path for the video on the page.
Ideally, I would like the script to be able to accept a course url and cycle through each video in the course, to pull the video file path from each video page within the course.
From reviewing the source, I found the area I am interested in is as follows:
<div> data-vjs-player etc etc </div>
Within this div, there is a video element. Within this element, is a src callout which contains the video link I am looking for, example as follows:
<video id="vjs_video_3_html5_api" class="vjs-tech" preload="auto" poster="https://media-exp1.licdn.com/dms/image/C4E0DAQEEM3rME8wwFw/learning-public-crop_675_1200/0?e=1595858400&v=beta&t=V5KkqHuGqUTliAMbL7oUBXeEWcrfBDdi4QrZbyGyAWE" src="https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_HD720/614299_00_02_XR15_exfiles.mp4?0pnG4-hMq6_WSlXmJvkGQa6ubLk5EIuE8SG-D0jd9RJOztR5jY8wmlBcsWjHLzBK22z6DydJXGoV8njYeJ_A-dMb6BIZrtkZdUq20t2tD6hxhdNKeWVvik7aOfN3Oyv78_wqePFK1rGmujQnzbCYudW9r0Oyl54EcFQhUqUFnGpkVqHLgQ_Gndo"></video>
I attempted to utilize the following code as a basis, following a BeautifulSoup tutorial to parse the website link for the src callout:
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html5lib')
results = soup.find(id="vjs_video_3_html5_api")
print(results.prettify())
However, it is at this point I have come to a standstill, as I do not understand where to go here, despite researching this to the best of my abilities at this current time.
I really would appreciate any help or guidance that you may be able to provide on this.
Thank you all in advance.
When you look at source HTML (for example print(soup)), you will see that the class of <video> tag is different.
You can extract the video url with this example:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.video['data-sources'])
print(data[0]['src'])
Prints:
https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_SD/614299_00_03_XR30_aboutpy.mp4?jNwDi0oWUMSPUqh0j6w7yy2IDyBgoGZEeY9Tj2TKVmZmpSMisIoXxG9K1BbRELSP_pM9ySZOFiOq6TzNFvxhEWoGujEGQYT7TfRhuXGwJyGffd5uWTdYBCoc65J-YJuvdg7xijnaDwVjFuUKSAJZxqvYyq8f5nOZrE0Mgckk-1XANfovQ8E

webbrowser module searching url with absolute path

I want to open a website to download resume from it, but following code tries to get to absolute path instead of just url:
import webbrowser
soup = BeautifulSoup(webbrowser.open('www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0'),"lxml")
generates the following error:
gvfs-open: /home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0:
error opening location: Error when getting information for file
'/home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0': No such file or directory
Clearly it is taking the home address and trying to search that on web which will not be present. What am I doing wrong here? Thanks in advance
I suppose you are confusing the usage of Beautiful Soup and webbrowser together. Webbrowser it is not needed to access the page.
From Documentation
Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting what you need. It doesn't take
much code to write an application
Adapting the tutorial example to your task to print the resume in output
from bs4 import BeautifulSoup
import requests
url = "www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0"
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
print soup.find("div", {"id": "resume"})

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.
Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

Getting no output while running Python Script to Scrape News Stories from CNN

import requests
from lxml import html
page = requests.get('http://www.cnn.com')
html_content = html.fromstring(page.content)
for i in html_content.iterchildren():
print i
news_stories = html_content.xpath('//h2[#data-analytics]/a/span/text()')
news_links = html_content.xpath('//h2[#data-analytics]/a/#href')
I am trying to run this code to understand how web scraping in python works.
I want to scrap top news stories and their links from CNN.
When i run this in Python Shell, the output for news_stories and news_links i get is:
[]
My question is where am i going wrong with this and is there a better way to achieve what i am trying to than this one?
In your code html_content is returning only page address and not the actual content of the page.
html_content = html.fromstring(page.content)
You can try printing following to see complete HTML code for that page:
import requests
from lxml import html
page = requests.get('http://www.cnn.com')
print page.text
Even though if you'll get the content also somehow, you will get it a gzipped response from the server. (Get html using Python requests?)
I would highly recommend you to use httplib2 library and BeautifulSoup to scrape news stories from CNN. That is really handy in use and get you what you want. You can see another stackoverflow post here (retrieve links from web page using python and BeautifulSoup)
I hope that help you.

Categories