webbrowser module searching url with absolute path - python

I want to open a website to download resume from it, but following code tries to get to absolute path instead of just url:
import webbrowser
soup = BeautifulSoup(webbrowser.open('www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0'),"lxml")
generates the following error:
gvfs-open: /home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0:
error opening location: Error when getting information for file
'/home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0': No such file or directory
Clearly it is taking the home address and trying to search that on web which will not be present. What am I doing wrong here? Thanks in advance

I suppose you are confusing the usage of Beautiful Soup and webbrowser together. Webbrowser it is not needed to access the page.
From Documentation
Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting what you need. It doesn't take
much code to write an application
Adapting the tutorial example to your task to print the resume in output
from bs4 import BeautifulSoup
import requests
url = "www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0"
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
print soup.find("div", {"id": "resume"})

Related

Exporting Linkedin Learning Video source to .txt file

I am very new to python, and programming in general.
I am trying to scrape a LinkedIn Learning web page to locate the full file path for the video on the page.
Ideally, I would like the script to be able to accept a course url and cycle through each video in the course, to pull the video file path from each video page within the course.
From reviewing the source, I found the area I am interested in is as follows:
<div> data-vjs-player etc etc </div>
Within this div, there is a video element. Within this element, is a src callout which contains the video link I am looking for, example as follows:
<video id="vjs_video_3_html5_api" class="vjs-tech" preload="auto" poster="https://media-exp1.licdn.com/dms/image/C4E0DAQEEM3rME8wwFw/learning-public-crop_675_1200/0?e=1595858400&v=beta&t=V5KkqHuGqUTliAMbL7oUBXeEWcrfBDdi4QrZbyGyAWE" src="https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_HD720/614299_00_02_XR15_exfiles.mp4?0pnG4-hMq6_WSlXmJvkGQa6ubLk5EIuE8SG-D0jd9RJOztR5jY8wmlBcsWjHLzBK22z6DydJXGoV8njYeJ_A-dMb6BIZrtkZdUq20t2tD6hxhdNKeWVvik7aOfN3Oyv78_wqePFK1rGmujQnzbCYudW9r0Oyl54EcFQhUqUFnGpkVqHLgQ_Gndo"></video>
I attempted to utilize the following code as a basis, following a BeautifulSoup tutorial to parse the website link for the src callout:
from bs4 import BeautifulSoup
from lxml import html
import requests
URL = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html5lib')
results = soup.find(id="vjs_video_3_html5_api")
print(results.prettify())
However, it is at this point I have come to a standstill, as I do not understand where to go here, despite researching this to the best of my abilities at this current time.
I really would appreciate any help or guidance that you may be able to provide on this.
Thank you all in advance.
When you look at source HTML (for example print(soup)), you will see that the class of <video> tag is different.
You can extract the video url with this example:
import json
import requests
from bs4 import BeautifulSoup
url = 'https://www.linkedin.com/learning/python-essential-training-2/about-python-3?u=2154233'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
data = json.loads(soup.video['data-sources'])
print(data[0]['src'])
Prints:
https://files3.lynda.com/secure/courses/614299/VBR_MP4h264_main_SD/614299_00_03_XR30_aboutpy.mp4?jNwDi0oWUMSPUqh0j6w7yy2IDyBgoGZEeY9Tj2TKVmZmpSMisIoXxG9K1BbRELSP_pM9ySZOFiOq6TzNFvxhEWoGujEGQYT7TfRhuXGwJyGffd5uWTdYBCoc65J-YJuvdg7xijnaDwVjFuUKSAJZxqvYyq8f5nOZrE0Mgckk-1XANfovQ8E

Python scraping website with flight tickets

I am trying to extract information about prices of flight tickets with a python script. Please take a look at the picture:
I would like to parse all the prices (such as "121" at the bottom of the tree). I have constructed a simple script and my problem is that I am not sure how to get the right parts from the code behind page's "inspect element". My code is below:
import urllib3
from bs4 import BeautifulSoup as BS
http = urllib3.PoolManager()
ULR = "https://greatescape.co/?datesType=oneway&dateRangeType=exact&departDate=2019-08-19&origin=EAP&originType=city&continent=europe&flightType=3&city=WAW"
response = http.request('GET', URL)
soup = BS(response.data, "html.parser")
body = soup.find('body')
__next = body.find('div', {'id':'__next'})
ui_container = __next.find('div', {'class':'ui-container'})
bottom_container_root = ui_container.find('div', {'class':'bottom-container-root'})
print(bottom_container_root)
The problem is that I am stuck at the level of ui-container. bottom-container-root is an empty variable, despite it is a direct child under ui-container. Could someone please let me know how to parse this tree properly?
I have no experience in web scraping, but as it happens it is one step in a bigger workflow I am building.
.find_next_siblings and .next_element can be useful in navigating through containers.
Here is some example usage below.
from bs4 import BeautifulSoup
html = open("small.html").read()
soup = BeautifulSoup(html)
print soup.head.next_element
print soup.head.next_element.next_element

I want to get all links from a certain webpage using python

i want to be able to pull all urls from the following webpage using python https://yeezysupply.com/pages/all i tried using some other suggestions i found but they didn't seem to work with this particular website. i would end up not finding any urls at all.
import urllib
import lxml.html
connection = urllib.urlopen('https://yeezysupply.com/pages/all')
dom = lxml.html.fromstring(connection.read())
for link in dom.xpath('//a/#href'):
print link
perhaps it would be useful for you to make use of modules specifically designed for this. heres a quick and dirty script that gets the relative links on the page
#!/usr/bin/python3
import requests, bs4
res = requests.get('https://yeezysupply.com/pages/all')
soup = bs4.BeautifulSoup(res.text,'html.parser')
links = soup.find_all('a')
for link in links:
print(link.attrs['href'])
it generates output like this:
/pages/jewelry
/pages/clothing
/pages/footwear
/pages/all
/cart
/products/womens-boucle-dress-bleach/?back=%2Fpages%2Fall
/products/double-sleeve-sweatshirt-bleach/?back=%2Fpages%2Fall
/products/boxy-fit-zip-up-hoodie-light-sand/?back=%2Fpages%2Fall
/products/womens-boucle-skirt-cream/?back=%2Fpages%2Fall
etc...
is this what you are looking for? requests and beautiful soup are amazing tools for scraping.
There are no links in the page source; they are inserted using Javascript after the page is loaded int the browser.

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

Pass over URLs scraping

I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code
Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

Categories