Pass over URLs scraping - python

I am trying to do some web scraping and I wrote a simple script that aims to print all URLs present in the webpage. I don't know why it passes over many URLs and is printing a list from the middle instead from the first URL.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print(links['href'])
Why that? Anyone could explain me what happen?
I am using Python 3.7.1, OS Windows 10 - Visual Studio Code

Often, hrefs just provide part (not complete) of urls. No worries.
Open it in a new tab/ browser. Find the missing part of the url. Add it to the href as string.
in the case, that must be 'http://www.bda-ieo.it/test/'.
Here is your code.
from urllib import request
from bs4 import BeautifulSoup
source = request.urlopen("http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=%25")
soup = BeautifulSoup(source, "html.parser")
for links in soup.select('a'):
print('http://www.bda-ieo.it/test/' + links['href'])
And this' the result.
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=A
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=B
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=C
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=D
http://www.bda-ieo.it/test/Alphabetical.aspx?Lan=Ita&FL=E
...
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=8721_2
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=347_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=2021_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=805958_1
http://www.bda-ieo.it/test/ComponentiAlimento.aspx?Lan=Ita&foodid=349_1

Related

Browser code and beautifulsoup collection different

I try to parse soccerstand front page soccer matches and fail because the items I get with BeautifulSoup are really different from what I see in browser.
My code is simple at the moment:
import urllib.request
from bs4 import BeautifulSoup
with urllib.request.urlopen('https://soccerstand.com/') as response:
url_data = response.read()
soup = BeautifulSoup(url_data, 'html.parser')
print(soup.find_all('div.event__match'))
So I tried this and this failed. When I checked soup variable it turned out not to contain such divs at all, so what I get with BS is different from what I see by inspecting code on the website.
What's the reason for that? Is there any workaround?

webbrowser module searching url with absolute path

I want to open a website to download resume from it, but following code tries to get to absolute path instead of just url:
import webbrowser
soup = BeautifulSoup(webbrowser.open('www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0'),"lxml")
generates the following error:
gvfs-open: /home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0:
error opening location: Error when getting information for file
'/home/utkarsh/Documents/Extract_Resume/www.indeed.com/r/Prabhanshu-
Pandit/dee64d1418e20069?sp=0': No such file or directory
Clearly it is taking the home address and trying to search that on web which will not be present. What am I doing wrong here? Thanks in advance
I suppose you are confusing the usage of Beautiful Soup and webbrowser together. Webbrowser it is not needed to access the page.
From Documentation
Beautiful Soup provides a few simple methods and Pythonic idioms for
navigating, searching, and modifying a parse tree: a toolkit for
dissecting a document and extracting what you need. It doesn't take
much code to write an application
Adapting the tutorial example to your task to print the resume in output
from bs4 import BeautifulSoup
import requests
url = "www.indeed.com/r/Prabhanshu-Pandit/dee64d1418e20069?sp=0"
r = requests.get("http://" +url)
data = r.text
soup = BeautifulSoup(data, "html.parser")
print soup.find("div", {"id": "resume"})

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.
As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python
If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent
Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

Data scraping with Python and Beautiful Soup

I am currently making my first steps with Python & Beautiful Soup in order to scrape data from the Russian statistics website.
Looking at different examples here on Stack Overflow, I think the code is correct, and yet my simple query does not return anything from this site. When executing the code, my Python command line remains blank, but also does not return an error.
What's wrong here?
My (very simple) code:
from bs4 import BeautifulSoup
import urllib2
url = "http://www.gks.ru/bgd/free/B00_25/IssWWW.exe/Stg/d000/000715.HTM"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
print(soup)
you need to specify a parser:
soup = BeautifulSoup(page.read(), 'html.parser')

Scraping all links using Python BeautifulSoup/lxml

http://www.snapdeal.com/
I was trying to scrape all links from this site and when I do, I get an unexpected result. I figured out that this is happening because of javascript.
under "See All categories" Tab you will find all major product categories. If you hover the mouse over any category it will expand the categories. I want those links from each major categories.
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url)
page = BeautifulSoup(data)
#print data
for link in page.findAll('a'):
l = link.get('href')
print l
But, this gave me a different result than what I expected (I turned off javascript and looked at the page source and output was from this source)
I just want to finds all sub links from each major category. any suggestions will be appreciated.
This is happening just because you are letting BeautifulSoup chose its own best parser , and you might not have installed lxml .
The best option is to use html.parser to parse the url .
from bs4 import BeautifulSoup
import urllib2
url = 'http://www.snapdeal.com/'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
This worked for me .Make sure to install dependencies .
I thinks you should try another library such as selenium , it provide a web driver for you and this is the advantage of this library ,for my self I couldn't handle javascripts with bs4.
Categories Menu is the url you are looking for. Many websites generate the content dynamically using XHR(XMLHTTPRequest).
In order to examine the components of a website get familiar with Firebug add-on in Firefox or Developer Tools(inbuilt addon) in Chrome. You can check the XHR used in website under the network tab in aforementioned add-ons.
Use a web scraping tool such as scrapy or mechanize
In mechanize, to get all the links in the snapdeal homepage,
br=Browser()
br.open("http://www.snapdeal.com")
for link in browser.links():
print link.name
print link.url
I have been looking into a way to scrape links from webpages that are only rendered in an actual browser but wanted the results to be run using a headless browser.
I was able to achieve this using phantomJS, selenium and beautiful soup
#!/usr/bin/python
import bs4
import requests
from selenium import webdriver
driver = webdriver.PhantomJS('phantomjs')
url = 'http://www.snapdeal.com/'
browser = driver.get(url)
content = driver.page_source
soup = bs4.BeautifulSoup(content)
links = [a.attrs.get('href') for a in soup.find_all('a')]
for paths in links:
print paths
driver.close()
The following examples will work for both HTTP and HTTPS. I'm writing this answer to show how this can be used in both Python 2 and Python 3.
Python 2
This is inspired by this answer.
from bs4 import BeautifulSoup
import urllib2
url = 'https://stackoverflow.com'
data = urllib2.urlopen(url).read()
page = BeautifulSoup(data,'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print l
Python 3
from bs4 import BeautifulSoup
from urllib.request import urlopen
import ssl
# to open up HTTPS URLs
gcontext = ssl.SSLContext()
# You can give any URL here. I have given the Stack Overflow homepage
url = 'https://stackoverflow.com'
data = urlopen(url, context=gcontext).read()
page = BeautifulSoup(data, 'html.parser')
for link in page.findAll('a'):
l = link.get('href')
print(l)
Other Languages
For other languages, please see this answer.

Categories