How to use BeautifulSoup to scrape a webpage url

How to use BeautifulSoup to scrape a webpage url - python

I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.

The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network tab of Web Developer in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001&lt=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information.
The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above.
But, I always prefer more then less!

Related

Scraping data from span tag with class using BS

Im trying to scrape project's names from Gitlab. When I inspect source code I see that name of project is in:
<span class='project-name'>Project Name</span>
Unfortunately, when I try to scrape this date I got empty list, My code looks like:
url = 'https://gitlab.com/users/USER/projects'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(source,'lxml')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
I was trying other solutions like using attrs, class_ or using other HTML tags, but nothing works. What can be wrong here?

Ok, so it looks like when you inspect the page in the network tabs in chrome developer tools, you can see that projects are not rendered when request is made:
What that means is, project information is requested after. In order to get the projects you need to send a request to https://gitlab.com/users/USER/projects.json endpoint:
After that you can inspect the response from that endpoint. As you can see the response here is json so we can load json data with json module and then in that dictionary there is an entry called html which has html data in it, so we can parse it with beautifulsoup and the rest of the code stays the same:
import bs4 as bs
import urllib, json
url = 'https://gitlab.com/users/USER/projects.json'
source = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(json.loads(source)["html"],'html.parser')
repos = [repo.text for repo in soup.find_all('span',{'class':'project-name'})]
print(repos)
Output:
['freebsd', 'freebsd-ports', 'freebsd-test', 'risc-vhdl', 'dotfiles', 'tideyBot']

Webscraping in Python (beautifulsoup)

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text

General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.

To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

I want to build small tool to help a family member download podcasts off a site.
In order to get the links to the files I first need to filter them out (with bs4 + python3).
The files are on this website (Estonian): Download Page "Laadi alla" = "Download"
So far my code is as follows:
(most of it is from examples on stackoverflow)
from bs4 import BeautifulSoup
import urllib.request
import re
url = urllib.request.urlopen("http://vikerraadio.err.ee/listing/mystiline_venemaa#?page=1&pagesize=902&phrase=&from=&to=&path=mystiline_venemaa&showAll")
content = url.read()
soup = BeautifulSoup(content, "lxml")
links = [a['href'] for a in soup.find_all('a',href=re.compile('http.*\.mp3'))]
print ("Links:", links)
Unfortunately I always get only two results.
Output:
Links: ['http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3', 'http://heli.err.ee/helid/exp/ERR_raadiouudised.mp3']
These are not the ones I want.
My best guess is that the page has somewhat broken html and bs4 / the parser is not able to find anything else.
I've tried different parsers with resulting in no change.
Maybe I'm doing something else wrong too.
My goal is to have the individual links in a list for example.
I'll filter out any duplicates / unwanted entries later myself.
Just a quick note, just in case: This is a public radio and all the content is legally hosted.
My new code is:
for link in soup.find_all('d2p1:DownloadUrl'):
print(link.text)
I am very unsure if the tag is selected correctly.
None of the examples listed in this question are actually working. See the answer below for working code.

Please be aware that the listings from the page are interfaced through an API. So instead of requesting the HTML page, I suggest you to request the API link which has 200 .mp3 links.
Please follow the below steps:
Request the API link, not the HTML page link
Check the response, it's a JSON. So extract the fields that are of your need
Help your Family, All Time :)
Solution
import requests, json
from bs4 import BeautifulSoup
myurl = 'http://vikerraadio.err.ee/api/listing/bypath?path=mystiline_venemaa&page=1&pagesize=200&phrase=&from=&to=&showAll=false'
r = requests.get(myurl)
abc = json.loads(r.text)
all_mp3 = {}
for lstngs in abc['ListItems']:
for asd in lstngs['Podcasts']:
all_mp3[asd['DownloadUrl']] = lstngs['Header']
all_mp3
all_mp3 is what you need. all_mp3 is a dictionary with download urls as keys and mp3 names as the values.

Using Beautiful Soup in Python to check availability of a product online

I am using python 2.7 and version 4.5.1 of Beautiful Soup
I'm at my wits end trying to make this very simple script to work. My goal is to to get the information on the online availability status of the NES console from Best Buy's website by parsing the html for the product's page and extracting the information in
<div class="status online-availability-status"> Sold out online </div>
This is my first time using the Beautiful Soup module so forgive me if I have missed something obvious. Here is the script I wrote to try to get the information above:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02')
soup = BeautifulSoup(page.content, 'html.parser')
avail = soup.findAll('div', {"class": "status online-availability-status"})
But then I just get an empty list for avail. Any idea why?
Any help is greatly appreciated.

As the comments above suggest, it seems that you are looking for a tag which is generated client side by JavaScript; it shows up using 'inspect' on the loaded page, but not when viewing the page source, which is what the call to requests is pulling back. You might try using dryscrape (which you may need to install with pip install dryscrape).
import dryscrape
from bs4 import BeautifulSoup
session = dryscrape.Session()
url = 'http://www.bestbuy.ca/en-CA/product/nintendo-nintendo-entertainment-system-nes-classic-edition-console-clvsnesa/10488665.aspx?path=922de2a5ceb066b0f058cc567ad3d547en02'
session.visit(url)
response = session.body()
soup = BeautifulSoup(response)
avail = soup.findAll('div', {"class": "status online-availability-status"})
This was the most popular solution in a question relating to scraping dynamically generated content:
Web-scraping JavaScript page with Python

If you try printing soup you'll see it probably returns something like Access Denied. This is because Best Buy requires an allowable User-Agent to be making the GET request. As you do not have a User-Agent specified in the Header, it is not returning anything.
Here is a link to generate a User Agent
How to use Python requests to fake a browser visit a.k.a and generate User Agent?
or you could figure out your user agent generated when you are viewing the webpage in your own browser
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Availability is loaded in JSON. You don't even need to parse HTML for that:
import urllib
import simplejson
sku = 1048865 # look at the URL of the web page, it is <blablah>//10488665.aspx
# chnage locations to get the right store
response = urllib.urlopen('http://api.bestbuy.ca/availability/products?callback=apiAvailability&accept-language=en&skus=%s&accept=application%2Fvnd.bestbuy.standardproduct.v1%2Bjson&postalCode=M5G2C3&locations=977%7C203%7C931%7C62%7C617&maxlos=3'%sku)
availability = simplejson.loads(response.read())
print availability[0]['shipping']['status']

source code of web page not available using urllib.urlopen()

I am trying to get video links from 'https://www.youtube.com/trendsdashboard#loc0=ind'. When I do inspect elements, it displays me the source html code for each videos. In source code retrieved using
urllib2.urlopen("https://www.youtube.com/trendsdashboard#loc0=ind").read()
It does not display html source for videos. Is there any otherway to do this?
<a href="/watch?v=dCdvyFkctOo" alt="Flipkart Wish Chain">
<img src="//i.ytimg.com/vi/dCdvyFkctOo/hqdefault.jpg" alt="Flipkart Wish Chain">
</a>
This simple code appears when we inspect elements from browser, but not in source code retrived by urllib

To view the source code you need use read method
If you just use open it gives you something like this.
In [12]: urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind')
Out[12]: <addinfourl at 3054207052L whose fp = <socket._fileobject object at 0xb60a6f2c>>
To see the source use read
urllib2.urlopen('https://www.youtube.com/trendsdashboard#loc0=ind').read()

Whenever you compare the source code between Python code and Web browser, dont do it through Insect Element, right click on the webpage and click view source, then you will find the actual source. Inspect Element displays the aggregated source code returned by as many network requests created as well as javascript code being executed.
Keep Developer Console open before opening the webpage, stay on Network tab and make sure that 'Preserve Log' is open for Chrome or 'Persist' for Firebug in Firefox, then you will see all the network requests made.

works for me...
import urllib2
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
html = urllib.urlopen(url).read()
IMO I'd use requests instead of urllib - it's a bit easier to use:
import requests
url = 'https://www.youtube.com/trendsdashboard#loc0=ind'
response = requests.get(url)
html = response.content
Edit
This will get you a list of all <a></a> tags with hyperlinks as per your edit. I use the library BeautifulSoup to parse the html:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
links = [tag for tag in soup.findAll('a') if tag.has_attr('href')]

we also need to decode the data to utf-8.
here is the code:
just use
response.decode('utf-8')
print(response)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use BeautifulSoup to scrape a webpage url - python

Related

Scraping data from span tag with class using BS

Webscraping in Python (beautifulsoup)

How do I filter out .mp3 links using beautifulsoup from (possibly) broken html? (JSON)

Using Beautiful Soup in Python to check availability of a product online

source code of web page not available using urllib.urlopen()

Categories

Resources