Webscraping in Python (beautifulsoup) - python

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text

General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.

To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

Related

How to get simple information through a crawler

I am trying to make a simple crawler that scrapes through this https://en.wikipedia.org/wiki/Web_scraping page, then proceeds to extract the 19 links from the See About section. This I manage to do, however I am also trying to extract the first paragraph from each of those 19 links and this is where it stops "working". I get the same paragraph from the first page and not from each one. This is what I have so far. I know there might be better options for doing this but i want to stick to BeautifulSoup and simple python code.
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text
soup = BeautifulSoup(data, 'html.parser')
def visit():
try:
p = soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit()
Example of the first print
Now visiting: https://en.wikipedia.org/wiki/OpenSocial
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. The web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scraping can be done manually by a software user, the term typically refers to automated processes implemented using a bot or web crawler. It is a form of copying in which specific data is gathered and copied from the web, typically into a central local database or spreadsheet, for later retrieval or analysis.
Intended function should be that it prints the first paragraph for every new link printed, not the same paragraph from the first link. What do I need to do in order to fix this? Or any tips on what I am missing. I am fairly new to python so I am still learning the concepts as I work on things.
At the top of your code you define data and soup. Both are tied to https://en.wikipedia.org/wiki/Web_scraping.
Every time you call visit(), you print from soup, and soup never changes.
You need to pass the url to visit(), e.g. visit(url_to_visit). The visit function should accept the url as an argument, then visit the page using requests, and create a new soup from the returned data, then print the first paragraph.
Edited to add code explaining my original answer:
from urllib.parse import urljoin
import requests
from bs4 import BeautifulSoup
start_url = 'https://en.wikipedia.org/wiki/Web_scraping'
# Renamed this to start_url to make it clear that this is the source page
data = requests.get(start_url).text
soup = BeautifulSoup(data, 'html.parser')
def visit(new_url): # function now accepts a url as an argument
try:
new_data = requests.get(new_url).text # retrieve the text from the url
new_soup = BeautifulSoup(new_data, 'html.parser') # process the retrieved html in beautiful soup
p = new_soup.p
print(p.get_text())
except AttributeError:
print('<p> Tag was not found')
links_todo = []
links = soup.find('div', {'class': 'div-col'}).find_all('a')
for link in links:
if 'href' in link.attrs:
links_todo.append(urljoin(start_url, link.attrs['href']))
while links_todo:
url_to_visit = links_todo.pop()
print('Now visiting:', url_to_visit)
visit(url_to_visit) # here's where we pass each line to the visit() function

How do I use Bs4 to pull similar information but from different places in DOM hierarchy?

I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)

Scraping information from a website with a 'Show More' button

I'm trying to scrape information for dresses from this website: https://www.libertylondon.com/uk/department/women/clothing/dresses/
Obviously, I'm not only interested in the first 60 results, but all of them. When clicking on the 'Show More' button a couple of times, I arrive at this url: https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=60&start=300
I would have expected that using the following code, I get a full download of the page mentioned above, but for some reason, it will still only yield the first 60 results.
import requests
import bs4
url = "https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=60&start=300"
res = requests.get(url)
res.encoding = 'utf-8'
res.raise_for_status()
html = res.text
soup = bs4.BeautifulSoup(html, "lxml")
elements = soup.find_all("div", attrs = {"class": "product product-tile"})
I can see that the issue lies within the request itself, since the soup variable does not contain the full html text I see when inspecting the page, but I can not figure out why that is.
Try the below url which fetch you 331 elements.
url : https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=331&start=0&format=ajax
import requests
import bs4
url="https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=331&start=0&format=ajax"
res = requests.get(url)
res.encoding = 'utf-8'
res.raise_for_status()
html = res.text
soup = bs4.BeautifulSoup(html, "lxml")
elements = soup.find_all("div", attrs = {"class": "product product-tile"})
print(len(elements))
The link you show after having clicked to the "Show more" button uses fragments (notice the # sign). This is not something sent to the server, but rather used by JavaScript in the front-end to load more items without reloading the full page.
However, you're lucky because if you look at the HTTP requests made in your browser console, you'll see that it does a request to https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=60&start=60. Those are query params (and seem to exactly match the fragments!), so this means the server will send the extra items.
I think in this case, the button "show more", load *sz* dress from the *from* th dress.
So when you do a http request with #sz=60&start=300 attribute, the database will only fetch the dresses from index 300 to 360, that is why the request only contains 60 dresses.
There is another button on the page that indicate another url: SHOW ALL, this button give this url: https://www.libertylondon.com/uk/department/women/clothing/dresses/?sz=120
With only the ?sz=120 url parameter you could get a answer with sz number of dresses. But it seems there is a limit of how many dresses you can load at once.
HTTP GET to https://www.libertylondon.com/uk/department/women/clothing/dresses/#sz=331&start=0 will return all items (331 is the current number of items and may be changed in the future)

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

How to use BeautifulSoup to scrape a webpage url

I get started with web scraping and I would like to get the URLs from certain page provided below.
import requests
from bs4 import BeautifulSoup as Soup
page = "http://www.zillow.com/homes/for_sale/fore_lt/2-_beds/any_days/globalrelevanceex_sort/57.610107,-65.170899,15.707662,-128.452149_rect/3_zm/"
response = requests.get(page)
soup = Soup(response.text)
Now, I have all the info of the page in the soup content and I would like to get URLs of all the homes provided in the image
When, I INSPECT any of the videos of the home, the chrome opens this DOM element in the image:
How would I get the link inside the <a href=""> tag using the soup? I think the parent is <div id = "lis-results">, but, I need a way to navigate to the element. Actually, I need all the URLs (391,479) of in a text file.
Zillow has an API and also, Python wrapper for the convenience of this kind of data job and I'm looking the code now. All I need to get is the URLs for the FOR SALE -> Foreclosures and POTENTIAL LISTING -> Foreclosed and Pre-foreclosed informations.
The issue is that the request you send doesn't get the URLs. In fact, if I look at the response (using e.g. jupyter) I get:
I would suggest a different strategy: these kind of websites often communicate via json files.
From the Network tab of Web Developer in Firefox you can find the URL to request the json file:
Now, with this file you can get all the information needed.
import json
page = "http://www.zillow.com/search/GetResults.htm?spt=homes&status=110001&lt=001000&ht=111111&pr=,&mp=,&bd=2%2C&ba=0%2C&sf=,&lot=,&yr=,&pho=0&pets=0&parking=0&laundry=0&income-restricted=0&pnd=0&red=0&zso=0&days=any&ds=all&pmf=1&pf=1&zoom=3&rect=-134340820,16594081,-56469727,54952386&p=1&sort=globalrelevanceex&search=maplist&disp=1&listright=true&isMapSearch=true&zoom=3"
response = requests.get(page) # request the json file
json_response = json.loads(response.text) # parse the json file
soup = Soup(json_response['list']['listHTML'], 'html.parser')
and the soup has what you are looking for. If you explore the json, you will find a lot of useful information.
The list of all the URLs can be find with
links = [i.attrs['href'] for i in soup.findAll("a",{"class":"hdp-link"})]
All the URLs appears twice. If you want that they are unique, you can fix the list, or, otherwise, look for "hdp-link routable" in class above.
But, I always prefer more then less!

Categories