How do I further narrow my search with BeautifulSoup? - python

So I'm just learning how to scrape websites for data, and I have one I want to scrape to populate a database to do something with for practice. The site doesn't have all the information posted in one big page i can scrape, but instead has the information broken up into multiple "sets", with each set having its own page/collection of the data I want to scrape. All the "sets" are listed on a singular page though, with each set listed having the link to its individual page. I figured my best bet would be to scrape the "sets" page for their URL, and then request through to the "set" pages to collect the data I'm trying to get. Checking the html, each set is listed in a container, with the URL being the first thing listed within each section, like this:
<td class="flexbox">
<a href="url_i_need">
<more stuff i don't need>
</td>
<repeats_as_above_for_next_set>
what I've tried is:
response = requests.get('site_url')
content = response.content
soup = BeautifulSoup(content, 'html.parser')
data = soup.find_all('td', 'flexbox')
This seems to do the trick scraping each of the TD sections, but nothing I try let me further skim through the data to just the portion I need. After narrowing my search down to the general section I care about, how do I scrape the URL of each of those sections?

You can loop over data and for each item find nested tag a:
for item in data:
link = item.find('a')
url = link['href']
Btw is this line correct?
soup = BeautifulSoup.find_all(content, 'html.parser')
Standard way is this one:
soup = BeautifulSoup(content, 'html.parser')

Related

How do I use Bs4 to pull similar information but from different places in DOM hierarchy?

I'm trying to scrape information from a series of pages from like these two:
https://www.nysenate.gov/legislation/bills/2019/s240
https://www.nysenate.gov/legislation/bills/2019/s8450
What I want to do is build a scraper that can pull down the text of "See Assembly Version of this Bill". In the two links listed above, the classes are the same but for one page it's the only iteration of that class, but for another it's the third.
I'm trying to make something like this work:
assembly_version = soup.select_one(".bill-amendment-detail content active > dd")
print(assembly_version)
But I keep getting None
Any thoughts?
url = "https://www.nysenate.gov/legislation/bills/2019/s11"
raw_html = requests.get(url).content
soup = BeautifulSoup(raw_html, "html.parser")
assembly_version = soup.find(class_="c-block c-bill-section c-bill--details").find("a").text.strip()
print(assembly_version)

Accessing all elements from main website page with Beautiful Soup

I want to scrape news from this website:
https://www.bbc.com/news
You can see that website has categories such as Home, US Election, Coronavirus etc.
For example, If I go to specific news article such as:
https://www.bbc.com/news/election-us-2020-54912611
I can write a scraper that will give me the headline, this is the code:
from bs4 import BeautifulSoup
response = requests.get("https://www.bbc.com/news/election-us-2020-54912611", headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.select("header h1")
print(title)
On this website there are hundreds of news, so my question is, Is there a way to access each news article thats on the website (all categories) from the home page url? On home page I cant see all news articles, I can see only some of them, so is there a way for me to load whole HTML code for whole website, so that I can easily get all news headlines with:
soup.select("header h1")
Ok, then after getting this headlines you can also have another links in this page, you just again open that links and fetch information from that links it can look like this:
visited = set()
links = [....]
while links:
if link_for_fetch in visited:
continue
link_for_fetch = links.pop()
content = get_contents(link_for_fetch)
headlines += parse_headlines()
links += parse_links()
visited.add(link_for_fetch)
it's just pseudocode, you can write in any programming language. but this can take a lot of time for parsing whole site :( and robots can block your ip address

Webscraping in Python (beautifulsoup)

I am trying to webscrape and am currently stuck on how I should continue with the code. I am trying to create a code that scrapes the first 80 Yelp! reviews. Since there are only 20 reviews per page, I am also stuck on figuring out how to create a loop to change the webpage to the next 20 reviews.
from bs4 import BeautifulSoup
import requests
import time
all_reviews = ''
def get_description(pullman):
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
page_title = soup.title.text
#get a tag content based on class
p_tag = soup.find_all('p',class_='lemon--p__373c0__3Qnnj text__373c0__2pB8f comment__373c0__3EKjH text-color--normal__373c0__K_MKN text-align--left__373c0__2pnx_')[0]
#print the text within the tag
return p_tag.text
General notes/tips:
Use the "Inspect" tool on pages you want to scrape.
As for your question, its also going to work much nicer if you visit the website and parse BeautifulSoup and then use the soup object in functions - visit once, parse as many times as you want. You won't be blacklisted by websites as often this way. An example structure below.
url = f'https://www.yelp.com/biz/pullman-bar-and-diner-iowa-city'
# get webpage data from url
response = requests.get(url)
#sleep for 2 seconds
time.sleep(2)
# get html document from web page data
html_doc = response.text
# parser
soup = BeautifulSoup(html_doc, "lxml")
get_description(soup)
get_reviews(soup)
If you inspect the page, each review appears as a copy of a template. If you take each review as an individual object and parse it, you can get the reviews you are looking for. The review template has the class id:lemon--li__373c0__1r9wz u-space-b3 u-padding-b3 border--bottom__373c0__uPbXS border-color--default__373c0__2oFDT
As for pagination, the pagination numbers are contained in a template with class="lemon--div__373c0__1mboc pagination-links__373c0__2ZHo6 border-color--default__373c0__2oFDT nowrap__373c0__1_N1j"
The individual page number links are contained within a-href tags, so just write a for loop to iterate over the links.
To get the next page, you're going to have to follow the "Next" link. The problem here is that the link is just the same as before plus #. Open the Inspector [Ctrl-Shift-I in Chrome, Firefox] and switch to the network tab, then click the next button, you'll see a request to something like:
https://www.yelp.com/biz/U4mOl3TRbaJ9-bgTQ1d6fw/review_feed?rl=en&sort_by=relevance_desc&q=&start=40
which looks something like:
{"reviews": [{"comment": {"text": "Such a great experience every time you come into this place...
This is JSON. The only problem is that you'll need to fool Yelp's servers into thinking you're browsing the website, by sending their headers to them, otherwise you get different data that doesn't look like comments.
They look like this in Chrome
My usual approach is to copy-paste the headers not prefixed with a colon (ignore :authority, etc) directly into a triple-quoted string called raw_headers, then run
headers = dict([[h.partition(':')[0], h.partition(':')[2]] for h in raw_headers.split('\n')])
over them, and pass them as an argument to requests with:
requests.get(url, headers=headers)
Some of the headers won't be necessary, cookies might expire, and all sorts of other issues might arise but this at least gives you a fighting chance.

Scraping with Python. Can't get wanted data

I am trying to scrape website, but I encountered a problem. When I try to scrape data, it looks like the html differs from what I see on google inspect and from what I get from python. I get this with http://edition.cnn.com/election/results/states/arizona/house/01 I tried to scrape election results. I used this script to check HTML part of the webpage, and I noticed that they different. There is no classes that I need, like section-wrapper.
page =requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content, "lxml")
print(soup)
Anyone knows what is the problem ?
http://data.cnn.com/ELECTION/2016/AZ/county/H_d1_county.json
This site use JavaScript fetch data, you can check the url above.
You can find this url in chrome dev-tools, there are many links, check it out
Chrome >>F12>> network tab>>F5(refresh page)>>double click the .josn url>> open new tab
import requests
from bs4 import BeautifulSoup
page=requests.get('http://edition.cnn.com/election/results/states/arizona/house/01')
soup = BeautifulSoup(page.content)
#you can try all sorts of tags here I used class: "ad" and class:"ec-placeholder"
g_data = soup.find_all("div", {"class":"ec-placeholder"})
h_data = soup.find_all("div"),{"class":"ad"}
for item in g_data:print item
#print '\n'
#for item in h_data:print item

Website Scraping Specific Forms

For an extra curricular school project, I'm learning how to scrape a website. As you can see by the code below, I am able to scrape a form called, 'elqFormRow' off of one page.
How would one go about scraping all occurrences of the 'elqFormRow' on the whole website? I'd like to return the URL of where that form was located into a list, but am running into trouble while doing so because I don't know how lol.
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage').read()
soup = bs.BeautifulSoup(sauce, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
print(div.text.strip())
You can grab the URLs from a page and follow them to (presumably) scrape the whole site. Something like this, which will require a little massaging depending on where you want to start and what pages you want:
import bs4 as bs
import requests
domain = "engage.hpe.com"
initial_url = 'http://engage.hpe.com/Template_NGN_Convert_EG-SW_Combined_TEALIUM-RegPage'
# get urls to scrape
text = requests.get(initial_url).text
initial_soup = bs.BeautifulSoup(text, 'lxml')
tags = initial_soup.findAll('a', href=True)
urls = []
for tag in tags:
if domain in tag:
urls.append(tag['href'])
urls.append(initial_url)
print(urls)
# function to grab your info
def scrape_desired_info(url):
out = []
text = requests.get(url).text
soup = bs.BeautifulSoup(text, 'lxml')
for div in soup.find_all('div', class_='elqFormRow'):
out.append(div.text.strip())
return out
info = [scrape_desired_info(url) for url in urls if domain in url]
URLlib stinks, use requests. If you need to go multiple levels down in the site put the URL finding section in a function and call it X number of times, where X is the number of levels of links you want to traverse.
Scrape responsibly. Try not to get into a sorcerer's apprentice situation where you're hitting the site over and over in a loop, or following links external to the site. In general, I'd also not put in the question the page you want to scrape.

Categories