Web scraping with Python and Beautiful Soup - python

I am practicing building web scrapers. One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites.
I'm using the following code:
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities
If I print out city_tags I get the HTML I want. However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all'.
I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? I have noticed that said html is in [] - does this make a difference?

Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities

As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node:
[tag['href'] for tag in city_tags]
#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']

Related

How to print the first google search result link using bs4?

I'm a beginner in python, I'm trying to get the first search result link from google which was stored inside a div with class='yuRUbf' using beautifulsoup. When I run the script output is 'None' what is the error here.
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
productDivs = soup.find("div", {"class": "yuRUbf"})
print(productDivs)
Let's see:
from bs4 import BeautifulSoup
import requests, json
headers = {
'User-agent':
"useragent"
}
html = requests.get('https://www.google.com/search?q=hello', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# locating div element with a tF2Cxc class
# calling for <a> tag and then calling for 'href' attribute
link = soup.find('div', class_='tF2Cxc').a['href']
print(link)
output:
'''
https://www.youtube.com/watch?v=YQHsXMglC9A
As you want first google search in which class name which you are looking for might be differ with name so first you can first find manually that link so it will be easy to identify
import requests
import bs4
url = 'https://www.google.com/search?q=site%3Astackoverflow.com+how+to+use+bs4+in+python&sxsrf=AOaemvKrCLt-Ji_EiPLjcEso3DVfBUmRbg%3A1630215433722&ei=CR0rYby7K7ue4-EP7pqIkAw&oq=site%3Astackoverflow.com+how+to+use+bs4+in+python&gs_lcp=Cgdnd3Mtd2l6EAM6BwgAEEcQsAM6BwgjELACECc6BQgAEM0CSgQIQRgAUMw2WPh_YLiFAWgBcAJ4AIABkAKIAd8lkgEHMC4xMC4xM5gBAKABAcgBCMABAQ&sclient=gws-wiz&ved=0ahUKEwj849XewdXyAhU7zzgGHW4NAsIQ4dUDCA8&uact=5'
request_result=requests.get( url )
soup = bs4.BeautifulSoup(request_result.text,"html.parser")
Using select method:
I have used css selector method in which it identifies all matching
divs and from list i have taken from index postion 1
And than i have use select_one to get a tag and find href
according to it!
main_data=soup.select("div.ZINbbc.xpd.O9g5cc.uUPGi")[1:]
main_data[0].select_one("a")['href'].replace("/url?q=","")
Using find method:
main_data=soup.find_all("div",class_="ZINbbc xpd O9g5cc uUPGi")[1:]
main_data[0].find("a")['href'].replace("/url?q=","")
Output [Same for Both the Case]:
'https://stackoverflow.com/questions/23102833/how-to-scrape-a-website-which-requires-login-using-python-and-beautifulsoup&sa=U&ved=2ahUKEwjGxv2wytXyAhUprZUCHR8mBNsQFnoECAkQAQ&usg=AOvVaw280R9Wlz2mUKHFYQUOFVv8'

BeautifulSoup - How to extract info from deeper layers of html

I'm using BeautifulSoup to scrape some real estate data and having trouble getting to what I need which are several href links that are deep in the .
http://www.mls.com/Search/New-York.mvc
To make the code stable, I've started with a parent that is two steps above my target that I need:
area_links = soup.findAll('ul', class_="sub-section-list", limit=2)
now I have a ResultSet element but have failed in getting anything but errors out of it.
I've tried a number of arguments using area_links.findAll and findAllNext
I need to extract the links to the different metro areas so I can then dig into those.
I prefer concise css selectors to target the a tags of interest:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.mls.com/Search/New-York.mvc')
soup = bs(r.content, 'lxml')
links = ['http://www.mls.com' + i['href'] for i in soup.select('.sub-section-list a')]
print(links)
For yours, you need to loop the returned list and find the child a tags and extract the href attributes:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.mls.com/Search/New-York.mvc')
soup = bs(r.content, 'lxml')
area_links = soup.find_all('ul', class_="sub-section-list", limit=2)
for area in area_links:
print(['http://www.mls.com' + i['href'] for i in area.find_all('a')])

Treating a list of items like a single item error: how to find links within each 'link' within string already scraped

I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.
To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!

Search in Sub-Pages of a Main Webpage using BeautifulSoup

I am trying to search for a div with class = 'class', but I need to find all matches in the mainpage as well as in the sub (or children) pages. How can I do this using BeautifulSoup or anything else?
I have found the closest answer in this search
Search the frequency of words in the sub pages of a webpage using Python
but this method only retrieved partial result, the page of interest has many more subpages. Is there another way of doing this?
My code so far:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
subpages = []
for anchor in soup.find_all('a', href=True):
string = 'https://www.mainpage.nl/'+str(anchor['href'])
subpages.append(string)
for subpage in subpages:
try:
soup_sub = BeautifulSoup(requests.get(subpage).content, 'html.parser')
promotie = soup_sub.find_all('strong', class_='c-action-banner__subtitle')
if len(promotie) > 0:
print(promotie)
except Exception:
pass
Thanks!

How to extract the link of the first gallery on imgur using BS4 and urllib

I am trying to extract the gallery link of the first result on an imgur search.
theurl = "https://imgur.com/search?q=" +text
thepage = urlopen(theurl)
soup = BeautifulSoup(thepage,"html.parser")
link = soup.findAll('a',{"class":"image-list-link"})[0].decode_contents()
Here is what is being displayed for link:
I am mainly trying to get the href value from only this section (the first result for the search)
Here is what the inspect element looks like:
Actually, it's pretty easy to accomplish what you're trying to do. As shown in the image, the href of first image (or any image for that matter) is located inside the <a> tag with the attribute class="image-list-link". So, you can use the find() function, which returns the first match found. And then, use ['href'] to get the link.
Code:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://imgur.com/search?q=python')
soup = BeautifulSoup(r.text, 'lxml')
first_image_link = soup.find('a', class_='image-list-link')['href']
print(first_image_link)
# /gallery/AxKwQ2c
If you want to get the links for all the images, you can use a list comprehension.
all_image_links = [a['href'] for a in soup.find_all('a', class_='image-list-link')]

Categories