For some reason I keep getting the following error when I run the fnmatch function.
Error: 'NoneType' object has no attribute 'replace'
It works when I try it with a single link, but doesn't work when I loop through an array and try to match every link in the array.
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
soup = BeautifulSoup(xml, 'html.parser')
all_links = [link.get('href') for link in soup.find_all('a')]
matched_links = [fnmatch(link, pattern) for link in all_links]
Not all <a> tags have href attributes. Your all_links probably has some None values in, which fnmatch() can't do regex matching on. This happens because some of those tags were probably being used as named anchors <a name="whatever"> instead of links.
You could add a guard condition to your list comprehension to make sure these get filtered out.
all_links = [link.get('href') for link in soup.find_all('a') if link.get('href')]
Subsequently, you could also do the conditional check on your second comprehension.
matched_links = [fnmatch(link, pattern) for link in all_links if link]
Related
I am writing a python code to scrape the pdfs of meetings off this website: https://www.gmcameetings.co.uk
The pdf links are within links, which are also within links. I have the first set of links off the page above, then I need to scrape links within the new urls.
When I do this I get the following error:
AttributeError: ResultSet object has no attribute 'find_all'. You're
probably treating a list of items like a single item. Did you call
find_all() when you meant to call find()?
This is my code so far which is all fine and checked in jupyter notebook:
# importing libaries and defining
import requests
import urllib.request
import time
from bs4 import BeautifulSoup as bs
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# creating folder to store pfds - if not create seperate folder
folder_location = r'E:\Internship\WORK'
# getting all meeting href off url
meeting_links = soup.find_all('a',href='TRUE')
for link in meeting_links:
print(link['href'])
if link['href'].find('/meetings/')>1:
print("Meeting!")
This is the line that then receives the error:
second_links = meeting_links.find_all('a', href='TRUE')
I have tried the find() as python suggests but that doesn't work either. But I understand that it can't treat meeting_links as a single item.
So basically, how do you search for links within each bit of the new string variable (meeting_links).
I already have code to get the pdfs once I have the second set of urls which seems to work fine but need to obviously get these first.
Hopefully this makes sense and I've explained ok - I only properly started using python on Monday so I'm a complete beginner.
To get all meeting links try
from bs4 import BeautifulSoup as bs
import requests
# set url
url = "https://www.gmcameetings.co.uk/"
# grab html
r = requests.get(url)
page = r.text
soup = bs(page,'lxml')
# Scrape to find all links
all_links = soup.find_all('a', href=True)
# Loop through links to find those containing '/meetings/'
meeting_links = []
for link in all_links:
href = link['href']
if '/meetings/' in href:
meeting_links.append(href)
print(meeting_links)
The .find() function that you use in your original code is specific to beautiful soup objects. To find a substring within a string, just use native Python: 'a' in 'abcd'.
Hope that helps!
I keep getting a traceback error saying AttributeError: 'NoneType' object has no attribute 'startswith' when I get to the end of my script. What I am doing up to this point is scraping all kinds of different pages then pulling all these different pages into one list that scrapes the final URL for each business page. What I did was go to each_page and scrape all the 'a' tags off of the page, then I am wanting to search through them and only keep the ones that start with '/401k/'. I know I could probably do it without having to add it to another list because I feel like I have too many. I was thinking of doing it like this:
for a in soup.findAll('a'):
href = a.get('href')
if href.startswith('/401k/'):
final_url.append(href)
#Even when I try this I get an error saying that no attribute
Either way it isn't getting the data and I cant figure out what is going on. Maybe I've been looking at the screen too much.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
pages = []
s_names = []
final_url = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if soup.find('span', class_='letter-pages'):
for a in span.find_all('a'):
href = a.get('href')
pages.append('https://www.brightscope.com'+href)
else:
pages.append(page.url)
hrefs = []
pages = set(pages)
for each_page in pages:
page = requests.get(each_page)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
href = a.get('href')
s_names.append(href)
# I am getting a traceback error AttributeError: 'NoneType' object has no attribute 'startswith' starting with the code below.
for each in s_names:
if each.startswith('/401k'):
final_url.append(each)
The problem you are facing is because you are trying to use the startswith operator irrespective of whether the value is present or not. You should first check if the each variable is having value or not. Try this
for each in s_names:
if each and each.startswith('/401k'):
final_url.append(each)
What the above statement is doing is, first it is checking if the value is None or not . Then if the value is not None then it is moving forward to make the check using startswith
a tags can have no href in html 5 so a.get('href') returns None. that's probably what you're experiencing.
What you need is to make sure you don't get None:
for a in soup.findAll('a'):
href = a.get('href')
if href is not None:
s_names.append(href)
See here for more details https://www.w3.org/TR/2016/REC-html51-20161101/textlevel-semantics.html#the-a-element
If the a element has no href attribute, then the element represents a placeholder for where a link might otherwise have been placed, if it had been relevant, consisting of just the element’s contents.
I am practicing building web scrapers. One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites.
I'm using the following code:
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities
If I print out city_tags I get the HTML I want. However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all'.
I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? I have noticed that said html is in [] - does this make a difference?
Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities
As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node:
[tag['href'] for tag in city_tags]
#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']
I'm scraping a website and I don't want to print two same href but only one. I can't figure out, could someone give me an intuition to follow ?
url = "http://www.fveconstruction.ch/anMetier.asp?M=04&R=4&PageSize=1000&BoolsMember=0"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
#If statement ?
print(href)
For example here, if a run the code, I'm going to have each href link doubled. Is there a if statement to remove and keep one of them ?
You don't need any conditional statement to do this. All you need is use set buit-in to remove duplicates from the result.
soup = BeautifulSoup(get_text, "html.parser")
links = {link['href'] for link in soup.find_all("a", href=re.compile('anDetails.asp'))}
print(links)
You can try using set on the find_all but you'd still most likely have duplicates since the object could differ but still contain the same href.
In which case you just create a list and append each href to the list.
Then you could create an if condition to check if it is already in the list before printing it out.
So you'd have
href_list = []
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
if href not in href_list:
print(href)
href_list.append(href)
Do you know why the first example in BeautifulSoup tutorial http://www.crummy.com/software/BeautifulSoup/documentation.html#QuickStart gives AttributeError: 'NavigableString' object has no attribute 'name'? According to this answer the space characters in the HTML causes the problem. I tried with sources of a few pages and 1 worked the others gave the same error (I removed spaces). Can you explain what does "name" refer to and why this error happens? Thanks.
Just ignore NavigableString objects while iterating through the tree:
from bs4 import BeautifulSoup, NavigableString, Tag
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for body_child in soup.body.children:
if isinstance(body_child, NavigableString):
continue
if isinstance(body_child, Tag):
print(body_child.name)
name will refer to the name of the tag if the object is a Tag object (ie: <html> name = "html")
if you have spaces in your markup in between nodes BeautifulSoup will turn those into NavigableString's. So if you use the index of the contents to grab nodes, you might grab a NavigableString instead of the next Tag.
To avoid this, query for the node you are looking for: Searching the Parse Tree
or if you know the name of the next tag you would like, you can use that name as the property and it will return the first Tag with that name or None if no children with that name exist: Using Tag Names as Members
If you wanna use the contents you have to check the objects you are working with. The error you are getting just means you are trying to access the name property because the code assumes it's a Tag
You can use try catch to eliminate the cases when Navigable String is being parsed in the loop, like this:
for j in soup.find_all(...)
try:
print j.find(...)
except NavigableString:
pass
This is the latest working code to obtain the name of the tags in soup.
from bs4 import BeautifulSoup, Tag
res = requests.get(url).content
soup = BeautifulSoup(res, 'lxml')
for child in soup.body.children:
if isinstance(body_child, Tag):
print(child.name)