I keep getting a traceback error saying AttributeError: 'NoneType' object has no attribute 'startswith' when I get to the end of my script. What I am doing up to this point is scraping all kinds of different pages then pulling all these different pages into one list that scrapes the final URL for each business page. What I did was go to each_page and scrape all the 'a' tags off of the page, then I am wanting to search through them and only keep the ones that start with '/401k/'. I know I could probably do it without having to add it to another list because I feel like I have too many. I was thinking of doing it like this:
for a in soup.findAll('a'):
href = a.get('href')
if href.startswith('/401k/'):
final_url.append(href)
#Even when I try this I get an error saying that no attribute
Either way it isn't getting the data and I cant figure out what is going on. Maybe I've been looking at the screen too much.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
pages = []
s_names = []
final_url = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if soup.find('span', class_='letter-pages'):
for a in span.find_all('a'):
href = a.get('href')
pages.append('https://www.brightscope.com'+href)
else:
pages.append(page.url)
hrefs = []
pages = set(pages)
for each_page in pages:
page = requests.get(each_page)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
href = a.get('href')
s_names.append(href)
# I am getting a traceback error AttributeError: 'NoneType' object has no attribute 'startswith' starting with the code below.
for each in s_names:
if each.startswith('/401k'):
final_url.append(each)
The problem you are facing is because you are trying to use the startswith operator irrespective of whether the value is present or not. You should first check if the each variable is having value or not. Try this
for each in s_names:
if each and each.startswith('/401k'):
final_url.append(each)
What the above statement is doing is, first it is checking if the value is None or not . Then if the value is not None then it is moving forward to make the check using startswith
a tags can have no href in html 5 so a.get('href') returns None. that's probably what you're experiencing.
What you need is to make sure you don't get None:
for a in soup.findAll('a'):
href = a.get('href')
if href is not None:
s_names.append(href)
See here for more details https://www.w3.org/TR/2016/REC-html51-20161101/textlevel-semantics.html#the-a-element
If the a element has no href attribute, then the element represents a placeholder for where a link might otherwise have been placed, if it had been relevant, consisting of just the element’s contents.
Related
I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]
For some reason I keep getting the following error when I run the fnmatch function.
Error: 'NoneType' object has no attribute 'replace'
It works when I try it with a single link, but doesn't work when I loop through an array and try to match every link in the array.
from bs4 import BeautifulSoup
xml = requests.get("https://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html").text
soup = BeautifulSoup(xml, 'html.parser')
all_links = [link.get('href') for link in soup.find_all('a')]
matched_links = [fnmatch(link, pattern) for link in all_links]
Not all <a> tags have href attributes. Your all_links probably has some None values in, which fnmatch() can't do regex matching on. This happens because some of those tags were probably being used as named anchors <a name="whatever"> instead of links.
You could add a guard condition to your list comprehension to make sure these get filtered out.
all_links = [link.get('href') for link in soup.find_all('a') if link.get('href')]
Subsequently, you could also do the conditional check on your second comprehension.
matched_links = [fnmatch(link, pattern) for link in all_links if link]
I am practicing building web scrapers. One that I am working on now involves going to a site, scraping links for the various cities on that site, then taking all of the links for each of the cities and scraping all the links for the properties in said cites.
I'm using the following code:
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags.find_all("a")] # Links to cities
If I print out city_tags I get the HTML I want. However, when I print cities_links I get AttributeError: 'ResultSet' object has no attribute 'find_all'.
I gather from other q's on here that this error occurs because city_tags returns none, but this can't be the case if it is printing out the desired html? I have noticed that said html is in [] - does this make a difference?
Well city_tags is a bs4.element.ResultSet (essentially a list) of tags and you are calling find_all on it. You probably want to call find_all in every element of the resultset or in this specific case just retrieve their href attribute
import requests
from bs4 import BeautifulSoup
main_url = "http://www.chapter-living.com/"
# Getting individual cities url
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('a', class_="nav-title") # Bottom page not loaded dynamycally
cities_links = [main_url + tag["href"] for tag in city_tags] # Links to cities
As the error says, the city_tags is a ResultSet which is a list of nodes and it doesn't have the find_all method, you either have to loop through the set and apply find_all on each individual node or in your case, I think you can simply extract the href attribute from each node:
[tag['href'] for tag in city_tags]
#['https://www.chapter-living.com/blog/',
# 'https://www.chapter-living.com/testimonials/',
# 'https://www.chapter-living.com/events/']
Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need
I'm scraping a website and I don't want to print two same href but only one. I can't figure out, could someone give me an intuition to follow ?
url = "http://www.fveconstruction.ch/anMetier.asp?M=04&R=4&PageSize=1000&BoolsMember=0"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
#If statement ?
print(href)
For example here, if a run the code, I'm going to have each href link doubled. Is there a if statement to remove and keep one of them ?
You don't need any conditional statement to do this. All you need is use set buit-in to remove duplicates from the result.
soup = BeautifulSoup(get_text, "html.parser")
links = {link['href'] for link in soup.find_all("a", href=re.compile('anDetails.asp'))}
print(links)
You can try using set on the find_all but you'd still most likely have duplicates since the object could differ but still contain the same href.
In which case you just create a list and append each href to the list.
Then you could create an if condition to check if it is already in the list before printing it out.
So you'd have
href_list = []
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
if href not in href_list:
print(href)
href_list.append(href)