Remove a href if the same one has been already scraped - python

I'm scraping a website and I don't want to print two same href but only one. I can't figure out, could someone give me an intuition to follow ?
url = "http://www.fveconstruction.ch/anMetier.asp?M=04&R=4&PageSize=1000&BoolsMember=0"
get_url = requests.get(url)
get_text = get_url.text
soup = BeautifulSoup(get_text, "html.parser")
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
#If statement ?
print(href)
For example here, if a run the code, I'm going to have each href link doubled. Is there a if statement to remove and keep one of them ?

You don't need any conditional statement to do this. All you need is use set buit-in to remove duplicates from the result.
soup = BeautifulSoup(get_text, "html.parser")
links = {link['href'] for link in soup.find_all("a", href=re.compile('anDetails.asp'))}
print(links)

You can try using set on the find_all but you'd still most likely have duplicates since the object could differ but still contain the same href.
In which case you just create a list and append each href to the list.
Then you could create an if condition to check if it is already in the list before printing it out.
So you'd have
href_list = []
for link in soup.find_all("a", href=re.compile('anDetails.asp')):
href = link.get('href')
if href not in href_list:
print(href)
href_list.append(href)

Related

Beginner - Web scraping - Download data

I want to download this data https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/index.html
So far, I able to get the links from tag p and those are each month links, but challenge is that under those each link their are 31 files (for each day), which I tried several methods from stack to get h2 headings, and
from bs4 import BeautifulSoup
import urllib.request as urllib2
url = "https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
headings = soup.findAll('h2');
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")
print (headings)
for link in soup.find_all('a'):
print(link.get('href'))
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text
and here is their output (only pasting last output)
['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#December',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#November',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#October',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#September',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#August',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#July',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#June',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#May',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#April',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#March',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#February',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#January'
My question is that these are h2 tag headings and further I need the links from each of h2 tag, which are stored under each a tag.
Though above program gives me all links which does have those links, but if I could get them in organized way or any other way, which can be more easy to store data direct from html sites, will be great. I will appreciate any help. Thank. you!
Find All h2 tag and loop over it now if you see h2 tag has no data so we have to find next tag for that find_next method is used
on p tag
Now we have to find all a tag so we will use find_all method i have done this in one line of code it will return list of links
Now we will loop over it and extract only href part but there is a cath href is not correct it contains 20001231ps.html like that
but we need 20041231ps.html like that so thats why i have done that
process of replacing and appending string
I have used dict1 where it will append key as month and value as list of links so it will be easy to extract.
Code:
months=soup.find_all("h2")
dict1={}
main_url="https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004"
for month in months:
dict1[month.text]=[main_url+"/"+link['href'].replace("2000","2004") for link in month.find_next("p").find_all("a")]
Output:
{'December': ['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041231ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041230ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041229ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041228ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041227ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041226ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041225ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041224ps.html',
.....
]}

How to get rid of duplicate links using python

I'm new at coding and I'm trying to scrape all unique web links from https://www.census.gov/programs-surveys/popest.html. I've tried to put the links into a set but the output comes back as {'/'}. I don't know any other way to get rid of duplicates. Below is my code. Thank you for you help.
from bs4 import BeautifulSoup
import urllib
import urllib.request
import requests
with urllib.request.urlopen('https://www.census.gov/programs-surveys/popest.html') as response:
html = response.read()
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a', href=True):
links = (link['href'])
link = str(link.get('href'))
if link.startswith('https'):
print(link)
elif link.endswith('html'):
print(link)
unique_links = set(link)
print(unique_links)
Let's say all the links are stored in a list called links1. Here is how you can remove duplicates without the use of set():
links2 = []
for link in links1:
if link not in link2:
links2.append(link)
Your set only contains the final link, declare the set() earlier, then add to it.
unique_links = set()
for link in soup.find_all('a', href=True):
link = str(link.get('href'))
if link.startswith('https'):
print(link)
elif link.endswith('html'):
print(link)
unique_links.add(link)
print(unique_links)
Create the set outside the for loop, then add to set inside the loop.
link_set = set()
for link in soup.find_all('a', href=True):
link_set.add(link['href']

Traceback error with startswith

I keep getting a traceback error saying AttributeError: 'NoneType' object has no attribute 'startswith' when I get to the end of my script. What I am doing up to this point is scraping all kinds of different pages then pulling all these different pages into one list that scrapes the final URL for each business page. What I did was go to each_page and scrape all the 'a' tags off of the page, then I am wanting to search through them and only keep the ones that start with '/401k/'. I know I could probably do it without having to add it to another list because I feel like I have too many. I was thinking of doing it like this:
for a in soup.findAll('a'):
href = a.get('href')
if href.startswith('/401k/'):
final_url.append(href)
#Even when I try this I get an error saying that no attribute
Either way it isn't getting the data and I cant figure out what is going on. Maybe I've been looking at the screen too much.
import requests
from bs4 import BeautifulSoup
url = "https://www.brightscope.com/ratings/"
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')
hrefs = []
ratings = []
pages = []
s_names = []
final_url = []
for href in soup.findAll('a'):
if 'href' in href.attrs:
hrefs.append(href.attrs['href'])
for good_ratings in hrefs:
if good_ratings.startswith('/ratings/'):
ratings.append(url[:-9]+good_ratings)
del ratings[0]
del ratings[27:]
for each_rating in ratings:
page = requests.get(each_rating)
soup = BeautifulSoup(page.text, 'html.parser')
span = soup.find('span', class_='letter-pages')
if soup.find('span', class_='letter-pages'):
for a in span.find_all('a'):
href = a.get('href')
pages.append('https://www.brightscope.com'+href)
else:
pages.append(page.url)
hrefs = []
pages = set(pages)
for each_page in pages:
page = requests.get(each_page)
soup = BeautifulSoup(page.text, 'html.parser')
for a in soup.findAll('a'):
href = a.get('href')
s_names.append(href)
# I am getting a traceback error AttributeError: 'NoneType' object has no attribute 'startswith' starting with the code below.
for each in s_names:
if each.startswith('/401k'):
final_url.append(each)
The problem you are facing is because you are trying to use the startswith operator irrespective of whether the value is present or not. You should first check if the each variable is having value or not. Try this
for each in s_names:
if each and each.startswith('/401k'):
final_url.append(each)
What the above statement is doing is, first it is checking if the value is None or not . Then if the value is not None then it is moving forward to make the check using startswith
a tags can have no href in html 5 so a.get('href') returns None. that's probably what you're experiencing.
What you need is to make sure you don't get None:
for a in soup.findAll('a'):
href = a.get('href')
if href is not None:
s_names.append(href)
See here for more details https://www.w3.org/TR/2016/REC-html51-20161101/textlevel-semantics.html#the-a-element
If the a element has no href attribute, then the element represents a placeholder for where a link might otherwise have been placed, if it had been relevant, consisting of just the element’s contents.

Getting all Links from a page Beautiful Soup

I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Categories