Beginner - Web scraping - Download data - python

I want to download this data https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/index.html
So far, I able to get the links from tag p and those are each month links, but challenge is that under those each link their are 31 files (for each day), which I tried several methods from stack to get h2 headings, and
from bs4 import BeautifulSoup
import urllib.request as urllib2
url = "https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
headings = soup.findAll('h2');
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")
print (headings)
for link in soup.find_all('a'):
print(link.get('href'))
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text
and here is their output (only pasting last output)
['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#December',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#November',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#October',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#September',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#August',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#July',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#June',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#May',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#April',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#March',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#February',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#January'
My question is that these are h2 tag headings and further I need the links from each of h2 tag, which are stored under each a tag.
Though above program gives me all links which does have those links, but if I could get them in organized way or any other way, which can be more easy to store data direct from html sites, will be great. I will appreciate any help. Thank. you!

Find All h2 tag and loop over it now if you see h2 tag has no data so we have to find next tag for that find_next method is used
on p tag
Now we have to find all a tag so we will use find_all method i have done this in one line of code it will return list of links
Now we will loop over it and extract only href part but there is a cath href is not correct it contains 20001231ps.html like that
but we need 20041231ps.html like that so thats why i have done that
process of replacing and appending string
I have used dict1 where it will append key as month and value as list of links so it will be easy to extract.
Code:
months=soup.find_all("h2")
dict1={}
main_url="https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004"
for month in months:
dict1[month.text]=[main_url+"/"+link['href'].replace("2000","2004") for link in month.find_next("p").find_all("a")]
Output:
{'December': ['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041231ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041230ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041229ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041228ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041227ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041226ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041225ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041224ps.html',
.....
]}

Related

Scrape tab href value from a webpage by python Beautiful Soup

I have code that extracts links from the main page and navigates through each page in the list of links, the new link has a tab page that is represented as follows in the source:
<Li Class=" tab-contacts" Id="contacts"><A Href="?id=448&tab=contacts"><Span Class="text">Contacts</Span>
I want to extract the href value and navigate to that page to get some information, here is my code so far:
import re
import requests
from bs4 import BeautifulSoup
r = requests.get(link_to_the_website)
data = r.content
soup = BeautifulSoup(data, "html.parser")
links = []
for i in soup.find_all('div',{'class':'leftInfoWrap'}):
link = i.find('a',href=True)
if link is None:
continue
links.append(link.get('href'))
for link in links:
soup = BeautifulSoup(link,"lxml")
tabs = soup.select('Li',{'class':' tab-contacts'})
print(tabs)
However I am getting an empty list with 'print(tabs)' command. I did verify the link variable and it is being populated. Thanks in advance
Looks like you are trying to mix find syntax with select.
I would use the parent id as an anchor then navigate to the child with css selectors and child combinator.
partial_link = soup.select_one('#contacts > a')['href']
You need to append the appropriate prefix.

How to disable all links not in a list, using beautiful soup

I am currently working on a web application (using flask for backend).
In my backend, I retrieve the page source of a given url using selenium. I want to go through the page_source and disable all links whose href is not inside a list. Something like:
body = browser.page_source
soup = BeautifulSoup(body, 'html.parser')
for link in soup.a:
if not (link['href'] in link_list):
link['href']=""
I am new to beautiful soup, so I am unsure about the syntax. I am using Beautiful soup 4
Figured it out:
soup = BeautifulSoup(c_body, 'lxml') #you can also use html.parser
for a in soup.findAll('a'):
if not (a['href'] in src_lst): #src_list is a list of the urls you want to keep
del a['href']
a.name='span' #to avoid the style associated with links
soup.span.unwrap() #to remove span tags and keep text only
c_body=str(soup) #c_body will be displayed in an iframe using srccdoc
EDIT: Above code might break if there are no span tags, so this would be a better approach::
soup = BeautifulSoup(c_body, 'lxml')
for a in soup.findAll('a'):
if a.has_attr("href"):
if not (a['href'] in src_lst):
del a['href']
a.name='span'
if len(soup.findAll('span')) > 0:
soup.span.unwrap()
c_body=str(soup)

Search in Sub-Pages of a Main Webpage using BeautifulSoup

I am trying to search for a div with class = 'class', but I need to find all matches in the mainpage as well as in the sub (or children) pages. How can I do this using BeautifulSoup or anything else?
I have found the closest answer in this search
Search the frequency of words in the sub pages of a webpage using Python
but this method only retrieved partial result, the page of interest has many more subpages. Is there another way of doing this?
My code so far:
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
subpages = []
for anchor in soup.find_all('a', href=True):
string = 'https://www.mainpage.nl/'+str(anchor['href'])
subpages.append(string)
for subpage in subpages:
try:
soup_sub = BeautifulSoup(requests.get(subpage).content, 'html.parser')
promotie = soup_sub.find_all('strong', class_='c-action-banner__subtitle')
if len(promotie) > 0:
print(promotie)
except Exception:
pass
Thanks!

Getting all Links from a page Beautiful Soup

I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]

How to solve, finding two of each link (Beautifulsoup, python)

Im using beautifulsoup4 to parse a webpage and collect all the href values using this code
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
allProductInfo = soup.find_all("a", class_="name-link")
print allProductInfo
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
linksList1 prints two of each link. I believe this is happening as its taking the link from the title as well as the item colour. I have tried a few things but cannot get BS to only parse the title link, and have a list of one of each link instead of two. I imagine its something real simple but im missing it. Thanks in advance
This code will give you the result without getting duplicate results
(also using set() may be a good idea as #Tarum Gupta)
But I changed the way you crawl
import requests
from bs4 import BeautifulSoup
#Collect links from 'new' page
pageRequest = requests.get('http://www.supremenewyork.com/shop/all/shirts')
soup = BeautifulSoup(pageRequest.content, "html.parser")
links = soup.select("div.turbolink_scroller a")
# Gets all divs with class of inner-article then search for a with name-link class
that is inside an h1 tag
allProductInfo = soup.select("div.inner-article h1 a.name-link")
# print (allProductInfo)
linksList1 = []
for href in allProductInfo:
linksList1.append(href.get('href'))
print(linksList1)
alldiv = soup.findAll("div", {"class":"inner-article"})
for div in alldiv:
linkList1.append(div.h1.a['href'])
set(linksList1) # use set() to remove duplicate link
list(set(linksList1)) # use list() convert set to list if you need

Categories