I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]
Related
I want to download this data https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/index.html
So far, I able to get the links from tag p and those are each month links, but challenge is that under those each link their are 31 files (for each day), which I tried several methods from stack to get h2 headings, and
from bs4 import BeautifulSoup
import urllib.request as urllib2
url = "https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
headings = soup.findAll('h2');
req = requests.get(url)
soup = BeautifulSoup(req.text, "html.parser")
print("The href links are :")
print (headings)
for link in soup.find_all('a'):
print(link.get('href'))
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
links_with_text
and here is their output (only pasting last output)
['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#December',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#November',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#October',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#September',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#August',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#July',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#June',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#May',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#April',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#March',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#February',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2000/index.html#January'
My question is that these are h2 tag headings and further I need the links from each of h2 tag, which are stored under each a tag.
Though above program gives me all links which does have those links, but if I could get them in organized way or any other way, which can be more easy to store data direct from html sites, will be great. I will appreciate any help. Thank. you!
Find All h2 tag and loop over it now if you see h2 tag has no data so we have to find next tag for that find_next method is used
on p tag
Now we have to find all a tag so we will use find_all method i have done this in one line of code it will return list of links
Now we will loop over it and extract only href part but there is a cath href is not correct it contains 20001231ps.html like that
but we need 20041231ps.html like that so thats why i have done that
process of replacing and appending string
I have used dict1 where it will append key as month and value as list of links so it will be easy to extract.
Code:
months=soup.find_all("h2")
dict1={}
main_url="https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004"
for month in months:
dict1[month.text]=[main_url+"/"+link['href'].replace("2000","2004") for link in month.find_next("p").find_all("a")]
Output:
{'December': ['https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041231ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041230ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041229ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041228ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041227ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041226ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041225ps.html',
'https://www.nrc.gov/reading-rm/doc-collections/event-status/reactor-status/2004/20041224ps.html',
.....
]}
I just started using beautifulsoup and am stuck on an issue regarding getting attributes of tags inside other tags. I am using the whitehouse.gov/briefing-room/ for practice. What I'm trying to do right now is just get all the links on this page and append them to an empty list. This is my code right now:
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tags in soup.find_all('h2'):
a_tag = h2_tags.find('a')
urls.append(a_tag.attr['href']) # This is where I get the NoneType error
This code returns the <a tags, but the first and last 3 tags it returns are 'None' and because of this, get a type error when trying to access the attributes to get the href for these <a tags
The problem is, that some <h2> tags don't contain <a> tags. So you have to check for that alternative. Or just select all <a> tags that are under <h2> using CSS selector:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for a_tag in soup.select('h2 a'): # <-- select <A> tags that are under <H2> tags
urls.append(a_tag.attrs['href'])
print(*urls, sep='\n')
Prints:
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/10/statement-by-nsc-spokesperson-emily-horne-on-national-security-advisor-jake-sullivan-leading-the-first-virtual-meeting-of-the-u-s-israel-strategic-consultative-group/
https://www.whitehouse.gov/briefing-room/press-briefings/2021/03/09/press-briefing-by-press-secretary-jen-psaki-and-deputy-director-of-the-national-economic-council-bharat-ramamurti-march-9-2021/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-the-white-houses-meeting-with-climate-finance-leaders/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-vice-president-kamala-harris-call-with-prime-minister-erna-solberg-of-norway/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/nomination-sent-to-the-senate-3/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-biden-announces-key-hire-for-the-office-of-management-and-budget/
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/09/remarks-by-president-biden-during-tour-of-w-s-jenks-son/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-joseph-r-biden-jr-approves-louisiana-disaster-declaration/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/statement-by-president-joe-biden-on-the-house-taking-up-the-pro-act/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/white-house-announces-additional-staff/
This is HTML source code that I am dealing with:
<a href="/people/charles-adams" class="gridlist__link">
So what I want to do is to extract the href attribute, in this case would be "/people/charles-adams", with beautifulsoup module. I need this because I want to get html source code with soup.findAll method for that particular webpage. But I am struggling to extract such attribute from the webpage. Could anyone help me with this problem?
P.S.
I am using this method to get html source code with Python module beautifulSoup:
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
Try something like:
refs = soup.find_all('a')
for i in refs:
if i.has_attr('href'):
print(i['href'])
It should output:
/people/charles-adams
You can tell beautifulsoup to find all anchor tags with soup.find_all('a'). Then you can filter it with list comprehension and get the links.
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]
links will be ['/people/charles-adams']
I have an HTML page that contains multiple links with the same reference as following:
<img src="myImage.png">
<img src="myImage22.png">
<img src="myImage33.png">
When I am requesting the page to return all the tags (links) that have the href 1, it is returning only the first one. How to tell the code to return all the links not only the first one?
This is my code:
page = requests.get('http://www.myWebsite.com')
soup = BeautifulSoup(page.content, 'html.parser')
author_name = soup.find('a', href= '1')
You can do it like this:
page = requests.get('http://www.myWebsite.com')
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a', {'href':'1'}):
print(link.getText())
or if you want to make a list out of them you can just do this:
author_names = [link.getText() for link in soup.find_all('a', {'href':'1'})]
The problem with your solution was that find() only returns the first result, while find_all() returns all of them.
You can read up more on Beautiful Soup here
I'm pretty new to Python and mainly need it for getting information from websites.
Here I tried to get the short headlines from the bottom of the website, but cant quite get them.
from bfs4 import BeautifulSoup
import requests
url = "http://some-website"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
Now I would need another findAll to get all the links/a from the var "nachrichten", but how can I do this ?
Use a css selector with select if you want all the links in a single list:
anchors = soup.select('ul.list a')
If you want individual lists:
anchors = [ ul.find_all(a) for a in soup.find_all('ul', {'class':'list'})]
Also if you want the hrefs you can make sure you only find the anchors with href attributes and extract:
hrefs = [a["href"] for a in soup.select('ul.list a[href]')]
With find_all set href=True i.e ul.find_all(a, href=True) .
from bs4 import BeautifulSoup
import requests
url = "http://www.n-tv.de/ticker/"
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
nachrichten = soup.findAll('ul', {'class':'list'})
links = []
for ul in nachrichten:
links.extend(ul.findAll('a'))
print len(links)
Hope this solves your problem and I think the import is bs4. I never herd of bfs4