How to extract href attribute in html source code - python

This is HTML source code that I am dealing with:
<a href="/people/charles-adams" class="gridlist__link">
So what I want to do is to extract the href attribute, in this case would be "/people/charles-adams", with beautifulsoup module. I need this because I want to get html source code with soup.findAll method for that particular webpage. But I am struggling to extract such attribute from the webpage. Could anyone help me with this problem?
P.S.
I am using this method to get html source code with Python module beautifulSoup:
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')

Try something like:
refs = soup.find_all('a')
for i in refs:
if i.has_attr('href'):
print(i['href'])
It should output:
/people/charles-adams

You can tell beautifulsoup to find all anchor tags with soup.find_all('a'). Then you can filter it with list comprehension and get the links.
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]
links will be ['/people/charles-adams']

Related

How to find all a tags with aria-label attribute BeautifulSoup

I am trying to use Beautiful Soup to find all a tags that have an aria-label attribute (not trying to find a tags with any specific value for the attribute, just every tag that has the attribute in general). My code is shown below. When I run the code, I get an error indicating that the aria-label parameter cannot be parsed. How can I do this correctly?
url = 'https://www.encodeproject.org/search/?type=Experiment&control_type!=*&status=released&perturbed=false&assay_title=TF+ChIP-seq&assay_title=Histone+ChIP-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.term_name=K562&biosample_ontology.term_name=HEK293&biosample_ontology.term_name=MCF-7&biosample_ontology.term_name=HepG2&limit=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
a_tags = soup.findAll('a', aria-label=True)
for tag in a_tags:
print(tag.text.strip())
You can use a CSS selector a[aria-label] which will select all a that have the attribute aria-label.
To use a CSS selector, use select() instead of find_all():
import requests
from bs4 import BeautifulSoup
url = 'https://www.encodeproject.org/search/?type=Experiment&control_type!=*&status=released&perturbed=false&assay_title=TF+ChIP-seq&assay_title=Histone+ChIP-seq&replicates.library.biosample.donor.organism.scientific_name=Homo+sapiens&biosample_ontology.term_name=K562&biosample_ontology.term_name=HEK293&biosample_ontology.term_name=MCF-7&biosample_ontology.term_name=HepG2&limit=all'
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
a_tags = soup.select('a[aria-label]')
for tag in a_tags:
print(tag.text.strip())
Or: use the attr= argument:
a_tags = soup.findAll('a', attrs={"aria-label": True})
Or: Check if aria-label is in the .attrs:
a_tags = soup.findAll(lambda tag: tag.name == "a" and "aria-label" in tag.attrs)

Using Beautifulsoup to get a tags and attriibutes of these a tags

I just started using beautifulsoup and am stuck on an issue regarding getting attributes of tags inside other tags. I am using the whitehouse.gov/briefing-room/ for practice. What I'm trying to do right now is just get all the links on this page and append them to an empty list. This is my code right now:
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tags in soup.find_all('h2'):
a_tag = h2_tags.find('a')
urls.append(a_tag.attr['href']) # This is where I get the NoneType error
This code returns the <a tags, but the first and last 3 tags it returns are 'None' and because of this, get a type error when trying to access the attributes to get the href for these <a tags
The problem is, that some <h2> tags don't contain <a> tags. So you have to check for that alternative. Or just select all <a> tags that are under <h2> using CSS selector:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for a_tag in soup.select('h2 a'): # <-- select <A> tags that are under <H2> tags
urls.append(a_tag.attrs['href'])
print(*urls, sep='\n')
Prints:
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/10/statement-by-nsc-spokesperson-emily-horne-on-national-security-advisor-jake-sullivan-leading-the-first-virtual-meeting-of-the-u-s-israel-strategic-consultative-group/
https://www.whitehouse.gov/briefing-room/press-briefings/2021/03/09/press-briefing-by-press-secretary-jen-psaki-and-deputy-director-of-the-national-economic-council-bharat-ramamurti-march-9-2021/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-the-white-houses-meeting-with-climate-finance-leaders/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-vice-president-kamala-harris-call-with-prime-minister-erna-solberg-of-norway/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/nomination-sent-to-the-senate-3/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-biden-announces-key-hire-for-the-office-of-management-and-budget/
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/09/remarks-by-president-biden-during-tour-of-w-s-jenks-son/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-joseph-r-biden-jr-approves-louisiana-disaster-declaration/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/statement-by-president-joe-biden-on-the-house-taking-up-the-pro-act/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/white-house-announces-additional-staff/

BS4 Get text from within all DIV tags but not children

I am crawling multiple webpages but am having an issue with some websites that have content/text with div tags rather than p or span. Previously the script worked fine getting text from p and span tags however if a snippet of the code is like the below:
<div>Hello<p>this is a test</p></div>
Using find_all('div') and .getText() provides the following output:
Hello this is a test
I am looking to get the result of just Hello. This will allow me to determine what content is in what tags. I have tried using recursive=False however this doesn't appear to function on a whole webpage with multiple div tags that have content in.
ADDED SNIPPET OF CODE
req = urllib.request.Request("https://www.healthline.com/health/fitness-exercise/pushups-everyday", headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read().decode("utf-8").lower()
soup = BeautifulSoup(html, 'html.parser')
divTag = soup.find_all('div')
text = []
for div in divTag:
i = div.getText()
text.append(i)
print(text)
Thanks in advance.
Based on your information this is answered here: how to get text from within a tag, but ignore other child tags
this would lead to something like this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
for div in soup.find_all('div'):
print(div.find(text=True, recursive=False))
EDIT:
you just have to change
i = div.getText()
to
i = div.find(text=True, recursive=False)
Here is a possible solution, we extract all 'p's from soup.
from bs4 import BeautifulSoup
html = "<div>Hello<p>this is a test</p></div>"
soup = BeautifulSoup(html, 'html.parser')
for p in soup.find('p'):
p.extract()
print(soup.text)

Getting all Links from a page Beautiful Soup

I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]

Beautifulsoup unable to extract data using attrs=class

I am extracting data for a research project and I have sucessfully used findAll('div', attrs={'class':'someClassName'}) in many websites but this particular website,
WebSite Link
doesn't return any values when I used attrs option. But when I don't use the attrs option I get entire html dom.
Here is the simple code that I started with to test it out:
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
My code is working fine, with requests
import requests
from BeautifulSoup import BeautifulSoup as bs
#grab HTML
r = requests.get(r'http://www.amazon.com/s/ref=sr_pg_1?rh=n:172282,k%3adigital%20camera&keywords=digital%20camera&ie=UTF8&qid=1343600585')
html = r.text
#parse the HTML
soup = bs(html)
results= soup.findAll('div', attrs={'class': 'data'})
print results
If you or anyone reading this question would like to know the reason that the code wasn't able to find the attrs value using the code you've given (copied below):
soup = bs(urlopen(url))
for div in soup.findAll('div', attrs={'class':'data'}):
print div
The issue is when you attempted to create a BeautifulSoup object soup = bs(urlopen(url)) as the value of urlopen(url) is a response object and not the DOM.
I'm sure any issues you had encountered could have been more easily resolved by using bs(urlopen(url).read()) instead.

Categories