beautifulsoup4 doesnt filter only class - python

I'm trying to get every "a" tag in an html page, and i'm trying to use
soup.find_all
here's my code:
r.text -- the youtube home page in html
soup = BeautifulSoup(r.text, 'html.parser')
for lnk in soup.find_all('a' , {'class' : 'ytd-thumbnail'}):
print(lnk)
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)
I'm trying to get something like this:
<a id="thumbnail" class="yt-simple-endpoint inline-block style-scope ytd-thumbnail" aria-hidden="true" tabindex="-1" href="youtubelink">
but it doesn't even go into the for loop, I don't know why

Use attrs while passing the dictionary in the find_all or find method.
soup = BeautifulSoup(r.text, 'html.parser')
for lnk in soup.find_all('a' , attrs={'class' : 'ytd-thumbnail'}):
print(lnk)
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)

You can try to use a CSS selector. I find them cleaner and more robust. Here, select creates a list of all a tags, where the class attribute contains substring ytd-thumbnail. As a side note, I'd also suggest using the lxml parser for working with bs4.
soup = BeautifulSoup(r.text, 'lxml')
for lnk in soup.select('a[class*=ytd-thumbnail]'):
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)

Related

How to extract href attribute in html source code

This is HTML source code that I am dealing with:
<a href="/people/charles-adams" class="gridlist__link">
So what I want to do is to extract the href attribute, in this case would be "/people/charles-adams", with beautifulsoup module. I need this because I want to get html source code with soup.findAll method for that particular webpage. But I am struggling to extract such attribute from the webpage. Could anyone help me with this problem?
P.S.
I am using this method to get html source code with Python module beautifulSoup:
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
Try something like:
refs = soup.find_all('a')
for i in refs:
if i.has_attr('href'):
print(i['href'])
It should output:
/people/charles-adams
You can tell beautifulsoup to find all anchor tags with soup.find_all('a'). Then you can filter it with list comprehension and get the links.
request = requests.get(link, headers=header)
html = request.text
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
tags = [tag for tag in tags if tag.has_attr('href')]
links = [tag['href'] for tag in tags]
links will be ['/people/charles-adams']

How to get ALL the HTML tags associated with a specific id (Beautiful soup)

I have an HTML page that contains multiple links with the same reference as following:
<img src="myImage.png">
<img src="myImage22.png">
<img src="myImage33.png">
When I am requesting the page to return all the tags (links) that have the href 1, it is returning only the first one. How to tell the code to return all the links not only the first one?
This is my code:
page = requests.get('http://www.myWebsite.com')
soup = BeautifulSoup(page.content, 'html.parser')
author_name = soup.find('a', href= '1')
You can do it like this:
page = requests.get('http://www.myWebsite.com')
soup = BeautifulSoup(page.content, 'html.parser')
for link in soup.find_all('a', {'href':'1'}):
print(link.getText())
or if you want to make a list out of them you can just do this:
author_names = [link.getText() for link in soup.find_all('a', {'href':'1'})]
The problem with your solution was that find() only returns the first result, while find_all() returns all of them.
You can read up more on Beautiful Soup here

Getting all Links from a page Beautiful Soup

I am using beautifulsoup to get all the links from a page. My code is:
import requests
from bs4 import BeautifulSoup
url = 'http://www.acontecaeventos.com.br/marketing-promocional-sao-paulo'
r = requests.get(url)
html_content = r.text
soup = BeautifulSoup(html_content, 'lxml')
soup.find_all('href')
All that I get is:
[]
How can I get a list of all the href links on that page?
You are telling the find_all method to find href tags, not attributes.
You need to find the <a> tags, they're used to represent link elements.
links = soup.find_all('a')
Later you can access their href attributes like this:
link = links[0] # get the first link in the entire page
url = link['href'] # get value of the href attribute
url = link.get('href') # or like this
Replace your last line:
links = soup.find_all('a')
By that line :
links = [a.get('href') for a in soup.find_all('a', href=True)]
It will scrap all the a tags, and for each a tags, it will append the href attribute to the links list.
If you want to know more about the for loop between the [], read about List comprehensions.
To get a list of everyhref regardless of tag use:
href_tags = soup.find_all(href=True)
hrefs = [tag.get('href') for tag in href_tags]

Python + BeautifulSoup: How to get ‘href’ attribute of ‘a’ element?

I have the following:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href which is /file-one/additional. So I did:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print “Link: “ + link_text
But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?
Thank you in advance and will be sure to upvote/accept answer!
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.
You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda to .find_all().
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.
Using .find_all().
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select() with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]
You can also use attrs to get the href tag with regex search
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all
You could solve this with just a couple lines of gazpacho:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:
'/file-one/additional'
A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:
from bs4 import BeautifulSoup
import requests
source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
link = article.find('a', href=True)['href'}
print(link)

Getting href using Beautiful Soup

I am trying to extract a specific link for this html code
<a class="pageNum taLnk" data-offset="10" data-page-number="1"
href="www.blahblahblah.com/bb32123">Page 1 </a>
<a class="pageNum taLnk" data-offset="20" data-page-number="2"
href="www.blahblahblah.com/bb45135">Page 2 </a>
As you can see, the link (href) are disorganized, therefore there are no pattern for me to use which means I need to extract the href manually using BeautifulSoup.
I want to specifically get Page 2's href.
These can the code I have now.
from bs4 import BeautifulSoup
import urllib
url = 'https://www.tripadvisor.com/ShowUserReviews-g293917-d539542-r447460956-Duangtawan_Hotel_Chiang_Mai-Chiang_Mai.html#REVIEWS'
page = urllib.request.urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('a', attrs = {'class' : 'pageNum taLnk'}):
print (link)
As you can see, I am stuck at trying to obtain the href information specifically for Page 2. Is there anyway to access with extra bit of information within the tags such as data-page-number = "2" or data-offset = "20".
page_2 = soup.find('a', attrs = {'data-page-number' : '2'})
This will only get you the page 2, if you want to get the next page no matter what the current page is, you should find the next page url:
next_page = soup.find('a', attrs = {'class' = 'nav next rndBtn ui_button primary taLnk'})
Some attributes, like the data-* attributes in HTML 5, have names that
can’t be used as the names of keyword arguments:
data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(data-foo="value")
# SyntaxError: keyword can't be an expression
You can use these attributes in searches by putting them into a
dictionary and passing the dictionary into find_all() as the attrs
argument:
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

Categories