I have the following:
html =
'''<div class=“file-one”>
<a href=“/file-one/additional” class=“file-link">
<h3 class=“file-name”>File One</h3>
</a>
<div class=“location”>
Down
</div>
</div>'''
And would like to get just the text of href which is /file-one/additional. So I did:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
link_text = “”
for a in soup.find_all(‘a’, href=True, text=True):
link_text = a[‘href’]
print “Link: “ + link_text
But it just prints a blank, nothing. Just Link:. So I tested it out on another site but with a different HTML, and it worked.
What could I be doing wrong? Or is there a possibility that the site intentionally programmed to not return the href?
Thank you in advance and will be sure to upvote/accept answer!
The 'a' tag in your html does not have any text directly, but it contains a 'h3' tag that has text. This means that text is None, and .find_all() fails to select the tag. Generally do not use the text parameter if a tag contains any other html elements except text content.
You can resolve this issue if you use only the tag's name (and the href keyword argument) to select elements. Then add a condition in the loop to check if they contain text.
soup = BeautifulSoup(html, 'html.parser')
links_with_text = []
for a in soup.find_all('a', href=True):
if a.text:
links_with_text.append(a['href'])
Or you could use a list comprehension, if you prefer one-liners.
links_with_text = [a['href'] for a in soup.find_all('a', href=True) if a.text]
Or you could pass a lambda to .find_all().
tags = soup.find_all(lambda tag: tag.name == 'a' and tag.get('href') and tag.text)
If you want to collect all links whether they have text or not, just select all 'a' tags that have a 'href' attribute. Anchor tags usually have links but that's not a requirement, so I think it's best to use the href argument.
Using .find_all().
links = [a['href'] for a in soup.find_all('a', href=True)]
Using .select() with CSS selectors.
links = [a['href'] for a in soup.select('a[href]')]
You can also use attrs to get the href tag with regex search
soup.find('a', href = re.compile(r'[/]([a-z]|[A-Z])\w+')).attrs['href']
First of all, use a different text editor that doesn't use curly quotes.
Second, remove the text=True flag from the soup.find_all
You could solve this with just a couple lines of gazpacho:
from gazpacho import Soup
html = """\
<div class="file-one">
<a href="/file-one/additional" class="file-link">
<h3 class="file-name">File One</h3>
</a>
<div class="location">
Down
</div>
</div>
"""
soup = Soup(html)
soup.find("a", {"class": "file-link"}).attrs['href']
Which would output:
'/file-one/additional'
A bit late to the party but I had the same issue recently scraping some recipes and got mine printing clean by doing this:
from bs4 import BeautifulSoup
import requests
source = requests.get('url for website')
soup = BeautifulSoup(source, 'lxml')
for article in soup.find_all('article'):
link = article.find('a', href=True)['href'}
print(link)
Related
I just started using beautifulsoup and am stuck on an issue regarding getting attributes of tags inside other tags. I am using the whitehouse.gov/briefing-room/ for practice. What I'm trying to do right now is just get all the links on this page and append them to an empty list. This is my code right now:
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for h2_tags in soup.find_all('h2'):
a_tag = h2_tags.find('a')
urls.append(a_tag.attr['href']) # This is where I get the NoneType error
This code returns the <a tags, but the first and last 3 tags it returns are 'None' and because of this, get a type error when trying to access the attributes to get the href for these <a tags
The problem is, that some <h2> tags don't contain <a> tags. So you have to check for that alternative. Or just select all <a> tags that are under <h2> using CSS selector:
import requests
from bs4 import BeautifulSoup
result = requests.get("https://www.whitehouse.gov/briefing-room/")
src = result.content
soup = BeautifulSoup(src, 'lxml')
urls = []
for a_tag in soup.select('h2 a'): # <-- select <A> tags that are under <H2> tags
urls.append(a_tag.attrs['href'])
print(*urls, sep='\n')
Prints:
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/10/statement-by-nsc-spokesperson-emily-horne-on-national-security-advisor-jake-sullivan-leading-the-first-virtual-meeting-of-the-u-s-israel-strategic-consultative-group/
https://www.whitehouse.gov/briefing-room/press-briefings/2021/03/09/press-briefing-by-press-secretary-jen-psaki-and-deputy-director-of-the-national-economic-council-bharat-ramamurti-march-9-2021/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-the-white-houses-meeting-with-climate-finance-leaders/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/readout-of-vice-president-kamala-harris-call-with-prime-minister-erna-solberg-of-norway/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/nomination-sent-to-the-senate-3/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-biden-announces-key-hire-for-the-office-of-management-and-budget/
https://www.whitehouse.gov/briefing-room/speeches-remarks/2021/03/09/remarks-by-president-biden-during-tour-of-w-s-jenks-son/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/president-joseph-r-biden-jr-approves-louisiana-disaster-declaration/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/statement-by-president-joe-biden-on-the-house-taking-up-the-pro-act/
https://www.whitehouse.gov/briefing-room/statements-releases/2021/03/09/white-house-announces-additional-staff/
I'm trying to get every "a" tag in an html page, and i'm trying to use
soup.find_all
here's my code:
r.text -- the youtube home page in html
soup = BeautifulSoup(r.text, 'html.parser')
for lnk in soup.find_all('a' , {'class' : 'ytd-thumbnail'}):
print(lnk)
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)
I'm trying to get something like this:
<a id="thumbnail" class="yt-simple-endpoint inline-block style-scope ytd-thumbnail" aria-hidden="true" tabindex="-1" href="youtubelink">
but it doesn't even go into the for loop, I don't know why
Use attrs while passing the dictionary in the find_all or find method.
soup = BeautifulSoup(r.text, 'html.parser')
for lnk in soup.find_all('a' , attrs={'class' : 'ytd-thumbnail'}):
print(lnk)
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)
You can try to use a CSS selector. I find them cleaner and more robust. Here, select creates a list of all a tags, where the class attribute contains substring ytd-thumbnail. As a side note, I'd also suggest using the lxml parser for working with bs4.
soup = BeautifulSoup(r.text, 'lxml')
for lnk in soup.select('a[class*=ytd-thumbnail]'):
link = lnk.get("href")
writeFile("queue.txt" , "https://youtube.com" + link)
removeQueue(url)
import requests, bs4, webbrowser
url = 'https://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords='
keywords = "keyboard"
full_link = url + keywords
res = requests.get(full_link)
soup = bs4.BeautifulSoup(res.text)
webbrowser.open(full_link)
a = soup.find('a', {'class': 'a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal'})
print(a)
Hi, I'm trying to get a very specific html element that is buried deep in divs but to no avail. Here is the HTML:
<a class="a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal" title="AmazonBasics Wired Keyboard" href="https://rads.stackoverflow.com/amzn/click/com/B005EOWBHC" rel="nofollow noreferrer"><h2 data-attribute="AmazonBasics Wired Keyboard" data-max-rows="0" class="a-size-medium s-inline s-access-title a-text-normal">AmazonBasics Wired Keyboard</h2></a>
and this is buried pretty deep. I want to get the href of this element, but currently my variable a returns None.
You need to use findAll and supply the classes as an array. For example:
a = soup.findAll('a', {'class': ['a-link-normal', 's-access-detail-page', 's-color-twister-title-link', 'a-text-normal']})
But I would also recommend against such specific class selection. The only one you really need is probably s-access-detail-page
Trying to achieve the following logic:
If URL in text is surrounded by paragraph tags (Example: <p>URL</p>), replace it in place to become a link instead: Click Here
The original file is a database dump (sql, UTF-8). Some URLs already exist in the desired format. I need to fix the missing links.
I am working on a script, which uses Beautifulsoup. If other solutions are make more sense (regex, etc.), I am open to suggestions.
You can search for all p elements that has a text starting with http. Then, replace it with a link:
for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
elm.replace_with(soup.new_tag("a", href=elm.get_text()))
Example working code:
from bs4 import BeautifulSoup
data = """
<div>
<p>http://google.com</p>
<p>https://stackoverflow.com</p>
</div>
"""
soup = BeautifulSoup(data, "html.parser")
for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
elm.replace_with(soup.new_tag("a", href=elm.get_text()))
print(soup.prettify())
Prints:
<div>
</div>
I can imagine this approach break, but it should be a good start for you.
If you additionally want to add texts to your links, set the .string property:
soup = BeautifulSoup(data, "html.parser")
for elm in soup.find_all("p", text=lambda text: text and text.startswith("http")):
a = soup.new_tag("a", href=elm.get_text())
a.string = "link"
elm.replace_with(a)
I'm working on a python script to do some web scraping. I want to find the base URL of a given section on a web page that looks like this:
<div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
...
</div>
So, I just need to get everything from the first href besides the number('webpage-category/page/') and I have the following working code:
pages = [l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href'])]
s = pages[0]
f = ''.join([i for i in s if not i.isdigit()])
The problem is, generating this list is a waste, since I just need the first href. I think a Generator would be the answer but I couldn't pull this off. Maybe you guys could help me to make this code more concise?
What about this:
from bs4 import BeautifulSoup
html = """ <div class='pagination'>
<a href='webpage-category/page/1'>1</a>
<a href='webpage-category/page/2'>2</a>
</div>"""
soup = BeautifulSoup(html)
link = soup.find('div', {'class': 'pagination'}).find('a')['href']
print '/'.join(link.split('/')[:-1])
prints:
webpage-category/page
Just FYI, speaking about the code you've provided - you can use [next()][-1] instead of a list comprehension:
s = next(l['href'] for link in soup.find_all('div', class_='pagination')
for l in link.find_all('a') if not re.search('pageSub', l['href']))
UPD (using the website link provided):
import urllib2
from bs4 import BeautifulSoup
url = "http://www.hdwallpapers.in/cars-desktop-wallpapers/page/2"
soup = BeautifulSoup(urllib2.urlopen(url))
links = soup.find_all('div', {'class': 'pagination'})[1].find_all('a')
print next('/'.join(link['href'].split('/')[:-1]) for link in links
if link.text.isdigit() and link.text != "1")