Python: extract the href surrounding image - python

I am using bs4 and want to extract a href of a specified image.
For example in the html code I have:
<div style="text-align:center;"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></div>
</div>
And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:
soup = bs4.BeautifulSoup(page)
for img in soup('img'):
imgLink = img.find_previous("a")
This returns the whole tag:
<img alt="Tumblr" border="0" src="Here_is_source"/>
But I can't take the href content, because when I try to make:
imgLink = img.find_previous("a")['href']
I have an error.
The same thing is when I try to use find_parent like
imgLink = img.find_parent("a")['href']
How can I fix that? And what is better: find_previous() or find_parent()?

Make sure you are only looking for images that have a <a> parent tag with href attribute:
for img in soup.select('a[href] img'):
link = img.find_parent('a', href=True)
print link['href']
The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.
If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.

Related

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?
You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

How to extract a href from an a class in beautiful soup?

I'm trying to extract href= in an class but am unable to extract it.
I've tried url = tag_variable.find("href"), but am getting None.
<a class="product-card__name" href="/store/groceryGateway/en/Herbs/Fresh/Longo%27s-Fresh-Herbs-Basil/p/00772468010517">
<strong>
Longo's Fresh Herbs Basil</strong>
</a>
href is an attribute or property of the a tag, not a tag object itself which find wants.
Assuming you have the desired a tag as tag_variable, you can use subscription like dict:
url = tag_variable["href"]

Use BeautifulSoup to get profile picture without class name

when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.
Here is a success example:
<a href="https://thenewboston.com/profile.php?user=2" class="user-name">
Bucky Roberts </a>
I can easily tell
soup = BeautifulSoup(plain_text,'html.parser')
for link in soup.findAll('a', {'class': 'user-name'}):
However, when I try to get the profile photo's link, I see the code below by inspecting:
<div class="panel profile-photo">
<a href="https://thenewboston.com/profile.php?user=2">
<img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
</a>
</div>
In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?
You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:
soup.select(".profile-photo > a > img")
To get the src values:
for image in soup.select(".profile-photo > a > img"):
print(image['src'])

Detecting Image Tag in html

I am using BeautifulSoup for parsing A Page's HTML. Due to broken html the markup is not consistent. I have the following html:
<div id='VideoID'>
<a href=#><img src='file.png'></a>
</div>
While on other page it's broken as:
<div id='VideoID'>
<a href=#></a> [Image Tag not enclosed here]
<img src='file.png'>
</div>
Following Line works for first Snippet as expected:
imageURL = imageElement.contents[1].contents[0]['src'].strip()
But not for 2nd one which is obvious.
Is there anyway I detect IMAGE tag within DIV of id 'VideoID' no matter it is enclosed in anchor tag or out of it.?
Yes with .descendants.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#descendants
You iterate through descendants list and you check the .name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#name
Or even easier with CSS selectors:
soup.select("div#VideoID img")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
you can use recursiveChildGenerator() to generate Nth child elements and can find image tag.
example:
for child in childs.recursiveChildGenerator():
image_file = child.findChildren("img")
This will help you to find image tag in any hierarchy.

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings
I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

Categories