Python: extract the href surrounding image

Python: extract the href surrounding image - python

I am using bs4 and want to extract a href of a specified image.
For example in the html code I have:
<div style="text-align:center;"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></div>
</div>
And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:
soup = bs4.BeautifulSoup(page)
for img in soup('img'):
imgLink = img.find_previous("a")
This returns the whole tag:
<img alt="Tumblr" border="0" src="Here_is_source"/>
But I can't take the href content, because when I try to make:
imgLink = img.find_previous("a")['href']
I have an error.
The same thing is when I try to use find_parent like
imgLink = img.find_parent("a")['href']
How can I fix that? And what is better: find_previous() or find_parent()?

Make sure you are only looking for images that have a <a> parent tag with href attribute:
for img in soup.select('a[href] img'):
link = img.find_parent('a', href=True)
print link['href']
The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.
If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.

Related

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

I am trying to get information from the <a> tag in between these two span tags
<span class="mentioned">
<a class="mentioned-123" onclick="information('123');" href="#28669">>>28669</a>
</span>
For example I would like to be able to get the value of the href in it. How can I do this?

You can look for the mentioned-123 class and then access the href with:
soup = BeautifulSoup(html, "html.parser")
print(soup.find("a", class_="mentioned-123")["href"])

How to extract a href from an a class in beautiful soup?

I'm trying to extract href= in an class but am unable to extract it.
I've tried url = tag_variable.find("href"), but am getting None.
<a class="product-card__name" href="/store/groceryGateway/en/Herbs/Fresh/Longo%27s-Fresh-Herbs-Basil/p/00772468010517">
<strong>
Longo's Fresh Herbs Basil</strong>
</a>

href is an attribute or property of the a tag, not a tag object itself which find wants.
Assuming you have the desired a tag as tag_variable, you can use subscription like dict:
url = tag_variable["href"]

Use BeautifulSoup to get profile picture without class name

when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.
Here is a success example:
<a href="https://thenewboston.com/profile.php?user=2" class="user-name">
Bucky Roberts </a>
I can easily tell
soup = BeautifulSoup(plain_text,'html.parser')
for link in soup.findAll('a', {'class': 'user-name'}):
However, when I try to get the profile photo's link, I see the code below by inspecting:
<div class="panel profile-photo">
<a href="https://thenewboston.com/profile.php?user=2">
<img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
</a>
</div>
In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?

You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:
soup.select(".profile-photo > a > img")
To get the src values:
for image in soup.select(".profile-photo > a > img"):
print(image['src'])

Detecting Image Tag in html

I am using BeautifulSoup for parsing A Page's HTML. Due to broken html the markup is not consistent. I have the following html:
<div id='VideoID'>
<a href=#><img src='file.png'></a>
</div>
While on other page it's broken as:
<div id='VideoID'>
<a href=#></a> [Image Tag not enclosed here]
<img src='file.png'>
</div>
Following Line works for first Snippet as expected:
imageURL = imageElement.contents[1].contents[0]['src'].strip()
But not for 2nd one which is obvious.
Is there anyway I detect IMAGE tag within DIV of id 'VideoID' no matter it is enclosed in anchor tag or out of it.?

Yes with .descendants.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#descendants
You iterate through descendants list and you check the .name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#name
Or even easier with CSS selectors:
soup.select("div#VideoID img")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors

you can use recursiveChildGenerator() to generate Nth child elements and can find image tag.
example:
for child in childs.recursiveChildGenerator():
image_file = child.findChildren("img")
This will help you to find image tag in any hierarchy.

Python 3, beautiful soup, get next tag

I have the following html part which repeates itself several times with other href links:
<div class="product-list-item margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">
Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item".
Pretty new to beautifulsoup and nothing that I came up with worked.
Thanks for your ideas.
EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.
EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):
soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].get("class"))
This will give me a list of all the "product-list-item" but then I tried something like
print(x[i].get("class").next_element)
Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:
print(x[i][0].get("class").next_element)
Which led to this error: return self.attrs[key] KeyError: 0.
Also tried with .find_all("href") and .get("href") but this all leads to the same errors.
EDIT3: Ok seems I found out how to solve it, now I did:
x = soup.find_all("div")
for i in range(len(x)):
if x[i].get("class") and "product-list-item" in x[i].get("class"):
print(x[i].next_element.next_element.get("href"))
This can also be shortened by using another attribute to the find_all function:
x = soup.find_all("div", "product-list-item")
for i in x:
print(i.next_element.next_element.get("href"))
greetings

I want to get all the href links in this document that are directly after the div tag with the class "product-list-item"
To find the first <a href> element in the <div>:
links = []
for div in soup.find_all('div', 'product-list-item'):
a = div.find('a', href=True) # find <a> anywhere in <div>
if a is not None:
links.append(a['href'])
It assumes that the link is inside <div>. Any elements in <div> before the first <a href> are ignored.
If you'd like; you can be more strict about it e.g., taking the link only if it is the first child in <div>:
a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
links.append(a['href'])
Or if <a> is not inside <div>:
a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
links.append(a['href'])
There are many ways to search and navigate in BeautifulSoup.
If you search with lxml.html, you could also use xpath and css expressions if you are familiar with them.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python: extract the href surrounding image - python

Related

How Can I Get Information From An A Tag Between Two Span Tags in BeautifulSoup Using Python?

How to extract a href from an a class in beautiful soup?

Use BeautifulSoup to get profile picture without class name

Detecting Image Tag in html

Python 3, beautiful soup, get next tag

Categories

Resources