Use BeautifulSoup to get profile picture without class name - python

when I learn BeautifulSoup library and try to crawl a webpage, I can limit the search result by limiting the attributes like: a, class name = user-name, which can be found by inspecting the HTML source.
Here is a success example:
<a href="https://thenewboston.com/profile.php?user=2" class="user-name">
Bucky Roberts </a>
I can easily tell
soup = BeautifulSoup(plain_text,'html.parser')
for link in soup.findAll('a', {'class': 'user-name'}):
However, when I try to get the profile photo's link, I see the code below by inspecting:
<div class="panel profile-photo">
<a href="https://thenewboston.com/profile.php?user=2">
<img src="/photos/users/2/resized/869b40793dc9aa91a438b1eb6ceeaa96.jpg" alt="">
</a>
</div>
In this case the .jpg link has nothing to refer to. Now what should I do to get the .jpg link for each user?

You can use the img element parent elements to create your locator. I would use the following CSS selector that would match img elements directly under the a elements directly under the element having profile-photo class:
soup.select(".profile-photo > a > img")
To get the src values:
for image in soup.select(".profile-photo > a > img"):
print(image['src'])

Related

How to get links within a subelement in selenium?

I have the following html code:
<div id="category"> //parent div
<div class="username"> // n-number of elements of class username which all exist within parent div
<a rel="" href="link" title="smth">click</a>
</div>
</div>
I want to get all the links witin the class username BUT only those within the parent div where id=category. When I execute the code below it doesn´t work. I can only access the title attribute by default but can´t extract the link. Does anyone have a solution?
a = driver.find_element_by_id('category').find_elements_by_class_name("username")
links = [x.get_attribute("href") for x in a]
Use the following css selector which will return all the anchor tags.
links = [x.get_attribute("href") for x in driver.find_elements(By.CSS_SELECTOR,"#category > .username >a")]
Or
links = [x.get_attribute("href") for x in driver.find_elements_by_css_selector("#category > .username >a")]

Creating a css selector to locate multiple ids in a single-shot

I've defined css selectors within the script to get the text within span elements and I'm getting them accordingly. However, the way I tried is definitely messy. I just seperated different css selectors using comma to let the script understand I'm after this or that.
If I opt for xpath I could have used 'div//span[.="Featured" or .="Sponsored"]' but in case of css selector I could not find anything similar to serve the same purpose. I know using 'span:contains("Featured"),span:contains("Sponsored")' I can get the text but there is the comma in between as usual.
What is the ideal way to locate the elements (within different ids) using css selectors except for comma?
My try so far with:
from lxml.html import fromstring
html = """
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/pizza-hut">
Pizza Hut
</a>
<div id="featured other-dynamic-ids">
<span>Sponsored</span>
</div>
</div>
<div class="rest-list-information">
<a class="restaurant-header" href="/madison-wi/restaurants/salads-up">
Salads UP
</a>
<div id="other-dynamic-ids border">
<span>Featured</span>
</div>
</div>
"""
root = fromstring(html)
for item in root.cssselect("[id~='featured'] span,[id~='border'] span"):
print(item.text)
You can do:
.rest-list-information div span
But I think it's a bad idea to consider the comma messy. You won't find many stylesheets that don't have commas.
If you are just looking to get all 'span' text from the HTML then the following should suffice:
root_spans = root.xpath('//span')
for i, root_spans in enumerate(root_spans):
span_text = root_spans.xpath('.//text()')[0]
print(span_text)

Python BeautifulSoupe4: How to get link from anchor tag without duplicate

<a href=”link” class=”link_to_img”>
<img src=”dosen’t matter”></img>
</a>
<span>
<a href=”link” class=”link_to_img”>Title Of Image</a>
</span>
As you can see there are two <a href=”link” class=”link_to_img”></a> when I try to get the link Example: href.findAll('a', 'class': 'link_to_img') it gets the link but it duplicates it and I just need it one time. Is there a way I can target the <a></a> inside the <span></span>
You can use the limit argument in findAll().
find_all("a", limit=1)
You can select your link with a span > a CSS selector which would match an a tag directly inside a span tag:
soup.select("span > a")
You can additionally check the class:
soup.select("span > a.link_to_img")
If you are using the latest beautifulsoup4 package, you can use select_one() to have it return a single Tag instance instead of a list:
soup.select_one("span > a.link_to_img")

Python: extract the href surrounding image

I am using bs4 and want to extract a href of a specified image.
For example in the html code I have:
<div style="text-align:center;"><img src="page_files/image.jpg" alt="Picture" border="0" width="150" height="150"></div>
</div>
And I have my image src given (page_files/image.jpg) and want to extract corresponding href, so in this example it is: page/folder1/image.jpg. I was trying to use find_previous method, but I have a small problem to extract the href content:
soup = bs4.BeautifulSoup(page)
for img in soup('img'):
imgLink = img.find_previous("a")
This returns the whole tag:
<img alt="Tumblr" border="0" src="Here_is_source"/>
But I can't take the href content, because when I try to make:
imgLink = img.find_previous("a")['href']
I have an error.
The same thing is when I try to use find_parent like
imgLink = img.find_parent("a")['href']
How can I fix that? And what is better: find_previous() or find_parent()?
Make sure you are only looking for images that have a <a> parent tag with href attribute:
for img in soup.select('a[href] img'):
link = img.find_parent('a', href=True)
print link['href']
The CSS selector picks only images that have an <a href="..."> parent tag with an href attribute. The find_parent() search then again limits the search to those tags that have the attribute set.
If you are searching for all images, chances are you are finding some that have a <a> tag parent or preceding tag that does not have the a href attribute; <a> tags can also be used for link targets with <a name="...">, for example. If you are getting NoneType attribute errors, that simply means there is no such parent tag for the given <img> tag.

Detecting Image Tag in html

I am using BeautifulSoup for parsing A Page's HTML. Due to broken html the markup is not consistent. I have the following html:
<div id='VideoID'>
<a href=#><img src='file.png'></a>
</div>
While on other page it's broken as:
<div id='VideoID'>
<a href=#></a> [Image Tag not enclosed here]
<img src='file.png'>
</div>
Following Line works for first Snippet as expected:
imageURL = imageElement.contents[1].contents[0]['src'].strip()
But not for 2nd one which is obvious.
Is there anyway I detect IMAGE tag within DIV of id 'VideoID' no matter it is enclosed in anchor tag or out of it.?
Yes with .descendants.
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#descendants
You iterate through descendants list and you check the .name
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#name
Or even easier with CSS selectors:
soup.select("div#VideoID img")
http://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
you can use recursiveChildGenerator() to generate Nth child elements and can find image tag.
example:
for child in childs.recursiveChildGenerator():
image_file = child.findChildren("img")
This will help you to find image tag in any hierarchy.

Categories