How to get website images src using BeautifulSoup - python

I'm trying to get all the images src/hyperlink form a webpage
import requests
from bs4 import BeautifulSoup
image_list = []
r = requests.get('https://example.com')
soup = BeautifulSoup(r.content)
for link in soup.find_all('img'):
image_list.append(link)

find the attributes of an html tag using get function. Pass the name of the attribute you want to extract from html tag to get function
for link in soup.find_all('img'):
image_list.append(link.get('src'))

Related

python list all webpages with a special image

I want to create a list of all pages with this image (image.png).
I did the following, what do I need to change?
import requests
from bs4 import BeautifulSoup
def getdata(url):
r=request.get(url)
return r.text
data = getdata ("https://url")
soup = BeautifulSoup(data)
image_links = soup.find_all('img', {'src': 'image.png'})
print (imgae_links)
I think you want to scrape a javascript website you can use selenium or any library
see this

How to get an image tag from a dynamic web page using BeautifulSoup?

Hi i am trying to get images on a webpage using requests and BeautifulSoup.
import requests
from bs4 import BeautifulSoup as BS
data = requests.get(url, headers=headers).content
soup = BS(data, "html.parser")
for imgtag in soup.find_all("img", class_="slider-img"):
print(imgtag["src"])
The problem is while I am getting the webpage in data it does not contain the image tags. Yet when i go to the webpage by my web browser the div tag is populated with multiple <img class="slider-img"> tags.
I am new to this so I am not getting what is going on with that web page. Thanks in advance for help.
PS - web page is using Fotorama Slider and src attribute contains CDN links. if this matters
The image tags are created dynamically by Javascript. You only need uuid to construct the image urls and they are stored within the page:
import re
import requests
from ast import literal_eval
url = "https://fotorama.io/"
img_url = "https://ucarecdn.com/{uuid}/-/stretch/off/-/resize/760x/"
html_doc = requests.get(url).text
uuids = re.search(r"uuids: (\[.*?\])", html_doc, flags=re.S).group(1)
uuids = literal_eval(uuids)
for uuid in uuids:
print(img_url.format(uuid=uuid))
Prints:
https://ucarecdn.com/05e7ff61-c1d5-4d96-ae79-c381956cca2e/-/stretch/off/-/resize/760x/
https://ucarecdn.com/cd8dfa25-2bc5-4546-995a-f3fd23809e1d/-/stretch/off/-/resize/760x/
https://ucarecdn.com/382a5139-6712-4418-b25e-cc8ba69ab07f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3ed25902-4a51-4628-a057-1e55fbca7856/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5b0b329d-050e-4143-bc92-7f40cdde46f5/-/stretch/off/-/resize/760x/
https://ucarecdn.com/464f96db-6ae3-4875-ac6a-cbede40c4a51/-/stretch/off/-/resize/760x/
https://ucarecdn.com/4facbe78-b4e8-4b7d-8fb0-d3659f46f1b4/-/stretch/off/-/resize/760x/
https://ucarecdn.com/379c6c28-f726-48a3-b59e-1248e1e30443/-/stretch/off/-/resize/760x/
https://ucarecdn.com/631479df-27a8-4047-ae59-63f9167001f2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e1e4402-84f0-4d78-b7d8-c48ec437b5af/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f55e6755-198a-408d-8e82-a50370527aed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/5264c896-cf01-4ad9-9216-114c20a388cc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/c6284eae-9be4-4811-b45b-17a5b6e99ad2/-/stretch/off/-/resize/760x/
https://ucarecdn.com/40ff508f-01e5-4417-bee0-20633efc6147/-/stretch/off/-/resize/760x/
https://ucarecdn.com/eaaee377-f1b5-49d7-a7db-d7a1f86b2805/-/stretch/off/-/resize/760x/
https://ucarecdn.com/584c29c8-b521-48ee-8104-6656d4faac97/-/stretch/off/-/resize/760x/
https://ucarecdn.com/798aa641-01fe-4ed2-886b-bac818c5fdfc/-/stretch/off/-/resize/760x/
https://ucarecdn.com/f82be8f5-d517-4642-8fe1-8987b4e530d0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/23b818d0-07c3-40de-a070-c999c1323ff3/-/stretch/off/-/resize/760x/
https://ucarecdn.com/7ca0e7f6-90eb-4254-82ea-58c77e74f6a0/-/stretch/off/-/resize/760x/
https://ucarecdn.com/42dc8c54-2315-453f-9b40-07e332b8ee39/-/stretch/off/-/resize/760x/
https://ucarecdn.com/8e62227c-5acb-4603-abb9-ac0643b7b478/-/stretch/off/-/resize/760x/
https://ucarecdn.com/80713821-5d54-4819-810a-19991502ca56/-/stretch/off/-/resize/760x/
https://ucarecdn.com/35ce83fa-eac1-4326-83e9-e445450b35ce/-/stretch/off/-/resize/760x/
https://ucarecdn.com/3df9ac37-4e86-49e5-9095-28679ab37718/-/stretch/off/-/resize/760x/
https://ucarecdn.com/9e7211c0-b73b-4b1d-8b47-4b1700f9a80f/-/stretch/off/-/resize/760x/
https://ucarecdn.com/1cc3c44b-e4a9-4e37-96cf-afafeb3eb748/-/stretch/off/-/resize/760x/
https://ucarecdn.com/ab52465c-b3d8-4bf6-986a-a4bf815dfaed/-/stretch/off/-/resize/760x/
https://ucarecdn.com/69e43c1d-9fac-4278-bec5-52291c1b1c2b/-/stretch/off/-/resize/760x/
https://ucarecdn.com/0627c11f-522d-48b9-9f17-9ea05b769aaa/-/stretch/off/-/resize/760x/

How to separate links of images on the basis of content inside them in beautifulsoup4

I am new to BeautifulSoup4 and I am trying to fetch all image links from a site for eg.Unsplash but I only wan urls that contains word "photo" in there url eg.
https://images.unsplash.com/photo-1541892079-2475b9253785?ixlib=rb-1.2.1&auto=format&fit=crop&w=500&q=60
I don't want urls that contain word "profile" for eg.
https://images.unsplash.com/profile-1508728808608-d3781b017e73?dpr=1&auto=format&fit=crop&w=32&h=32&q=60&crop=faces&bg=fff
How can I do so I am using Pyhton 3.6 and urllib3 .
You can use this script as an example, how to filter the links:
import requests
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']: # print only links with `photo` inside them
print(img['src'])
Prints:
https://images.unsplash.com/photo-1597649260558-e2bd7d35f043?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format%2Ccompress&fit=crop&w=1000&h=1000
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1598929214025-d6bb6167d43b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599567513879-604247ea2bd3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
https://images.unsplash.com/photo-1599366611308-719895c34512?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&w=1000&q=80
With urllib:
import urllib.request
from bs4 import BeautifulSoup
url = 'https://unsplash.com'
soup = BeautifulSoup(urllib.request.urlopen(url).read(), 'html.parser')
for img in soup.find_all('img'):
if 'photo' in img['src']:
print(img['src'])

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

How to find specific video html tag using beautiful soup?

Does anyone know how to use beautifulsoup in python.
I have this search engine with a list of different urls.
I want to get only the html tag containing a video embed url. and get the link.
example
import BeautifulSoup
html = '''https://archive.org/details/20070519_detroit2'''
#or this.. html = '''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
#or this... html = '''https://www.youtube.com/watch?v=fI3zBtE_S_k'''
soup = BeautifulSoup.BeautifulSoup(html)
what should I do next . to get the html tag of video, or object or the exact link of the video..
I need it to put it on my iframe. i will integrate the python to my php. so getting the link of the video and outputting it using the python then i will echo it on my iframe.
You need to get the html of the page not just the url
use the built-in lib urllib like this:
import urllib
from bs4 import BeautifulSoup as BS
url = '''https://archive.org/details/20070519_detroit2'''
#open and read page
page = urllib.urlopen(url)
html = page.read()
#create BeautifulSoup parse-able "soup"
soup = BS(html)
#get the src attribute from the video tag
video = soup.find("video").get("src")
also with the site you are using i noticed that to get the embed link just change details in the link to embed so it looks like this:
https://archive.org/embed/20070519_detroit2
so if you want to do it to multiple urls without having to parse just do something like this:
url = '''https://archive.org/details/20070519_detroit2'''
spl = url.split('/')
spl[3] = 'embed'
embed = "/".join(spl)
print embed
EDIT
to get the embed link for the other links you provided in your edit you need to look through the html of the page you are parsing, look until you fint the link then get the tag its in then the attribute
for
'''http://www.kumby.com/avatar-the-last-airbender-book-3-chapter-5/'''
just do
soup.find("iframe").get("src")
the iframe becuase the link is in the iframe tag and the .get("src") because the link is the src attribute
You can try the next one because you should learn how to do it if you want to be able to do it in the future :)
Good luck!
You can't parse a URL. BeautifulSoup is used to parse an html page. Retrieve the page first:
import urllib2
data = urllib2.ulropen("https://archive.org/details/20070519_detroit2")
html = data.read()
Then you can use find, and then take the src attribute:
soup = BeautifulSoup(html)
video = soup.find('video')
src = video['src']
this is a one liner to get all the downloadable MP4 file in that page, in case you need it.
import bs4, urllib2
url = 'https://archive.org/details/20070519_detroit2'
soup = bs4.BeautifulSoup(urllib2.urlopen(url))
links = [a['href'] for a in soup.find_all(lambda tag: tag.name == "a" and '.mp4' in tag['href'])]
print links
Here are the output:
['/download/20070519_detroit2/20070519_detroit_jungleearth.mp4',
'/download/20070519_detroit2/20070519_detroit_sweetkissofdeath.mp4',
'/download/20070519_detroit2/20070519_detroit_goodman.mp4',
...
'/download/20070519_detroit2/20070519_detroit_wilson_512kb.mp4']
These are relative links and you and put them together with the domain and you get absolute path.

Categories