When scraping image url src, get data:image/jpeg;base64 - python

I was trying to scrape the image url from a website using python urllib2.
Here is my code to get the html string:
req = urllib2.Request(url, headers = urllib2Header)
htmlStr = urllib2.urlopen(req, timeout=15).read()
When I view from the browser, the html code of the image looks like this:
<img id="main-image" src="http://abcd.com/images/41Q2VRKA2QL._SY300_.jpg" alt="" rel="" style="display: inline; cursor: pointer;">
However, when I read from the htmlStr I captured, the image was converted to base64 image, which looks like this:
<img id="main-image" src="data:image/jpeg;base64,/9j/4AAQSkZJRgABAQAAAQABAAD/2wBDAAUDBAQEAwUEBAQFBQU....">
I am wondering why this happened. Is there a way to get the original image url rather than the base64 image string?
Thanks.

you could use BeautifulSoup
Example:
import urllib2
from bs4 import BeautifulSoup
url = "www.theurlyouwanttoscrape.com"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
img_src = soup.find('img', {'id':'main_image'})['src']

Related

Extract image url from html

<div class="col-lg-2 col-md-2 vcenter no-pad-top no-pad-bot">
<img itemprop="image" src="/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png" alt="SPINNEY MOBILE DEVELOPMENT" class="b-lazy pull-left center-block img-responsive b-loaded"></div>
I need to extract image from this particular class only
"/uploads/images/cache/20955226c5c975c230cb8e1f8cff0e6f1583249561_150_150.png"
My code:
url = "https://www.appfutura.com/developers/spinney"
html = urlopen(url).read()
soup = BeautifulSoup(html,"lxml")
soup.prettify()
for link in soup.find_all('img'):
print(link.get('src'))
how can i Acheive further task?
please help
If I understood your question correctly :
When printing img "src" attribute you can check whether it contains "/uploads/images/cache/" or not.
img = soup.find_all('img')
for link in img:
if "/uploads/images/cache/" in link.get('src'):
print(link.get('src'))
there is a module called 'webbrowser' which lets you open a url, such as:
import webbrowser
url = "https://www.appfutura.com/developers/spinney"
webbrowser.open(url)
but to DOWNLOAD a url you must import requests
import requests
url = 'https://www.appfutura.com/developers/spinney'
r = requests.get(url, allow_redirects=True)
open('yoururlname.html', 'wb').write(r.content)
if you created a folder with that program, the download will end up in your folder. the program will be just a black screen popping up and closing, and the url will be downloaded

Python beautifulsoup: Get a placeholder from img src

With the help of BeautifulSoup I try to read the image address of an image from a homepage.
In the page source text I see the URL of the image.
But if I try to read the address with the command find_all from BeautifulSoup I only get a placeholder for the image URL.
The URL from the image is structured as follows:
<br /><img src="mangas/Young Justice (2019)/Young Justice (2019) Issue 11/cw002.jpg" alt="" width="1200" height="1846" class="picture" />
In BeautifulSoup i get this:
<img 0="" alt="" class="picture" height="" src="/pics/placeholder2.jpg" width=""/>]
I hope anybody can give me a tip or why i get a placeholder instead the original image url.
My Code:
import requests
from bs4 import BeautifulSoup as BS
from requests.exceptions import ConnectionError
def getimageurl(url):
try:
response = requests.get(url)
soup = BS(response.text, 'html.parser')
data = soup.find_all('a', href=True)
for a in data:
t = a.find_all('img', attrs={'class': 'picture'})
print(t)
except ConnectionError:
print('Cant open url: {0}'.format(url))

Retrieving Imgur Image Link via Web Scraping Python

I am trying to retrieve the link for an image using imgur.com. It seems that the picture (if .jpg or .png) is usually stored within (div class="image post-image") on their website, like:
<div class='image post-image'>
<img alt="" src="//i.imgur.com/QSGvOm3.jpg" original-title="" style="max-width: 100%; min-height: 666px;">
</div>
so here is my code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://imgur.com/gallery/0PTPt'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
info = soup.find_all('div', {'class':'post-image'})
file = open('imgur-html.txt', 'w')
file.write(str(info))
file.close()
Instead of being able to get everything within these tags, this is my output:
<div class="post-image" style="min-height: 666px">
</div>
What do I need to do in order to access this further so I can get the image link? Or is this simply something where I need to only use the API? Thanks for any help.
The child img it would appear is dynamically added and not present. You can extract full link from rel
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://imgur.com/gallery/0PTPt')
soup = bs(r.content, 'lxml')
print(soup.select_one('[rel=image_src]')['href'])

Python: get CATCHA file from html

I try to decode captcha on python, but I don't know, how can I get it from html.
I use
html = session.get(page, headers=headers).text
soup = BeautifulSoup(html, "html.parser")
And html looks like
<img src="/captcha.gif" style="width:1px;height:1px"/>
How can I exctract it? I can do it only with save image?
you can get image on your PC like this:
import urllib.request
from bs4 import BeautifulSoup as BS
tag = '<img src="/captcha.gif" style="width:1px;height:1px"/>'
soup = BS(tag)
img_tag = soup.find('img')
urllib.request.urlretrieve('https://absolute/path/to'+img_tag['src'], os.getcwd() + '/temp_img')

How to extract specific URL from HTML using Beautiful Soup?

I want to extract specific URLs from an HTML page.
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = http://bassrx.tumblr.com/tagged/tt # nsfw link
page = urlopen(url)
html = page.read() # get the html from the url
# this works without BeautifulSoup, but it is slow:
image_links = re.findall("src.\"(\S*?media.tumblr\S*?tumblr_\S*?jpg)", html)
print image_links
The output of the above is exactly the URL, nothing else: http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg
The only downside is it is very slow.
BeautifulSoup is extremely fast at parsing HTML, so that's why I want to use it.
The urls that I want are actually the img src. Here's a snippet from the HMTL that contains that information that I want.
<div class="media"><a href="http://bassrx.tumblr.com/image/85635265422">
<img src="http://38.media.tumblr.com/tumblr_ln5gwxHYei1qi02clo1_500.jpg"/>
</a></div>
So, my question is, how can I get BeautifulSoup to extract all of those 'img src' urls cleanly without any other cruft?
I just want a list of matching urls. I've been trying to use soup.findall() function, but cannot get any useful results.
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://bassrx.tumblr.com/tagged/tt'
soup = BeautifulSoup(urlopen(url).read())
for element in soup.findAll('img'):
print(element.get('src'))
You can use div.media > a > img CSS selector to find img tags inside a which is inside a div tag with media class:
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = "<url_here>"
soup = BeautifulSoup(urlopen(url))
images = soup.select('div.media > a > img')
print [image.get('src') for image in images]
In order to make the parsing faster you can use lxml parser:
soup = BeautifulSoup(urlopen(url), "lxml")
You need to install lxml module first, of course.
Also, you can make use of a SoupStrainer class for parsing only relevant part of the document.
Hope that helps.
Have a look a BeautifulSoup.find_all with re.compile mix
from urllib2 import urlopen
import re
from bs4 import BeautifulSoup
url = "http://bassrx.tumblr.com/tagged/tt" # nsfw link
page = urlopen(url)
html = page.read()
bs = BeautifulSoup(html)
a_tumblr = [a_element for a_element in bs.find_all(href=re.compile("media\.tumblr"))]
##[<link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="shortcut icon"/>, <link href="http://37.media.tumblr.com/avatar_df3a9e37c757_128.png" rel="apple-touch-icon"/>]

Categories