Extracting image src based on attribute with BeautifulSoup - python

I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.
Here's my code:
url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"):
print("inside FOR")
print(link.get('src'))

You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:
film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ##._V1_SY317_CR0,0,214,317_.jpg
I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.

I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag
This code is working for me with BeautifulSoup4:
url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"):
print "inside FOR"
print(tag['src'])

If I understand correctly you are looking for the src of the image, for the extraction of it after that.
In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:
m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url)
page = client.read()
client.close()
page_soup = BS(page, 'html.parser')
teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
name = team.h2.text
shield_url = team.img['src']
Then, you need to process the image. You have to options.
1st: using numpy:
def url_to_image(url):
'''
Función para extraer una imagen de una URL
'''
resp = uOpen(url)
image = np.asarray(bytearray(resp.read()), dtype='uint8')
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
return image
shield = url_to_image(shield_url)
2nd Using scikit-image library (that you will probably need to install):
shield = io.imread('http:' + shield_url)
Note: Just in this particular example I needed to add http: at the beggining.
Hope it helps!

Here's a full working example with gazpacho:
Step 1 - import everything and download the html:
from pathlib import Path
from urllib.request import urlretrieve as download
from gazpacho import Soup
id = 'tt5057054'
url = f"https://www.imdb.com/title/{id}"
soup = Soup.get(url)
Step 2 - find the src url for the image asset:
image = (soup
.find("div", {"id": "title-overview"})
.find("div", {"class": "poster"})
.find("img")
.attrs['src']
)
Step 3 - save it to your machine:
directory = "images"
Path(directory).mkdir(exist_ok=True)
extension = image.split('.')[-1]
download(image, f"{directory}/{id}.{extension}")

Related

How do I get 1 href per one <li> (multiple of them)?

I am trying to get images which are placed inside of <li>'s anchor tag as a href.
I am able to get only one link, but not everything.
I am trying to scrape the following page:
https://www.msxdistribution.com/love-triangle
As you can see there are multiple product images and I am trying to get them but unfortunately I am not able to do so, what I did successfully is to get only first image, but not other...
Here's my code:
def scraping_data(productlinks,r):
ix = int(0)
for link in productlinks:
ix = ix + 1
f = requests.get(link,headers=headers).text
hun=BeautifulSoup(f,'html.parser')
dom = etree.HTML(str(hun))
#Here I get description of product
try:
name=hun.find("h1",{"class":"product-name"}).get_text().replace('\n',"")
print(name)
except:
name = None
try:
print("Trying to fetch image...")
all_imgs = hun.find_all('img') #Here I tried to fetch every img from web-site
for image in all_imgs:
print(all_imgs)
ioner = image.find_all(attrs={'class': 'zoomImg'}) #Tried to get only images with class of zoomImg #Unsuccessful
print(ioner)
ss = hun.find("a",{"class":"fancy-images"}).get('href') #This one gets only first img and it works
print(ss)
except Exception as e:
print("No images")
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.msxdistribution.com/love-triangle"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for img in soup.select(".etalage_thumb_image"):
print(img["src"])
Prints:
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-1_1/stimolatore-love-triangle-11.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-2_1/stimolatore-love-triangle-12.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-3_1/stimolatore-love-triangle-13.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-4_1/stimolatore-love-triangle-14.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-5/stimolatore-love-triangle-15.jpg

Python web scraping script does not find element by xPath even though it exists

Currently I'm working on a small script which should extract the name, link, price and image of the cheapest product given a link of a price comparison webiste of my country.
An example link would look like this: https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist
This is the code I have so far:
#!/usr/bin/env python3
from urllib.request import Request, urlopen
from lxml import html
from lxml import etree
from lxml.etree import tostring
link = 'https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist'
link = 'https://geizhals.at/?cat=monlcd19wide&v=e&hloc=at&sort=p&bl1_id=30&xf=11939_23%7E11955_IPS%7E11963_240%7E14591_19201080'
link = 'https://geizhals.at/?cat=cpuamdam4&xf=25_6%7E5_PCIe+4.0%7E5_SMT%7E820_AM4'
def get_webSite():
req = Request(link, headers={'User-Agent': 'Mozilla/5.0'})
return urlopen(req).read()
webpage = get_webSite() # Contains all HTML from the site
root = html.fromstring(webpage)
price = root.xpath("//*[#id=\"product0\"]/div[6]/span/span")[0].text.strip()
name = root.xpath("//*[#id=\"product0\"]/div[2]/a/span")[0].text.strip()
link = "https://geizhals.at/" + root.xpath("//*[#id=\"product0\"]/div[2]/a/#href")[0]
picture = root.xpath("//*[#id=\"product0\"]/div[1]/a/div/picture/img/#big-image-url")[0]
# the # refers to the attribute of the selected element, / slashes seem to separate the searched terms
# The [0] refers to the first element of a list, we use this because xPath returns a list with exactly one item
price = price.lstrip('€ ') # removes the euro sign and the space
price = price.replace(',', '.') # removes the comma with a dot
price = float(price) # converts price string to float
print(f"Price : {price}")
print("Name : " + (name))
print("Link : " + (link))
print("PictureLink : " + (picture))
Everything works and get's printed into the console except the link to the picture thumbnail.
I have tried the normal xPath and the full xPath with no avail. No such element get's found even though it exists.
What could be the problem?
The error in your xpath is in doing:
img/#big-image-url
It should be:
img[#big-image-url]
Otherwise, the / will traverse to children of the img but you want to check the attribute of the img tag itself. Here's an example of grabbing all images from the page:
import requests
from lxml import html
res=requests.get('https://geizhals.at/?cat=monlcd19wide&xf=11939_23~11955_IPS~11963_144~14591_19201080&asuch=&bpmin=&bpmax=&v=e&hloc=at&plz=&dist=&mail=&sort=p&bl1_id=30#productlist')
root = html.fromstring(res.content)
[item.attrib['big-image-url'] for item in root.xpath('//img[#big-image-url]')]
['https://gzhls.at/i/61/20/2436120-n0.jpg', 'https://gzhls.at/i/05/53/2430553-n0.jpg', 'https://gzhls.at/i/75/76/2237576-n0.jpg', 'https://gzhls.at/i/15/28/2201528-n0.jpg', 'https://gzhls.at/i/19/26/2221926-n0.jpg', 'https://gzhls.at/i/06/38/2410638-n0.jpg', 'https://gzhls.at/i/98/04/2459804-n0.jpg', 'https://gzhls.at/i/14/04/2201404-n0.jpg', 'https://gzhls.at/i/24/52/2132452-n0.jpg', 'https://gzhls.at/i/17/64/2401764-n0.jpg', 'https://gzhls.at/i/07/97/2350797-n0.jpg', 'https://gzhls.at/i/50/31/2365031-n0.jpg', 'https://gzhls.at/i/25/01/2322501-n0.jpg', 'https://gzhls.at/i/26/50/2152650-n0.jpg', 'https://gzhls.at/i/27/93/2202793-n0.jpg', 'https://gzhls.at/i/72/69/2267269-n0.jpg', 'https://gzhls.at/i/20/79/2142079-n0.jpg', 'https://gzhls.at/i/06/48/2430648-n0.jpg', 'https://gzhls.at/i/41/24/2294124-n0.jpg', 'https://gzhls.at/i/82/46/2378246-n0.jpg', 'https://gzhls.at/i/46/35/2124635-n0.jpg', 'https://gzhls.at/i/43/84/2304384-n0.jpg', 'https://gzhls.at/i/29/73/2382973-n0.jpg', 'https://gzhls.at/i/07/36/2410736-n0.jpg', 'https://gzhls.at/i/97/54/2459754-n0.jpg', 'https://gzhls.at/i/67/40/2456740-n0.jpg', 'https://gzhls.at/i/15/03/2151503-n0.jpg', 'https://gzhls.at/i/45/26/2244526-n0.jpg', 'https://gzhls.at/i/91/51/2089151-n0.jpg', 'https://gzhls.at/i/39/71/2393971-n0.jpg']
So it should be there within the html big-image-url attribute, for example:

trying to get all imgs from web python + bs4

First of all I'll say that, as my code comments are in Spanish, I'll try to explain them in English even though the code is pretty obvious and easy to understand. Don't feel insulted if I'm explaining things that are too obvious :)
So I'm trying to get all img from a website but it seems it just doesn't want to. I've read some similar articles but none seem to work.
import requests
from bs4 import BeautifulSoup as bs
import os
You can visit the web and see the html yourself.
# url de las imgs
url = 'https://dbz.space/cards/'
Here request the web page for it to be parsed
# descargamos la pagina para scrapear
page = requests.get(url)
soup = bs(page.text, 'html.parser')
Here I search for all the images with that class
# localizamos todas las imgs con esa clase
image_tags = soup.findAll("img", {"class": "thumb on"})
Here i just look if the folder imgs exist and if it doesn't then create one to then go inside it
# si no existe imgs lo creamos
if not os.path.exists('imgs'):
os.makedirs('imgs')
# cambiamos de directorio
os.chdir('imgs')
A variable for then naming all images
# para el nombre de la imagen
x = 0
And finally the saving process
# guardando imagenes
for image in image_tags:
try:
url = image['src']
response = requests.get(url)
if response.status_code == 200:
with open('img-' + str(x) + '.jpg', 'wb') as f:
f.write(requests.get(url).content)
f.close()
x += 1
print('Nueva imagen en carpeta')
except:
pass
So, the imgs on the web are inside a div tag and they have the class "thumb on" and they also contain the src (obviously) link which is the one I want to get to my folder called "imgs"
If all you want is the URL of the image file itself...
> <img class="thumb on"
> src="https://dbz.s3.amazonaws.com/v2/global/character/thumb/card_1011720_thumb.png">
Then simply...
yourBSobj.find("img", {"class": "thumb on"}).attrs['src']
I would use find_all() actually so you can iterate through a loop of images, do your processing/saving etc and then see your results afterwards.
First of all, as #cricket_007 said, the img tags are indeed loaded asynchronously by JavaScript. But, there is no need of using Selenium.
Upon inspection, you can see that each img tag is located inside this tag:
<div class="..." res="..." base="..." aim="" quantity="" release="1" imgur="x">
This tag is available in the source code (i.e. Not loaded by JavaScript). Here, we can get the x value which is a part of the imgur URL. One example:
<div class="..." res="1010160" base="1010161" aim="" quantity="" release="1" imgur="yK0wNs3">
After getting the imgur value, you can make the URL like this:
'https://i.imgur.com/{}.png'.format(imgur)
As the URL is https://i.imgur.com/yK0wNs3.png.
Complete code:
r = requests.get('https://dbz.space/cards/')
#soup = BeautifulSoup(r.text, 'lxml')
soup = bs(r.text, 'html.parser')
if not os.path.exists('imgs'):
os.makedirs('imgs')
os.chdir('imgs')
i = 0
for item in soup.find_all('div', imgur=True):
imgur = item['imgur']
if imgur:
r = requests.get('https://i.imgur.com/{}.png'.format(imgur))
with open('img-{}.jpg'.format(i), 'wb') as f:
f.write(r.content)
i += 1
Partial Output:
Note: I'm using f.write(r.content) and not f.write(requests.get(url).content). There's no need to send one more request.
So the error that poped saying File "pilla.py", line 6, in <module> soup = BeautifulSoup(r.text, 'lxml') NameError: name 'BeautifulSoup' is not defined
Is solved by changing on the variable soup BeautifulSoup for bs and lxlm for html.parser Complete code is right here:
import requests
from bs4 import BeautifulSoup as bs
import os
r = requests.get('https://dbz.space/cards/')
soup = bs(r.text, 'html.parser')
if not os.path.exists('imgs'):
os.makedirs('imgs')
os.chdir('imgs')
i = 0
for item in soup.find_all('div', imgur=True):
imgur = item['imgur']
if imgur:
r = requests.get('https://i.imgur.com/{}.png'.format(imgur))
with open('img-{}.jpg'.format(i), 'wb') as f:
f.write(r.content)
i += 1
Thank you all very much for the help. Really apreciate it :)

Extract image links from the webpage using Python

So I wanted to get all of the pictures on this page(of the nba teams).
http://www.cbssports.com/nba/draft/mock-draft
However, my code gives a bit more than that. It gives me,
<img src="http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png" alt="Orlando Magic" width="30" height="30" border="0" />
How can I shorten it to only give me, http://sports.cbsimg.net/images/nba/logos/30x30/ORL.png.
My code:
import urllib2
from BeautifulSoup import BeautifulSoup
# or if your're using BeautifulSoup4:
# from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[1].find("a")
if anchor:
print anchor
I know this can be "traumatic", but for those automatically generated pages, where you just want to grab the damn images away and never come back, a quick-n-dirty regular expression that takes the desired pattern tends to be my choice (no Beautiful Soup dependency is a great advantage):
import urllib, re
source = urllib.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read()
## every image name is an abbreviation composed by capital letters, so...
for link in re.findall('http://sports.cbsimg.net/images/nba/logos/30x30/[A-Z]*.png', source):
print link
## the code above just prints the link;
## if you want to actually download, set the flag below to True
actually_download = False
if actually_download:
filename = link.split('/')[-1]
urllib.urlretrieve(link, filename)
Hope this helps!
To save all the images on http://www.cbssports.com/nba/draft/mock-draft,
import urllib2
import os
from BeautifulSoup import BeautifulSoup
URL = "http://www.cbssports.com/nba/draft/mock-draft"
default_dir = os.path.join(os.path.expanduser("~"),"Pictures")
opener = urllib2.build_opener()
urllib2.install_opener(opener)
soup = BeautifulSoup(urllib2.urlopen(URL).read())
imgs = soup.findAll("img",{"alt":True, "src":True})
for img in imgs:
img_url = img["src"]
filename = os.path.join(default_dir, img_url.split("/")[-1])
img_data = opener.open(img_url)
f = open(filename,"wb")
f.write(img_data.read())
f.close()
To save any particular image on http://www.cbssports.com/nba/draft/mock-draft,
use
soup.find("img",{"src":"image_name_from_source"})
You can use this functions for getting the list of all images url from url.
#
#
# get_url_images_in_text()
#
# #param html - the html to extract urls of images from him.
# #param protocol - the protocol of the website, for append to urls that not start with protocol.
#
# #return list of imags url.
#
#
def get_url_images_in_text(html, protocol):
urls = []
all_urls = re.findall(r'((http\:|https\:)?\/\/[^"\' ]*?\.(png|jpg))', html, flags=re.IGNORECASE | re.MULTILINE | re.UNICODE)
for url in all_urls:
if not url[0].startswith("http"):
urls.append(protocol + url[0])
else:
urls.append(url[0])
return urls
#
#
# get_images_from_url()
#
# #param url - the url for extract images url from him.
#
# #return list of images url.
#
#
def get_images_from_url(url):
protocol = url.split('/')[0]
resp = requests.get(url)
return get_url_images_in_text(resp.text, protocol)

How come I don't print and image

This is my code:
import urllib2
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.cbssports.com/nba/draft/mock-draft').read())
rows = soup.findAll("table", attrs = {'class': 'data borderTop'})[0].tbody.findAll("tr")[2:]
for row in rows:
fields = row.findAll("td")
if len(fields) >= 3:
anchor = row.findAll("td")[1].find("a")
if anchor:
print anchor
Instead of printing out an image, it gives me where the image is in the page source. Any reasons as to why?
According to BeautifulSoup documentation, soup.findAll returns a list of tags or NavigableStrings.
So you have to use specific methods such as content().
Visit http://www.crummy.com/software/BeautifulSoup/bs3/documentation.html at "Navigating the Parse Tree" subtitle to find what you need in this case.
It looks like you want the team logo thumbnails?
import urllib2
import BeautifulSoup
url = 'http://www.cbssports.com/nba/draft/mock-draft'
txt = urllib2.urlopen(url).read()
bs = BeautifulSoup.BeautifulSoup(txt)
# get the main table
t = bs.findAll('table', attrs={'class': 'data borderTop'})[0]
# get the thumbnail urls
imgs = [im["src"] for im in t.findAll('img')] if "logos" in im["src"]]
imgs now looks like
[[u'http://sports.cbsimg.net/images/nba/logos/30x30/NO.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/CHA.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/WAS.png',
u'http://sports.cbsimg.net/images/nba/logos/30x30/CLE.png',
etc. These are the file locations for each logo, which is all the HTML actually contains; if you want the actual pictures, you have to get each one separately.
The list contains duplicate references to each logo; the quickest way to remove duplicates is
imgs = list(set(imgs))
Alternatively, the list does not include every team; if you had a full list of team name contractions, you could build the logo-url list directly.
Also, looking at the site, each 30x30 logo has a corresponding 90x90 logo which you might prefer - much larger and clearer. If so,
imgs = [im.replace('30x30', '90x90') for im in imgs]
imgs now looks like
[u'http://sports.cbsimg.net/images/nba/logos/90x90/BOS.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/CHA.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/CLE.png',
u'http://sports.cbsimg.net/images/nba/logos/90x90/DAL.png',
etc.
Now, for each url, we download the image and save it:
import os
savedir = 'c:\\my documents\\logos' # assumes this dir actually exists!
for im in imgs:
fname = im.rsplit('/', 1)[1]
fname = os.path.join(savedir, fname)
with open(fname, 'wb') as outf:
outf.write(urllib2.urlopen(im).read())
and you have your logos.

Categories