I am trying to get images which are placed inside of <li>'s anchor tag as a href.
I am able to get only one link, but not everything.
I am trying to scrape the following page:
https://www.msxdistribution.com/love-triangle
As you can see there are multiple product images and I am trying to get them but unfortunately I am not able to do so, what I did successfully is to get only first image, but not other...
Here's my code:
def scraping_data(productlinks,r):
ix = int(0)
for link in productlinks:
ix = ix + 1
f = requests.get(link,headers=headers).text
hun=BeautifulSoup(f,'html.parser')
dom = etree.HTML(str(hun))
#Here I get description of product
try:
name=hun.find("h1",{"class":"product-name"}).get_text().replace('\n',"")
print(name)
except:
name = None
try:
print("Trying to fetch image...")
all_imgs = hun.find_all('img') #Here I tried to fetch every img from web-site
for image in all_imgs:
print(all_imgs)
ioner = image.find_all(attrs={'class': 'zoomImg'}) #Tried to get only images with class of zoomImg #Unsuccessful
print(ioner)
ss = hun.find("a",{"class":"fancy-images"}).get('href') #This one gets only first img and it works
print(ss)
except Exception as e:
print("No images")
Try:
import requests
from bs4 import BeautifulSoup
url = "https://www.msxdistribution.com/love-triangle"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
for img in soup.select(".etalage_thumb_image"):
print(img["src"])
Prints:
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-1_1/stimolatore-love-triangle-11.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-2_1/stimolatore-love-triangle-12.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-3_1/stimolatore-love-triangle-13.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-4_1/stimolatore-love-triangle-14.jpg
https://www.msxdistribution.com/media/catalog/product/cache/4/thumbnail/800x800/9df78eab33525d08d6e5fb8d27136e95/7/1/7101024-5/stimolatore-love-triangle-15.jpg
Related
When I was trying to scrape the data from Sephora and Ulta using beautifulsoup, I could get the html content of the page. Then when I tried to use lxml to parse it using xpath, i didn't get any output. But working with this same xpath in selenium, i could get the output.
Using Beautifulsoup
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i])
my_url=df['product_url'].iloc[i]
My_url= ureq(my_url)
my_html=My_url.read()
My_url.close()
soup = BeautifulSoup(my_html, 'html.parser')
dom = et.HTML(str(soup))
#price
try:
price=(dom.xpath('//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span/text()'))
df['price'].iloc[i]=price
except:
pass
Using Selenium
lst=[]
urls=df['product_url']
for url in urls[:599]:
time.sleep(1)
driver.get(url)
time.sleep(2)
try:
prize=driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
except:
pass
lst.append([prize])
pz=None
dt=None
Does anyone know why i cant get the content using lxml to parse it using same xpath in beautifulsoup? Thanks so much in advance.
Sample Link of Ulta:
[1]: https://www.ulta.com/p/coco-mademoiselle-eau-de-parfum-spray-pimprod2015831
Sample Link of Sephora:
[2]: https://www.sephora.com/product/coco-mademoiselle-P12495?skuId=513168&icid2=products
1. About the XPath
driver.find_element('xpath','//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span').text
I'm a bit surprised that the selenium code works for your Sephora links - the link you provided redirects to a productnotcarried page, but at this link (for example), that XPath has no matches. You can use //p[#data-comp="Price "]//span/b instead.
Actually, even for Ulta, I prefer //*[#class="ProductHero__content"]//*[#class="ProductPricing"]/span just for human-readability although it looks better if you use this path with css selectors
prize=driver.find_element("css selector", '*.ProductHero__content *.ProductPricing>span').text
[Coding for both sites - Selenium]
To account for both sites, you could set up something like this reference dictionary:
xRef = {
'www.ulta.com': '//*[#id="1b7a3ab3-2765-4ee2-8367-c8a0e7230fa4"]/span',
'www.sephora.com': '//p[#data-comp="Price "]//span/b'
}
# for url in urls[:599]:... ################ REST OF CODE #############
and then use it accordingly
# from urllib.parse import urlsplit
# lst, urls, xRef = ....
# for url in urls[:599]:
# sleep...driver.get...sleep...
try:
uxrKey = urlsplit(url).netloc
prize = driver.find_element('xpath', xRef[uxrKey]).text
except:
# pass # you'll just be repeating whatever you got in the previous loop for prize
# [also, if this happens in the first loop, an error will be raised at lst.append([prize])]
prize = None # 'MISSING' # '' #
################ REST OF CODE #############
2. Limitations of Scraping with bs4+requests
I don't know what et and ureq are, but response from requests.get can be parsed without them; although [afaik] bs4 doesn't have any XPath support, css selectors can be used with .select .
price = soup.select('.ProductHero__content .ProductPricing>span') # for Ulta
price = soup.select('p[data-comp~="Price"] span>b') # for Sephora
Although that's enough for Sephora, there's another issue - the price in Ulta pages are loaded with js so the parent of the price span is empty.
3. [Suggested Solution] Extracting from JSON inside script Tags
For both sites, product data can be found inside script tags, so this function can be used to extract price from either site:
# import json
############ LONGER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
s, sj = scriptTag.get_text(), json.loads(scriptTag.get_text())
while s:
sPair = s.split('"#type"', 1)[1].split(':', 1)[1].split(',', 1)
t, s = sPair[0].strip(), sPair[1]
try:
if t == '"Product"': return sj['offers']['price'] # Ulta
elif t == '"Organization"': return sj['offers'][0]['price'] # Sephora
# elif.... # can add more options
# else.... # can add a default
except: continue
except: return None
#######################################
############ SHORTER VERSION ##########
def getPrice_fromScript(scriptTag):
try:
sj = json.loads(scriptTag.get_text())
try: return sj['offers']['price'] # Ulta
except: pass
try: return sj['offers'][0]['price'] # Sephora
except: pass
# try...except: pass # can try more options
except: return None
#######################################
and you can use it with your BeautifulSoup code:
# from requests_html import HTMLSession # IF you use instead of requests
# def getPrice_fromScript....
for i in range(len(df)):
response = requests.get(df['product_url'].iloc[i]) # takes too long [for me]
# response = HTMLSession().get(df['product_url'].iloc[i]) # is faster [for me]
## error handing, just in case ##
if response.status_code != 200:
errorMsg = f'Failed to scrape [{response.status_code} {response.reason}] - '
print(errorMsg, df['product_url'].iloc[i])
continue # skip to next loop/url
soup = BeautifulSoup(response.content, 'html.parser')
pList = [p.strip() for p in [
getPrice_fromScript(s) for s in soup.select('script[type="application/ld+json"]')[:5] # [1:2]
] if p and p.strip()]
if pList: df['price'].iloc[i] = pList[0]
(The price should be in the second script tag with type="application/ld+json", but this is searching the first 5 just in case....)
Note: requests.get was being very slow when I was testing these codes, especially for Sephora, so I ended up using HTMLSession().get instead.
I am trying to scrape App Store pages of games to fetch their icon and screenshots using Python. However, sometimes it returns some random images that are not necessarily game screenshots. For example, it might return Apple Maps screenshot etc.
Here is my script to get image URLs,
def get_app_store_images_icon(url):
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
link = soup.find('ul', {"class": "l-row l-row--peek we-screenshot-viewer__screenshots-list"})
link_icon = soup.find('div', {"class": "product-hero__media l-column small-5 medium-4 large-3 small-valign-top"})
images = []
try:
for subHtml in link.find_all('picture'):
image_sources = subHtml.find_all('source')[-1].get('srcset')
# to get last image in srcset
start = image_sources.rindex(", ")
end = image_sources.rindex(" ")
images.append(image_sources[start+2:end])
except:
return None
try:
image_icon_sources = link_icon.find_all('source')[-1].get('srcset')
start = image_icon_sources.rindex(", ")
end = image_icon_sources.rindex(" ")
image_icon = image_icon_sources[start+2:end]
except:
return None
all_images = {'images': images, 'icon': image_icon}
return all_images
I am using BeautifulSoup4 version 4.9.1 and Python 3.7
I want to know if this is a bug or Apple's security conventions to dynamically change images for crawlers.
I am somewhat new to Python and can't for the life of me figure out why the following code isn’t pulling the element I am trying to get.
It currently returns:
for player in all_players:
player_first, player_last = player.split()
player_first = player_first.lower()
player_last = player_last.lower()
first_name_letters = player_first[:2]
last_name_letters = player_last[:5]
player_url_code = '/{}/{}{}01'.format(last_name_letters[0], last_name_letters, first_name_letters)
player_url = 'https://www.basketball-reference.com/players' + player_url_code + '.html'
print(player_url) #test
req = urlopen(player_url)
soup = bs.BeautifulSoup(req, 'lxml')
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
Currently returning:
--> for td in table.find_all('td'):
player_pbp_data.append(td.get_text()) #if this works, would like to
AttributeError: 'NoneType' object has no attribute 'find_all'
Note: iterating through children of the wrapper object returns:
< div class="table_outer_container" > as part of the tree.
Thanks!
Make sure that table contains the data you expect.
For example https://www.basketball-reference.com/players/a/abdulka01.html doesn't seem to contain a div with id='all_advanced_pbp'
Try to explicitly pass the html instead:
bs.BeautifulSoup(the_html, 'html.parser')
I trie to extract data from the url you gave but it did not get full DOM. after then i try to access the page with browser with javascrip and without javascrip, i know website need javascrip to load some data. But the page like players it need not. The simple way to get dynamic data is using selenium
This is my test code
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
player_pbp_data = []
def get_list(t="a"):
with requests.Session() as se:
url = "https://www.basketball-reference.com/players/{}/".format(t)
req = se.get(url)
soup = BeautifulSoup(req.text,"lxml")
with open("a.html","wb") as f:
f.write(req.text.encode())
table = soup.find("div",class_="table_wrapper setup_long long")
players = {player.a.text:"https://www.basketball-reference.com"+player.a["href"] for player in table.find_all("th",class_="left ")}
def get_each_player(player_url="https://www.basketball-reference.com/players/a/abdulta01.html"):
with webdriver.Chrome() as ph:
ph.get(player_url)
text = ph.page_source
'''
with requests.Session() as se:
text = se.get(player_url).text
'''
soup = BeautifulSoup(text, 'lxml')
try:
wrapper = soup.find('div', id='all_advanced_pbp')
table = wrapper.find('div', class_='table_outer_container')
for td in table.find_all('td'):
player_pbp_data.append(td.get_text())
except Exception as e:
print("This page dose not contain pbp")
get_each_player()
I am looking to identify the urls that request external resources in html files.
I currently use the scr attribute in the img and script tags, and the href attribute in the link tag (to identify css).
Are there other tags that I should be examining to identify other resources?
For reference, my code in Python is currently:
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = [x['src'] for x in soup.findAll('img')]
css_link = [x['href'] for x in soup.findAll('link')]
scipt_src = [] ## Often times script doesn't have attributes 'src' hence need for try/except
for x in soup.findAll('script'):
try:
scipt_src.append(x['src'])
except KeyError:
pass
Updated my code to capture what seemed like the most common resources in html code. Obviously this doesn't look at resources requested in either CSS or Javascript. If I am missing tags please comment.
from bs4 import BeautifulSoup
def find_list_resources (tag, attribute,soup):
list = []
for x in soup.findAll(tag):
try:
list.append(x[attribute])
except KeyError:
pass
return(list)
html = read_in_file(file)
soup = BeautifulSoup(html)
image_scr = find_list_resources('img',"src",soup)
scipt_src = find_list_resources('script',"src",soup)
css_link = find_list_resources("link","href",soup)
video_src = find_list_resources("video","src",soup)
audio_src = find_list_resources("audio","src",soup)
iframe_src = find_list_resources("iframe","src",soup)
embed_src = find_list_resources("embed","src",soup)
object_data = find_list_resources("object","data",soup)
soruce_src = find_list_resources("source","src",soup)
I'm using BeautifulSoup to get a HTML page from IMDb, and I would like to extract the poster image from the page. I've got the image based on one of the attributes, but I don't know how to extract the data inside it.
Here's my code:
url = 'http://www.imdb.com/title/tt%s/' % (id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print("before FOR")
for src in soup.find(itemprop="image"):
print("inside FOR")
print(link.get('src'))
You're almost there - just a couple of mistakes. soup.find() gets the first element that matches, not a list, so you don't need to iterate over it. Once you have got the element, you can get its attributes (like src) using dictionary access. Here's a reworked version:
film_id = '0423409'
url = 'http://www.imdb.com/title/tt%s/' % (film_id)
soup = BeautifulSoup(urllib2.urlopen(url).read())
link = soup.find(itemprop="image")
print(link["src"])
# output:
http://ia.media-imdb.com/images/M/MV5BMTg2ODMwNTY3NV5BMl5BanBnXkFtZTcwMzczNjEzMQ##._V1_SY317_CR0,0,214,317_.jpg
I've changed id to film_id, because id() is a built-in function, and it's bad practice to mask those.
I believe your example is very close. You need to use findAll() instead of find() and when you iterate, you switch from src to link. In the below example I switched it to tag
This code is working for me with BeautifulSoup4:
url = 'http://www.imdb.com/title/tt%s/' % (id,)
soup = BeautifulSoup(urllib2.urlopen(url).read())
print "before FOR"
for tag in soup.findAll(itemprop="image"):
print "inside FOR"
print(tag['src'])
If I understand correctly you are looking for the src of the image, for the extraction of it after that.
In the first place you need to find (using the inspector) in which position in the HTML is the image. For example, in my particle case that I was scrapping soccer team shields, I needed:
m_url = 'http://www.marca.com/futbol/primera/equipos.html'
client = uOpen(m_url)
page = client.read()
client.close()
page_soup = BS(page, 'html.parser')
teams = page_soup.findAll('li', {'id': 'nombreEquipo'})
for team in teams:
name = team.h2.text
shield_url = team.img['src']
Then, you need to process the image. You have to options.
1st: using numpy:
def url_to_image(url):
'''
Función para extraer una imagen de una URL
'''
resp = uOpen(url)
image = np.asarray(bytearray(resp.read()), dtype='uint8')
image = cv2.imdecode(image, cv2.IMREAD_COLOR)
return image
shield = url_to_image(shield_url)
2nd Using scikit-image library (that you will probably need to install):
shield = io.imread('http:' + shield_url)
Note: Just in this particular example I needed to add http: at the beggining.
Hope it helps!
Here's a full working example with gazpacho:
Step 1 - import everything and download the html:
from pathlib import Path
from urllib.request import urlretrieve as download
from gazpacho import Soup
id = 'tt5057054'
url = f"https://www.imdb.com/title/{id}"
soup = Soup.get(url)
Step 2 - find the src url for the image asset:
image = (soup
.find("div", {"id": "title-overview"})
.find("div", {"class": "poster"})
.find("img")
.attrs['src']
)
Step 3 - save it to your machine:
directory = "images"
Path(directory).mkdir(exist_ok=True)
extension = image.split('.')[-1]
download(image, f"{directory}/{id}.{extension}")