How to web-scrape images which does not have source? - python

Link:https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0
This website has questions in image form that I need to scrape. However I cannot even get a link to their source and it outputs links to some loading gifs. When I saw the source code, there weren't even any "src" to the images. You can see how the website works on the link provided above. How can I download all these images?
from bs4 import BeautifulSoup
import requests
import os
url = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
images = soup.find_all('img')
for image in images:
link = image['src']
print (link)

The question id's are embedded as part of the page, try extracting the id using the re(regex) module.
import re
import requests
from bs4 import BeautifulSoup
headers = {
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36"
}
URL = "https://www.exam-mate.com/topicalpastpapers/?cat=3&subject=22&years=&seasons=&paper=&zone=&chapter=&order=asc0"
BASE_URL = "https://www.exam-mate.com"
soup = BeautifulSoup(requests.get(URL).content, "html.parser")
for tag in soup.select("td:nth-of-type(1) a"):
# Find the question id within the page
question_link = re.search(r"/questions.*\.png", tag["onclick"]).group()
print(BASE_URL + question_link)
Output:
https://www.exam-mate.com/questions/1240/1362/1240_q_1362_1_1.png
https://www.exam-mate.com/questions/1240/1363/1240_q_1363_2_1.png
https://www.exam-mate.com/questions/1240/1364/1240_q_1364_3_1.png
https://www.exam-mate.com/questions/1240/1365/1240_q_1365_4_1.png
https://www.exam-mate.com/questions/1240/1366/1240_q_1366_5_1.png
...And on

As the page is dynamic BeautifulSoup doesn't work here. Have to use selenium
Navigate to the site
Get all questions using xpath: //div/div[3]/center/table/tbody/tr/td[1]/center/a and loop and click on them.
Get the image source using xpath: //*[#id="question_prev"]/div[2]/img/#src then get and save the image.

Related

Why I can't scrape images inside a class or a div?

I want to get all the images within a div, but everytime I try the output returns 'none' or just an empyt list. The issue just seems to happens when I try to scrape between a div, or a class. Even using different user-agents, .find or .find_all .
from bs4 import BeautifulSoup
import requests
abcde = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36'}
r = requests.get('https://www.gettyimages.com.br/fotos/randon', headers=abcde)
soup = BeautifulSoup(r.content, 'html.parser')
check = soup.find_all('img', class_="GalleryItems-module__searchContent___DbMmK"})
print(check)
Would recommend to work with an api, while there is on https://developers.gettyimages.com/docs/
To answer your question concerning just images - Classes are not the best identifier cause often they are dynamic, also there is a gallery(fixed) and a mosaic view.
Simply select the <article> and its child <img> to get your goal:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.gettyimages.com.br/fotos/randon?assettype=image&sort=mostpopular&phrase=randon',
headers = {'User-Agent': 'Mozilla/5.0'}
)
soup = BeautifulSoup(r.text)
for e in soup.select('article img'):
print(e.get('src'))
Output
https://media.gettyimages.com/photos/randon-norway-picture-id974597088?k=20&m=974597088&s=612x612&w=0&h=EIwbJNzCld1tbU7rTyt42pie2yCEk5z4e6L6Z4kWhdo=
https://media.gettyimages.com/photos/caption-patrick-roanhouse-a-266-member-chats-about-some-software-on-picture-id97112678?k=20&m=97112678&s=612x612&w=0&h=zmwqIlVv2f-M9Vz_qcpITPzj-3SON99G3P69h69J5Gs=
https://media.gettyimages.com/photos/12th-and-f-streets-nw-washington-dc-pedestrians-teofila-randon-left-picture-id97102402?k=20&m=97102402&s=612x612&w=0&h=potzNUgMo3gKab5eS_pwyNggS2YGn6sCnDQYxdGUHqc=
https://media.gettyimages.com/photos/randon-perdue-kari-barnhart-attend-the-other-nashville-society-one-picture-id969787702?k=20&m=969787702&s=612x612&w=0&h=kcaYmOKruLb57Vqga68xvEZB1V12wSPPYkC6GdvXO18=
https://media.gettyimages.com/photos/death-of-duguesclin-to-chateauneuf-de-randon-july-13-1380-during-the-picture-id959538894?k=20&m=959538894&s=612x612&w=0&h=lx3DHDSf3kBc_h-O2kjR2D6UYDjPPvhn8xJ_KM0cmMc=
https://media.gettyimages.com/photos/ski-de-randone-a-saintefoy-au-dessus-du-couloir-de-la-croix-savoie-mr-picture-id945817638?k=20&m=945817638&s=612x612&w=0&h=fRd3M2KCa5dd0z8ePnPw2IkAKhXYJpuCFuUTz7jpVPU=
...

How can I get URLs from Oddsportal?

How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

Could not scrape some image links from a webpage using requests module

I've created a script using requests and BeautifulSoup library to parse the links of some images from a webpage. The image links are visible when you use this selector [class^='cylindo-viewer-frame'] > img[src*='/frames/'] within the search bar (Ctrl + F) after inspecting element. This how they look like in the dom.
I know I can grab those image links using selenium but I would like to stick with requests module. I've noticed several times that there are always possibilities to grab such dynamic content using requests module. I've tried finding those links within script tag and in dev tools but no luck.
Two of the expected links out of 32 are:
https://content.cylindo.com/api/v2/4616/products/657285/frames/5/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/7/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
This is how I've tried:
import requests
from bs4 import BeautifulSoup
link = 'https://www.ethanallen.com/on/demandware.store/Sites-ethanallen-us-Site/en_US/Product-Variation?pid=emersonQS&dwvar_emersonQS_Fabric=Q1031&dwvar_emersonQS_seatingSize=90sofa&step=2'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36'
r = s.get(link)
soup = BeautifulSoup(r.text,"lxml")
for item in soup.select(".cylindo-viewer-container li[class^='cylindo-viewer-frame'] > img[src*='/frames/']"):
print(item.get("src"))
How can I grab those image links using requests?
Why you should use selenium?
Website serves content dynamically, what is not to handle with requests, cause the information you try to match is not in the response.
Take a look, it is not that hard ;)
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.ethanallen.com/on/demandware.store/Sites-ethanallen-us-Site/en_US/Product-Variation?pid=emersonQS&dwvar_emersonQS_Fabric=Q1031&dwvar_emersonQS_seatingSize=90sofa&step=2"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, 'lxml')
for item in soup.select(".cylindo-viewer-container li[class^='cylindo-viewer-frame'] > img[src*='/frames/']"):
print(item.get("src"))
driver.close()
Output
https://content.cylindo.com/api/v2/4616/products/657285/frames/3/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/27/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/29/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/11/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/31/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
https://content.cylindo.com/api/v2/4616/products/657285/frames/5/657285.JPG?background=FFFFFF&feature=FABRIC:Q1031&size=1268
...

How can I scrape all the images from a website?

I have a website where I'd like to get all the images from the website.
The website is kind of a dynamic in nature, I tried using google's Agenty Chrome extension and followed the steps:
I Choose one image that I want to extract using CSS selector, this will make the extension select the same other images automatically.
Viewed the Show button and select ATTR(attribute).
Changed src as an ATTR field.
Gave a name field name option.
Saved it & ran it in using Agenty platform/API.
This should yield me the result but it's not, it is returning an empty output.
Is there any better option? Will BS4 a better option for this? Any help is appreciated.
I am assuming you want to download all images in the website. It is actually very easy to do this effectively using beautiful soup 4 (BS4).
#code to find all images in a given webpage
from bs4 import BeautifulSoup
import urllib.request
import requests
import shutil
url=('https://www.mcmaster.com/')
html_page = urllib.request.urlopen(url)
soup = BeautifulSoup(html_page, features="lxml")
for img in soup.findAll('img'):
assa=(img.get('src'))
new_image=(url+assa)
You can also download the image with this tacked-on to the end:
response = requests.get(my_url, stream=True)
with open('Mypic.bmp', 'wb') as file:
shutil.copyfileobj(response.raw, file)
Everything in two lines:
from bs4 import BeautifulSoup; import urllib.request; from urllib.request import urlretrieve
for img in (BeautifulSoup((urllib.request.urlopen("https://apod.nasa.gov/apod/astropix.html")), features="lxml")).findAll('img'): assa=(img.get('src')); urlretrieve(("https://apod.nasa.gov/apod/"+assa), "Mypic.bmp")
The new image should be in the same directory as the python file, but can be moved with:
os.rename()
In the case of the McMaster website, the images are linked differently, so the above methods won't work. The following code should get most of the images on the website:
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import urllib.request
import shutil
import requests
req = Request("https://www.mcmaster.com/")
html_page = urlopen(req)
soup = BeautifulSoup(html_page, "lxml")
links = []
for link in soup.findAll('link'):
links.append(link.get('href'))
print(links)
UPDATE: I found from some github post the below code that is MUCH more accurate:
import requests
import re
image_link_home=("https://images1.mcmaster.com/init/gfx/home/.*[0-9]")
html_page = requests.get(('https://www.mcmaster.com/'),headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for item in re.findall(image_link_home,html_page):
if str(item).startswith('http') and len(item) < 150:
print(item.strip())
else:
for elements in item.split('background-image:url('):
for item in re.findall(image_link_home,elements):
print((str(item).split('")')[0]).strip())
Hope this helps!
You should use scrapy, it makes the crawling seamless, by selecting the content you wish to download with css tags You can automate the crawling easily.
You can use Agenty Web Scraping Tool.
Setup your scraper using Chrome extension to extract src attribute from images
Save the agent to run on cloud.
Here is similar question answered on Agenty forum - https://forum.agenty.com/t/can-i-extract-images-from-website/24
Full Disclosure - I am working at Agenty
This site using CSS embedding to store images. If you check the source code you can find links which has https://images1.mcmaster.com/init/gfx/home/ those are the actual images but its actually stitched together (row of images)
Example : https://images1.mcmaster.com/init/gfx/home/Fastening-and-Joining-Fasteners-sprite-60.png?ver=1539608820
import requests
import re
url=('https://www.mcmaster.com/')
image_urls = []
html_page = requests.get(url,headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'}).text
for values in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',html_page):
if str(values).startswith('http') and len(values) < 150:
image_urls.append(values.strip())
else:
for elements in values.split('background-image:url('):
for urls in re.findall('https://images1.mcmaster.com/init/gfx/home/.*[0-9]',elements):
urls = str(urls).split('")')[0]
image_urls.append(urls.strip())
print(len(image_urls))
print(image_urls)
Note: Scraping website is subject to copyrights

Using python to scrape push data?

I'm trying to scrape the left side of this news site (= SENESTE NYT):
https://www.dr.dk/nyheder/
But it seems the data isn't anywhere to be found? Neither in the html or related api/json etc. Is it some kind of push data?
Using Chrome's Network console I've found this api but it doesn't contain the news items on the left side:
https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100
Can anyone help me? How do I scrape "SENESTE NYT"?
I first loaded the page with selenium and then processed with BeautifulSoup.
from selenium import webdriver
from bs4 import BeautifulSoup
url = "https://www.dr.dk/nyheder"
driver = webdriver.Chrome()
driver.get(url)
page_source = driver.page_source
soup = BeautifulSoup(page_source, "lxml")
div = soup.find("div", {"class":"timeline-container"})
headlines = div.find_all("h3")
print(headlines)
And it seems to find the headlines:
[<h3>Puigdemont: Debatterede spørgsmål af interesse for hele Europa</h3>,
<h3>Afblæser tsunami-varsel for Hawaii</h3>,
<h3>56.000 flygter fra vulkan i udbrud </h3>,
<h3>Pence: USA offentliggør snart plan for ambassadeflytning </h3>,
<h3>Østjysk motorvej genåbnet </h3>]
Not sure if this is what you wanted.
-----EDITED----
More efficient way would be to create request with some custom headers (already confirmed this is not working)
import requests
headers = {
"Accept":"*/*",
"Host":"www.dr.dk",
"Referer":"https://www.dr.dk/nyheder",
"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36"
}
r = requests.get(url="https://www.dr.dk/tjenester/newsapp-content/teasers?reqoffset=0&reqlimit=100", headers=headers)
r.json()

Categories