I am trying to scrape some img files from IG using selenium and bs4. I have this following script to do it, it seems to work fine, but eventually I'd like it to just print img src, a sample: https://scontent-lax3-2.cdninstagram.com/vp/2592f6b07f88bfc4bfdf6d73400a04b8/5BA6E998/t51.2885-15/s640x640/sh0.08/e35/28752330_1972627949433283_1816022201220988928_n.jpg and download images later. But for now I would need some help to just print that img src link without the tags and extras. Thanks for the advice.
Code:
import requests
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = ('https://www.instagram.com/kitties/')
driver = webdriver.Firefox()
driver.get(url)
soup = BeautifulSoup(driver.page_source, 'lxml')
img_url = soup.find_all('img', class_='_2di5p')
print img_url
Just print out the src of the found images.
imgs= soup.find_all('img', class_='_2di5p')
for img in imgs:
img_url=img["src"]
print img_url
Related
I'm trying to scrape a link in the video description on youtube, but the list always return empty.
I've tried to change the tag from where I'm scraping, but there is no change in either the output nor the error message.
Here's the code I'm using:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.youtube.com/watch?v=gqUqGaXipe8').text
soup = BeautifulSoup(source, 'lxml')
link = [i['href'] for i in soup.findAll('a', class_='yt-simple-endpoint style-scope yt-formatted-string', href=True)]
print(link)
What is wrong, and how can I solve it?
In your case, requests doesn't return the whole HTML structure of the page. If Youtube is filling in the data using JavaScript we must run it through a real browser to get the source of the page, such as Chrome Headless using the Selenium library. Here is the general solution:
from bs4 import BeautifulSoup
from selenium import webdriver
import time
options = webdriver.ChromeOptions()
options.add_argument('--headless')
driver = webdriver.Chrome(options = options)
url = "https://www.youtube.com/watch?v=Oh1nqnZAKxw"
driver.get(url)
time.sleep(2)
soup = BeautifulSoup(driver.page_source, 'html.parser')
driver.quit()
link = [i['href'] for i in soup.select('div#meta div#description [href]')]
print(link)
I'm trying to download images from this page. I wrote the following Python script:
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print (img.get('src'))
However, I get only the image names and not the complete path. On the site, I can hover over the image name when I inspect the html and the link appears. Is there any way I can parse this link using BeautifulSoup?
Image
The image URIs in your page are marked up relative to the hostname.
You can build an absolute url for each image using urljoin function in urllib.parse module.
from urllib.parse import urljoin
page_url = "http://ottofrello.dk/malerierstor.htm"
request = requests.get(page_url)
...
for img in element:
image_url = urljoin(
page_url,
img.get('src')
)
print(image_url)
From what I understood, you are interested in absolute image path and not the relative path which you are getting right now. The only change I made is in your print statement.
import requests
import subprocess
from bs4 import BeautifulSoup
request = requests.get("http://ottofrello.dk/malerierstor.htm")
content = request.content
soup = BeautifulSoup(content, "html.parser")
element = soup.find_all("img")
for img in element:
print ('http://ottofrello.dk/' + img.get('src'))
I'm unable to scrape images from the website www.kissmanga.com . I'm using Python3 and the Requests and Beautifulsoup libraries. The scraped image tags give blank "src".
SRC:
from bs4 import BeautifulSoup
import requests
scraper = cfscrape.create_scraper()
url = "http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206"
response = requests.get(url)
soup2 = BeautifulSoup(response.text, 'html.parser')
divImage = soup2.find('div',{"id": "divImage"})
for img in divImage.findAll('img'):
print(img)
response.close()
I think image scraping is prevented because I believe the website uses cloudflare. Upon this assumption, I also tried using the "cfscrape" library to scrape the content.
You need to wait for JavaScript to inject the html code for images.
Multiple tools are capable of doing this, here are some of them:
Ghost
PhantomJS (Ghost Driver)
Selenium
I was able to get it working with Selenium:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
driver = webdriver.Firefox()
# it takes forever to load the page, therefore we are setting a threshold
driver.set_page_load_timeout(5)
try:
driver.get("http://kissmanga.com/Manga/Bleach/Bleach-634--Friend-004?id=235206")
except TimeoutException:
# never ignore exceptions silently in real world code
pass
soup2 = BeautifulSoup(driver.page_source, 'html.parser')
divImage = soup2.find('div', {"id": "divImage"})
# close the browser
driver.close()
for img in divImage.findAll('img'):
print img.get('src')
Refer to How to download image using requests if you also want to download these images.
Have you tried setting a custom user-agent?
It's typically considered unethical to do so, but so is scraping manga.
I am trying to fetch important images and not thumbnail or other gifs from the Wikipedia page and using following code. However the "img" is coming as length of "0". any suggestion on how to rectify it.
Code :
import urllib
import urllib2
from bs4 import BeautifulSoup
import os
html = urllib2.urlopen("http://en.wikipedia.org/wiki/Main_Page")
soup = BeautifulSoup(html)
imgs = soup.findAll("div",{"class":"image"})
Also if someone can explain in detail that how to use the findAll by looking at "source element" in webpage. That will be awesome.
The a tags on the page have an image class, not div:
>>> img_links = soup.findAll("a", {"class":"image"})
>>> for img_link in img_links:
... print img_link.img['src']
...
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
Or, even better, use a.image > img CSS selector:
>>> for img in soup.select('a.image > img'):
... print img['src']
//upload.wikimedia.org/wikipedia/commons/thumb/1/1f/Stora_Kronan.jpeg/100px-Stora_Kronan.jpeg
//upload.wikimedia.org/wikipedia/commons/thumb/4/4b/Christuss%C3%A4ule_8.jpg/77px-Christuss%C3%A4ule_8.jpg
...
UPD (downloading images using urllib.urlretrieve):
from urllib import urlretrieve
import urlparse
from bs4 import BeautifulSoup
import urllib2
url = "http://en.wikipedia.org/wiki/Main_Page"
soup = BeautifulSoup(urllib2.urlopen(url))
for img in soup.select('a.image > img'):
img_url = urlparse.urljoin(url, img['src'])
file_name = img['src'].split('/')[-1]
urlretrieve(img_url, file_name)
I don't see any div tags with a class called 'image' on that page.
You could get all the image tags and throw away ones that are small.
imgs = soup.select('img')
I am looking to grab the full size product images from here
My thinking was:
Follow the image link
Download the picture
Go back
Repeat for n+1 pictures
I know how to open the image thumbnails but not how to get the full size images. Any ideas on how this could be done?
This will get you all URL of the images:
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
print img.a['href'].split("imgurl=")[1]
Output:
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g1_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g4_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g2_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g5_satellite-pro-c850.jpg
http://www.toshiba.fr/contents/fr_FR/SERIES_DESCRIPTION/images/g3_satellite-pro-c850.jpg
And this code is for downloading and saving those images:
import os
import urllib
import urllib2
from bs4 import BeautifulSoup
url = "http://icecat.biz/p/toshiba/pscbxe-01t00een/satellite-pro-notebooks-4051528049077-Satellite+Pro+C8501GR-17732197.html"
html = urllib2.urlopen(url)
soup = BeautifulSoup(html)
imgs = soup.findAll("div", {"class":"thumb-pic"})
for img in imgs:
imgUrl = img.a['href'].split("imgurl=")[1]
urllib.urlretrieve(imgUrl, os.path.basename(imgUrl))