How to scrape images from a site using Beautiful Soup?

How to scrape images from a site using Beautiful Soup? - python

i am trying to pull a set of weave sample images from a site
the objective is to create a dataset for creative project
code and screenshot of site to scrape included below
any pointers greatly appreciated, thank you
'''
from bs4 import BeautifulSoup
import requests
import urllib.request
import shutil
url = "https://cdndrafts-01-2019.handweaving.net"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
aas = soup.find_all("right-padding", class_='img')
image_info = []
for a in aas:
image_tag = a.findChildren('img')
image_info.append((image_tag[0]['src'], image_tag[0]['alt']))
def download_image(image):
response = requests.get(image[0], stream=True)
realname = ''.join(e for e in image[1] if e.isalnum())
file = open("C://cdnddrafts{}/jpg".format(realname))
response.raw.decode_conent = True
shutil.copyfileobj(response.raw, file)
del response
for i in range(0, len(image_info)):
download_image(image_info[i])
'''
one of the images to scrape

This is pretty in-depth guide on how to do this, take a look. Some parts are redundant & you can leave them out.

Related

Filter images by name before scraping with beautifulsoup?

I'm trying to create a program that scrapes a site for images using bs4. The site contains two types of images, low quality ones and high quality ones. The high quality files are named the same thing as their low quality versions, but contain the word "website" and the end before the .png. I'd like to only download the "website" files. Here's what I tried.
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if not image.endswith("Website.png"):
continue
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
I don't get any error messages, but no files download. Any tips?

You are only checking if it ends with "Website.png" after you have already established that it doesn't. Better not to even check if it doesn't:
from bs4 import BeautifulSoup
import requests
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
images = soup.find_all('img')
resolvedURLs = []
for image in images:
src = image.get('src')
resolvedURLs.append(requests.compat.urljoin(URL, src))
for image in resolvedURLs:
if image.endswith("Website.png"):
webs = requests.get(image)
open('scraped_images/' + image.split('/')[-1], 'wb').write(webs.content)
Actually using list comprehensions you can make your code less procedural and prevent mistakes of the sort you made in the future:
from bs4 import BeautifulSoup
import requests
from requests.compat import urljoin
URL = "https://www.ssbwiki.com/Category:Head_icons_(SSBU)"
getURL = requests.get(URL, headers={"User-Agent":"Mozilla/5.0"})
soup = BeautifulSoup(getURL.text, 'html.parser')
image_urls = [urljoin(URL,image.get('src')) for image in soup.find_all('img')]
# let's make this one a generator so we don't keep too many downloaded
# images in memory
images = (requests.get(url) for url in image_urls if url.endswith("Website.png"))
for image in images:
# use the context manager so the files are closed after write
with open('scraped_images/' + image.split('/')[-1], 'wb') as f:
f.write(image.content)

How to scrape video URL from Webpage using python?

I want to download videos from a website.
Here is my code.
Every time when i run this code, it returns blank file.
Here is live code: https://colab.research.google.com/drive/19NDLYHI2n9rG6KeBCiv9vKXdwb5JL9Nb?usp=sharing
from bs4 import BeautifulSoup
import requests
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
#print(soup.prettify())
result = soup.find_all('video', class_="video-player")
print(result)

using Regex
import requests
import re
response = requests.get("....../video/20000kCCF0")
videoId = '20000kCCF0'
videos = re.findall(r'https://[^"]+' + videoId + '[^"]+mp4', response.text)
print(videos)

You always get a blank return because soup.find_all() doesn't find anything.
Maybe you should check the url.content you receive by hand and then decide what to look for with find_all()
EDIT: After digging a bit around I found out how to get the content_url_orig:
from bs4 import BeautifulSoup
import requests
import json
url = requests.get("https://www.mxtakatak.com/xt0.3a7ed6f84ded3c0f678638602b48bb1b840bea7edb3700d62cebcf7a400d4279/video/20000kCCF0")
page = url.content
soup = BeautifulSoup(page, "html.parser")
result = str(soup.find_all('script')[1]) #looking for script tag inside the html-file
result = result.split('window._state = ')[1].split("</script>']")[0].split('\n')[0]
#separating the json from the whole script-string, digged around in the file to find out how to do it
result = json.loads(result)
#navigating in the json to get the video-url
entity = list(result['entities'].items())[0][1]
download_url = entity['content_url_orig']
print(download_url)
Funny sidenote: If I read the JSON correctly you can find all videos with download-URLs the creator uploaded :)

how to download images using requests library that are behind cloudflare?

I'm learning about python's request library so that I can automatically download some images through their links.
But the images that I'm trying to download are behind Cloudflare, and so I get ERROR 1020 Access Denied
Here's my code
import requests
from bs4 import BeautifulSoup
# -------------------------------------------------------------------------------------------------------
response = requests.get("https://main_link").text
soup = BeautifulSoup( response , 'html.parser')
for i, link in enumerate(soup.find_all('img')): # getting all image elements
l = link.get('src') # image link -> https://link/link/image.jpg
img_data = requests.get(l).content
with open(f'Test{i}.png', 'wb') as f:
f.write(img_data)
I looked at some resources like this StackOverflow question
which says to use cfscrape
And this is my code:
import requests
import cfscrape
from bs4 import BeautifulSoup
# ------------------------------------------------------------------------------------------------------
scraper = cfscrape.create_scraper()
response = scraper.get("https://main_link").text
soup = BeautifulSoup( response , 'html.parser')
for i, link in enumerate(soup.find_all('img')):
l = link.get('src') # https://link/link/image.jpg
img_data = scraper.get(l).content
with open(f'Test{i}.png', 'wb') as f:
f.write(img_data)
But I still get the 1020 ERROR
I even used the cloudscraper library that too does not work.
I've looked at other resources but can't seem to understand what to do.
Any help is appreciated

How do I filter tags with class in Python and BeautifulSoup?

I'm trying to scrape images from a site using beautifulsoup HTML parser.
There are 2 kinds of image tags for each image on the site. One is for the thumbnail and the other is the bigger size image that only appears after I click on the thumbnail and expand. The bigger size tag contains a class="expanded-image" attribute.
I'm trying to parse through the HTML and get the "src" attribute of the expanded image which contains the source for the image.
When I try to execute my code, nothing happens. It just says the process finished without scraping any image. But when I don't try to filter the code and just give tag as an argument, it downloads all the thumbnails.
Here's my code:
import webbrowser, requests, os
from bs4 import BeautifulSoup
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata('https://boards.4chan.org/a/thread/30814')
soup = BeautifulSoup(htmldata, 'html.parser')
list = []
for i in soup.find_all("img",{"class":"expanded-thumb"}):
list.append(i['src'].replace("//","https://"))
def download(url, pathname):
if not os.path.isdir(pathname):
os.makedirs(pathname)
filename = os.path.join(pathname, url.split("/")[-1])
response = requests.get(url, stream=True)
with open(filename, "wb") as f:
f.write(response.content)
for a in list:
download(a,"file")

You might be running into a problem using "list" as a variable name. It's a type in python. Start with this (replacing TEST_4CHAN_URL with whatever thread you want), incorporating my suggestion from the comment above.
import requests
from bs4 import BeautifulSoup
TEST_4CHAN_URL = "https://boards.4chan.org/a/thread/<INSERT_THREAD_ID_HERE>"
def getdata(url):
r = requests.get(url)
return r.text
htmldata = getdata(TEST_4CHAN_URL)
soup = BeautifulSoup(htmldata, "html.parser")
src_list = []
for i in soup.find_all("a", {"class":"fileThumb"}):
src_list.append(i['href'].replace("//", "https://"))
print(src_list)

How to get video src using BeautifulSoup in Python

I am trying to find a downloadable video links in a website. For example, I am working with urls like these https://www.loc.gov/item/2015669100/. You can see that there is a m3u8 video link under mejs__mediaelement div tag.
However my code is not printing anything. Meaning that it's not finding the Video urls for the website.
My code is below
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
with open('pages2crawl.txt', 'r') as inFile:
lines = [line.rstrip() for line in inFile]
for page in lines:
req = Request(page, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(), 'html.parser')
pages = soup.findAll('div', attrs={'class' : 'mejs__mediaelement'})
for e in pages:
video = e.find("video").get("src")
if video.endswith("m3u8"):
print(video)

If you just want to make a simple script it would probably be easier to use regex.
import re, requests
s = requests.Session() #start the session
data = s.get(url) #http get request to download data
data = data.text #get the raw text
vidlinks = re.findall("src='(.*?).m3u8'/>", data) #find all between the two parts in the data
print(vidlinks[0] + ".m3u8") #print the full link with extension

You can use CSS selector source[type="application/x-mpegURL"] to extract MPEG link (or source[type="video/mp4"] to extract mp4 link):
import requests
from bs4 import BeautifulSoup
url = "https://www.loc.gov/item/2015669100/"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
link_mpeg = soup.select_one('source[type="application/x-mpegURL"]')["src"]
link_mp4 = soup.select_one('source[type="video/mp4"]')["src"]
print(link_mpeg)
print(link_mp4)
Prints:
https://tile.loc.gov/streaming-services/iiif/service:afc:afc2010039:afc2010039_crhp0001:afc2010039_crhp0001_mv04/full/full/0/full/default.m3u8
https://tile.loc.gov/storage-services/service/afc/afc2010039/afc2010039_crhp0001/afc2010039_crhp0001_mv04.mp4

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape images from a site using Beautiful Soup? - python

This is pretty in-depth guide on how to do this, take a look. Some parts are redundant & you can leave them out.

Related

Filter images by name before scraping with beautifulsoup?

How to scrape video URL from Webpage using python?

how to download images using requests library that are behind cloudflare?

How do I filter tags with class in Python and BeautifulSoup?

How to get video src using BeautifulSoup in Python

Categories

Resources