Downloading Images using BS4

Downloading Images using BS4 - python

I have been trying to download images off of a site using bs4, the images are not jpeg or png so I think that bs4 is unable to find the image, I could be wrong about that as well.
Here's my code
#--- IMAGE --- NOT WORKING
#Finds the image URL with the name of the product
#done
image = soup.find('img', attrs={'class':"image_container"})
try:
image = image.get("src")
except AttributeError:
print("NO image FOUND")
image = "NO image FOUND"
if(image != "NO image FOUND"): #if the image is found
try:
pos = image.index("?")
image = "http:" + image[:pos]
except ValueError:
pass
pathImg += name[:nameLength] # Truncates to 5 characters and adds to pathImg file
if(generateFiles):
download(image, pathImg) # Downloads image
self.image = image # Exporting var to class global var
Heres an image of where the source is on the page
Source Code for the image container

Related

Python Image extraction sequence from pdf

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw that the images being extracted don't follow a proper sequence. Sometimes it is starting to extract from the bottom, sometimes from the top and so on. Is there a way to modify my code so that the extraction follow a proper sequence?
Given below is the code I am using :
import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)
for i in range(len(doc)):
img_num = 0
p_no = 1
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4:
img_num += 1
pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
else:
img_num += 1
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
pix1 = None
pix = None
p_no += 1
Given below is a sample page of the pdf

I have the same problem I've used the following code:
import fitz
import io
from PIL import Image
file = "file_path"
pdf_file = fitz.open(file)
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
# printing number of images found in this page
if image_list:
print(f"[+] Found {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on the given pdf page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
print(img)
print(image_index)
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
The most probable way is to locate the 'img' var and order them.
I'd love to hear any further sggestions or if you found better idea/solution.

Add page title, img2pdf

I've recently found this (wonderful) python software to convert multiple images to a single pdf, img2pdf. After create the first pdf I realized that every page hasn't got any title and it's difficult identify what's the original image (because there're 400), does anyone know how can I add a page title?
Thanks in advance.

I tried to find the same solution but ended up writing a Python program to solve it. I dont know if it helps you but here is a solution nonetheless.
In Python I used PIL.Image and ImageDraw to go through all images and put the filename in each if the images. After that I used img2pdf as a python library to generate the pdf.
Must be run it in the same folder of the images.
import os
import img2pdf
from PIL import Image, ImageDraw, ImageFont, ExifTags
# Enter the path to the font you want, 'fc-list' on ubuntu will get a list of fonts you can use.
#image_text_font = ImageFont.truetype('/Library/Fonts/Arial.ttf', 15)
image_text_font = ImageFont.truetype("/usr/share/fonts/truetype/dejavu/DejaVuSansMono.ttf", 32)
# Tags the images with 'file name' in the upper left corner
def tag_images():
for file in os.listdir('.'):
if file.endswith(".jpg") and str(file+"_tagged.jpg") not in os.listdir('.') and not file.endswith("_tagged.jpg"):
one_image = check_and_adjust_rotation(Image.open(file))
one_image_draw = ImageDraw.Draw(one_image)
# Add textbox to image
size = one_image_draw.textsize(file, font=image_text_font)
offset = image_text_font.getoffset(file)
one_image_draw.rectangle((10, 10, 10 + size[0] + offset[0], 10 + size[1] + offset[1]), fill='white', outline='black')
# Add text to image
one_image_draw.text((10,10), file, font=image_text_font, fill='black')
# Save tagged image
one_image.save(file + "_tagged.jpg")
print(f'Tagged and saved "{file}_tagged.jpg".')
# Generate the PDF
def generate_pdf_from_multiple_images():
with open("output.pdf", "wb") as f:
f.write(img2pdf.convert([image_file for image_file in os.listdir('.') if image_file.endswith("_tagged.jpg")]))
# Use exif information about rotation to apply proper rotation to the image
def check_and_adjust_rotation(image):
try :
for orientation in ExifTags.TAGS.keys() :
if ExifTags.TAGS[orientation]=='Orientation' : break
exif=dict(image._getexif().items())
print(exif[orientation])
if exif[orientation] == 3 :
image=image.rotate(180, expand=True)
elif exif[orientation] == 6 :
image=image.rotate(270, expand=True)
elif exif[orientation] == 8 :
image=image.rotate(90, expand=True)
except:
traceback.print_exc()
return image
def main():
tag_images()
generate_pdf_from_multiple_images()
if __name__ == '__main__':
main()

Change color scheme when extracting an image from PDF in Python

I am trying to read an image from a pdf following this post:
Extract images from PDF without resampling, in python?
So far I managed to get the image file from the pdf, but it contains a CWYK color scheme and the picture is becoming messed up.
My code is the following:
import PyPDF2
import struct
pdf_filename = 'document.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
page = cond_scan_reader.getPage(4)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
print(xObject[obj])
if xObject[obj]['/Subtype'] == '/Image':
if xObject[obj]['/Filter'] == '/DCTDecode':
data = xObject[obj]._data
img = open("image" + ".jpg", "wb")
img.write(data)
img.close()
pdf_file.close()
The point is that when I save, the colors are all weird, I believe it's because of the colorScheme. I have the following in the console:
{'/Type': '/XObject', '/Subtype': '/Image', '/Width': 1122, '/Height': 502, '/Interpolate': <PyPDF2.generic.BooleanObject object at 0x1061574a8>, '/ColorSpace': '/DeviceCMYK', '/BitsPerComponent': 8, '/Filter': '/DCTDecode'}
As you can see, the ColorSpace is CMYK, and I believe that's why the colors of the image are weird.
That's the image I have:
This is the original image (it is inside a pdf file):
Can anyone help me?
Thanks in advance.
Israel

A CMYK mode JPG image that contained in PDF must be invert.
But in PIL, invert of CMYK mode image is not supported.
Than I solve it using numpy.
Full source is in below link.
https://github.com/Gaia3D/pdfImageExtractor/blob/master/extrectImage.py
imgData = np.frombuffer(img.tobytes(), dtype='B')
invData = np.full(imgData.shape, 255, dtype='B')
invData -= imgData
img = Image.frombytes(img.mode, img.size, invData.tobytes())
img.save(outFileName + ".jpg")

Use custom scrapy imagePipeline to download images and overwrite existing images

I am practising using scrapy to crop image with a custom imagePipeline.
I am using this code:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def convert_image(self, image, size=None):
if image.format == 'PNG' and image.mode == 'RGBA':
background = Image.new('RGBA', image.size, (255, 255, 255))
background.paste(image, image)
image = background.convert('RGB')
elif image.mode != 'RGB':
image = image.convert('RGB')
if size:
image = image.copy()
image.thumbnail(size, Image.ANTIALIAS)
else:
# cut water image TODO use defined image replace Not cut
x,y = image.size
if(y>120):
image = image.crop((0,0,x,y-25))
buf = StringIO()
try:
image.save(buf, 'JPEG')
except Exception, ex:
raise ImageException("Cannot process image. Error: %s" % ex)
return image, buf
It works well but have a problem.
If there are original images in the folder,
then run the spider,
the images it download won't replace the original one.
How can I get it to over-write the original images ?

There is an expiration setting, it is by default 90 days.

Python StringIO memory leak

I have a python program that with time slows to a crawl. I've tested thoroughly, and narrowed it down to a method that downloads an image. The method uses cstringIO and urllib. The problem may also be some sort of infinite download with urllib (the program just freezes after a few hundred downloads).
Any thoughts on where the issue may be?
foundImages = []
images = soup.find_all('img')
print('downloading Images')
for imageTag in images:
gc.collect()
url = None
try:
#load image into a file to determine size and width
url = imageTag.attrs['src']
imgFile = StringIO(urllib.urlopen(url).read())
im = Image.open(imgFile)
width, height = im.size
#if width and height are both above a threshold, it is a valid image
#so add to recipe images
if width > self.minOptimalWidth and height > self.minOptimaHeight:
image = MIImage({})
image.originalUrl = url.encode('ascii', 'ignore')
image.width = width
image.height = height
foundImages.append(image)
imgFile = None
im = None
except Exception:
print('failed image download url: ' + url)
traceback.print_exc()
continue
#set the main image to be the first in the array
if len(foundImages) > 0:
first = foundImages[0]
recipe.imageUrl = first.originalUrl
return foundImages

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading Images using BS4 - python

Related

Python Image extraction sequence from pdf

Add page title, img2pdf

Change color scheme when extracting an image from PDF in Python

Use custom scrapy imagePipeline to download images and overwrite existing images

Python StringIO memory leak

Categories

Resources