Python StringIO memory leak

Python StringIO memory leak - python

I have a python program that with time slows to a crawl. I've tested thoroughly, and narrowed it down to a method that downloads an image. The method uses cstringIO and urllib. The problem may also be some sort of infinite download with urllib (the program just freezes after a few hundred downloads).
Any thoughts on where the issue may be?
foundImages = []
images = soup.find_all('img')
print('downloading Images')
for imageTag in images:
gc.collect()
url = None
try:
#load image into a file to determine size and width
url = imageTag.attrs['src']
imgFile = StringIO(urllib.urlopen(url).read())
im = Image.open(imgFile)
width, height = im.size
#if width and height are both above a threshold, it is a valid image
#so add to recipe images
if width > self.minOptimalWidth and height > self.minOptimaHeight:
image = MIImage({})
image.originalUrl = url.encode('ascii', 'ignore')
image.width = width
image.height = height
foundImages.append(image)
imgFile = None
im = None
except Exception:
print('failed image download url: ' + url)
traceback.print_exc()
continue
#set the main image to be the first in the array
if len(foundImages) > 0:
first = foundImages[0]
recipe.imageUrl = first.originalUrl
return foundImages

Related

Downloading Images using BS4

I have been trying to download images off of a site using bs4, the images are not jpeg or png so I think that bs4 is unable to find the image, I could be wrong about that as well.
Here's my code
#--- IMAGE --- NOT WORKING
#Finds the image URL with the name of the product
#done
image = soup.find('img', attrs={'class':"image_container"})
try:
image = image.get("src")
except AttributeError:
print("NO image FOUND")
image = "NO image FOUND"
if(image != "NO image FOUND"): #if the image is found
try:
pos = image.index("?")
image = "http:" + image[:pos]
except ValueError:
pass
pathImg += name[:nameLength] # Truncates to 5 characters and adds to pathImg file
if(generateFiles):
download(image, pathImg) # Downloads image
self.image = image # Exporting var to class global var
Heres an image of where the source is on the page
Source Code for the image container

Python Image extraction sequence from pdf

I was trying to extract images from a pdf using PyMuPDF (fitz). My pdf has multiple images in a single page. I am maintaining a proper sequence number while saving my images. I saw that the images being extracted don't follow a proper sequence. Sometimes it is starting to extract from the bottom, sometimes from the top and so on. Is there a way to modify my code so that the extraction follow a proper sequence?
Given below is the code I am using :
import fitz
from PIL import Image
filename = "document.pdf"
doc = fitz.open(filename)
for i in range(len(doc)):
img_num = 0
p_no = 1
for img in doc.getPageImageList(i):
xref = img[0]
pix = fitz.Pixmap(doc, xref)
if pix.n - pix.alpha < 4:
img_num += 1
pix.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
else:
img_num += 1
pix1 = fitz.Pixmap(fitz.csRGB, pix)
pix1.writeImage("%s-%s.jpg" % (str(p_no),str(img_num)))
pix1 = None
pix = None
p_no += 1
Given below is a sample page of the pdf

I have the same problem I've used the following code:
import fitz
import io
from PIL import Image
file = "file_path"
pdf_file = fitz.open(file)
for page_index in range(len(pdf_file)):
# get the page itself
page = pdf_file[page_index]
image_list = page.getImageList()
# printing number of images found in this page
if image_list:
print(f"[+] Found {len(image_list)} images in page {page_index}")
else:
print("[!] No images found on the given pdf page", page_index)
for image_index, img in enumerate(page.getImageList(), start=1):
print(img)
print(image_index)
# get the XREF of the image
xref = img[0]
# extract the image bytes
base_image = pdf_file.extractImage(xref)
image_bytes = base_image["image"]
# get the image extension
image_ext = base_image["ext"]
# load it to PIL
image = Image.open(io.BytesIO(image_bytes))
# save it to local disk
image.save(open(f"image{page_index+1}_{image_index}.{image_ext}", "wb"))
The most probable way is to locate the 'img' var and order them.
I'd love to hear any further sggestions or if you found better idea/solution.

I'm resizing an Image via PILLOW on my flask backend, but when it gets to the line of code .resize, it fails?

I'm trying to resize an uploaded file. So far I am confident that the image is loaded properly, and the PILLOW image class is created. It runs through my resizing script, but then it always stops on the .resize code...
I run the code on my desktop (not on a server), and the image resize works, but when I combine the resizing script with an image uploaded via POST, it's not working and shows a 500 error. What's going on?
I used print imageThumbnail.size right after the imageresizer code and got AttributeError: 'NoneType' object has no attribute 'size'
def imageResizer(im, pixellimit):
width, height = im.size
if width > height:
#Land scape mode. Scale to width.
aspectRatio = float(height)/float(width)
Scaledwidth = pixellimit
Scaledheight = int(round(Scaledwidth * aspectRatio))
newSize = (Scaledwidth, Scaledheight)
elif height > width:
#Portrait mode, Scale to height.
aspectRatio = float(width)/float(height)
Scaledheight = pixellimit
Scaledwidth = int(round(Scaledheight * aspectRatio))
newSize = (Scaledwidth, Scaledheight)
#FAILS RIGHT HERE... I double checked by writing print flags all over, and it so happens nothing past this line gets written
imageThumbnail = im.resize(newSize)
return imageThumbnail
Here's the portion of the FLask framework.
file = request.files['file']
location = str(args['lat']) + str(args['lon'])
location = location.replace('.','_')
GUID = datetime.strftime(datetime.now(), '%Y%m%d%H%M%S') + location
datetimeEntry = datetime.strftime(datetime.now(), '%Y-%m-%d %H:%M:%S')
fullFileName = GUID + '.' + file.filename.rsplit('.', 1)[1]
if file and allowed_file(file.filename):
filename = secure_filename(file.filename)
image = Image.open(file)
imageThumbnail = imageResizer(image, 800)
#NOTHING PAST THIS POINT GETS EXECUTED
imageThumbnailName = GUID + "thumb" + '.' + file.filename.rsplit('.', 1)[1]
imageThumbnailName.save(os.path.join(app.config['UPLOAD_FOLDER'], imageThumbnailName))
file.save(os.path.join(app.config['UPLOAD_FOLDER_LARGE_IMAGES'], fullFileName))

The problem is that you are trying to open:
file = request.files['file']
image = Image.open(file)
That file is not an actual file, but some metadata object with upload information. What you should do instead is:
image = Image.open(file.stream)

Use custom scrapy imagePipeline to download images and overwrite existing images

I am practising using scrapy to crop image with a custom imagePipeline.
I am using this code:
class MyImagesPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def convert_image(self, image, size=None):
if image.format == 'PNG' and image.mode == 'RGBA':
background = Image.new('RGBA', image.size, (255, 255, 255))
background.paste(image, image)
image = background.convert('RGB')
elif image.mode != 'RGB':
image = image.convert('RGB')
if size:
image = image.copy()
image.thumbnail(size, Image.ANTIALIAS)
else:
# cut water image TODO use defined image replace Not cut
x,y = image.size
if(y>120):
image = image.crop((0,0,x,y-25))
buf = StringIO()
try:
image.save(buf, 'JPEG')
except Exception, ex:
raise ImageException("Cannot process image. Error: %s" % ex)
return image, buf
It works well but have a problem.
If there are original images in the folder,
then run the spider,
the images it download won't replace the original one.
How can I get it to over-write the original images ?

There is an expiration setting, it is by default 90 days.

Converting an UploadedFile to PIL image in Django

I'm trying to check an image's dimension, before saving it. I don't need to change it, just make sure it fits my limits.
Right now, I can read the file, and save it to AWS without a problem.
output['pic file'] = request.POST['picture_file']
conn = myproject.S3.AWSAuthConnection(aws_key_id, aws_key)
filedata = request.FILES['picture'].read()
content_type = 'image/png'
conn.put(
bucket_name,
request.POST['picture_file'],
myproject.S3.S3Object(filedata),
{'x-amz-acl': 'public-read', 'Content-Type': content_type},
)
I need to put a step in the middle, that makes sure the file has the right size / width dimensions. My file isn't coming from a form that uses ImageField, and all the solutions I've seen use that.
Is there a way to do something like
img = Image.open(filedata)

image = Image.open(file)
#To get the image size, in pixels.
(width,height) = image.size()
#check for dimensions width and height and resize
image = image.resize((width_new,height_new))

I've done this before but I can't find my old snippet... so here we go off the top of my head
picture = request.FILES.get['picture']
img = Image.open(picture)
#check sizes .... probably using img.size and then resize
#resave if necessary
imgstr = StringIO()
img.save(imgstr, 'PNG')
imgstr.reset()
filedata = imgstr.read()

The code bellow creates the image from the request, as you want:
from PIL import ImageFile
def image_upload(request):
for f in request.FILES.values():
p = ImageFile.Parser()
while 1:
s = f.read(1024)
if not s:
break
p.feed(s)
im = p.close()
im.save("/tmp/" + f.name)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python StringIO memory leak - python

Related

Downloading Images using BS4

Python Image extraction sequence from pdf

I'm resizing an Image via PILLOW on my flask backend, but when it gets to the line of code .resize, it fails?

Use custom scrapy imagePipeline to download images and overwrite existing images

Converting an UploadedFile to PIL image in Django

Categories

Resources