Displaying lots of png files to a webpage - python

Hopefully this question wont be asking for too much and can be understandable, but any help would be amazing. Currently I am doing [astronomy] research, and I am required to construct a webpage of quasar spectra to look like this...Sample of final product
This is to be done so by downloading each individual spectra from this source here...https://data.sdss.org/sas/dr13/eboss/spectro/redux/images/v5_9_0/v5_9_0/3590-55201/.
The problem is, I am struggling to find a way to download large quantities of png files all at once. For some reason, all the spectra on this link do not have their coordinates (Right ascension and declination) on the file name. Whereas the code provided to me as an example does.
In the situation that I have the png "00:14:53.206-09:12:17.70-4536-55857-0770.png" downloaded, it should be displayed. However as mentioned before, all the files I have viewed when trying to do this myself, do not list those. My page looks like direct code, no actual images. But it remains in code because it cannot pull forward those spectra since they are not downloaded, and I would prefer to have them assorted by their coordinates.
Downloading a FITS file which contains the quasar catalog was suggested to me. Presumably, the coords would in some way have to be appended to the png files downloaded. Apparently this is all supposed to be easy.
In summary: How do I download large quantities of png files, where they do not display their coordinates. I also need a method of renaming the image files to so that their file names correspond with the coordinates, and then print to a webpage.

When displaying images on a website (regardless of where you sourced the images from, or the format - jpg/png etc), it is advisable that you COMPRESS your images. This is especially valid in cases where the images are big, and where there are a number of images on the page (pages like yours!). There are a few online image compressors like tinypng (where you can upload ~30 images at at time to compress, and it compresses both jpg and pngs) or pngcrush.
Compressing images this way will reduce the file size (greatly in some cases) but the image appears the same. This will very much improve the load time on your site.
When you download a file (any file, not just an image file, you can save it as anything you want (name-wise) so you can rename the files on download. You will need to upload all the [preferably compressed] images to a web server in order to display them on a webpage. If you don't know ANY webscripting, start with learning basic html (you won't need a lot for this project), but the best way to display the images would probably be to use a loop to loop through the image folder using either javascript or php

Related

Searching Information at PDFs

I am a beginner user at Python / programming world and I am trying to solve a problem.
I have a kind of keyword list. I want to look for these keywords at some folders which contain a lot of PDFs. PDFs are not character based, they are image based (they contain text as image). In other words, the PDFs are scanned via scanner at first decade of 2000s. So, I can not search a word in the PDF file. I could not use Windows search etc. I can control only with my eyes and this is time consuming & boring.
I researched the question on the internet and found some solutions. According to these solutions, I tried to write a code via Python. It worked but success rate is a bit low.
Firstly, my code converts the PDF file to image files (PyMuPDF package).
Secondly, my code reads text on these images and creates a text information as string (PIL, pytesseract packages)
Finally, the code searches keywords at this text information and returns True if a keyword is found.
Example;
keyword_list = ["a-001", "b-002", "c-003"]
pdf_list = ["a.pdf", "b.pdf", "c.pdf", ...., "z.pdf"]
Code should find a-001 at a.pdf file. Because I controlled via my eyes and a.pdf contains a-001. The code found actually.
Code should find b-002 at b.pdf file too. Because I controlled via my eyes and b.pdf contains b-001. The code could not find.
So my code's success rate is %50. When it finds, it finds true pdf file; I have no problem on that. Found PDF really contains what I am looking for. But sometimes, it could not detect the keyword at the PDF file which I can see clearly.
Do you have any better idea to solve this problem more accurately? I am not chasing %100 success rate, it is impossible. Because, some PDFs contain handwriting. But, most of them contain computer writing and they should be detected. Can I rise the success rate %75?
Your best chance is to extract the image with the highest possible resolution, which might mean not "converting the PDF to an image" but rather "parsing the PDF and extracting the image stream" (given it was 2000's scanned, it is probably a TIFF stream, at that). This is an example using PyMuPdf.
You can perhaps try and further improve the image by adjusting brightness, contrast and using filters such as "despeckling". With poorly scanned images I have had good results with sharpening filters, and there are some filters ("erode" and "washout") that might improve poor typewriting (I remember some "e"'s where the eye of the "e" was almost completely dark, and they got easily mistaken for "c"'s).
Then train Tesseract to improve recognition ratio. I am not sure of how this can be done with the Python interface, though.

Extract images from a pdf as pdfs

I need to find a tool (python, adobe suite, some cmd line utility, etc) that can extract images from a PDF as individual PDF files - not jpegs, pngs, etc.
Does such a thing exist? Seems like there is a bunch of stuff out there for extracting image files to png, jpeg, etc, but nothing for extracting the images as PDFs. A strange request I know.
I am working with a large set of PDFs that contain images that are comprised of all kinds of different images formats, bitmaps, vectors, etc. If there was some way to programmatically pull out images as pdfs it would save me a lot of time.
Right now I am selecting a portion of the page in the PDF in acrobat pro, choosing to edit in illustrator, and then saving as PDF.
Very time consuming.
Any ideas?
You could use poppler's pdfimages utility to extract all bitmap images as-is from a PDF. In a second step, you can convert these bitmaps back to PDFs. img2pdf seems like a good candidate for this.

Custom file structure to save multiple images in python

I am experimenting with packaging of data, and since most of my data is stored as image/graphs and other similar data; I was planning to find a more efficient way to store these images.
I did read about saving them in a DB as blob; and some others are more inclined to save them in the file system; but what I would like is to have the images to not be visible outside the application. This is essential because when I run analysis on instruments; I am not interested in showing users all the images, but only the ones related to their particular instrument.
Plus it is convenient to pack data in one single file, compared to a folder with 20-30 images in it.
I was thinking to store the images in a custom structure, a sort of a bin file, using python; unless there is something that already cover that functionality. In my search I didn't notice any specific struct to save images, while the most common solutions were either a folder in the file system or the DB approach.
If you can convert your images to raster arrays, you can store them in an HDF5 file: Add raster image to HDF5 file using h5py

Integration testing and images

I'm writing an app that converts different images to JPG. It operates over a complex directory structure. There, a directory may include other directories, image files (JPG, GIF, PNG, TIFF), PDF files, RAR/ZIP archives, which in turn may include anything of the above. The app finds everything that can be converted to an image and places the resulting JPGs into a separate folder.
How do i write integration tests to test the conversion of images? Specifically, how should i fake the complex directory structure with all the files?
Currently i just store a sample directory structure, which i manually assembled out of various image, PDF and archive files, in a tests/ directory. In a setUp method i put this sample directory in place of the actual data and run the code. I had an idea to generate all these sample files myself (generate JPGs via Imagemagick, for example), but it proved hard.
How integration testing on images is usually done?
Do you write your own library to convert images of you just use existing library? In the latter case you simply do not test it. Author has already tested it somehow. You just need to create an abstraction layer between your code and the image library you use. Then you can simply check if your code calls the library with desired parameters.
If you really insist on testing pictures then you need to make the transformation deterministic (and compare actual result with expected result) or you need to make comparison a bit less strict (from ignoring date fields to OCR recognizing the image).
Testing files is way easier (you do not need probability based OCR).Check if your program placed all files in expected location.

Can you reduce memory consumption by ReportLab when embedding very large images, or is there a Python PDF toolkit that can?

Right now reportlab is making PDFs most of the time. However when one file gets several large images (125 files with a total on disk size of 7MB), we end up running out of memory and crashing trying to build a PDF that should ultimately be smaller than 39MB. The problem stems from:
elif mode not in ('L','RGB','CMYK'):
im = im.convert('RGB')
self.mode = 'RGB'
Where nice b&w (bitonal) images are converted to RGB and when you have images with sizes in the 2595x3000, they consume a lot of memory. (Not sure why they consume 2GB, but that point is moot. When we add them to reportlab our entire python memory footprint is about 50MB, when we call
doc.build(elements, canvasmaker=canvasmaker)
Memory usage skyrockets as we go from bitonal PNGs to RGB and then render them onto the page.
While I try to see if I can figure out how to inject bitonal images into reportlab PDFs, I thought I would see if anyone else had an idea of how to fix this problem either in reportlab or with another tool.
We have a working PDF maker using PODOFO in C++, one of my possible solutions is to write a script/outline for that tool that will simply generate the PDF in a subprocess and then return that via a file or stdout.
Short of redoing PIL you are out of luck. The Images are converted internally in PIL to 24 bit color TIFs. This is not something you can easily change.
We switched to Podofo and generate the PDF outside of python.

Categories