I have a simple loop to download a large number of images (1,5 million). The images itself are small, but I estimated that the total size will be 250 GB, which is too much for my HDD.
I got an external HDD, but even though the code runs without errors, the designated image folder is empty!
I tried the same code for a direction on my internal HDD and it works fine, slowly retrieving the images. Interestingly, the code reads the .csv file from the external HDD, so reading seems to not problem.
Any idea what I could do?
import os
import pandas as pd
import urllib
# change paths and dependencies:
file_name = "ID_with_image_links.csv"
file_path = "/Volumes/Extreme SSD/"
path_for_images = "/Volumes/Extreme SSD/images"
os.chdir(file_path)
df = pd.read_csv(file_name)
total_len = len(df)
os.chdir(path_for_images)
df = df.head(10) # this is for try-out
n = 1
for index, row in df.iterrows():
id = str(row['ID'])
im_num = str(row["Image Number"])
link = str(row["Links"])
urllib.request.urlretrieve(link, (id + "_" + im_num + ".jpg"))
print("Image", n, "of ", total_len, "downloaded")
n = n +1
Try setting the directory writable. I figure you are using macOS?
You can set the directories rights to read/write by using chmod 666 /Volumes/Extreme SSD/images/ on the terminal as root.
At least on BSD (and macOS is based on that) mounting an external drive is read only by default IIRC.
Related
I need to save images as a .tif. I cannot install a new python package (university-controlled computer, with Anaconda), so I decide to use the available Tifffile present in skimage.external.
Here is the trouble:
The following code works nice and saves the 3 channels of my image:
from skimage.external.tifffile import TiffWriter
import numpy as np
path = '26.01.2021 NLFK CPV timeseries A3B10 imaged 07.04.2021/26.01.2021 NLFK CPV timeseries A3B10 imaged 07.04.2021/Processed'
result_img2 = np.zeros((3, 512, 512))
res = (6.006634765625e-08, 6.006634765625e-08)
raw_b = str('abcdefg')
meta = {}
for i, c in enumerate(result_img2):
with TiffWriter(path+'/'+'test_C'+str(i)+'.tif') as tif:
tif.save(np.array(c, dtype='uint8'), resolution=res, description=raw_b, metadata=meta)
But I just cannot keep my images named 'test'!
However, doing like this:
filename = "07.04.2021 NLFK nucleolin r488 A3B10 m546 DAPI mock Preview 2_nucleus1_"
with TiffWriter(path+'/'+filename+str(i)+'.tif') as tif:
tif.save(np.array(c, dtype='uint8'), resolution=res, description=raw_b, metadata=meta)
Results in an Errno 2 error.
I tried to check if this filename contains forbidden characters (from another post):
keepcharacters = (' ','.','_')
filename = "".join(c for c in filename if c.isalnum() or c in keepcharacters).rstrip()
or the encoding filename.encode('utf8') and it also failed (again, Errno 2).
The name indicates the condition of each experiment and changes every time, so I need to save it like that.
After a quick check, I can save it at the root of the .py. But again, I need the file to be saved in their respective folder.
Any idea of where my trouble can be?
I have a dataset of 1,00,000+ .IMG files that I need to convert to .PNG / .JPG format to apply CNN for a simple classification task.
I referred to this answer and the solution works for me partially. What I mean is that some images are not properly converted. The reason for that, according to my understanding is that some images have a Pixel Depth of 16 while some have 8.
for file in fileList:
rawData = open(file, 'rb').read()
size = re.search("(LINES = \d\d\d\d)|(LINES = \d\d\d)", str(rawData))
pixelDepth = re.search("(SAMPLE_BITS = \d\d)|(SAMPLE_BITS = \d)", str(rawData))
size = (str(size)[-6:-2])
pixelDepth = (str(pixelDepth)[-4:-2])
print(int(size))
print(int(pixelDepth))
imgSize = (int(size), int(size))
img = Image.frombytes('L', imgSize, rawData)
img.save(str(file)+'.jpg')
Data Source: NASA Messenger Mission
.IMG files and their corresponding converted .JPG Files
Files with Pixel Depth of 8 are successfully converted:
Files with Pixel Depth of 16 are NOT properly converted:
Please let me know if there's any more information that I should provide.
Hopefully, from my other answer, here, you now have a better understanding of how your files are formatted. So, the code should look something like this:
#!/usr/bin/env python3
import sys
import re
import numpy as np
from PIL import Image
import cv2
rawData = open('EW0220137564B.IMG', 'rb').read()
# File size in bytes
fs = len(rawData)
bitDepth = int(re.search("SAMPLE_BITS\s+=\s+(\d+)",str(rawData)).group(1))
bytespp = int(bitDepth/8)
height = int(re.search("LINES\s+=\s+(\d+)",str(rawData)).group(1))
width = int(re.search("LINE_SAMPLES\s+=\s+(\d+)",str(rawData)).group(1))
print(bitDepth,height,width)
# Offset from start of file to image data - assumes image at tail end of file
offset = fs - (width*height*bytespp)
# Check bitDepth
if bitDepth == 8:
na = np.frombuffer(rawData, offset=offset, dtype=np.uint8).reshape(height,width)
elif bitDepth == 16:
dt = np.dtype(np.uint16)
dt = dt.newbyteorder('>')
na = np.frombuffer(rawData, offset=offset, dtype=dt).reshape(height,width).astype(np.uint8)
else:
print(f'ERROR: Unexpected bit depth: {bitDepth}',file=sys.stderr)
# Save either with PIL
Image.fromarray(na).save('result.jpg')
# Or with OpenCV may be faster
cv2.imwrite('result.jpg', na)
If you have thousands to do, I would recommend GNU Parallel which you can easily install on your Mac with homebrew using:
brew install parallel
You can then change my program above to accept a filename as parameter in-place of the hard-coded filename and the command to get them all done in parallel is:
parallel --dry-run script.py {} ::: *.IMG
For a bit more effort, you can get it done even faster by putting the code above in a function and calling the function for each file specified as a parameter. That way you can avoid starting a new Python interpreter per image and tell GNU Parallel to pass as many files as possible to each invocation of your script like this:
parallel -X --dry-run script.py ::: *.IMG
The structure of the script then looks like this:
def processOne(filename):
open, read, search, extract, save as per my code above
# Main - process all filenames received as parameters
for filename in sys.argv[1:]:
processOne(filename)
I have the following code to create multiple jpgs from a single multi-page PDF. However I get the following error: wand.exceptions.BlobError: unable to open image '{uuid}.jpg': No such file or directory # error/blob.c/OpenBlob/2841 but the image has been created. I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
def split_pdf(pdf_obj, step_functions_client, task_token):
print(time.time())
read_pdf = PyPDF2.PdfFileReader(pdf_obj)
images = []
for page_num in range(read_pdf.numPages):
output = PyPDF2.PdfFileWriter()
output.addPage(read_pdf.getPage(page_num))
generateduuid = str(uuid.uuid4())
filename = generateduuid + ".pdf"
outputfilename = generateduuid + ".jpg"
with open(filename, "wb") as out_pdf:
output.write(out_pdf) # write to local instead
image = {"page": str(page_num + 1)} # Start at 1 rather than 0
create_image_process = subprocess.Popen(["gs", "-o " + outputfilename, "-sDEVICE=jpeg", "-r300", "-dJPEGQ=100", filename], stdout=subprocess.PIPE)
create_image_process.wait()
time.sleep(10)
with(Image(filename=outputfilename)) as img:
image["image_data"] = img.make_blob('jpeg')
image["height"] = img.height
image["width"] = img.width
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images
It looks like you aren't passing in a value when you try to open the PDF in the first place - hence the error you are receiving.
Make sure you format the string with the full file path as well, e.g. f'/path/to/file/{uuid}.jpg' or '/path/to/file/{}.jpg'.format(uuid)
I don't really understand why your using PyPDF2, GhostScript, and wand. You not parsing/manipulating any PostScript, and Wand sits on top of ImageMagick which sits on top of ghostscript. You might be able to reduce the function down to one PDF utility.
def split_pdf(pdf_obj, step_functions_client, task_token):
images = []
with Image(file=pdf_obj, resolution=300) as document:
for index, page in enumerate(document.sequence):
image = {
"page": index + 1,
"height": page.height,
"width": page.width,
}
with Image(page) as frame:
image["image_data"] = frame.make_blob("JPEG")
images.append(image)
if hasattr(step_functions_client, 'send_task_heartbeat'):
step_functions_client.send_task_heartbeat(taskToken=task_token)
return images
I initially thought it may be a race condition so I put in a time.sleep() but that didn't work either so I don't believe that's it. Has anyone seen this before?
The example code doesn't have any error handling. PDFs can be generated by many software vendors, and a lot of them do a sloppy job. It's more than possible that PyPDF or Ghostscript failed, and you never got a chance to handle this.
For example, when I use Ghostscript for PDFs generated by a random website, I often see the following message on stderr...
ignoring zlib error: incorrect data check
... which results in incomplete documents, or blank pages.
Another common example is that the system resources have been exhausted, and no additional memory can be allocated. This happens all the time with web servers, and the solution is usually to migrate the task over to a queue worker that can cleanly shutdown at the end of each task-completion.
I'm trying to load two memory mapped files,
temp = numpy.load(currentDirectory + "\\tmp\\temperature.npy", mmap_mode='r')
salinity = numpy.load(currentDirectory + "\\tmp\\salinity.npy", mmap_mode='r')
but Python throws the following error:
IOError: Failed to interpret file 'C:\\my\\file\\path\\..\\tmp\\salinity.npy' as a pickle
When I load either by itself, it works just fine.
The files that are quite large (~500MB), but otherwise I don't think they are notable.
What might the problem be here?
This works for me. Both files are large than 5GB.
X = np.load(os.path.join(path, '_file1.npy'), mmap_mode='r')
Y = np.load(os.path.join(path, '_file2.npy'), mmap_mode='r')
Which operating system are you using? The problem is not with size of the "npy" files but problem is with "\" in path. change your path as:
path = '/media/gtx1060/DATA/Datasets'
I am running a query to select a polygon from a set of polygons. Then I input that polygon into a feature dataset in a geodatabase. I then use this polygon(or set of polygons) to dissolve to get the boundary of the polygons and the centroid of the polygon(s), also entered into separate feature datasets in a geodatabase.
import arcpy, os
#Specify the drive you have stored the NCT_GIS foler on
drive = arcpy.GetParameterAsText(0)
arcpy.env.workspace = (drive + ":\\NCT_GIS\\DATA\\RF_Properties.gdb")
arcpy.env.overwriteOutput = True
lot_DP = arcpy.GetParameterAsText(1).split(';')
PropertyName = arcpy.GetParameterAsText(2)
queryList= []
for i in range(0,len(lot_DP)):
if i % 2 == 0:
lt = lot_DP[i]
DP = lot_DP[i+1]
query_line = """( "LOTNUMBER" = '{0}' AND "PLANNUMBER" = {1} )""".format(lt, DP)
queryList.append(query_line)
if i < (len(lot_DP)):
queryList.append(" OR ")
del queryList[len(queryList)-1]
query = ''.join(queryList)
#Feature dataset for lot file
RF_Prop = drive + ":\\NCT_GIS\\DATA\\RF_Properties.gdb\\Lots\\"
#Feature dataset for the property boundary
RF_Bound = drive + ":\\NCT_GIS\\DATA\\RF_Properties.gdb\\Boundary\\"
#Feature dataset for the property centroid
RF_Centroid = drive + ":\\NCT_GIS\\DATA\\RF_Properties.gdb\\Centroid\\"
lotFile = drive + ":\\NCT_GIS\\DATA\\NSWData.gdb\\Admin\\cadastre"
arcpy.MakeFeatureLayer_management(lotFile, "lot_lyr")
arcpy.SelectLayerByAttribute_management("lot_lyr", "NEW_SELECTION", query)
#Create lot polygons in feature dataset
arcpy.CopyFeatures_management("lot_lyr", RF_Prop + PropertyName)
#Create property boundary in feature dataset
arcpy.
arcpy.Dissolve_management(RF_Prop + PropertyName , RF_Bound + PropertyName, "", "", "SINGLE_PART", "DISSOLVE_LINES")
#Create property centroid in feature dataset
arcpy.FeatureToPoint_management(RF_Bound + PropertyName, RF_Centroid + PropertyName, "CENTROID")
Every time I run this I get an error when trying to add anything to the geodatabase EXCEPT when copying the lot layer into the geodatabase.
I have tried not copying the lots into the geodatabase and copying it into a shapefile and then using that but still it the boundary and centroid will not import into the geodatabase. I tried outputing the boundaries into shapefiles then using the FeatureClassToGeodatabase tool but still I get error after error.
If anyone can shed light on this It would be grateful
In my experience, I found out that if I had recently opened then closed ArcMap or ArcCatalog it would leave two ArcGIS services running (check task manager) even though I had closed ArcMap and ArcCatalog. If I tried running a script while these two services were running I would get this error. Finding these services in Windows task manager and ending them fixed this error for me. The two services were
ArcGIS cache manager
ArcGIS online services
I've also heard that your computer's security/anti-virus software may possibly interfere with scripts running. So adding your working directory as an exception to your security software may also help.
If in the rare occasions that this didn't work I just had to restart the computer.