Undersampling with image data in python - python

main idea of undersampling is randomly delete the class which has sufficient observations so that the comparative ratio of two classes is significant in our data.
So, how to undersampling with image data in python? please help me:(
I took the fundus image data from Kaggle. there are 35127 images with 5 classes.
class 0: 25810 data,
class 1: 2443 data,
class 2: 5292 data,
class 3: 873 data,
class 4: 708 data,
I want each class to have as much as 708 images following the 4th class. How do I delete the rest of the images in Python?

I know it is an old question but for the sake of people looking for the answer, this code works perfectly:
path = r'C:/The_Path'# You can provide the path here
n = 2500 # Number of random images to be removed
img_names = os.listdir(path) # Get image names in folder
img_names = random.sample(img_names, n) # Pick 2500 random images
for image in img_names: # Go over each image name to be deleted
f = os.path.join(path, image) # Create valid path to image
os.remove(f) # Remove the image
As your question states, you want all classes to be equal to class 4, i.e., 708 images. Simply find out the difference and replace n, for example, the difference between the number of class 3 images and 708 images are 165 images and so n = 165. Furthermore, you can make this into a function to generalise it more.
The code has been taken from, but edited:
How can i delete multiple images from multiple folders using python
https://stackoverflow.com/users/10512332/vikrant-sharma answered the question.
Thank you!

Related

How to to calculate unnested watersheds in GRASS GIS?

I am running into a few issues using the GRASS GIS module r.accumulate while running it in Python. I use the module to calculate sub watersheds for over 7000 measurement points. Unfortunately, the output of the algorithm is nested. So all sub watersheds are overlapping each other. Running the r.accumulate sub watershed module takes roughly 2 minutes for either one or multiple points, I assume the bottleneck is loading the direction raster.
I was wondering if there is an unnested variant in GRASS GIS available and if not, how to overcome the bottleneck of loading the direction raster every time you call the module accumulate. Below is a code snippet of what I have tried so far (resulting in a nested variant):
locations = VectorTopo('locations',mapset='PERMANENT')
locations.open('r')
points=[]
for i in range(len(locations)):
points.append(locations.read(i+1).coords())
for j in range(0,len(points),255):
output = "watershed_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output,overwrite=True, flags = "r", coordinates = points[j:j+255])
gs.run_command('r.stats', flags="ac", input=output, output="stat_batch_{}.csv".format(j),overwrite=True)
Any thoughts or ideas are very welcome.
I already replied to your email, but now I see your Python code and better understand your "overlapping" issue. In this case, you don't want to feed individual outlet points one at a time. You can just run
r.accumulate direction=direction#PERMANENT subwatershed=output outlet=locations
r.accumulate's outlet option can handle multiple outlets and will generate non-overlapping subwatersheds.
The answer provided via email was very usefull. To share the answer I have provided the code below to do an unnested basin subwatershed calculation. A small remark: I had to feed the coordinates in batches as the list of coordinates exceeded the max length of characters windows could handle.
Thanks to #Huidae Cho, the call to R.accumulate to calculate subwatersheds and longest flow path can now be done in one call instead of two seperate calls.
The output are unnested basins. Where the largers subwatersheds are seperated from the smaller subbasins instead of being clipped up into the smaller basins. This had to with the fact that the output is the raster format, where each cell can only represent one basin.
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command('g.region', rast='direction#PERMANENT')
StationIds = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'StationId')["values"].values())
XY = list(gs.vector.vector_db_select('locations_snapped_new', columns = 'x_real,y_real')["values"].values())
for j in range(0,len(XY),255):
output_ws = "watershed_batch_{}#Watersheds".format(j)
output_lfp = "lfp_batch_{}#Watersheds".format(j)
output_lfp_unique = "lfp_unique_batch_{}#Watersheds".format(j)
gs.run_command("r.accumulate", direction='direction#PERMANENT', subwatershed=output_ws, flags = "ar", coordinates = XY[j:j+255],lfp=output_lfp, id=StationIds[j:j+255], id_column="id",overwrite=True)
gs.run_command("r.to.vect", input=output_ws, output=output_ws, type="area", overwrite=True)
gs.run_command("v.extract", input=output_lfp, where="1 order by id", output=output_lfp_unique,overwrite=True)
To export the unique watersheds I used the following code. I had to transform the longest_flow_path to point as some of the longest_flow_paths intersected with the corner boundary of the watershed next to it. Some longest flow paths were thus not fully within the subwatershed. See image below where the red line (longest flow path) touches the subwatershed boundary:
enter image description here
gs.run_command('g.mapset',mapset='Watersheds')
lfps= gs.list_grouped('vect', pattern='lfp_unique_*')['Watersheds']
ws= gs.list_grouped('vect', pattern='watershed_batch*')['Watersheds']
files=np.stack((lfps,ws)).T
#print(files)
for file in files:
print(file)
ids = list(gs.vector.vector_db_select(file[0],columns="id")["values"].values())
for idx in ids:
idx=int(idx[0])
expr = f'id="{idx}"'
gs.run_command('v.extract',input=file[0], where=expr, output="tmp_lfp",overwrite=True)
gs.run_command("v.to.points", input="tmp_lfp", output="tmp_lfp_points", use="vertex", overwrite=True)
gs.run_command('v.select', ainput= file[1], binput = "tmp_lfp_points", output="tmp_subwatersheds", overwrite=True)
gs.run_command('v.db.update', map = "tmp_subwatersheds",col= "value", value=idx)
gs.run_command('g.mapset',mapset='vector_out')
gs.run_command('v.dissolve',input= "tmp_subwatersheds#Watersheds", output="subwatersheds_{}".format(idx),col="value",overwrite=True)
gs.run_command('g.mapset',mapset='Watersheds')
gs.run_command("g.remove", flags="f", type="vector",name="tmp_lfp,tmp_subwatersheds")
I ended up with a vector for each subwatershed

Loading high number of images to memory and save pickle

I have a problem...
I have a dataset with 1200 cases and 30 classes per case and 160 images per class. These images are grayscale ndarrays, float64 dtype.
I would like to slice each case and get only 30 images from each class and put them in a dictionary where the first_key is case_name and second_one name of a class. After all of this I would like to save whole dictionary to a pickle.
but I run out of memory all the time.
brain_all = []
for dir in path.iterdir():
brain_sample = {}
path_dir = path_save / dir.name
try:
path_dir.mkdir(parents=True, exist_ok=False)
except FileExistsErorr:
print('Folder is already there')
for file in dir.iterdir():
sample = nib.load(file).get_fdata()[:, :, 75:105]
if 'flair' in file.name:
brain_sample['flair'] = sample
elif 't1ce' in file.name:
brain_sample['t1ce'] = sample
brain_all.append([file, brain_sample])

Modifying a large number of DICOM (.dcm) files.

bit of a simple question perhaps but I am not making progress and would appreciate help.
I have a list of size 422. Within index 0 there are 135 file paths to .dcm images. For example '~/images/0001.dcm','~/images/0135.dcm' Within index 1, there are 112 image paths, index 2 has 110, etc.
All images are of size 512 x 512. I am looking to re-size them to be 64 x 64.
This is my first time working with both image and .dcm data so I'm very unsure about how to resize. I am also unsure how to access and modify the files within the 'inner' list, if you will.
Is something like this way off the mark?
IMG_PX_SIZE = 64
result = []
for i in test_list:
result_inner_list = []
for image in i:
# resize all images at index position i and store in list
new_img = cv2.resize(np.array(image.pixel_array (IMG_PX_SIZE,IMG_PX_SIZE))
result_inner_list.append(new_img)
# Once all images at index point i are modified, append them these to a master list.
result.append(result_inner_list)
You seem to be struggling with two issues:
accessing the file paths
resize
For you to win, better separate these two tasks, sample code below
IMG_PX_SIZE = 64
def resize(image):
# your resize code here similar to:
# return v2.resize(np.array(image.pixel_array(IMG_PX_SIZE,IMG_PX_SIZE))
pass
def read(path):
# your file read operation here
pass
big_list = [['~/images/0001.dcm','~/images/0135.dcm'],
['~/images/0002.dcm','~/images/0136.dcm']]
resized_images = [[resize(read(path)) for path in paths] for paths in big_list]

Extracting bounding boxes and category labels in MS-COCO dataset

I am working with MS-COCO dataset and I want to extract bounding boxes as well as labels for the images corresponding to backpack (category ID: 27) and laptop (category ID: 73) categories, and store them into different text files to train a neural network based model later.
I have already extracted the images corresponding to the aforementioned two categories and have created empty annotation files in a separate folder wherein I am looking to store the annotations along with labels (format of the annotation file is something like: label x y w h where w and h indicate width and height of the detected category). I built upon COCO-API (coco.py to be precise) to extract the images and create the empty text annotation files.
Following is the main function I wrote on top of coco.py to do so:
if __name__ == "__main__":
littleCo = COCO('/home/r.bohare/coco_data/annotations/instances_train2014.json')
#id_laptop = littleCo.getCatIds('laptop')
"""Extracting image ids corresponding to backpack and laptop images."""
bag_img_ids = littleCo.getImgIds(catIds=[27])
laptop_img_ids = littleCo.getImgIds(catIds=[73])
#print "IDs of bag images:", bag_img_ids
#print "IDs of laptop imgs:", laptop_img_ids
"""Extracting annotation ids corresponding to backpack and laptop images."""
bag_ann_ids = littleCo.getAnnIds(catIds=[27])
laptop_ann_ids = littleCo.getAnnIds(catIds=[73])
#print "Annotation IDs of bags:", bag_ann_ids
#print "Annotation IDs of laptops:", laptop_ann_ids
"""Extracting image names corresponding to bag and laptop categories."""
bag_imgs = littleCo.loadImgs(ids=bag_img_ids)
laptop_imgs = littleCo.loadImgs(ids=laptop_img_ids)
#print "Bag images:", bag_imgs
#print "Laptop images:", laptop_imgs
bag_img_names = [image['file_name'] for image in bag_imgs]
laptop_img_names = [image['file_name'] for image in laptop_imgs]
print "Bag Images:", len(bag_img_names), bag_img_names[:5]
print "Laptop Images:", len(laptop_img_names), laptop_img_names[:5]
"""Extracting annotations corresponding to bag and laptop images."""
bag_ann = littleCo.loadAnns(ids=bag_ann_ids)
laptop_ann = littleCo.loadAnns(ids=laptop_ann_ids)
bag_bbox = [ann['bbox'] for ann in bag_ann]
laptop_bbox = [ann['bbox'] for ann in laptop_ann]
print "Bags' bounding boxes:", len(bag_ann), bag_bbox[:5]
print "Laptops' bounding boxes:", len(laptop_bbox), laptop_bbox[:5]
"""Saving files corresponding to bags and laptop category in a directory."""
import shutil
#path_to_imgs = "/export/work/Data Pool/coco_data/train2014/"
#path_to_subset_imgs = "/export/work/Data Pool/coco_subset_data/"
path_to_ann = "/export/work/Data Pool/coco_subset_data/annotations/"
dirs_list = [("/export/work/Data Pool/coco_data/train2014/", "/export/work/Data Pool/coco_subset_data/")]
for source_folder, destination_folder in dirs_list:
for img in bag_img_names:
shutil.copy(source_folder + img, destination_folder + img)
print "Bag images copied!"
for img in laptop_img_names:
shutil.copy(source_folder + img, destination_folder + img)
print "Laptop images copied!"
"""Creating empty files for annotation."""
for f in os.listdir("/export/work/Data Pool/coco_subset_data/images/"):
if f.endswith('.jpg'):
open(os.path.join(path_to_ann, f.replace('.jpg', '.txt')), 'w+').close()
print "Done creating empty annotation files."
I provided only the main function here as the rest of the code is coco.py file in COCO-API.
I debugged the code to find that there are different data structures:
cats, a dictionary which maps category IDs to their supercategories and category names (labels).
imgToAnns, also a dictionary which maps every image ID to its segmentation ground truth, bounding box ground truth, category ID etc. From what I have managed to know so far, I think I need to use this dictionary to somehow map the image names I have in bag_img_names and laptop_img_names lists to their labels and bounding boxes, but I am not able to think in the right direction, as to how to access this dictionary (No method in coco.py returns it directly).
imgs, another dictionary which gives meta information about all images, such as, image name, image url, captured date etc.
Finally, I know this is an extremely specific question. Feel free to let me know if this belongs to a community other than stackoverflow (stats.stackexchange.com for example), and I will remove it. Also, it might be possible that I missed some vital information. I will provide it if I can think of it, or if someone asks.
I am only a beginner in Python, so please forgive me if I might have missed something obvious.
Any help whatsoever is highly appreciated. Thank You.
2 years have passed. Now coco.py can already do what you were doing, by adding at the end some functions to map the annotations, converted into RLE format, to the images. take a look at this cocoapi.

How to loop through one element of a zip() function twice - Python

So here's my dilema... I'm writing a script that reads all .png files from a folder and then converts them to a number of different dimensions which I have specified in a list. Everything works as it should except it quits after handling one image.
Here is my code:
sizeFormats = ["1024x1024", "114x114", "40x40", "58x58", "60x60", "640x1136", "640x960"]
def resizeImages():
widthList = []
heightList = []
resizedHeight = 0
resizedWidth = 0
#targetPath is the path to the folder that contains the images
folderToResizeContents = os.listdir(targetPath)
#This splits the dimensions into 2 separate lists for height and width (ex: 640x960 adds
#640 to widthList and 960 to heightList
for index in sizeFormats:
widthList.append(index.split("x")[0])
heightList.append(index.split("x")[1])
#for every image in the folder, apply the dimensions from the populated lists and save
for image,w,h in zip(folderToResizeContents,widthList,heightList):
resizedWidth = int(w)
resizedHeight = int(h)
sourceFilePath = os.path.join(targetPath,image)
imageFileToConvert = Image.open(sourceFilePath)
outputFile = imageFileToConvert.resize((resizedWidth,resizedHeight), Image.ANTIALIAS)
outputFile.save(sourceFilePath)
The following will be returned if the target folder contains 2 images called image1.png,image2.png (for sake of visualization I'll add the dimensions that get applied to the image after an underscore):
image1_1024x1024.png,
..............,
image1_640x690.png (Returns all 7 different dimensions for image1 fine)
it stops there when I need it to apply the same transformations to image_2. I know this is because the length of widthList and heightList are only 7 elements long and so exits the loop before image2 gets its turn. Is there any way I can go about looping through widthList and heightList for every image in the targetPath?
Why not keep it simple:
for image in folderToResizeContents:
for fmt in sizeFormats:
(w,h) = fmt.split('x')
N.B. You are overwriting the files produced as you are not changing the name of the outpath.
Nest your for loops and you can apply all 7 dimensions to each image
for image in folderToResizeContents:
for w,h in zip(widthList,heightList):
the first for loop will ensure it happens for each image, whereas the second for loop will ensure that the image is resized to each size
You need to re-iterate through the sizeFormats for every file. Zip doesn't do this unless you get even trickier with cyclic iterators for height and width.
Sometimes tools such as zip make for longer more complicated code when a couple of nested for loops work fine. I think its more straight forward than splitting into multiple lists and then zipping them back together again.
sizeFormats = ["1024x1024", "114x114", "40x40", "58x58", "60x60", "640x1136", "640x960"]
sizeTuples = [(int(w), int(h)) for w,h in map(lambda wh: wh.split('x'), sizeFormats)]
def resizeImages():
#for every image in the folder, apply the dimensions from the populated lists and save
for image in os.listdir(targetPath):
for resizedWidth, resizedHeight in sizeTuples:
sourceFilePath = os.path.join(targetPath,image)
imageFileToConvert = Image.open(sourceFilePath)
outputFile = imageFileToConvert.resize((resizedWidth,resizedHeight), Image.ANTIALIAS)
outputFile.save(sourceFilePath)

Categories