I have a folder with many subfolders that contain images.
I split the data using train_test_split from sklearn.model_selection same as below:
folder_data_train, folder_data_test, train_target, test_target =
train_test_split(data, targets_array, test_size=0.20, random_state=42, shuffle = True, stratify = targets_array)
folder_data_test is include .png images.
output of print(folder_data_test) is:
['/avi_images/A4CH_RV\\12505310b836710d_c18.png'
'/avi_images/PLAX_valves\\6ad39d497bc07141_c21.png'
'/avi_images/A4CH_LV\\7f50b7e4c051d48f_b52.png' ...
'/avi_images/Suprasternal\\6978b0ee7068a69e_b37.png'
'/avi_images/A5CH\\61cabd1291a81fc8_b43.png'
'/avi_images/PLAX_full\\2cab9cf0dd8d6480_b7.png']
I want to copy these images from folder_data_test to new directory including subfolders. for example subfloder is A4CH_RV. My current code is:
dst_dir_test = '/avi_images_search/test/'
for testdata in folder_data_test:
shutil.copy(testdata, dst_dir_test)
it is copying all images from folder_data_test to dst_dir_test directory without subfolder. How can I copy them to the relevant subfolders?
shutil acts almost like shell in this case.
Your code does this (for each file) :
shutil.copy('/avi_images/A4CH_RV\\12505310b836710d_c18.png', '/avi_images_search/test/')
This is rougthly equivalent to
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/
I'm a little bit confused by the \\, but I guess you are on windows and what you want is
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/A4CH_RV
To do this in python you'll have to play around with the path, and create the directory before the copy.
src_base_path = '/avi_images/'
for testdata in folder_data_test:
[src_dir_path,file_name] = os.path.split(testdata)
sub_dir = os.path.join(dst_dir_test, src_dir_path[len(src_base_path):])
os.makedirs(sub_dir, exist_ok=True)
shutil.copy(testdata, sub_dir)
I didn't try it, but it should be something along those lines.
Related
Please Help me.
I'm new to python and I want to load all jpg images in 500 folders (each folder has up to 100 jpg images).
I also have a CSV file that contains labels for folders, I want to tell python to consider every folder label for all jpg images in that folder.
example file name: folder_name[.....].jpg
Each file has the same folder name, except the [], which is different for each file.
How can I tell python no matter what it is []??
I would appreciate any help.
train = pd.read_csv("COAD_CMS_label_train.csv")`
train_image = []
for i in tqdm(range(train.shape[0])):
img = image.load_img('tiles/'+train['folder_name'][i]+' [*]'.astype('str')+'.jpg', target_size=(256,256,3),
grayscale=False,)
example for labels:
folder_name,Label
TCGA-A6-2683-01Z-00-DX1.0dfc5d0a-68f4-45e1-a879-0428313c6dbc,CMS2
TCGA-F4-6459-01Z-00-DX1.80a78213-1137-4521-9d60-ac64813dec4c,CMS4
TCGA-A6-6653-01Z-00-DX1.e130666d-2681-4382-9e7a-4a4d27cb77a4,CMS1
You could do something like this which will result in an array of all files with the jpg extension, where parent_dir is the parent directory which contains all the subdirectories with the images.
images_files = glob.iglob(f'{parent_dir}/*.jpg', recursive=True)
Hope that helps.
Good luck
I have a question about splitting a dataset of 20k images along with their labels, the dataset is in the format of YOLOv3 which has an image file and a .txt file with the same name as the image, the text file has the labels inside it.
I want to split the dataset into train/test splits, is there a way to randomly select the image and its labels .txt file with it and store it in a separate folder using Python?
I want to be able to split the dataset randomly. For instance, select 16k files along with label file too and store them separately in a train folder and the remaining 4k should be stored in a test folder.
This could manually be done in the file explorer by selecting the first 16k files and move them to a different folder but the split won't be random as I plan to do this over and over again for the same dataset.
Here is what the data looks like
images and labels screenshot
I suggest you to take a look at following Python built-in modules
glob
random
os
shutill
for manipulating files and paths in Python. Here is my code with comments that might solve your problem. It's very simple
import glob
import random
import os
import shutil
# Get all paths to your images files and text files
PATH = 'path/to/dataset/'
img_paths = glob.glob(PATH+'*.jpg')
txt_paths = glob.glob(PATH+'*.txt')
# Calculate number of files for training, validation
data_size = len(img_paths)
r = 0.8
train_size = int(data_size * 0.8)
# Shuffle two list
img_txt = list(zip(img_paths, txt_paths))
random.seed(43)
random.shuffle(img_txt)
img_paths, txt_paths = zip(*img_txt)
# Now split them
train_img_paths = img_paths[:train_size]
train_txt_paths = txt_paths[:train_size]
valid_img_paths = img_paths[train_size:]
valid_txt_paths = txt_paths[train_size:]
# Move them to train, valid folders
train_folder = PATH+'train/'
valid_folder = PATH+'valid/'
os.mkdir(train_folder)
os.mkdir(valid_folder)
def move(paths, folder):
for p in paths:
shutil.move(p, folder)
move(train_img_paths, train_folder)
move(train_txt_paths, train_folder)
move(valid_img_paths, valid_folder)
move(valid_txt_paths, valid_folder)
This is my first kaggle kernel and I am sure about things in kaggle.
I tried to create a new kernel for cats vs dogs classifier.
I created a new kernel in https://www.kaggle.com/c/dogs-vs-cats/notebooks
Then,
!ls ../input/dogs-vs-cats/
# sampleSubmission.csv test1.zip train.zip
!unzip ../input/dogs-vs-cats/train.zip
# this gives a report that looks like it works.
# it displays jpg files names
# but when I check the folder train, it does not exits
!ls ../input/dogs-vs-cats/train/
# there is no folder train
import os
print(os.listdir("../input/dogs-vs-cats"))
# ['train.zip', 'test1.zip', 'sampleSubmission.csv']
# there is no unzipped folder
How to access the data in kaggle kernel?
You can load the zip file into pandas,
df = pd.read_csv('train.zip')
df
You are looking at the unzipped files at the wrong place.
Instead of:
!unzip ../input/dogs-vs-cats/train.zip
!ls ../input/dogs-vs-cats/train/
Do this:
!unzip ../input/dogs-vs-cats/train.zip
!ls train/
To check in python
import os
print(os.listdir("train"))
I would like to use datasets: emotions, scene, and yeast in my project in anaconda (python 3.6.5).
I have used the following codes:
from skmultilearn.dataset import load_dataset
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
It works successfully when I am connected to the internet,
But when I am offline, it doesn't work!
I have downloaded all 3 named above datasets in a folder like this:
H:\Projects\Datasets
How can I use this folder as my source datasets while I am offline?
(I'm using windows 10)
The extensions of datasets that I have downloaded them are: .rar
Like this: emotions.rar, scene.rar, and yeast.rar, and I have downloaded them from: http://mulan.sourceforge.net/datasets-mlc.html
You can but you first need to know the path that the dataset was stored to.
To do this you can load once and get the path. This path will never change so you only need to do the following once in order to get the desired path. Next, knowing the path, you can load offline whatever you want.
Example:
from sklearn.datasets import load_iris
import pandas as pd, os
#get the path
path = load_iris()['filename']
print(path)
#offline load
df = pd.read_csv(path)
#the path: THIS IS WHAT YOU NEED
main_path_with_datasets = os.path.dirname(path)
Once you get the main_path_with_datasets i.e. by doing main_path_with_datasets = os.path.dirname(path), you will now have the path. You can use it to load all the available downloaded datasets.
os.listdir(main_path_with_datasets)
['digits.csv.gz',
'wine_data.csv',
'diabetes_target.csv.gz',
'iris.csv',
'breast_cancer.csv',
'diabetes_data.csv.gz',
'linnerud_physiological.csv',
'linnerud_exercise.csv',
'boston_house_prices.csv']
EDIT for skmultilearn
from skmultilearn.dataset import load_dataset_dump
path = 'C:\\Users\\myname\\scikit_ml_learn_data\\'
X, y, feature_names, label_names = load_dataset_dump(path + 'emotions-train.scikitml.bz2')
I'm based on Window 10, Jupyter Notebook, Pytorch 1.0, Python 3.6.x currently.
At first I confirm to the correct path of files using this code : print(os.listdir('./Dataset/images/')).
and I could check that this path is correct.
but I met Error :
RuntimeError: Found 0 files in subfolders of: ./Dataset/images/
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif"
What is the matter?
Could you suggest a solution?
I tried to ./dataset/1/images like this method. but the result was same....
img_dir = './Dataset/images/'
img_data = torchvision.datasets.ImageFolder(os.path.join(img_dir), transforms.Compose([
transforms.Scale(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
]))
img_batch = data.DataLoader(img_data, batch_size=batch_size,
shuffle = True, drop_last=True)
I met the same problem when using celebA, including 200,000 images. As we can see there are many images. But in a small sample situation (I tried 20 images), I checked, the error will not be raised, which means we can read images successfully.
But when the number grows, we should use other methods.
I solved the problem according to this website. Thanks to QimingChen
Github solution
Simply, adding another folder named 1 (/train/--->train/1/) in the original folder will enable our program to work, without changing the path. That's because when facing large datasets, images should be sorted in subfolders of different classes.
The Original answer on Github:
Let's say I am going to use ImageFolder("/train/") to read jpg files in folder train.
The file structure is
/train/
-- 1.jpg
-- 2.jpg
-- 3.jpg
I failed to load them, leading to errors:
RuntimeError: Found 0 images in subfolders of: ./data
Supported image extensions are: .jpg,.JPG,.jpeg,.JPEG,.png,.PNG,.ppm,.PPM,.bmp,.BMP
I read the solution above and tried tens of times. When I changed the structure to
/train/1/
-- 1.jpg
-- 2.jpg
-- 3.jpg
But the read in code is still -- ImageFolder("/train/"), IT WORKS.
It seems like the program tends to recursively read in files, that is convenient in some cases.
Hope this would help!!
Can you post the structure of your files? In your case, it is supposed to be:
img_dir
|_class1
|_a.jpg
|_b.jpg
|_class2
|_a.jpg
|_b.jpg
...
According to the rules of the DataLoader in pytorch you should choose the the superior path of the image path. That means if your images locate in './Dataset/images/', the path of the data loader should be './Dataset' instead. I hope it can fix your bug.:)
You can modify the ImageFolder class to get to the root folder directly (without subfolders):
class ImageFolder(Dataset):
def __init__(self, root, transform=None):
#Call make_dataset to collect files.
self.samples = make_dataset(opt.dataroot)
self.imgs = self.samples
self.transformA = transformA
...
We call the make_dataset method to collect our files:
def make_dataset(dir):
import os
images = []
d = os.path.expanduser(dir)
if not os.path.exists(dir):
print('path does not exist')
for root, _, fnames in sorted(os.walk(d)):
for fname in sorted(fnames):
path = os.path.join(root, fname)
images.append(path)
return images
All the action takes place in the loop containing os.walk. Here, the files are collected from the 'root' directory, which we specify as the directory containing our files.
See the documentation of ImageFolder dataset to see how this dataset class expects the images to be organized into subfolders under `./Dataset/images' according to image classes. Make sure your images adhere to this order.
Apparently, the solution is just making the picture name alpha-numeric. They may be another solution but this work.