I am working on image classification in tensorflow. I am at the point of loading a local dataset from my project directory into my python file. I am following the tensorflow docs (https://www.tensorflow.org/tutorials/images/classification), and when I reach the point of adding data, the docs import the data from the internet using a google dataset. They use
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
and then
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
How would I do these same thing with a local directory called DataSet?
get_file will download only if not existed. So you can set fname to a local file, and set origin = '' like:
data_dir = tf.keras.utils.get_file(os.path.abspath('flower_photos'), origin='', untar=True)
os.path.abspath is needed since keras search cache_dir for the file by default.
And since untar is deprecated, you may better use extract instead like:
data_dir = tf.keras.utils.get_file(os.path.abspath('flower_photos.tar.gz'), origin='', extract=True)
suppose that your DataSet contains subfolders that contain image.png.
import pathlib
data_dir = pathlib.Path('path/to/your/DataSet_folder')
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*.png'))
list_ds contains all paths to your images.
Related
I have collected a group of images that I want to train a model on.
How do I load the image dataset? I have a folder of training data with two folders in it denoting the two different kinds of objects. How would I go about loading this data set and then training a model?
this might help you to load your dataset into data variable from a single folder of images
import cv2
import os
import numpy as np
path = 'path to your dataset'
list_of_files = os.listdir(path)
data = np.empty(0)
for i in list_of_files:
x = cv2.imread(os.path.join(path+i))
data.append(x)
I have a folder with many subfolders that contain images.
I split the data using train_test_split from sklearn.model_selection same as below:
folder_data_train, folder_data_test, train_target, test_target =
train_test_split(data, targets_array, test_size=0.20, random_state=42, shuffle = True, stratify = targets_array)
folder_data_test is include .png images.
output of print(folder_data_test) is:
['/avi_images/A4CH_RV\\12505310b836710d_c18.png'
'/avi_images/PLAX_valves\\6ad39d497bc07141_c21.png'
'/avi_images/A4CH_LV\\7f50b7e4c051d48f_b52.png' ...
'/avi_images/Suprasternal\\6978b0ee7068a69e_b37.png'
'/avi_images/A5CH\\61cabd1291a81fc8_b43.png'
'/avi_images/PLAX_full\\2cab9cf0dd8d6480_b7.png']
I want to copy these images from folder_data_test to new directory including subfolders. for example subfloder is A4CH_RV. My current code is:
dst_dir_test = '/avi_images_search/test/'
for testdata in folder_data_test:
shutil.copy(testdata, dst_dir_test)
it is copying all images from folder_data_test to dst_dir_test directory without subfolder. How can I copy them to the relevant subfolders?
shutil acts almost like shell in this case.
Your code does this (for each file) :
shutil.copy('/avi_images/A4CH_RV\\12505310b836710d_c18.png', '/avi_images_search/test/')
This is rougthly equivalent to
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/
I'm a little bit confused by the \\, but I guess you are on windows and what you want is
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/A4CH_RV
To do this in python you'll have to play around with the path, and create the directory before the copy.
src_base_path = '/avi_images/'
for testdata in folder_data_test:
[src_dir_path,file_name] = os.path.split(testdata)
sub_dir = os.path.join(dst_dir_test, src_dir_path[len(src_base_path):])
os.makedirs(sub_dir, exist_ok=True)
shutil.copy(testdata, sub_dir)
I didn't try it, but it should be something along those lines.
I would like to use datasets: emotions, scene, and yeast in my project in anaconda (python 3.6.5).
I have used the following codes:
from skmultilearn.dataset import load_dataset
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
It works successfully when I am connected to the internet,
But when I am offline, it doesn't work!
I have downloaded all 3 named above datasets in a folder like this:
H:\Projects\Datasets
How can I use this folder as my source datasets while I am offline?
(I'm using windows 10)
The extensions of datasets that I have downloaded them are: .rar
Like this: emotions.rar, scene.rar, and yeast.rar, and I have downloaded them from: http://mulan.sourceforge.net/datasets-mlc.html
You can but you first need to know the path that the dataset was stored to.
To do this you can load once and get the path. This path will never change so you only need to do the following once in order to get the desired path. Next, knowing the path, you can load offline whatever you want.
Example:
from sklearn.datasets import load_iris
import pandas as pd, os
#get the path
path = load_iris()['filename']
print(path)
#offline load
df = pd.read_csv(path)
#the path: THIS IS WHAT YOU NEED
main_path_with_datasets = os.path.dirname(path)
Once you get the main_path_with_datasets i.e. by doing main_path_with_datasets = os.path.dirname(path), you will now have the path. You can use it to load all the available downloaded datasets.
os.listdir(main_path_with_datasets)
['digits.csv.gz',
'wine_data.csv',
'diabetes_target.csv.gz',
'iris.csv',
'breast_cancer.csv',
'diabetes_data.csv.gz',
'linnerud_physiological.csv',
'linnerud_exercise.csv',
'boston_house_prices.csv']
EDIT for skmultilearn
from skmultilearn.dataset import load_dataset_dump
path = 'C:\\Users\\myname\\scikit_ml_learn_data\\'
X, y, feature_names, label_names = load_dataset_dump(path + 'emotions-train.scikitml.bz2')
I'm new to Machine Learning and I'm following a Sentdex tutorial on Google Colab. It's supposed to be a ML program that distinguishes between cat and dog images. However, whenever I run my code, somethings wrong with my 'file or directory.'
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\atlgwc16\\PetImages/Dog'
I honestly don't know where Google Colab stores its files so I don't know where to put the folder of images.
Here is my full code so far:
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
from tqdm import tqdm
DATADIR = "C:\Users\atlgwc16\PetImages"
CATEGORIES = ["Dog", "Cat"]
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for img in os.listdir(path):
img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
plt.imshow(img_array, cmap = 'gray')
plt.show()
break
Tutorial being followed as referenced in the question:
https://pythonprogramming.net/loading-custom-data-deep-learning-python-tensorflow-keras/
Since you are using Google Colab, you can upload the Kaggle dataset of dog and cat images to Google Drive. See the Google Colab Jupyter notebook provided by Google that explains how to do this:
https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA
You would then access files from your Google Drive (in this case, the training set after you upload it to Google Drive) much in the same way as accessing files locally on your computer.
This is the example provided in the link above:
with open('/content/gdrive/My Drive/foo.txt', 'w') as f:
f.write('Hello Google Drive!')
!cat /content/gdrive/My\ Drive/foo.txt
So, since you are using Google Colab, you would need to adjust the code from the Sentdex tutorial to work better with the notebook you are creating. Google Colab uses Jupyter notebooks. Each cell in the notebook runs off of the same 'session'. So, if you import a Python module in one cell, it can be used in the next cells. It's magic like that.
It would look like this:
[CELL 1]
from google.colab import drive
drive.mount('/content/gdrive')
You will then give permission for Google Colab to access your Google Drive.
[CELL 2]
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
from tqdm import tqdm
DATADIR = '/content/gdrive/My Drive/PetImages/'
#^See?#
# You would need to go to Google Drive and create the 'PetImages' folder at the top level of your Google Drive. You would upload the data set to the PetImages folder creating a 'Dog' subfolder and a 'Cat' subfolder.
CATEGORIES = ["Dog", "Cat"]
for category in CATEGORIES: # do dogs and cats
path = os.path.join(DATADIR,category) # create path to dogs and cats
for img in os.listdir(path): # iterate over each image per dogs and cats
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
plt.imshow(img_array, cmap='gray') # graph it
plt.show() # display!
break # we just want one for now so break
break #...and one more!
After properly uploading the data set to Google Drive and using the special google.colab module, you should be able to easily access your training data. Google Colab is a cloud-based tool for creating Jupyter notebooks and running Python programs. So, while similar to running a Python program locally on your computer, it is not exactly the same. It would help to read through how Google Colab works more if you want to use it completely in the cloud--using GDrive to store files rather than your own computer. See the link I posted above from Google.
Happy coding.
I did it for my self and it works for me.
I use data set from my local drive such as a hard disk.
note: your dataset folder must be in the zip form.
first, follow the method with me and you will access your dataset from the local drive.I use google colab. first, create a Jupyter notebook in google Colab and run the below code step by step:
first step: run the below code in your notebook and upload your dataset from your hard drive or local drive
from google.Colab import files
uploaded = files.upload()
when the process is complete 100 percent and do the second step:
second step:
copy and run the below code, this step will unzip the dataset
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['DogVsCat.zip']), "r")
zf.extractall()
third step: run the code it will import all the required libraries
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
this will import all the required libraries for you.
fourth step:
specify the path for specifying the path do below steps:
fist: check the image for more ease of your
on the left corner folder icon click on the highlighted folder and you will see your unzip dataset in my case my dataset is "DogVsCat",
note: there you will see two kinds of dataset zip and unzip, you copy the path of unzip data.
right click on it and copy the path from it and
run the below code:
DIRECTORY ='/content/DogVsCats'
CATEGORIES = ['cats', 'dogs']
note: please add your path in DIRECTORY(This directory is the path for me) path, not my path. and again run the code:
note: please add your own folder names in CATEGORIES not my folder names for more information see the picture:
my dataset structure
at the end create train data
5th step:
data = []
for category in CATEGORIES:
path = os.path.join(DIRECTORY, category)
for img in os.listdir(path):
img_path = os.path.join(path, img)
label = CATEGORIES.index(category)
arr = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
new_arr = cv2.resize(arr, (60, 60))
data.append([new_arr, label])
sixth step:
print the data:
run the below code to show you
data
seventh step: shuffle your data:
import random
random.shuffle(data)
eight-step:
specify features and labels for training the model
X = []
y = []
for features, label in data:
X.append(features)
y.append(label)
X = np.array(X)
y = np.array(y)
ninth-step: print features
X
tenth-step: print labels
y
note I can not share all the code with you for the lack of time.
note: for more clearness check my code pictures:
pic1-Of-My-Code
pic2-of-my-code
I'm based on Window 10, Jupyter Notebook, Pytorch 1.0, Python 3.6.x currently.
At first I confirm to the correct path of files using this code : print(os.listdir('./Dataset/images/')).
and I could check that this path is correct.
but I met Error :
RuntimeError: Found 0 files in subfolders of: ./Dataset/images/
Supported extensions are: .jpg,.jpeg,.png,.ppm,.bmp,.pgm,.tif"
What is the matter?
Could you suggest a solution?
I tried to ./dataset/1/images like this method. but the result was same....
img_dir = './Dataset/images/'
img_data = torchvision.datasets.ImageFolder(os.path.join(img_dir), transforms.Compose([
transforms.Scale(256),
transforms.RandomResizedCrop(224),
transforms.RandomHorizontalFlip(),
transforms.ToTensor(),
]))
img_batch = data.DataLoader(img_data, batch_size=batch_size,
shuffle = True, drop_last=True)
I met the same problem when using celebA, including 200,000 images. As we can see there are many images. But in a small sample situation (I tried 20 images), I checked, the error will not be raised, which means we can read images successfully.
But when the number grows, we should use other methods.
I solved the problem according to this website. Thanks to QimingChen
Github solution
Simply, adding another folder named 1 (/train/--->train/1/) in the original folder will enable our program to work, without changing the path. That's because when facing large datasets, images should be sorted in subfolders of different classes.
The Original answer on Github:
Let's say I am going to use ImageFolder("/train/") to read jpg files in folder train.
The file structure is
/train/
-- 1.jpg
-- 2.jpg
-- 3.jpg
I failed to load them, leading to errors:
RuntimeError: Found 0 images in subfolders of: ./data
Supported image extensions are: .jpg,.JPG,.jpeg,.JPEG,.png,.PNG,.ppm,.PPM,.bmp,.BMP
I read the solution above and tried tens of times. When I changed the structure to
/train/1/
-- 1.jpg
-- 2.jpg
-- 3.jpg
But the read in code is still -- ImageFolder("/train/"), IT WORKS.
It seems like the program tends to recursively read in files, that is convenient in some cases.
Hope this would help!!
Can you post the structure of your files? In your case, it is supposed to be:
img_dir
|_class1
|_a.jpg
|_b.jpg
|_class2
|_a.jpg
|_b.jpg
...
According to the rules of the DataLoader in pytorch you should choose the the superior path of the image path. That means if your images locate in './Dataset/images/', the path of the data loader should be './Dataset' instead. I hope it can fix your bug.:)
You can modify the ImageFolder class to get to the root folder directly (without subfolders):
class ImageFolder(Dataset):
def __init__(self, root, transform=None):
#Call make_dataset to collect files.
self.samples = make_dataset(opt.dataroot)
self.imgs = self.samples
self.transformA = transformA
...
We call the make_dataset method to collect our files:
def make_dataset(dir):
import os
images = []
d = os.path.expanduser(dir)
if not os.path.exists(dir):
print('path does not exist')
for root, _, fnames in sorted(os.walk(d)):
for fname in sorted(fnames):
path = os.path.join(root, fname)
images.append(path)
return images
All the action takes place in the loop containing os.walk. Here, the files are collected from the 'root' directory, which we specify as the directory containing our files.
See the documentation of ImageFolder dataset to see how this dataset class expects the images to be organized into subfolders under `./Dataset/images' according to image classes. Make sure your images adhere to this order.
Apparently, the solution is just making the picture name alpha-numeric. They may be another solution but this work.