Loading datasets in offline mode in sklearn and skmultilearn - python

I would like to use datasets: emotions, scene, and yeast in my project in anaconda (python 3.6.5).
I have used the following codes:
from skmultilearn.dataset import load_dataset
X_train, y_train, feature_names, label_names = load_dataset('emotions', 'train')
It works successfully when I am connected to the internet,
But when I am offline, it doesn't work!
I have downloaded all 3 named above datasets in a folder like this:
H:\Projects\Datasets
How can I use this folder as my source datasets while I am offline?
(I'm using windows 10)
The extensions of datasets that I have downloaded them are: .rar
Like this: emotions.rar, scene.rar, and yeast.rar, and I have downloaded them from: http://mulan.sourceforge.net/datasets-mlc.html

You can but you first need to know the path that the dataset was stored to.
To do this you can load once and get the path. This path will never change so you only need to do the following once in order to get the desired path. Next, knowing the path, you can load offline whatever you want.
Example:
from sklearn.datasets import load_iris
import pandas as pd, os
#get the path
path = load_iris()['filename']
print(path)
#offline load
df = pd.read_csv(path)
#the path: THIS IS WHAT YOU NEED
main_path_with_datasets = os.path.dirname(path)
Once you get the main_path_with_datasets i.e. by doing main_path_with_datasets = os.path.dirname(path), you will now have the path. You can use it to load all the available downloaded datasets.
os.listdir(main_path_with_datasets)
['digits.csv.gz',
'wine_data.csv',
'diabetes_target.csv.gz',
'iris.csv',
'breast_cancer.csv',
'diabetes_data.csv.gz',
'linnerud_physiological.csv',
'linnerud_exercise.csv',
'boston_house_prices.csv']
EDIT for skmultilearn
from skmultilearn.dataset import load_dataset_dump
path = 'C:\\Users\\myname\\scikit_ml_learn_data\\'
X, y, feature_names, label_names = load_dataset_dump(path + 'emotions-train.scikitml.bz2')

Related

Python import local dataset in tensorflow

I am working on image classification in tensorflow. I am at the point of loading a local dataset from my project directory into my python file. I am following the tensorflow docs (https://www.tensorflow.org/tutorials/images/classification), and when I reach the point of adding data, the docs import the data from the internet using a google dataset. They use
dataset_url = "https://storage.googleapis.com/download.tensorflow.org/example_images/flower_photos.tgz"
and then
data_dir = tf.keras.utils.get_file('flower_photos', origin=dataset_url, untar=True)
How would I do these same thing with a local directory called DataSet?
get_file will download only if not existed. So you can set fname to a local file, and set origin = '' like:
data_dir = tf.keras.utils.get_file(os.path.abspath('flower_photos'), origin='', untar=True)
os.path.abspath is needed since keras search cache_dir for the file by default.
And since untar is deprecated, you may better use extract instead like:
data_dir = tf.keras.utils.get_file(os.path.abspath('flower_photos.tar.gz'), origin='', extract=True)
suppose that your DataSet contains subfolders that contain image.png.
import pathlib
data_dir = pathlib.Path('path/to/your/DataSet_folder')
list_ds = tf.data.Dataset.list_files(str(data_dir/'*/*.png'))
list_ds contains all paths to your images.

Copy images in subfolders to another directory with subfolders using Python

I have a folder with many subfolders that contain images.
I split the data using train_test_split from sklearn.model_selection same as below:
folder_data_train, folder_data_test, train_target, test_target =
train_test_split(data, targets_array, test_size=0.20, random_state=42, shuffle = True, stratify = targets_array)
folder_data_test is include .png images.
output of print(folder_data_test) is:
['/avi_images/A4CH_RV\\12505310b836710d_c18.png'
'/avi_images/PLAX_valves\\6ad39d497bc07141_c21.png'
'/avi_images/A4CH_LV\\7f50b7e4c051d48f_b52.png' ...
'/avi_images/Suprasternal\\6978b0ee7068a69e_b37.png'
'/avi_images/A5CH\\61cabd1291a81fc8_b43.png'
'/avi_images/PLAX_full\\2cab9cf0dd8d6480_b7.png']
I want to copy these images from folder_data_test to new directory including subfolders. for example subfloder is A4CH_RV. My current code is:
dst_dir_test = '/avi_images_search/test/'
for testdata in folder_data_test:
shutil.copy(testdata, dst_dir_test)
it is copying all images from folder_data_test to dst_dir_test directory without subfolder. How can I copy them to the relevant subfolders?
shutil acts almost like shell in this case.
Your code does this (for each file) :
shutil.copy('/avi_images/A4CH_RV\\12505310b836710d_c18.png', '/avi_images_search/test/')
This is rougthly equivalent to
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/
I'm a little bit confused by the \\, but I guess you are on windows and what you want is
cp /avi_images/A4CH_RV\\12505310b836710d_c18.png /avi_images_search/test/A4CH_RV
To do this in python you'll have to play around with the path, and create the directory before the copy.
src_base_path = '/avi_images/'
for testdata in folder_data_test:
[src_dir_path,file_name] = os.path.split(testdata)
sub_dir = os.path.join(dst_dir_test, src_dir_path[len(src_base_path):])
os.makedirs(sub_dir, exist_ok=True)
shutil.copy(testdata, sub_dir)
I didn't try it, but it should be something along those lines.

How to unzip cats-vs-dogs data in kaggle?

This is my first kaggle kernel and I am sure about things in kaggle.
I tried to create a new kernel for cats vs dogs classifier.
I created a new kernel in https://www.kaggle.com/c/dogs-vs-cats/notebooks
Then,
!ls ../input/dogs-vs-cats/
# sampleSubmission.csv test1.zip train.zip
!unzip ../input/dogs-vs-cats/train.zip
# this gives a report that looks like it works.
# it displays jpg files names
# but when I check the folder train, it does not exits
!ls ../input/dogs-vs-cats/train/
# there is no folder train
import os
print(os.listdir("../input/dogs-vs-cats"))
# ['train.zip', 'test1.zip', 'sampleSubmission.csv']
# there is no unzipped folder
How to access the data in kaggle kernel?
You can load the zip file into pandas,
df = pd.read_csv('train.zip')
df
You are looking at the unzipped files at the wrong place.
Instead of:
!unzip ../input/dogs-vs-cats/train.zip
!ls ../input/dogs-vs-cats/train/
Do this:
!unzip ../input/dogs-vs-cats/train.zip
!ls train/
To check in python
import os
print(os.listdir("train"))

How to load image dataset for SVM image classification task

I'm trying to make a linear SVM classifier (AD vs NC) for the classification of Alzheimer's Disease by using MRI images. How can I load the image dataset correctly?
I found an example of SVM image classification and I tried to run through the trial, but there was an error when loading the dataset.
The folder name is "images"
there are five subfolders in "images". They are named as doller_bill, sunflower, pizza, dog, and ball. Each subfolder contains 50-60 photos as jpg format. The followings are the sample codes I downloaded.
download from github
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np %matplotlib notebook
from sklearn import svm, metrics, datasets
from sklearn.utils import Bunch
from sklearn.model_selection import GridSearchCV, train_test_split
from skimage.io import imread
from skimage.transform import resize
def load_image_files(container_path, dimension=(64, 64)):
image_dir = Path(container_path)
folders = [directory for directory in image_dir.iterdir() if
directory.is_dir()]
categories = [fo.name for fo in folders]
descr = "A image classification dataset"
images = []
flat_data = []
target = []
for i, direc in enumerate(folders):
for file in direc.iterdir():
img = skimage.io.imread(file)
img_resized = resize(img, dimension, anti_aliasing=True,
mode='reflect')
flat_data.append(img_resized.flatten())
images.append(img_resized)
target.append(i)
flat_data = np.array(flat_data)
target = np.array(target)
images = np.array(images)
return Bunch(data=flat_data,
target=target,
target_names=categories,
images=images,
DESCR=descr)
image_dataset = load_image_files("images/")
However, when I run through the codes, it appeared an error as follows
NameError: name 'skimage' is not defined
So, would you please help me to figure out how to load the image dataset.
For instance, I have a folder named "images"
the subfolders are named as "MRI images_NC", "MRI images_AD",
Accordingly, each folder contains 1500 photos approximately.
Thanks again.
name 'skimage' is not defined
means that during the import
from skimage.io import imread `enter code here`
the skimage package can not be found
Please run a
pip install scikit-image

FileNotFoundError: No such file or directory (for Dogs and Cats code)

I'm new to Machine Learning and I'm following a Sentdex tutorial on Google Colab. It's supposed to be a ML program that distinguishes between cat and dog images. However, whenever I run my code, somethings wrong with my 'file or directory.'
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\atlgwc16\\PetImages/Dog'
I honestly don't know where Google Colab stores its files so I don't know where to put the folder of images.
Here is my full code so far:
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
from tqdm import tqdm
DATADIR = "C:\Users\atlgwc16\PetImages"
CATEGORIES = ["Dog", "Cat"]
for category in CATEGORIES:
path = os.path.join(DATADIR, category)
for img in os.listdir(path):
img_array = cv2.imread(os.path.join(path,img), cv2.IMREAD_GRAYSCALE)
plt.imshow(img_array, cmap = 'gray')
plt.show()
break
Tutorial being followed as referenced in the question:
https://pythonprogramming.net/loading-custom-data-deep-learning-python-tensorflow-keras/
Since you are using Google Colab, you can upload the Kaggle dataset of dog and cat images to Google Drive. See the Google Colab Jupyter notebook provided by Google that explains how to do this:
https://colab.research.google.com/notebooks/io.ipynb#scrollTo=u22w3BFiOveA
You would then access files from your Google Drive (in this case, the training set after you upload it to Google Drive) much in the same way as accessing files locally on your computer.
This is the example provided in the link above:
with open('/content/gdrive/My Drive/foo.txt', 'w') as f:
f.write('Hello Google Drive!')
!cat /content/gdrive/My\ Drive/foo.txt
So, since you are using Google Colab, you would need to adjust the code from the Sentdex tutorial to work better with the notebook you are creating. Google Colab uses Jupyter notebooks. Each cell in the notebook runs off of the same 'session'. So, if you import a Python module in one cell, it can be used in the next cells. It's magic like that.
It would look like this:
[CELL 1]
from google.colab import drive
drive.mount('/content/gdrive')
You will then give permission for Google Colab to access your Google Drive.
[CELL 2]
import numpy as np
import matplotlib.pyplot as plt
import os
import cv2
from tqdm import tqdm
DATADIR = '/content/gdrive/My Drive/PetImages/'
#^See?#
# You would need to go to Google Drive and create the 'PetImages' folder at the top level of your Google Drive. You would upload the data set to the PetImages folder creating a 'Dog' subfolder and a 'Cat' subfolder.
CATEGORIES = ["Dog", "Cat"]
for category in CATEGORIES: # do dogs and cats
path = os.path.join(DATADIR,category) # create path to dogs and cats
for img in os.listdir(path): # iterate over each image per dogs and cats
img_array = cv2.imread(os.path.join(path,img) ,cv2.IMREAD_GRAYSCALE) # convert to array
plt.imshow(img_array, cmap='gray') # graph it
plt.show() # display!
break # we just want one for now so break
break #...and one more!
After properly uploading the data set to Google Drive and using the special google.colab module, you should be able to easily access your training data. Google Colab is a cloud-based tool for creating Jupyter notebooks and running Python programs. So, while similar to running a Python program locally on your computer, it is not exactly the same. It would help to read through how Google Colab works more if you want to use it completely in the cloud--using GDrive to store files rather than your own computer. See the link I posted above from Google.
Happy coding.
I did it for my self and it works for me.
I use data set from my local drive such as a hard disk.
note: your dataset folder must be in the zip form.
first, follow the method with me and you will access your dataset from the local drive.I use google colab. first, create a Jupyter notebook in google Colab and run the below code step by step:
first step: run the below code in your notebook and upload your dataset from your hard drive or local drive
from google.Colab import files
uploaded = files.upload()
when the process is complete 100 percent and do the second step:
second step:
copy and run the below code, this step will unzip the dataset
import zipfile
import io
zf = zipfile.ZipFile(io.BytesIO(uploaded['DogVsCat.zip']), "r")
zf.extractall()
third step: run the code it will import all the required libraries
import numpy as np
import os
import cv2
import matplotlib.pyplot as plt
this will import all the required libraries for you.
fourth step:
specify the path for specifying the path do below steps:
fist: check the image for more ease of your
on the left corner folder icon click on the highlighted folder and you will see your unzip dataset in my case my dataset is "DogVsCat",
note: there you will see two kinds of dataset zip and unzip, you copy the path of unzip data.
right click on it and copy the path from it and
run the below code:
DIRECTORY ='/content/DogVsCats'
CATEGORIES = ['cats', 'dogs']
note: please add your path in DIRECTORY(This directory is the path for me) path, not my path. and again run the code:
note: please add your own folder names in CATEGORIES not my folder names for more information see the picture:
my dataset structure
at the end create train data
5th step:
data = []
for category in CATEGORIES:
path = os.path.join(DIRECTORY, category)
for img in os.listdir(path):
img_path = os.path.join(path, img)
label = CATEGORIES.index(category)
arr = cv2.imread(img_path, cv2.IMREAD_GRAYSCALE)
new_arr = cv2.resize(arr, (60, 60))
data.append([new_arr, label])
sixth step:
print the data:
run the below code to show you
data
seventh step: shuffle your data:
import random
random.shuffle(data)
eight-step:
specify features and labels for training the model
X = []
y = []
for features, label in data:
X.append(features)
y.append(label)
X = np.array(X)
y = np.array(y)
ninth-step: print features
X
tenth-step: print labels
y
note I can not share all the code with you for the lack of time.
note: for more clearness check my code pictures:
pic1-Of-My-Code
pic2-of-my-code

Categories