How Do I Create My Own Datasets To Work With Sklearn

How Do I Create My Own Datasets To Work With Sklearn - python

I have found a piece of sklearn code that I am finding relatively straightforward to run the example Iris Dataset, but how do I create my own dataset similar to this?
iris.data - contains the data measurements of three types of flower
iris.target - contains the labels of three types of flower
e.g. rather than analysing the three types of flower in the Iris Dataset, I would like to make my own datasets that follow this format and that I can pass through the code.
example_sports.data - contains the data measurements of three types of sports players
example_sports.target - contains the labels of three types of sport
from sklearn.datasets import load_iris #load inbuilt dataset from sklearn
iris = load_iris() #assign variable name iris to inbuilt dataset
iris.data # contains the four numeric variables, sepal length, sepal width, petal length, petal width
print(iris.data) #printing the measurements of the Iris Dataset below
iris.target # relates to the different species shown as 0,1,2 for the three different
# species of Iris, a categorical variable, basically a label
print(iris.target)
The full code can be found at https://www.youtube.com/watch?v=asW8tp1qiFQ

sklearn datasets are stored in Bunch which is basically just a type of dict. Sklearn data and targets are basically just NumPy arrays and can be fed into the fit() method of the sklearn estimator you are interested in. But if you want to make your own data as a Bunch, you can do something like the following:
from sklearn.utils import Bunch
import numpy as np
example_sports = Bunch(data = np.array([[1.0,1.2,2.1],[3.0,2.3,1.0],[4.5,3.4,0.5]]), target = np.array([3,2,1]))
print(example_sports.data)
print(example_sports.target)
Naturally, you can read your own custom lists into the data and target entries of the Bunch. Pandas is a good tool if you have the data in Excel/CSV files.

Try using type() command whenever you are stuck. In this case it shows you that it is a Bunch object. Then you can search documentations of that class on the web and understand how to use them.
The following will help you.
from sklearn.utils import Bunch
b = Bunch(a=1, b="textt", c = pd.Series(np.arange(5)), d = np.asarray([0, 8, 9]))
b.c

Related

Data Imputation Methods

I want to run an evaluation of imputation methods on my data, rather than the California Housing data on the following sklearn page:
https://scikit-learn.org/stable/auto_examples/impute/plot_iterative_imputer_variants_comparison.html#sphx-glr-auto-examples-impute-plot-iterative-imputer-variants-comparison-py
I can remove the following code
from sklearn.datasets import fetch_california_housing
but don't know how to add my data (as a *.csv file) for evaluation and to what extent the code below needs to be modified.
N_SPLITS = 5
rng = np.random.RandomState(0)
X_full, y_full = fetch_california_housing(return_X_y=True)
# ~2k samples is enough for the purpose of the example.
# Remove the following two lines for a slower run with different error bars.
X_full = X_full[::10]
y_full = y_full[::10]
n_samples, n_features = X_full.shape

Grouping clusters based on one feature column

I have not clustered data in a while and at the moment i have a massive list of accounts with their perspective areas (or OUs in the table below).
I have used kmeans and kmodes to try and cluster based on OU - meaning that I want the output to group the 17 OUs i have and cluster them based on the provided information. Thus far the output has provided me with clustering based on each record individually and not based on each OU. can some one help me figure out how to group the output then cluster somehow? below is the same of the code used.
# Building the model with 3 clusters
kmode = KModes(n_clusters=3, init = "random", n_init = 5, verbose=1)
clusters = kmode.fit_predict(df)
clusters
#insert the predicted cluster values in our original dataset.
df.insert(0, "Cluster", clusters, True)
df.head(10)

I don't have access to your data set, but below is a generic example of how to do clustering.
# Cluster analysis, or clustering, is an unsupervised machine learning task.
# It involves automatically discovering natural grouping in data. Unlike supervised learning (like predictive modeling),
# clustering algorithms only interpret the input data and find natural groups or clusters in feature space.
import statsmodels.api as sm
import numpy as np
import pandas as pd
mtcars = sm.datasets.get_rdataset("mtcars", "datasets", cache=True).data
df_cars = pd.DataFrame(mtcars)
df_cars.head()
from numpy import unique
from numpy import where
from sklearn.datasets import make_classification
from sklearn.cluster import KMeans
from matplotlib import pyplot
# define dataset
X = df_cars[['mpg','hp']]
# define the model
model = KMeans(n_clusters=8)
# fit the model
model.fit(X)
# assign a cluster to each example
yhat = model.predict(X)
X['kmeans']=yhat
pyplot.scatter(X['mpg'], X['hp'], c=X['kmeans'], cmap='rainbow', s=50, alpha=0.8)
See the link below for more details.
https://github.com/ASH-WICUS/Notebooks/blob/master/Clustering%20Algorithms%20Compared.ipynb

Map the first five observations of the Iris dataset from numbers to strings

As an introduction to machine learning I have to find the first five names corresponding to the flowers of the Iris dataset from the scikit-learn library.
I'm not quite sure how to approach this, as I'm completely new in the field. I was told I can do some numpy indexing to retrieve these.
I know that the integers in iris.target correspond to 0 = 'setosa', 1 = 'versicolor', 2 = 'virginica'.
EDIT:
To clarify, what I actually want to achieve is to map the integers to names for the first 5 flowers from iris.data (assign setosa, veriscolor or virginica to each of the first five observations).

Do you want to convert the numbers to their corresponding categories? If so, try:
# Load first five flowers and store them in `y`
y = load_iris()['target'][:5]
# Declare dictionary to map each number to its corresponding text
dictionary = {0:'setosa',1:'versicolor',2:'virginica'}
# Translate each number to text using the dictionary
[dictionary[i] for i in y]
You can do the same with numpy.where:
# Import numpy
import numpy as np
# Case-like structure
np.where(y == 0, 'setosa',
np.where(y == 1, 'versicolor',
'virginica'))

use pandas if you can. its simple,
import pandas as pd
import numpy as np
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
new_names = ['sepal_length','sepal_width','petal_length','petal_width','iris_class']
dataset = pd.read_csv(url, names=new_names, skiprows=0, delimiter=',') # load iris dataset from url
dataset.info() # gives details about your dataset
dataset.head() # this will give you first 5 entries in your dataset
# for more details
# check out this link
# https://medium.com/#yosik81/machine-learning-in-30-minutes-with-python-and-google-colab-6e6dfb77f5e1

Iris decision tree in python

i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?

Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.

I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]

Issue with Scikit-learn data analysis

am attempting to take a .dat file of about 90,000 data lines of two variables (wavelength and intensity) and apply a sklearn.pca filter to it.
Here is a small set of that data:
wavelength intensity
[um] [W/m**2/um/sr]
196.078431372549 1.108370393265022E-003
192.307692307692 1.163428008597600E-003
188.679245283019 1.223639983609668E-003
The code I am using to analyze the data is below
pca= PCA(n_components=2)
pca.fit(data)
print(pca.components_)
The error code I get is this when I try to apply 2 pca components to one of the data sets:
ValueError: Datatype coercion is not allowed
Any help resolving would be much appreciated

I think in your case, the problem is the column name, especially [W/m**2/um/sr].
Also when using PCA, do not forget to rescale the input variables into "comparable" units using StandardScaler.
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
data = pd.DataFrame({'wavelength [um]': [196.078431372549, 1.108370393265022E-003, 192.307692307692], 'intensity [W/m**2/um/sr]': [1.163428008597600E-003, 188.679245283019, 1.223639983609668E-003]})
scaler = StandardScaler(with_mean=True, with_std=True)
pca= PCA(n_components=2)
pca.fit(scaler.fit_transform(data))
print(pca.components_)
Worked well for me. Maybe you just need to specify:
data.columns = data.columns.astype(str)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How Do I Create My Own Datasets To Work With Sklearn - python

Related

Data Imputation Methods

Grouping clusters based on one feature column

Map the first five observations of the Iris dataset from numbers to strings

Iris decision tree in python

Issue with Scikit-learn data analysis

Categories

Resources