I am new to programming and I was working with the titanic dataset from Kaggle. I have been trying to build the Logistic Regression model after performing one-hot encoding. But I keep getting the error. I think the error is caused due to the dummy variable. Below is my code.
import numpy as np
import pandas as pd
import matplotlib as plt
import seaborn as sns
#Loading data
df=pd.read_csv(r"C:\Users\Downloads\train.csv")
#Deleting unwanted columns
df.drop(["PassengerId","Name","Cabin","Ticket"],axis=1,inplace=True)
#COunt of Missing values in each column
print(df.isnull().sum())
#Deleting rows with missing values based on column name
df.dropna(subset=['Embarked','Age'],inplace=True)
print(df.isnull().sum())
#One hot encoding for categorical variables
#Creating dummy variables for Sex column
dummies = pd.get_dummies(df.Sex)
dummies2=pd.get_dummies(df.Embarked)
#Appending the dummies dataframe with original dataframe
new_df= pd.concat([df,dummies,dummies2],axis='columns')
print(type(new_df))
#print(new_df.head(10))
#Drop the original sex,Embarked column and one of the dummy column for bth variables
new_df.drop(['Sex','Embarked'],axis='columns',inplace=True)
print(new_df.head(10))
new_df.info()
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix,accuracy_score
x = df.drop('Survived', axis=1)
y = df['Survived']
logmodel = LogisticRegression()
logmodel.fit(x, y)
As we discussed in the comments, here is the solution:
First, you need to modify your x and y variables to use new_df instead of df just like so:
x = new_df.drop('Survived', axis=1)
y = new_df['Survived']
Then, you need to increase the iteration of your Logistic Regression Model like so:
logmodel = LogisticRegression(max_iter=1000)
Related
when i try to define X and Y from my dataset that is already defined and I made some analysis based on it and i dont have any problem.
but when i start to define (X) and (Y), an error message appear that "NameError: name 'MyNewDataSet' is not defined
the dataset name is "MyNewDataSet"
do I need to define a new dataset before assign the values to X and Y? or what should I do
Here is how I define X and Y from my dataframe:
#Create X
X = MyNewDataSet.drop(['y'], axis=1)
For X I am using multiple columns so all I do is remove the column I will be using for my y variable.
#Create y
y = MyNewDataSet['y']
Here I create Y by assigning the columns y to the variable.
If this does not work please share some of your code. That way it might be easier for us to visualize your problem, but I hope this can help.
there is my code to define my new dataset
###KNN imputation to fill missing values.
import numpy as np
from numpy import isnan
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
imputer.fit(MyDataSet)
MyNewDataSet = pd.DataFrame(imputer.transform(MyDataSet), columns = MyDataSet.columns)
MyNewDataSet.set_index(MyDataSet.index, inplace= True)
MyNewDataSet = MyNewDataSet.astype(MyDataSet.dtypes.to_dict())
MyNewDataSet
and here is my code to assign values to X and Y
### Split the dataset into a set of features (X) and Target variable (y)
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import HuberRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
y=MyNewDataSet["Target"]
x=MyNewDataSet.drop("Target", axis=1)
i am trying to learn about decision trees and I ended up finding a article about decision trees. The goal of the article is to decide if a flower is a iris flower or not but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch iris: inner_f Instance of 'tuple' has no 'target' member
and
iris: Bunch iris: inner_f Instance of 'tuple' has no 'data' member
i get these errors at the x = iris.data line and at the y = iris.target line.
Here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
Your error message indicates that the problematic value iris is a tuple, which doesn't have the attributes you're referencing. Check the documentation for the tools you're using; they should explain how to unpack datasets.load_iris() into the objects you need.
I would not filter warnings in most cases, as you get useful information from the warnings.
So, the sklearn datasets format is a Bunch, which is a specialized container object that works like a dictionary. You can access it with dot notation, e.g. iris.data or dictionary notation, e.g. iris['data']. Here, it is unclear what the error is on your machine, as I (like other commenters) had no problem accessing iris.data or iris['data'] in python 3.8.5.
I wanted to let you know a couple of places to improve your approach:
(1) It is unclear why you need to construct a dataframe as you can get the samples you need directly from calling train_test_split on the concatenated numpy arrays or you can get a random sample of indices from the numpy arrays directly.
(2) Your method for constructing the dataframe is more complex than it needs to be.
import pandas as pd
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
# load iris data
iris = datasets.load_iris()
# train test split
X_train, y_train, X_test, y_test = train_test_split(iris.data, iris.target)
# random shuffle of data/target indices
rng = np.random.default_rng()
rng_size = iris.data.shape[0]
idx_sample = rng.choice(np.arange(rng_size), size=rng_size, replace=False)
# simpler way to create dataframe
# concatenate along the columns (axis 1)
# then set the column names in one place
df = pd.concat([pd.DataFrame(iris.data), pd.DataFrame(iris.target)], axis=1)
df.columns = ["sepal_length", "sepal_width", "petal_length", "petal_width", "types"]
Currently, I'm doing the titanic dataset on kaggle. The Age column has some missing values, and I tried to impute them using sklearn.impute SimpleImputer.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error as mae
from sklearn.model_selection import train_test_split as tts
from sklearn.impute import SimpleImputer
titanic_data = pd.read_csv("../input/titanic/train.csv")
imputer = SimpleImputer(missing_values=np.nan)
features = ['Age', 'Pclass']
X = titanic_data[features]
y = titanic_data.Survived
age_arr = X.Age.values.reshape(1, -1)
imputed_age = pd.DataFrame(imputer.fit_transform(age_arr))
X.Age = imputed_age
print(imputed_age)
As shown above, I have some trouble arranging and converting those arrays and data columns. I need a proper way to make those a single column in the age column. When I print imputed_age, it gives me a dataframe where each age is a column. I want to make all of these in the same column, and how could I easily do the imputing and successfully put the imputed values into the dataframe again?
How could I put those imputed values into the dataframe?
I asked this on a forum elsewhere and someone gave me a solution. I'll put it here, and I've modified it a bit.
import pandas as pd
import seaborn as sns
from sklearn.impute import SimpleImputer
df = sns.load_dataset("titanic")
features = ["pclass","age"]
X = df.loc[:,features]
y = df.survived
imputer = SimpleImputer()
age_transform = pd.DataFrame(imputer.fit_transform(pd.DataFrame(X.age)),columns=["Age"])
I check your code and I found that if we input dataframe in imputer.fit_transform, we don't need to reshape to (1,-1).
So I just make age columns as dataframe and input it in imputer and fit_transform. And I think it works well.
Im trying to make my way to a sligthly more flexible knn input script than the tutorials based of the iris dataset but Im having some trouble (I think) to add the matching 2nd dimension to the numpy array in #6 and when I come to #11. the fitting.
File "G:\PROGRAMMERING\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 212, in check_consistent_length
" samples: %r" % [int(l) for l in lengths]) ValueError: Found input variables with inconsistent numbers of samples: [150, 1]
x is (150,5) and y is (150,1). 150 is the number of samples in both, but they differ in number of fields, is this the problem and if so how do I fix it?
#1. Loading the Pandas libraries as pd
import pandas as pd
import numpy as np
#2. Read data from the file 'custom.csv' placed in your code directory
data = pd.read_csv("custom.csv")
#3. Preview the first 5 lines of the loaded data
print(data.head())
print(type(data))
#4.Test the shape of the data
print(data.shape)
df = pd.DataFrame(data)
print(df)
#5. Convert non-numericals to numericals
print(df.dtypes)
# Any object should be converted to numerical
df['species'] = pd.Categorical(df['species'])
df['species'] = df.species.cat.codes
print("outcome:")
print(df.dtypes)
#6.Convert df to numpy.ndarray
np = df.to_numpy()
print(type(np)) #this should state <class 'numpy.ndarray'>
print(data.shape)
print(np)
x = np.data
y = [df['species']]
print(y)
#K-nearest neighbor (find closest) - searach for the K nearest observations in the dataset
#The model calculates the distance to all, and selects the K nearest ones.
#8. Import the class you plan to use
from sklearn.neighbors import (KNeighborsClassifier)
#9. Pick a value for K
k = 2
#10. Instantiate the "estimator" (make an instance of the model)
knn = KNeighborsClassifier(n_neighbors=k)
print(knn)
#11. fit the model with data/model training
knn.fit(x, y)
#12. Predict the response for a new observation
print(knn.predict([[3, 5, 4, 2]]))```
This is how I used the scikit-learn KNeighborsClassifier to fit the knn model:
import numpy as np
import pandas as pd
from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier
df = datasets.load_iris()
X = pd.DataFrame(df.data)
y = df.target
knn = KNeighborsClassifier(n_neighbors = 2)
knn.fit(X,y)
print(knn.predict([[6, 3, 5, 2]]))
#prints output class [2]
print(knn.predict([[3, 5, 4, 2]]))
#prints output class [1]
From DataFrame you don't need to convert to numpy array, you can directly fit the model on DataFrame, also while converting the DataFrame to numpy array you have named that as np which is also used to import numpy at the top import numpy as np
The input prediction input is 4 columns, leaving the fifth 'species' without prediction. Also, if 'species' was the target it cannot be given as input to the knn at the same time. The pop removes this particular column from the dataFrame df.
#npdf = df.to_numpy()
df = df.apply(lambda x:pd.Series(x))
y = np.asarray(df['species'])
#removes the target from the sample
df.pop('species')
x = df.to_numpy()
I have this code that normalizes a pandas dataframe.
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
from sklearn import preprocessing
df = pd.read_csv('DS/RS_DS/final_dataset.csv')
rec_df = df.drop(['person_id','encounter_id','birthdate','CN','HN','DN','DIAG_DM','DIAG_NONDM','TPN'], axis=1)
#normalize values from 0 to 1
df_val = rec_df.values
min_max_scaler = preprocessing.MinMaxScaler()
df_val_scaled = min_max_scaler.fit_transform(df_val)
df_scaled = pd.DataFrame(df_val_scaled)
df_flask = pd.DataFrame([[42.8,151,73,79,0,1,74]],columns=['weight','height','wc','hc','isMale','isFemale','age'])
df_flask_val = df_flask.values
df_flask_val_scaled = min_max_scaler.fit_transform(df_flask_val)
df_flask_scaled = pd.DataFrame(df_flask_val_scaled)
df_scaled returns a dataframe that is normalized. df_flask is a dataframe that I want to normalize based on df_scaled so I can use it for comparison. df_flask_scaled return all 0, I think it didnt normalize based on the dataframe. is there anyway to normalize the single row df.
or should I add this data to the dataframe then compute normalizing?
I think you should do fit and transform separately. This is done to ensure that the distribution of data using in fitting is maintained.
# initialise scaler
min_max_scaler = preprocessing.MinMaxScaler()
# fit here
min_max_scaler.fit(rec_df.values)
# apply transformation
df_val_scaled = min_max_scaler.transform(rec_df.values)
df_flask_val_scaled = min_max_scaler.transform(df_flask_val)