I'm trying to run code provided by yhat in their article about random forests in Python, but I keep getting following error message:
File "test_iris_with_rf.py", line 11, in <module>
df['species'] = pd.Factor(iris.target, iris.target_names)
AttributeError: 'module' object has no attribute 'Factor'
Code:
from sklearn.datasets import load_iris
from sklearn.ensemble import RandomForestClassifier
import pandas as pd
import numpy as np
iris = load_iris()
df = pd.DataFrame(iris.data, columns=iris.feature_names)
print df
print iris.target_names
df['is_train'] = np.random.uniform(0, 1, len(df)) <= .75
df['species'] = pd.Factor(iris.target, iris.target_names)
df.head()
In newer versions of pandas, the Factor is called Categorical instead. Change your line to:
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
Categorical variables seems to be one of the more active areas of development in pandas, so I believe it's changed yet again in pandas 0.15.0 :
df['species'] = pd.Categorical.from_codes(iris.target, iris.target_names)
(I lacked sufficient reputation to add this as a comment on David Robinson's answer)
def factor(series):
#input should be a pandas series object
dic = {}
for i,val in enumerate(series.value_counts().index):
dic[val] = i
return [ dic[val] for val in series.values ]
Related
i am trying to learn about decision trees and I found a tutorial about the subject. The goal of the tutorial is to decide if a flower is a special type of flower called iris but i seem to run into some errors that i hope somebody got the answer to i get two errors like the following:
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'target' member
and
iris: Bunch
iris: inner_f
Instance of 'tuple' has no 'data' member
here is the code:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
#load iris data
iris = datasets.load_iris()
x = iris.data
y = iris.target
d = [{"sepal_length":row[0],
"sepal_width":row[1],
"petal_length":row[2],
"petal_width":row[3]} for row in x]
df = pd.DataFrame(d) # construct dataframe
df["types"] = y # assign types
df = df.sample(frac=1.0) # random shuffle rows
df.head()
Is there anybody that knows why i get these errors?
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
%matplotlib inline
boston = load_boston()
print(boston.keys())
When I type this I get the output:
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
so I know that feature_names is an attribute. However, when I type
boston.columns = boston.feature_names
the ouput comes as
'DataFrame' object has no attribute 'feature_names'
To convert boston sklearn dataset to pandas Dataframe use:
df = pd.DataFrame(boston.data,columns=boston.feature_names)
df['target'] = pd.Series(boston.target)
I had something similar. Also with scikitlearn to make a random forest with this tutorial:
https://www.datacamp.com/tutorial/random-forests-classifier-python
I stumbled upon this line of code:
import pandas as pd
feature_imp = pd.Series(clf.feature_importances_,**index=iris.feature_names**).sort_values(ascending=False)
feature_imp
I got an error from the bold part (between the **).
Thanks to the suggestions of #anky and #David Meu I tried:
feature_imp = pd.Series(clf.feature_importances_, index = dfNAN.columns.values).sort_values(ascending=False)
that results in the error:
ValueError: Length of values (4) does not match length of index (5)
so I tried:
feature_imp = pd.Series(clf.feature_importances_, index = dfNAN.columns.values[:4]).sort_values(ascending=False)
which works!
I am running this code
import pandas as np
import numpy as np
from sklearn import cluster
from sklearn.cluster import KMeans
model = cluster.KMeans(n_clusters=4, random_state=10)
Then I put that through a dataframe I am working on and that includes the columns age and income, which is the clusters I am working on,
model.fit(df[['income', 'age']]
And so far it works well until I run the following bit, which aims at creating a column with the label of the cluster each data point belongs to.
df['cluster'] = model.labels_df.head()
And this is the error code I get:
AttributeError: 'KMeans' object has no attribute 'labels_df'
Any suggestions?
The attribute to access the labels of the model is: model.labels_
Use:
df['cluster'] = model.labels_
By typing model.labels_df.head() you request the head of model.labels_df that does not exist.
I believe you have mistyped it and you need:
df['cluster'] = model.labels_
df.head()
I have used the following code to convert the sk learn breast cancer data set to data frame : I am not getting the output ? I am very new in python and not able to figure out what is wrong.
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = numpy.c_[cancer.data, cancer.target]
columns = numpy.append(cancer.feature_names, ["target"])
return pandas.DataFrame(data, columns=columns)
answer_one()
Use pandas
There was a great answer here: How to convert a Scikit-learn dataset to a Pandas dataset?
The keys in bunch object give you an idea about which data you want to make columns for.
df = pd.DataFrame(cancer.data, columns=cancer.feature_names)
df['target'] = pd.Series(cancer.target)
The following code works
def answer_one():
import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
data = np.c_[cancer.data, cancer.target]
columns = np.append(cancer.feature_names, ["target"])
return pd.DataFrame(data, columns=columns)
answer_one()
The reason why your code doesn't work before was you try to call numpy and pandas package again after defining it as np and pd respectively.
However, i suggest that the package loading and redefinition is done at the beginning of the script, outside a function definition.
As of scikit-learn 0.23 you can do the following to get a DataFrame and save some keystrokes:
df = load_breast_cancer(as_frame=True)
df.frame
dataframe = pd.DataFrame(data=cancer.data, columns=cancer.feature_names)
dataframe['target'] = cancer.target
return dataframe
I am a newbie in learning Machine Learning, so following a great tutorial in YouTube. But the following code is giving me an error. I read a similar question in here, but timetuple() does not solve my case, nor any solutions from the video.
Here is my code :
import pandas as pd
import quandl, math
from datetime import datetime, date, time, timedelta
import time
import numpy as np
from sklearn import preprocessing, cross_validation, svm
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt #plot stuff, how to plot in graph
from matplotlib import style #nice looking thing
style.use('ggplot') #which nice-looking-thing i wanna use
quandl.ApiConfig.api_key = '...' #erased my key for secrecy
df = quandl.get_table('WIKI/PRICES')
## ... ##other irrelevant code snippets
forecast_out = int(math.ceil(0.01*len(df)))
df['label'] = df[forecast_col].shift(-forecast_out)
X = np.array(df.drop(['label'],1))
X = preprocessing.scale(X)
X_lately = X[-forecast_out:]
X = X[:-forecast_out]
df.dropna(inplace=True)
y = np.array(df['label'])
y = np.array(df['label'])
# ... #other irrelevant code snippets
forecast_set = clf.predict(X_lately)
df['Forecast'] = np.nan
last_date = df.iloc[-1].name
last_unix = last_date.timestamp() ###MAIN attribute error found here
one_day = 86400
next_unix = last_unix + one_day
For this above code, I got this following error :
AttributeError Traceback (most recent call last)
<ipython-input-8-4a1a193ea81d> in <module>()
1 last_date = df.iloc[-1].name
----> 2 last_unix = last_date.timestamp()
3 one_day = 86400
4 next_unix = last_unix + one_day
AttributeError: 'numpy.int64' object has no attribute 'timestamp'
I couldn't figure out the solution though there are many solutions in the internet but nothing worked for me. I am using Python 3.5 in anaconda. timetuple() doesn't work for me and same attribute error occurs.
It seems it doesn't index the rows by the dates. So when you are trying to get last_date, actually it is getting int instead of date.
As per my understanding you can add date index by using the following line after reading csv code - df.set_index('date', inplace=True)
Replace last_date.timestamp() with time.mktime(datetime.datetime.strptime(last_date,"%d%m%y").timetuple())
Hi sometimes when do such operations index is changed. Date column is no more index.
df=df.set_index('Date')
Just add this just after you create a dataframe.
Example-
df = quandl.get_table('WIKI/PRICES') OR df=pd.read_csv("stock_data.csv") add after this.