pycaret, regression on target column

pycaret, regression on target column - python

I'm trying to apply some machine learning based regression on data from a CSV file. My columns are:
Index(['date', 'customer_id', 'product_category', 'payment_method',
'value [USD]', 'time_on_site', 'clicks_in_site', 'USD/[Minutes]',
'USD/clicks_in_site'],
dtype='object')
When I run:
from pycaret.regression import *
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
exp_reg = setup(data = df, target='value [USD]', session_id=123,
high_cardinality_features = ['product_category'],
normalize = True,
ignore_features = ['customer_id', 'date', 'time_on_site']
)
I get the following error message:
KeyError Traceback (most recent call last)
<ipython-input-43-20eab85de0cc> in <module>()
2 high_cardinality_features = ['product_category'],
3 normalize = True,
----> 4 ignore_features = ['customer_id', 'date', 'time_on_site']
5 )
6
5 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/indexes/base.py in drop(self, labels, errors)
5285 if mask.any():
5286 if errors != "ignore":
-> 5287 raise KeyError(f"{labels[mask]} not found in axis")
5288 indexer = indexer[~mask]
5289 return self.delete(indexer)
KeyError: "['value [USD]'] not found in axis"

I found the solution. The column name ['value [USD]'] was the problem. After renaming it the code works as intended. It has probably something to do with the brackets inside the column name which can maybe be interpreted as a dictionary or list but I'm not sure.

Related

error when try to incorporate target data from SKLEARN in python

I am trying to build a matrix to evaluate how different features impact the data set's target. Here I use scikit-learn's breast cancer data. My code is below but the result shows an error, I could not figure out how to fix it.
import numpy as np
import seaborn as sns; sns.set(style="ticks", color_codes=True)
import sklearn.datasets
import pandas as pd
from sklearn.datasets import load_breast_cancer
WBC_dataset = load_breast_cancer()
WBC_df = pd.DataFrame(
data= np.c_[WBC_dataset['data'],WBC_dataset['target']],
columns= np.append(WBC_dataset['feature_names'], ['Condition']))
cols = WBC_dataset.columns.drop('Condition')
WBC_df[cols] = WBC_df[cols].apply(pd.to_numeric)
g = sns.pairplot(WBC_df, hue='Condition')
Here is the error:
KeyError Traceback (most recent call last)
~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in __getattr__(self, key)
104 try:
--> 105 return self[key]
106 except KeyError:
KeyError: 'columns'
During handling of the above exception, another exception occurred:
AttributeError Traceback (most recent call last)
<ipython-input-245-1123cb0b25bb> in <module>
9 data= np.c_[WBC_dataset['data'],WBC_dataset['target']],
10 columns= np.append(WBC_dataset['feature_names'], ['Condition']))
---> 11 cols = WBC_dataset.columns.drop('Condition')
12 WBC_df[cols] = WBC_df[cols].apply(pd.to_numeric)
13 g = sns.pairplot(WBC_df, hue='Condition')
~\Anaconda3\lib\site-packages\sklearn\utils\__init__.py in __getattr__(self, key)
105 return self[key]
106 except KeyError:
--> 107 raise AttributeError(key)
108
109 def __setstate__(self, state):
AttributeError: columns

WBC_df is not a dataframe, it is a dict containing multiple values. Check the documentation.
data = load_breast_cancer()
WBC_df = pd.DataFrame(data.data,columns = data.feature_names)

The WBC dataset has non numbers in it that need removing before you pass it to sklearn
By default if you pass these null values to pandas it will create a column with type object.
The easiest way to fix this is at import using pandas to clean it up
## creating a custom missing value list
missing_values = ["NA","N/a",np.nan,"?"]
l1 = pd.read_csv("../../DataSets/Breast cancer dataset/breast-cancer-wisconsin.data",
header=None,na_values=missing_values,
names=['id','clump_thickness','uniformity_of_cell_size','uniformity_of_cell_shape','marginal_adhesion','single_epithelial_cell_size','bare_nuclei','bland_chromatin','normal_nucleoli','mitoses','diagnosis'])
l1 = l1.dropna()
### 'bare_nuclei' is the column with 16 occurrences of "?" in it. so this removes the ? rows,

Dealing with outliers in Pandas [duplicate]

This question already has answers here:
Detect and exclude outliers in a pandas DataFrame
(19 answers)
Closed 2 years ago.
Good day. The problem is the following - when trying to remove outliers from one of the columns in the table
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
from scipy.stats import norm
from scipy import stats
import numpy as np
df = pd.read_csv("https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBM-DA0321EN-SkillsNetwork/LargeData/m2_survey_data.csv")
df["ConvertedComp"].plot(kind="box", figsize=(10,10))
z_scores = stats.zscore(df["ConvertedComp"])
abs_z_scores = np.abs(z_scores)
filtered_entries = (abs_z_scores < 3).all(axis=1)
new_df = df[filtered_entries]
the following error crashes.
---------------------------------------------------------------------------
AxisError Traceback (most recent call last)
<ipython-input-133-7811da442811> in <module>
4 z_scores
5 abs_z_scores = np.abs(z_scores)
----> 6 filtered_entries = (abs_z_scores < 3).all(axis=1)
7 #new_df = df[filtered_entries]
C:\ProgramData\WatsonStudioDesktop\miniconda3\envs\desktop\lib\site-packages\numpy\core\_methods.py in _all(a, axis, dtype, out, keepdims)
44
45 def _all(a, axis=None, dtype=None, out=None, keepdims=False):
---> 46 return umr_all(a, axis, dtype, out, keepdims)
47
48 def _count_reduce_items(arr, axis):
AxisError: axis 1 is out of bounds for array of dimension 1
I would be grateful for your advice, ideas are almost over

Your zscore is computed over only 1 column, so the result is a one-dimensional array
z_scores = stats.zscore(df["ConvertedComp"])
new_df = df[np.abs(z_scores) < 3]
Now if you run zscore over multiple column, then your original code would have worked:
z_scores = stats.zscore(df[["ConvertedComp", 'AnotherColumn']])
new_df = df[(np.abs(z_scores) < 3).all(axis=1)]

A squarred variable is outside the index

A variation of this post, without the detailed traceback, had been posted in the SO about two hours ago. This version contains the whole traceback.)
I am running StatsModels to get parameter estimates from ordinary least-squares (OLS). Data-processing and model-specific commands are shown below. When I use import statsmodels.formula.api as smas the operative api, the OLS works as desired (after I drop some 15 rows programmatically), giving intuitive results. But when I switch to import statsmodels.api as sm as the binding api, without changing the code almost at all, things fall apart, and Python interpreter triggers an error saying that 'inc_2 is not in the index'. Mind you, inc_2 was computed after the dataframe was read into StatsModels in both model runs: and yet the run was successful in the first, but not in the second. (BTW, p_c_inc_18 is per-capita income, and inc_2 is the former squarred. inc_2 is the offensive element in the second run.)
import pandas as pd
import numpy as np
import statsmodels.api as sm
%matplotlib inline import
matplotlib.pyplot as plt
import seaborn as sns
sns.set(style="whitegrid") eg=pd.read_csv(r'C:/../../../une_edu_pipc_06.csv') pd.options.display.precision = 3
plt.rc("figure", figsize=(16,8))
plt.rc("font", size=14)
sm_col = eg["lt_hsd_17"] + eg["hsd_17"]
eg["ut_hsd_17"] = sm_col
sm_col2 = eg["sm_col_17"] + eg["col_17"] eg["bnd_hsd_17"] = sm_col2
eg["d_09"]= eg["Rate_09"]-eg["Rate_06"]
eg["d_10"]= eg["Rate_10"]-eg["Rate_06"] inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
X = eg[["p_c_inc_18","ut_hsd_17","d_10","inc_2"]]
y = eg["Rate_18"]
X = sm.add_constant(X)
mod = sm.OLS(y, X)
res = mod.fit()
print(res.summary())
Here is the traceback in full.
KeyError Traceback (most recent call last)
<ipython-input-21-e2f4d325145e> in <module>
17 eg["d_10"]= eg["Rate_10"]-eg["Rate_06"]
18 inc_2=eg["p_c_inc_18"]*eg["p_c_inc_18"]
---> 19 X = eg[["p_c_inc_18","ut_hsd_17","d_10","inc_2"]]
20 y = eg["Rate_18"]
21 X = sm.add_constant(X)
~\Anaconda3\lib\site-packages\pandas\core\frame.py in __getitem__(self, key)
2804 if is_iterator(key):
2805 key = list(key)
-> 2806 indexer = self.loc._get_listlike_indexer(key, axis=1, raise_missing=True)[1]
2807
2808 # take() does not accept boolean indexers
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _get_listlike_indexer(self, key, axis, raise_missing)
1550 keyarr, indexer, new_indexer = ax._reindex_non_unique(keyarr)
1551
-> 1552 self._validate_read_indexer(
1553 keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
1554 )
~\Anaconda3\lib\site-packages\pandas\core\indexing.py in _validate_read_indexer(self, key, indexer, axis, raise_missing)
1644 if not (self.name == "loc" and not raise_missing):
1645 not_found = list(set(key) - set(ax))
-> 1646 raise KeyError(f"{not_found} not in index")
1647
1648 # we skip the warning on Categorical/Interval
KeyError: "['inc_2'] not in index"
What am I doing wrong?

The syntax you used insists that a list of strings is a legal index into eg. If you print(eg), you'll see that it has no such element. I think what you meant was to make a list of elements, each indexed by a single string.
X = [
eg["p_c_inc_18"],
eg["ut_hsd_17"],
eg["d_10"],
eg["inc_2"]
]

ValueError: Found infinity in column x

I got an error ValueError: Found infinity in column x.
Traceback says
---> 20 model.fit(df)
242 df[‘x’] = pd.to_numeric(df[‘x’])
243 if np.isinf(df[‘x’].values).any():
--> 244 raise ValueError('Found infinity in column y.')
245 df[‘d’] = pd.to_datetime(df[‘d’])
246 if df[‘d’].isnull().any():
I really cannot understand what is the meaning of this error message because I do not have infinity number in df.How should I fix this?What is wrong in my codes?
My codes is
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from fbprophet import Prophet
for i in range(10):
df = pd.read_csv('data'+ i + '.csv', encoding='shift-jis')
model = Prophet()
model.fit(df)
future_data = model.make_future_dataframe(periods=12, freq = 'm')
forecast_data = model.predict(future_data)
model.plot(forecast_data)
model.plot_components(forecast_data)
plt.show()

So, you need to remove infinity values from your DataFrame. It can be done like this:
DataFrame.replace([np.inf, -np.inf], np.nan)
When you replaced infinity values to NaN you can remove it from DataFrame via dropna:
DataFrame.dropna(subset=["YourColumn"], how="all")

Convert DF into Numpy Array for calculations

I have the data in a dataframe format that I will use for linear regression calculation using user-built function. Here is the code:
from sklearn.datasets import load_boston
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = df.to_array(x)
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values
However, I am getting the error:
AttributeError Traceback (most recent call last)
<ipython-input-131-272f1b4d26ba> in <module>()
1 import copy
2
----> 3 xw = df.to_array(x)
AttributeError: 'int' object has no attribute 'to_array'
I am not sure where the problem. I need to pass an array of values (x in this case) to the function to execute some matrix operations
The insert function was working in a step by step code development but for some reason is failing here.
I tried:
xw = copy.deepcopy(x)
with no success
Any thoughts?

it is x.as_matrix() not df.to_array(x)
Please refer to pandas document for more detail on as_matrix()
Here is the code that work
from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
boston = load_boston()
bos = pd.DataFrame(boston.data) # convert to DF
bos.columns = boston.feature_names
bos['PRICE'] = boston.target
y = bos.PRICE
x = bos.drop('PRICE', axis = 1) # DROP PRICE since only want X-type variables (not Y-target)
xw = x.as_matrix()
xw = np.insert(xw,0,1, axis = 1) # to insert a column of "1" values

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pycaret, regression on target column - python

I found the solution. The column name ['value [USD]'] was the problem. After renaming it the code works as intended. It has probably something to do with the brackets inside the column name which can maybe be interpreted as a dictionary or list but I'm not sure.

Related

error when try to incorporate target data from SKLEARN in python

Dealing with outliers in Pandas [duplicate]

A squarred variable is outside the index

ValueError: Found infinity in column x

Convert DF into Numpy Array for calculations

Categories

Resources