FeatureAgglomeration: feature_names_in and get_feature_names_out - python

I used FeatureAgglomeration to cluster my 105x105 dataframe into 40 clusters based on Spearman. Now I want to get the output feature names using feature_names_in and get_feature_names_out, but it does not seem to work, and I cannot find the solution anymore. This is my code:
import pandas as pd
import numpy as np
from sklearn.cluster import FeatureAgglomeration
features = np.array([...])
print(features.shape)
>>> (105,)
Class1_rank=pd.read_excel(r'H:\PycharmProjects\RadiomicsPipeline\Class1_rank.xlsx')
print(Class1_rank)
>>> original_shape_Elongation ... original_ngtdm_Strength
original_shape_Elongation 1.000000 ... -0.054310
original_shape_Flatness 0.616327 ... -0.019544
original_shape_LeastAxisLength 0.271645 ... -0.293157
>>> [105 rows x 105 columns]
print(agglo.n_features_in_)
>>> 105
print(agglo.feature_names_in_(Class1_rank))
print(agglo.get_feature_names_out())
df_reduced = agglo.transform(Class1)
At print(agglo.feature_names_in_()) I get to following error:
TypeError: 'numpy.ndarray' object is not callable
However, Class1_rank is a DataFrame, and thus should not give that error? What I am doing wrong here?
What I have tried:
Comment print(agglo.feature_names_in_(Class1_rank)). Works, but then print(agglo.get features out) gives the following result, and not the names of the features I included.
['featureagglomeration0' 'featureagglomeration1' 'featureagglomeration2' 'featureagglomeration3' 'featureagglomeration4'....]
Use features as input for both functions, gives the same error.
Insert the features as strings for Class1_rank, gives the same error.

feature_names_in_ is an array, not a callable, so agglo.feature_names_in_ is correct, but parentheses after it (empty or not) is incorrect.
get_feature_names_out() gives names for each cluster, which are not in 1-1 correspondence with input features, so it cannot give you something like the original feature names. You can use the labels_ attribute to find which input features go into which output features, see e.g. this answer.

Related

How to fix "DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance..."

I have been trying to implement the Apriori Algorithm in python. There are several examples online, they all use similar methods and mostly the same example dataset. The reference link: https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis/notebook
(starting from the line [26])
I have a different dataset that has the same structure as the example datasets online. I keep getting the
"DeprecationWarning: DataFrames with non-bool types result in worse
computationalperformance and their support might be discontinued in
the future.Please use a DataFrame with bool type"
error.
Here is my code:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
df1 = pd.read_csv(r'C:\Users\USER\dataset', sep=';')
df=df1.fillna(0)
basket = pd.pivot_table(data=df, index='cust_id', columns='Product', values='quantity', aggfunc='count',fill_value=0.0)
def convert_into_binary(x):
if x > 0:
return 1
else:
return 0
basket_sets = basket.applymap(convert_into_binary)
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
print(frequent_itemsets)
# association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print(rules)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Empty DataFrame Columns: [antecedents, consequents, antecedent
support, consequent support, support, confidence, lift, leverage,
conviction] Index: []
I am not sure if this issue is related to this error that I am having. I am new to python and I would really appreciate assistance and support on this issue.
I ran into the same issue even after converting my dataframe fields to 0 and 1.
The fix was just making sure the apriori module knows the dataframe is of boolean type, so in your case you should run this :
frequent_itemsets = apriori(basket_sets.astype('bool'), min_support=0.07, use_colnames=True)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Try using a smaller min_support

AttributeError:' 'numpy.ndarray' object has no attribute 'rolling' ' arises only after filtering the CSV data

If I pass the value of csv data following the way given below it produces the output.
data = pd.read_csv("abc.csv")
avg = data['A'].rolling(3).mean()
print(avg)
But if pass the value via following the way given below it produces error.
dff=[]
dff1=[]
dff1=abs(data['A'])
b, a = scipy.signal.butter(2, 0.05, 'highpass')
dff = scipy.signal.filtfilt(b, a, dff1)
avg = dff.rolling(3).mean()
print(avg)
Error is:
AttributeError: 'numpy.ndarray' object has no attribute 'rolling'
I can't figure it out, what is wrong with the code?
after applying dff = pd.Dataframe(dff)new problem arises. one unexpected zero is displayed at the top.
What is the reason behind this? How to get rid of this problem?
rolling is a function on Pandas Series and DataFrames. Scipy knows nothing about these, and generates Numpy ndarrays as output. It can accept dataframes and series as input, because the Pandas types can mimic ndarrays when needed.
The solution might be as simple as re-wrapping the ndarray as a dataframe using
dff = pd.Dataframe(dff)

How do I remove float64 from Pandas query

I am working on my first student project with the Iris Dataset and learning Pandas. I wondered if anyone can help? I'm trying to remove dtype: float64 from the pandas results. I am also noticing the results are prefixed with 37m on the other part of the print statement.
Reading solutions to similar questions I have tried substituting
IrisData = pd.read_csv('IRIS.csv')
with
IrisData = pd.loadtxt('IRIS.csv', dtype='float')
but this returns errors
raise AttributeError(f"module 'pandas' has no attribute '{name}'")
AttributeError: module 'pandas' has no attribute 'loadtxt'
CODE USED TO GET THE AVERAGE SIZE OF ALL IRIS
# importing pandas as pd
import pandas as pd
# Creating the dataframe
IrisData = pd.read_csv('IRIS.csv')
# sum over the column axis.
averageofdata = IrisData.mean(axis = 0, skipna = True)
print("Average Sizes of All Iris Data")
print(averageofdata)
RESULTS OF CODE
You should make the difference between a data and the way it is displayed. The dtype: float64 is displayed because you are printing a pandas Series. A simple way to get rid of it is to convert the Series into a DataFrame:
print(pd.DataFrame(averageofdata))
For the [37m, it is probably the ANSI escape sequence Esc [ 3 7 m. Those are used on certain terminals for fancy displays (colors, blinkink, etc.). I cannot remember what this one is for, and cannot guess what produced it.

Python sklearn-pandas Transform Multiple Columns at the same time error

I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())
I get the following error:
raise ValueError("bad input shape {0}".format(shape))
ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())
To my understanding, both should work well and yield same results.
What am I doing wrong here?
Thanks!
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([
(['children', 'salary'], sklearn.decomposition.PCA(1))
])
This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
"""
Map Pandas data frame column subsets to their own
sklearn transformation.
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
input_df=False):
"""
Params:
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.

min-max scaling of dataframe in iPython

Im new in Python. I have dataframe and I want do min-max(0-1) scaling in every column (every attr). I found method MinMaxScaller but I dont know how to use it with dataframe.
from sklearn import preprocessing
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
return minmax_scale.fit_transform(X)
data_normalized = sci_minmax(data)
data_variance=data_normalized.var()
data_variance.head(10)
The error is 'numpy.float64' object has no attribute 'head'. I need the return type dataframe
There is no head method in scipy/numpy.
If you want a pandas.DataFrame, you'll have to call the constructor.
Any chance you mean to look at the first 10 records with head?
You can do this easily with numpy, too.
To select the first 10 records of an array, the python syntax is array[:10]. With numpy matrixes, you will want to specify rows and columns: array[:10,] or array[,:10]

Categories