How do I remove float64 from Pandas query - python

I am working on my first student project with the Iris Dataset and learning Pandas. I wondered if anyone can help? I'm trying to remove dtype: float64 from the pandas results. I am also noticing the results are prefixed with 37m on the other part of the print statement.
Reading solutions to similar questions I have tried substituting
IrisData = pd.read_csv('IRIS.csv')
with
IrisData = pd.loadtxt('IRIS.csv', dtype='float')
but this returns errors
raise AttributeError(f"module 'pandas' has no attribute '{name}'")
AttributeError: module 'pandas' has no attribute 'loadtxt'
CODE USED TO GET THE AVERAGE SIZE OF ALL IRIS
# importing pandas as pd
import pandas as pd
# Creating the dataframe
IrisData = pd.read_csv('IRIS.csv')
# sum over the column axis.
averageofdata = IrisData.mean(axis = 0, skipna = True)
print("Average Sizes of All Iris Data")
print(averageofdata)
RESULTS OF CODE

You should make the difference between a data and the way it is displayed. The dtype: float64 is displayed because you are printing a pandas Series. A simple way to get rid of it is to convert the Series into a DataFrame:
print(pd.DataFrame(averageofdata))
For the [37m, it is probably the ANSI escape sequence Esc [ 3 7 m. Those are used on certain terminals for fancy displays (colors, blinkink, etc.). I cannot remember what this one is for, and cannot guess what produced it.

Related

How to fix "DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance..."

I have been trying to implement the Apriori Algorithm in python. There are several examples online, they all use similar methods and mostly the same example dataset. The reference link: https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis/notebook
(starting from the line [26])
I have a different dataset that has the same structure as the example datasets online. I keep getting the
"DeprecationWarning: DataFrames with non-bool types result in worse
computationalperformance and their support might be discontinued in
the future.Please use a DataFrame with bool type"
error.
Here is my code:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
df1 = pd.read_csv(r'C:\Users\USER\dataset', sep=';')
df=df1.fillna(0)
basket = pd.pivot_table(data=df, index='cust_id', columns='Product', values='quantity', aggfunc='count',fill_value=0.0)
def convert_into_binary(x):
if x > 0:
return 1
else:
return 0
basket_sets = basket.applymap(convert_into_binary)
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
print(frequent_itemsets)
# association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print(rules)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Empty DataFrame Columns: [antecedents, consequents, antecedent
support, consequent support, support, confidence, lift, leverage,
conviction] Index: []
I am not sure if this issue is related to this error that I am having. I am new to python and I would really appreciate assistance and support on this issue.
I ran into the same issue even after converting my dataframe fields to 0 and 1.
The fix was just making sure the apriori module knows the dataframe is of boolean type, so in your case you should run this :
frequent_itemsets = apriori(basket_sets.astype('bool'), min_support=0.07, use_colnames=True)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Try using a smaller min_support

AttributeError:' 'numpy.ndarray' object has no attribute 'rolling' ' arises only after filtering the CSV data

If I pass the value of csv data following the way given below it produces the output.
data = pd.read_csv("abc.csv")
avg = data['A'].rolling(3).mean()
print(avg)
But if pass the value via following the way given below it produces error.
dff=[]
dff1=[]
dff1=abs(data['A'])
b, a = scipy.signal.butter(2, 0.05, 'highpass')
dff = scipy.signal.filtfilt(b, a, dff1)
avg = dff.rolling(3).mean()
print(avg)
Error is:
AttributeError: 'numpy.ndarray' object has no attribute 'rolling'
I can't figure it out, what is wrong with the code?
after applying dff = pd.Dataframe(dff)new problem arises. one unexpected zero is displayed at the top.
What is the reason behind this? How to get rid of this problem?
rolling is a function on Pandas Series and DataFrames. Scipy knows nothing about these, and generates Numpy ndarrays as output. It can accept dataframes and series as input, because the Pandas types can mimic ndarrays when needed.
The solution might be as simple as re-wrapping the ndarray as a dataframe using
dff = pd.Dataframe(dff)

Python 3.6.5 returns '<' not supported between instances of 'tuple' and 'str' error message

I'm trying to split a data set into a training and testing part. I am struggling at a structural problem as it seems as the hierarchy of the data seems to be wrong to proceed with below code.
I tried the following:
import pandas as pd
data = pd.DataFrame(web.DataReader('SPY', data_source='morningstar')['Close'])
cutoff = '2015-1-1'
data = data[data.index < cutoff].dropna().copy()
As data.head() will reveal, data is not actually a pd.DataFrame but a pd.Series whose index is a pd.MultiIndex (as suggested also by the error which hints that each element is a tuple) rather than a pd.DatetimeIndex.
What you could do would be to simply let
df = data.unstack(0)
With that, df[df.index < cutoff] performs the filtering you are trying to do.

Python aggregate functions (e.g. sum) not working on object dtypes, but won't work when they're converted either?

I'm importing data from a CSV file which has text, date and numeric columns. I'm using pandas.read_csv() to read it in, but I'm not specifying what each column's dtype should be. Here's a cut of that csv file (apologies for the shoddy formatting).
Now these two columns (total_imp_pma, char_value_aa503) are imported very differently. I import all the number fields and create a new dataframe called base_varlist4, which only contains the number columns.
When I run base_varlist4.dtypes, I get:
total_imp_pma object
char_value_aa503 float64
So as you can see, total_imp_pma was imported as an object. The problem then means that if I run this:
#calculate max, and group by obs_date
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
Where varlist4 is just my list of columns, I get the wrong max value for total_imp_pma but the correct max value for char_value_aa503.
Logically, this means I should change the object total_imp_pma to either a float or an integer. However, when I run:
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
And then proceed to do the max value, I still get an incorrect result.
What's going on here? Why does pandas.read_csv() import some columns as an object dtype, and others as an int64 or float64 dtype? Why does conversion not work?
I have a theory but I'm not sure how to work around it. The only difference I see in the two columns in my source data are that total_imp_pma has mixed typed cells all the way down. For example, 66979 is a General cell, while there's a cell a little further down with a value of 1,760.60 as a number.
I think the mixed cell types in certain columns is causing pandas.read_csv() to be confused and just say "whelp, dunno what this is, import it as an object".
... how do I fix this?
EDIT: Here's an MCVE as per the request below.
Data in CSV is:
Char_Value_AA503 Total_IMP_PMA
1293 19.9
1831 0.9
1.2
243 2,666.50
Code is:
import pandas as pd
loc = r"xxxxxxxxxxxxxx"
source_data_name = 'import_problem_example.csv'
reporting_date = '01Feb2018'
source_data = pd.read_csv(loc + source_data_name)
source_data.columns = source_data.columns.str.lower()
varlist4 = ["char_value_aa503","total_imp_pma"]
base_varlist4 = source_data[varlist4]
base_varlist4['obs_date'] = reporting_date
base_varlist4[varlist4] = base_varlist4[varlist4].apply(pd.to_numeric, errors='coerce')
output_max_temp=base_varlist4.groupby('obs_date').max(skipna=True)
#reset obs_date to be treated as a column rather than an index
output_max_temp.reset_index()
#reshape temporary output to have 2 columns corresponding to variable and value
output_max=pd.melt(output_max_temp, id_vars='obs_date', value_vars=varlist4)
""" Test some stuff"""
source_data.dtypes
output_max
source_data.dtypes
As you can see, the max value of total_imp_pma comes out as 19.9, when it should be 2666.50.

min-max scaling of dataframe in iPython

Im new in Python. I have dataframe and I want do min-max(0-1) scaling in every column (every attr). I found method MinMaxScaller but I dont know how to use it with dataframe.
from sklearn import preprocessing
def sci_minmax(X):
minmax_scale = preprocessing.MinMaxScaler(feature_range=(0, 1), copy=True)
return minmax_scale.fit_transform(X)
data_normalized = sci_minmax(data)
data_variance=data_normalized.var()
data_variance.head(10)
The error is 'numpy.float64' object has no attribute 'head'. I need the return type dataframe
There is no head method in scipy/numpy.
If you want a pandas.DataFrame, you'll have to call the constructor.
Any chance you mean to look at the first 10 records with head?
You can do this easily with numpy, too.
To select the first 10 records of an array, the python syntax is array[:10]. With numpy matrixes, you will want to specify rows and columns: array[:10,] or array[,:10]

Categories