getting strange error while calculating z-score - python

i want to calculate z-score of my whole dataset. i have tried two types of code but unfortunately they both gave me the same error.
my 1 code is here:
zee=stats.zscore(df)
print(zee)
my 2 code is:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df))
print(z)
am using jupyter
The error i have got:
-----
TypeError Traceback (most recent call last)
<ipython-input-23-ef429aebacfd> in <module>
1 from scipy import stats
2 import numpy as np
----> 3 z = np.abs(stats.zscore(df))
4 print(z)
~/.local/lib/python3.8/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~/.local/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
and here the info of my dataframe,if theres something wrong with my datafarme.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 100 non-null object
1 Country 100 non-null object
2 Item Type 100 non-null object
3 Sales Channel 100 non-null object
4 Order Priority 100 non-null object
5 Order Date 100 non-null object
6 Order ID 100 non-null int64
7 Ship Date 100 non-null object
8 Units Sold 100 non-null int64
9 Unit Price 100 non-null float64
10 Unit Cost 100 non-null float64
11 Total Revenue 100 non-null float64
12 Total Cost 100 non-null float64
13 Total Profit 100 non-null float64
dtypes: float64(5), int64(2), object(7)
memory usage: 11.1+ KB
thanks in advance.

Your df contains non float/int values, please try sending only int/float cols to your zscore func.
stats.zscore(df[['Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']])

Related

ValueError: at least one array or dtype is required

Here is my df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162 entries, 0 to 1161
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Algorithms&DataStructures 428 non-null float64
1 C/C++Programming 688 non-null float64
2 Calculus1 835 non-null float64
3 Calculus2 752 non-null float64
4 Calculus3 366 non-null float64
5 ChemistryLaboratory 497 non-null float64
6 ChemistryforEngineers 823 non-null float64
7 ComputerArchitecture 433 non-null float64
And this is the function used to impute the NaN value:
from sklearn.neighbors import KNeighborsRegressor
# function that imputes a dataframe
def impute_knn(df):
''' inputs: pandas df containing feature matrix '''
''' outputs: dataframe with NaN imputed '''
# imputation with KNN unsupervised method
# separate dataframe into numerical/categorical
ldf = df.select_dtypes(include=[np.number]) # select numerical columns in df
ldf_putaside = df.select_dtypes(exclude=[np.number]) # select categorical columns in df
# define columns w/ and w/o missing data
cols_nan = ldf.columns[ldf.isna().any()].tolist() # columns w/ nan
cols_no_nan = ldf.columns.difference(cols_nan).values # columns w/o nan
for col in cols_nan:
imp_test = ldf[ldf[col].isna()] # indicies which have missing data will become our test set
imp_train = ldf.dropna() # all indicies which which have no missing data
model = KNeighborsRegressor(n_neighbors=5) # KNR Unsupervised Approach
knr = model.fit(imp_train[cols_no_nan], imp_train[col])
ldf.loc[df[col].isna(), col] = knr.predict(imp_test[cols_no_nan])
return pd.concat([ldf,ldf_putaside],axis=1)
I got it from: Bayesian Regression | House Price Prediction
However, when I apply it to my dataframe, it reports an error:
Full error:
ValueError Traceback (most recent call last)
<ipython-input-284-b13fac408835> in <module>
----> 1 df2 = impute_knn(df)
2 # looks like we have a full feature matrix
3 df2.info()
5 frames
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
663
664 if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
--> 665 dtype_orig = np.result_type(*dtypes_orig)
666
667 if dtype_numeric:
<__array_function__ internals> in result_type(*args, **kwargs)
ValueError: at least one array or dtype is required
I'd love to hear your comment! Thank you!

How to convert array to being 1 dimensional for use as an index

I'm trying to create an index from a numpy array, but everytime i try i get the following error 'ValueError: Cannot index with multidimensional key'. How can I get this 'indices' array into the correct format to work?
Here is the relevant code:
Dataframe:
default student balance income
0 No No 729.526495 44361.625074
1 No Yes 817.180407 12106.134700
2 No No 1073.549164 31767.138947
3 No No 529.250605 35704.493935
4 No No 785.655883 38463.495879
... ... ... ... ...
9995 No No 711.555020 52992.378914
9996 No No 757.962918 19660.721768
9997 No No 845.411989 58636.156984
9998 No No 1569.009053 36669.112365
9999 No Yes 200.922183 16862.952321
10000 rows × 4 columns
default.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 4 columns):
default 10000 non-null object
student 10000 non-null object
balance 10000 non-null float64
income 10000 non-null float64
dtypes: float64(2), object(2)
memory usage: 312.6+ KB
def regression (X,y,indices):
reg = smf.logit('Default_ ~ balance + income',default,subset=indices).fit()
beta_0 = reg.coeffs(1)
print(reg.coeffs)
n_iter = 1
for i in range(0,n_iter):
sample_size = len(default)
X = default[['balance','income']]
y = default['default']
#create random set of indices
indices = np.round(np.random.rand(len(default),1)*len(default)).astype(int)
regression(X,y,indices)
Format of array im trying to use as index:
[[2573]
[8523]
[2403]
...
[1635]
[6665]
[6364]]
Just collapse it to the one-dimensional array using flatten()
https://docs.scipy.org/doc/numpy/reference/generated/numpy.ndarray.flatten.html

Python Scatter_Matrix Issue : 'matrix' object has no attribute '_get_numeric_data'

I have been trying to plot a lot of different time series against each others but I am unable to create a scatter_matrix.
Print(ptExcitationInside.as_matrix) returns :
[[<bound method NDFrame.as_matrix of Particle excitation inside(j)
date
2017-03-07 08:00:00.779 7.0
2017-03-07 08:00:00.780 7.0
2017-03-07 08:00:00.781 7.0
... ...
2017-03-06 14:34:32.041 168.0
2017-03-06 14:34:32.042 169.0
[23671264 rows x 1 columns]>]]
and Print(ptExcitationOutside.as_matrix) returns :
<bound method NDFrame.as_matrix of Particle excitation outside(j)
date
2017-03-06 08:00:00.779 47.0
2017-03-06 08:00:00.780 47.0
2017-03-06 08:00:00.781 47.0
... ...
2017-03-06 14:34:32.041 168.0
2017-03-06 14:34:32.042 169.0
[23671264 rows x 1 columns]>]]
I would like to use scatter_matrix to look at the correlation between the variables. (I have more than 2 time series like these, these are just examples)
I tried to create a big matrix observables = np.matrix[ptExcitationInside.as_matrix, ptExcitationOutside.as_matrix][ptExcitationInside.as_matrix, ptExcitationOutside.as_matrix] and pd.scatter_matrix(observables)
but it returns :
AttributeError Traceback (most recent call last)
<ipython-input-232-218b3e0bf63a> in <module>()
----> 1 pd.scatter_matrix(observables, c='blue', alpha = 0.5, figsize = (10, 10), diagonal = 'kde');
/louis/anaconda3/lib/python3.6/site-packages/pandas/tools/plotting.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwds)
343 import matplotlib.pyplot as plt
344
--> 345 df = frame._get_numeric_data()
346 n = df.columns.size
347 naxes = n * n
AttributeError: 'matrix' object has no attribute '_get_numeric_data'
so I tried to get only the value column but print((ptExcitationInside.as_matrix)[:,0]) returns :
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-227-b6ec50a21c53> in <module>()
1
----> 2 print((ptExcitationInside.as_matrix)[:,0])
TypeError: 'method' object is not subscriptable
Can someone help me, I am new to Python (first week on it), I am a C/Java developer and I don't know how to understand what variable types I am using.
EDIT :
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 23671264 entries, 2017-03-07 08:00:00.779000 to 2017-03-07 14:34:32.042000
Freq: L Data columns (total 1 columns): Particle excitation inside float64
dtypes: float64(1) memory usage: 361.2 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 23671264 entries, 2017-03-07 08:00:00.779000 to 2017-03-07 14:34:32.042000
Freq: L Data columns (total 1 columns): Particle excitation outside
float64 dtypes: float64(1) memory usage: 361.2 MB

Plotting Pandas' pivot_table from long data

I have a xls file with data organized in long format. I have four columns: the variable name, the country name, the year and the value.
After importing the data in Python with pandas.read_excel, I want to plot the time series of one variable for different countries. To do so, I create a pivot table that transforms the data in wide format. When I try to plot with matplotlib, I get an error
ValueError: could not convert string to float: 'ZAF'
(where 'ZAF' is the label of one country)
What's the problem?
This is the code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data = pd.read_excel('raw_emissions_energy.xls','raw data', index_col = None, thousands='.',parse_cols="A,C,F,M")
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data[(data['VAR']=='CO2_PBPROD')], index='COU', columns='Year')
plt.plot(data_CO2PROD)
The xls file with raw data looks like:
raw data Excel view
This is what I get from data_CO2PROD.info()
<class 'pandas.core.frame.DataFrame'>
Index: 105 entries, ARE to ZAF
Data columns (total 16 columns):
(Value, 1990) 104 non-null float64
(Value, 1995) 105 non-null float64
(Value, 2000) 105 non-null float64
(Value, 2001) 105 non-null float64
(Value, 2002) 105 non-null float64
(Value, 2003) 105 non-null float64
(Value, 2004) 105 non-null float64
(Value, 2005) 105 non-null float64
(Value, 2006) 105 non-null float64
(Value, 2007) 105 non-null float64
(Value, 2008) 105 non-null float64
(Value, 2009) 105 non-null float64
(Value, 2010) 105 non-null float64
(Value, 2011) 105 non-null float64
(Value, 2012) 105 non-null float64
(Value, 2013) 105 non-null float64
dtypes: float64(16)
memory usage: 13.9+ KB
None
Using data_CO2PROD.plot() instead of plt.plot(data_CO2PROD) allowed me to plot the data. http://pandas.pydata.org/pandas-docs/stable/visualization.html.
Simple code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
data= pd.DataFrame(np.random.randn(3,4), columns=['VAR','COU','Year','VAL'])
data['VAR'] = ['CC','CC','KK']
data['COU'] =['ZAF','NL','DK']
data['Year']=['1987','1987','2006']
data['VAL'] = [32,33,35]
data['Year'] = data['Year'].astype(str)
data['COU'] = data['COU'].astype(str)
# generate sub-datasets for specific VARs
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')], index='COU', columns='Year')
data_CO2PROD.plot()
plt.show()
I think you need add parameter values to pivot_table:
data_CO2PROD = pd.pivot_table(data=data[(data['VAR']=='CC')],
index='COU',
columns='Year',
values='Value')
data_CO2PROD.plot()
plt.show()

pandas dataframe conversion for linear regression

I read the CSV file and get a dataframe (name: data) that has a few columns, the first a few are in format numeric long(type:pandas.core.series.Series) and the last column(label) is a binary response variable string 'P(ass)'/'F(ail)'
import statsmodels.api as sm
label = data.ix[:, -1]
label[label == 'P'] = 1
label[label == 'F'] = 0
fea = data.ix[:, 0: -1]
logit = sm.Logit(label, fea)
result = logit.fit()
print result.summary()
Pandas throws me this error message: "ValueError: Pandas data cast to numpy dtype of object. Check input data with np.asarray(data)"
Numpy,Pandas etc modules are imported already. I tried to convert fea columns to float but still does not go through. Could someone tell me how to correct?
Thanks
update:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 500 entries, 68135 to 3002
Data columns (total 8 columns):
TestQty 500 non-null int64
WaferSize 500 non-null int64
ChuckTemp 500 non-null int64
Notch 500 non-null int64
ORIGINALDIEX 500 non-null int64
ORIGINALDIEY 500 non-null int64
DUTNo 500 non-null int64
PassFail 500 non-null object
dtypes: int64(7), object(1)
memory usage: 35.2+ KB
data.sum()
TestQty 530
WaferSize 6000
ChuckTemp 41395
Notch 135000
ORIGINALDIEX 12810
ORIGINALDIEY 7885
DUTNo 271132
PassFail 20
dtype: float64
Shouldn't your features be this:
fea = data.ix[:, 0:-1]
From you data, you see that PassFail sums to 20 before you convert 'P' to 1 and 'F' to zero. I believe that is the source of your error.
To see what is in there, try:
data.PassFail.unique()
To verify that it totals to 500 (the number of rows in the DataFrame):
sum(label[label == 0]) + sum(label[label == 1)
Finally, try passing values to the function rather than Series and DataFrames:
logit = sm.Logit(label.values, fea.values)

Categories