I have been trying to plot a lot of different time series against each others but I am unable to create a scatter_matrix.
Print(ptExcitationInside.as_matrix) returns :
[[<bound method NDFrame.as_matrix of Particle excitation inside(j)
date
2017-03-07 08:00:00.779 7.0
2017-03-07 08:00:00.780 7.0
2017-03-07 08:00:00.781 7.0
... ...
2017-03-06 14:34:32.041 168.0
2017-03-06 14:34:32.042 169.0
[23671264 rows x 1 columns]>]]
and Print(ptExcitationOutside.as_matrix) returns :
<bound method NDFrame.as_matrix of Particle excitation outside(j)
date
2017-03-06 08:00:00.779 47.0
2017-03-06 08:00:00.780 47.0
2017-03-06 08:00:00.781 47.0
... ...
2017-03-06 14:34:32.041 168.0
2017-03-06 14:34:32.042 169.0
[23671264 rows x 1 columns]>]]
I would like to use scatter_matrix to look at the correlation between the variables. (I have more than 2 time series like these, these are just examples)
I tried to create a big matrix observables = np.matrix[ptExcitationInside.as_matrix, ptExcitationOutside.as_matrix][ptExcitationInside.as_matrix, ptExcitationOutside.as_matrix] and pd.scatter_matrix(observables)
but it returns :
AttributeError Traceback (most recent call last)
<ipython-input-232-218b3e0bf63a> in <module>()
----> 1 pd.scatter_matrix(observables, c='blue', alpha = 0.5, figsize = (10, 10), diagonal = 'kde');
/louis/anaconda3/lib/python3.6/site-packages/pandas/tools/plotting.py in scatter_matrix(frame, alpha, figsize, ax, grid, diagonal, marker, density_kwds, hist_kwds, range_padding, **kwds)
343 import matplotlib.pyplot as plt
344
--> 345 df = frame._get_numeric_data()
346 n = df.columns.size
347 naxes = n * n
AttributeError: 'matrix' object has no attribute '_get_numeric_data'
so I tried to get only the value column but print((ptExcitationInside.as_matrix)[:,0]) returns :
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-227-b6ec50a21c53> in <module>()
1
----> 2 print((ptExcitationInside.as_matrix)[:,0])
TypeError: 'method' object is not subscriptable
Can someone help me, I am new to Python (first week on it), I am a C/Java developer and I don't know how to understand what variable types I am using.
EDIT :
<class 'pandas.core.frame.DataFrame'> DatetimeIndex: 23671264 entries, 2017-03-07 08:00:00.779000 to 2017-03-07 14:34:32.042000
Freq: L Data columns (total 1 columns): Particle excitation inside float64
dtypes: float64(1) memory usage: 361.2 MB
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 23671264 entries, 2017-03-07 08:00:00.779000 to 2017-03-07 14:34:32.042000
Freq: L Data columns (total 1 columns): Particle excitation outside
float64 dtypes: float64(1) memory usage: 361.2 MB
Related
Here is my df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1162 entries, 0 to 1161
Data columns (total 61 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Algorithms&DataStructures 428 non-null float64
1 C/C++Programming 688 non-null float64
2 Calculus1 835 non-null float64
3 Calculus2 752 non-null float64
4 Calculus3 366 non-null float64
5 ChemistryLaboratory 497 non-null float64
6 ChemistryforEngineers 823 non-null float64
7 ComputerArchitecture 433 non-null float64
And this is the function used to impute the NaN value:
from sklearn.neighbors import KNeighborsRegressor
# function that imputes a dataframe
def impute_knn(df):
''' inputs: pandas df containing feature matrix '''
''' outputs: dataframe with NaN imputed '''
# imputation with KNN unsupervised method
# separate dataframe into numerical/categorical
ldf = df.select_dtypes(include=[np.number]) # select numerical columns in df
ldf_putaside = df.select_dtypes(exclude=[np.number]) # select categorical columns in df
# define columns w/ and w/o missing data
cols_nan = ldf.columns[ldf.isna().any()].tolist() # columns w/ nan
cols_no_nan = ldf.columns.difference(cols_nan).values # columns w/o nan
for col in cols_nan:
imp_test = ldf[ldf[col].isna()] # indicies which have missing data will become our test set
imp_train = ldf.dropna() # all indicies which which have no missing data
model = KNeighborsRegressor(n_neighbors=5) # KNR Unsupervised Approach
knr = model.fit(imp_train[cols_no_nan], imp_train[col])
ldf.loc[df[col].isna(), col] = knr.predict(imp_test[cols_no_nan])
return pd.concat([ldf,ldf_putaside],axis=1)
I got it from: Bayesian Regression | House Price Prediction
However, when I apply it to my dataframe, it reports an error:
Full error:
ValueError Traceback (most recent call last)
<ipython-input-284-b13fac408835> in <module>
----> 1 df2 = impute_knn(df)
2 # looks like we have a full feature matrix
3 df2.info()
5 frames
/usr/local/lib/python3.8/dist-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
663
664 if all(isinstance(dtype, np.dtype) for dtype in dtypes_orig):
--> 665 dtype_orig = np.result_type(*dtypes_orig)
666
667 if dtype_numeric:
<__array_function__ internals> in result_type(*args, **kwargs)
ValueError: at least one array or dtype is required
I'd love to hear your comment! Thank you!
i want to calculate z-score of my whole dataset. i have tried two types of code but unfortunately they both gave me the same error.
my 1 code is here:
zee=stats.zscore(df)
print(zee)
my 2 code is:
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(df))
print(z)
am using jupyter
The error i have got:
-----
TypeError Traceback (most recent call last)
<ipython-input-23-ef429aebacfd> in <module>
1 from scipy import stats
2 import numpy as np
----> 3 z = np.abs(stats.zscore(df))
4 print(z)
~/.local/lib/python3.8/site-packages/scipy/stats/stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~/.local/lib/python3.8/site-packages/numpy/core/_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
and here the info of my dataframe,if theres something wrong with my datafarme.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 14 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Region 100 non-null object
1 Country 100 non-null object
2 Item Type 100 non-null object
3 Sales Channel 100 non-null object
4 Order Priority 100 non-null object
5 Order Date 100 non-null object
6 Order ID 100 non-null int64
7 Ship Date 100 non-null object
8 Units Sold 100 non-null int64
9 Unit Price 100 non-null float64
10 Unit Cost 100 non-null float64
11 Total Revenue 100 non-null float64
12 Total Cost 100 non-null float64
13 Total Profit 100 non-null float64
dtypes: float64(5), int64(2), object(7)
memory usage: 11.1+ KB
thanks in advance.
Your df contains non float/int values, please try sending only int/float cols to your zscore func.
stats.zscore(df[['Unit Cost', 'Total Revenue', 'Total Cost', 'Total Profit']])
I encountered data error while trying to convert my high dimensional vector into 2 dimension using PCA.
This is my input data, each row has 300 dimensions:
vector
0 [0.01053525, -0.007869658, 0.0024931028, -0.04...
1 [-0.024436072, -0.016484523, 0.03859031, 0.000...
2 [0.015011676, -0.020465894, 0.004854744, -0.00...
3 [-0.010836455, -0.006562917, 0.00265073, 0.022...
4 [-0.018123362, -0.026007563, 0.04781856, -0.03...
... ...
45124 [-0.016111804, -0.041917775, 0.010192914, -0.0...
45125 [0.0311568, -0.013044083, 0.030656694, -0.0126...
45126 [-0.021875003, -0.005635035, 0.0076896898, -0....
45127 [-0.0062000924, -0.041035958, 0.0077403532, 0....
45128 [0.007794927, 0.0019561667, 0.15995999, -0.054...
[45129 rows x 1 columns]
My Code:
data = pd.read_parquet('1.parquet', engine='fastparquet')
reduced = pca.fit_transform(data)
Error:
TypeError Traceback (most recent call last)
TypeError: float() argument must be a string or a number, not 'list'
The above exception was the direct cause of the following exception:
ValueError Traceback (most recent call last)
<ipython-input-15-8e547411a212> in <module>
----> 1 reduced = pca.fit_transform(data)
...
...
ValueError: setting an array element with a sequence.
Edit
>>data.shape
(45129, 1)
>>data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45129 entries, 0 to 45128
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 vector 45129 non-null object
dtypes: object(1)
memory usage: 352.7+ KB
Scikit-learn doesn't know how to handle a column that contains an array (list), so you'll need to expand the column. Since each row has an array of the same size, you can do this fairly easily with only 45,000 rows. Once you expand your data, you should be fine.
import pandas as pd
from sklearn.decomposition import PCA
df = pd.DataFrame({"a": [[0.01, 0.02, 0.03], [0.04, 0.4, 0.1]]})
expanded_df = pd.DataFrame(df.a.tolist())
expanded_df
0 1 2
0 0.01 0.02 0.03
1 0.04 0.40 0.10
pca = PCA(n_components=2)
reduced = pca.fit_transform(expanded_df)
reduced
array([[ 1.93778224e-01, 1.43048962e-17],
[-1.93778224e-01, 1.43048962e-17]])
I'm trying to generate a new column in a pandas dataframe from other columns and am getting some math errors that I don't understand. Here is a snapshot of the problem and some simplifying diagnostics...
I can generate a data frame that looks pretty good:
import pandas
import math as m
data = {'loc':['1','2','3','4','5'],
'lat':[61.3850,32.7990,34.9513,14.2417,33.7712],
'lng':[-152.2683,-86.8073,-92.3809,-170.7197,-111.3877]}
frame = pandas.DataFrame(data)
frame
Out[15]:
lat lng loc
0 61.3850 -152.2683 1
1 32.7990 -86.8073 2
2 34.9513 -92.3809 3
3 14.2417 -170.7197 4
4 33.7712 -111.3877 5
5 rows × 3 columns
I can do simple math (i.e. degrees to radians):
In [32]:
m.pi*frame.lat/180.
Out[32]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
But I can't convert from degrees to radians using the python math library:
In [33]:
m.radians(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-33-99a986252f80> in <module>()
----> 1 m.radians(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
And can't even convert the values to floats to try to force it to work:
In [34]:
float(frame.lat)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-34-3311aee92f31> in <module>()
----> 1 float(frame.lat)
/Users/user/anaconda/lib/python2.7/site-packages/pandas/core/series.pyc in wrapper(self)
72 return converter(self.iloc[0])
73 raise TypeError(
---> 74 "cannot convert the series to {0}".format(str(converter)))
75 return wrapper
76
TypeError: cannot convert the series to <type 'float'>
I'm sure there must be a simple explanation and would appreciate your help in finding it. Thanks!
math functions such as math.radians expect a numeric value such as a float, not a sequence such as a pandas.Series.
Instead, you could use numpy.radians, since numpy.radians can accept an array as input:
In [95]: np.radians(frame['lat'])
Out[95]:
0 1.071370
1 0.572451
2 0.610015
3 0.248565
4 0.589419
Name: lat, dtype: float64
Only Series of length 1 can be converted to a float. So while
this works,
In [103]: math.radians(pd.Series([1]))
Out[103]: 0.017453292519943295
in general it does not:
In [104]: math.radians(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>
math.radians is calling float on its argument. Note that you get the same error calling float on pd.Series([1,2]):
In [105]: float(pd.Series([1,2]))
TypeError: cannot convert the series to <type 'float'>
I had a similar issue but was using a custom function. The solution was to use the apply function:
def monthdiff(x):
z = (int(x/100) * 12) + (x - int(x/100) * 100)
return z
series['age'].apply(monthdiff)
Now, I have a new column with my simple (yet beautiful) calculation applied to every line in the data frame!
try:
pd.to_numeric()
When I got the same error, this is what worked for me.
New to pandas here. A (trivial) problem: hosts, operations, execution times. I want to group by host, then by host+operation, calculate std deviation for execution time per host, then by host+operation pair. Seems simple?
It works for grouping by a single column:
df
Out[360]:
<class 'pandas.core.frame.DataFrame'>
Int64Index: 132564 entries, 0 to 132563
Data columns (total 9 columns):
datespecial 132564 non-null values
host 132564 non-null values
idnum 132564 non-null values
operation 132564 non-null values
time 132564 non-null values
...
dtypes: float32(1), int64(2), object(6)
byhost = df.groupby('host')
byhost.std()
Out[362]:
datespecial idnum time
host
ahost1.test 11946.961952 40367.033852 0.003699
host1.test 15484.975077 38206.578115 0.008800
host10.test NaN 37644.137631 0.018001
...
Nice. Now:
byhostandop = df.groupby(['host', 'operation'])
byhostandop.std()
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-364-2c2566b866c4> in <module>()
----> 1 byhostandop.std()
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in std(self, ddof)
386 # todo, implement at cython level?
387 if ddof == 1:
--> 388 return self._cython_agg_general('std')
389 else:
390 f = lambda x: x.std(ddof=ddof)
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_general(self, how, numeric_only)
1615
1616 def _cython_agg_general(self, how, numeric_only=True):
-> 1617 new_blocks = self._cython_agg_blocks(how, numeric_only=numeric_only)
1618 return self._wrap_agged_blocks(new_blocks)
1619
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in _cython_agg_blocks(self, how, numeric_only)
1653 values = com.ensure_float(values)
1654
-> 1655 result, _ = self.grouper.aggregate(values, how, axis=agg_axis)
1656
1657 # see if we can cast the block back to the original dtype
/home/username/anaconda/lib/python2.7/site-packages/pandas/core/groupby.pyc in aggregate(self, values, how, axis)
838 if is_numeric:
839 result = lib.row_bool_subset(result,
--> 840 (counts > 0).view(np.uint8))
841 else:
842 result = lib.row_bool_subset_object(result,
/home/username/anaconda/lib/python2.7/site-packages/pandas/lib.so in pandas.lib.row_bool_subset (pandas/lib.c:16540)()
ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'
Huh?? Why do I get this exception?
More questions:
how do I calculate std deviation on dataframe.groupby([several columns])?
how can I limit calculation to a selected column? E.g. it obviously doesn't make sense to calculate std dev on dates/timestamps here.
It's important to know your version of Pandas / Python. Looks like this exception could arise in Pandas version < 0.10 (see ValueError: Buffer dtype mismatch, expected 'float64_t' but got 'float'). To avoid this, you can cast your float columns to float64:
df.astype('float64')
To calculate std() on selected columns, just select columns :)
>>> df = pd.DataFrame({'a':range(10), 'b':range(10,20), 'c':list('abcdefghij'), 'g':[1]*3 + [2]*3 + [3]*4})
>>> df
a b c g
0 0 10 a 1
1 1 11 b 1
2 2 12 c 1
3 3 13 d 2
4 4 14 e 2
5 5 15 f 2
6 6 16 g 3
7 7 17 h 3
8 8 18 i 3
9 9 19 j 3
>>> df.groupby('g')[['a', 'b']].std()
a b
g
1 1.000000 1.000000
2 1.000000 1.000000
3 1.290994 1.290994
update
As far as it goes, it looks like std() is calling aggregation() on the groupby result, and a subtle bug (see here - Python Pandas: Using Aggregate vs Apply to define new columns). To avoid this, you can use apply():
byhostandop['time'].apply(lambda x: x.std())