See the below error message. It points to this code which takes two numpy arrays with company brands and see if there are any new brand names in the new_df brand column.
I have looked at the input variables new_df['brand'].unique(),existing_df['brand'].unique() and neither of them are None, they are numpy arrays, so I don't get what the problem is:
#find new brands
brand_diff = np.setdiff1d(new_df['brand'].unique(),existing_df['brand'].unique(),False)
count_brand_diff = len(brand_diff)
TypeError Traceback (most recent call last)
<ipython-input-75-254b4c01e085> in <module>
71
72 #find new brands
---> 73 brand_diff = np.setdiff1d(new_df['brand'].unique(),existing_df['brand'].unique(),False)
74 count_brand_diff = len(brand_diff)
75
<__array_function__ internals> in setdiff1d(*args, **kwargs)
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in setdiff1d(ar1, ar2, assume_unique)
782 ar1 = np.asarray(ar1).ravel()
783 else:
--> 784 ar1 = unique(ar1)
785 ar2 = unique(ar2)
786 return ar1[in1d(ar1, ar2, assume_unique=True, invert=True)]
<__array_function__ internals> in unique(*args, **kwargs)
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
260 ar = np.asanyarray(ar)
261 if axis is None:
--> 262 ret = _unique1d(ar, return_index, return_inverse, return_counts)
263 return _unpack_tuple(ret)
264
~/opt/anaconda3/lib/python3.7/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
308 aux = ar[perm]
309 else:
--> 310 ar.sort()
311 aux = ar
312 mask = np.empty(aux.shape, dtype=np.bool_)
TypeError: '<' not supported between instances of 'NoneType' and 'NavigableString'```
The problem is with the data you are using because the code is correct,
example:
>>existing_df
brand
apple
apple
bmw
>>new_df
brand
apple
lexus
bmw
>>count_brand_diff
1
Hence, of you need more help, please provide an example of the data you are using.
Related
I need some help in figuring out this. Have been trying a few things but not working. I have a pandas data frame shown below(in the end) :
The data is available at irregular intervals ( frequency not fixed). I am looking to sample the data at a fixed frequency for eg every 1 minute. If the column is a float then mean every 1 minute works fine
df1.resample('1T',base = 1).mean()
but since the data is categorical mean doesn't make sense, I also tried sum which is also not making sense from sampling. What essentially I need is the max count of the column when sampled at 1 minute To do this I used the following code to apply the custom function to the values that fall in 1 minute when resampling . .
def custome_mod(arraylike):
vals, counts = np.unique(arraylike, return_counts=True)
return (np.argwhere(counts == np.max(counts)))
df1.resample('1T',base = 1).apply(custome_mod)
The output I am expecting is : data frame available at every 1 minute and value with maximum count for the data that fall in that 1 minute .
For some reason it does not seem to work and gives me error . Have been trying to debugg for a very long time . Can somebody please provide some inputs/code check ?
The error I get is following :
ValueError: zero-size array to reduction operation maximum which has no identity
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
264 try:
--> 265 return self._python_agg_general(func, *args, **kwargs)
266 except (ValueError, KeyError):
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
935
--> 936 result, counts = self.grouper.agg_series(obj, f)
937 assert result is not None
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in agg_series(self, obj, func)
862 grouper = libreduction.SeriesBinGrouper(obj, func, self.bins, dummy)
--> 863 return grouper.get_result()
864
pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesBinGrouper.get_result()
pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
pandas/_libs/reduction.pyx in pandas._libs.reduction._check_result_array()
ValueError: Function does not reduce
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
358 # Check if the function is reducing or not.
--> 359 result = grouped._aggregate_item_by_item(how, *args, **kwargs)
360 else:
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_item_by_item(self, func, *args, **kwargs)
1171 try:
-> 1172 result[item] = colg.aggregate(func, *args, **kwargs)
1173
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
268 # see see test_groupby.test_basic
--> 269 result = self._aggregate_named(func, *args, **kwargs)
270
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/generic.py in _aggregate_named(self, func, *args, **kwargs)
453 if isinstance(output, (Series, Index, np.ndarray)):
--> 454 raise ValueError("Must produce aggregated value")
455 result[name] = output
ValueError: Must produce aggregated value
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<command-36984414005459> in <module>
----> 1 df1.resample('1T',base = 1).apply(custome_mod)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in aggregate(self, func, *args, **kwargs)
283 how = func
284 grouper = None
--> 285 result = self._groupby_and_aggregate(how, grouper, *args, **kwargs)
286
287 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/resample.py in _groupby_and_aggregate(self, how, grouper, *args, **kwargs)
380 # we have a non-reducing function
381 # try to evaluate
--> 382 result = grouped.apply(how, *args, **kwargs)
383
384 result = self._apply_loffset(result)
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in apply(self, func, *args, **kwargs)
733 with option_context("mode.chained_assignment", None):
734 try:
--> 735 result = self._python_apply_general(f)
736 except TypeError:
737 # gh-20949
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/groupby.py in _python_apply_general(self, f)
749
750 def _python_apply_general(self, f):
--> 751 keys, values, mutated = self.grouper.apply(f, self._selected_obj, self.axis)
752
753 return self._wrap_applied_output(
/databricks/python/lib/python3.7/site-packages/pandas/core/groupby/ops.py in apply(self, f, data, axis)
204 # group might be modified
205 group_axes = group.axes
--> 206 res = f(group)
207 if not _is_indexed_like(res, group_axes):
208 mutated = True
<command-36984414005658> in custome_mod(arraylike)
1 def custome_mod(arraylike):
2 vals, counts = np.unique(arraylike, return_counts=True)
----> 3 return (np.argwhere(counts == np.max(counts)))
<__array_function__ internals> in amax(*args, **kwargs)
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in amax(a, axis, out, keepdims, initial, where)
2666 """
2667 return _wrapreduction(a, np.maximum, 'max', axis, None, out,
-> 2668 keepdims=keepdims, initial=initial, where=where)
2669
2670
/databricks/python/lib/python3.7/site-packages/numpy/core/fromnumeric.py in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
88 return reduction(axis=axis, out=out, **passkwargs)
89
---> 90 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)
91
92
ValueError: zero-size array to reduction operation maximum which has no identity
Sample Dataframe and expected Output
Sample Df
6/3/2021 1:19:05 0
6/3/2021 1:19:15 1
6/3/2021 1:19:26 1
6/3/2021 1:19:38 1
6/3/2021 1:20:06 0
6/3/2021 1:20:16 0
6/3/2021 1:20:36 1
6/3/2021 1:21:09 1
6/3/2021 1:21:19 1
6/3/2021 1:21:45 0
6/4/2021 1:19:15 0
6/4/2021 1:19:25 0
6/4/2021 1:19:36 0
6/4/2021 1:19:48 1
6/4/2021 1:22:26 1
6/4/2021 1:22:36 0
6/4/2021 1:22:46 0
6/5/2021 2:20:19 0
6/5/2021 2:20:21 1
6/5/2021 2:20:40 0
Expected Output
6/3/2021 1:19 1
6/3/2021 1:20 0
6/3/2021 1:21 1
6/4/2021 1:19 0
6/4/2021 1:22 0
6/5/2021 2:20 0
Notice that original Data frame has data available at irregular frequency ( sometime every 5 second 20 seconds etc . The output expected is also show abover - need data every 1 minute ( resample to every minute instead of original irregular seconds) and the categorical column should have most frequent value during that minute. For ex : in orginal data at in 19minute there are four data points and the most frequent value in that is 1. Similarly at 20 minute there are three data points in original data and the most frquent is 0 . Similarly for 21 minutes there are three data points and the most frequent is 1. Also data I am working has 20 million rows . Hope it helps, This is an effort to reduce the data dimension .
After expected output I would do groupby column and count . This count will be in minutes and I will be able to know How long this column was 1 (in time )
Update after your edit:
out = df.set_index(pd.to_datetime(df.index).floor('T')) \
.groupby(level=0)['category'] \
.apply(lambda x: x.value_counts().idxmax())
print(out)
# Output
2021-06-03 01:19:00 1
2021-06-03 01:20:00 0
2021-06-03 01:21:00 1
2021-06-04 01:19:00 0
2021-06-04 01:22:00 0
2021-06-05 02:20:00 0
Name: category, dtype: int64
Old answer
# I used 'D' instead of 'T'
>>> df.set_index(df.index.floor('D')).groupby(level=0).count()
category
2021-06-03 6
2021-06-04 2
2021-06-06 1
2021-06-08 1
2021-06-25 1
2021-06-29 6
2021-06-30 3
# OR
>>> df.set_index(df.index.floor('D')).groupby(level=0).sum()
category
2021-06-03 2
2021-06-04 0
2021-06-06 1
2021-06-08 1
2021-06-25 0
2021-06-29 3
2021-06-30 1
I am new to Python and am learning LSTM using Pandas with a sample project that I've modified from Github to use with my own data. I am running it on Kaggle.
For reference, the project is found here: https://github.com/abaranovskis-redsamurai/automation-repo/blob/master/forecast-lstm/forecast_lstm_shampoo_sales.ipynb
My data is simply a csv with dates and sales. Here's what the first few lines look like, with the date being YYYY-MM:
"date","num"
"1995-12",700
"1996-1",500
"1997-2",1300
"1996-3",2800
"1996-4",3500
The error I am getting says that "TypeError: float() argument must be a string or a number, not 'datetime.datetime'".
The code is here:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.sequence import TimeseriesGenerator
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dropout
import warnings
warnings.filterwarnings("ignore")
def parser(x):
return pd.datetime.strptime(x, '%Y-%m')
df = pd.read_csv('../input/smalltestb/smalltest1b.csv', parse_dates=[0], date_parser=parser)
df.tail()
train = df
scaler = MinMaxScaler()
scaler.fit(train)
train = scaler.transform(train)
n_input = 12
n_features = 1
generator = TimeseriesGenerator(train, train, length=n_input, batch_size=6)
model = Sequential()
model.add(LSTM(200, activation='relu', input_shape=(n_input, n_features)))
model.add(Dropout(0.15))
Finally, the error message:
TypeError Traceback (most recent call last)
/tmp/ipykernel_35/785266029.py in <module>
25
26 scaler = MinMaxScaler()
---> 27 scaler.fit(train)
28 train = scaler.transform(train)
29 n_input = 12
/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in fit(self, X, y)
334 # Reset internal state before fitting
335 self._reset()
--> 336 return self.partial_fit(X, y)
337
338 def partial_fit(self, X, y=None):
/opt/conda/lib/python3.7/site-packages/sklearn/preprocessing/_data.py in partial_fit(self, X, y)
369 X = self._validate_data(X, reset=first_pass,
370 estimator=self, dtype=FLOAT_DTYPES,
--> 371 force_all_finite="allow-nan")
372
373 data_min = np.nanmin(X, axis=0)
/opt/conda/lib/python3.7/site-packages/sklearn/base.py in _validate_data(self, X, y, reset, validate_separately, **check_params)
418 f"requires y to be passed, but the target y is None."
419 )
--> 420 X = check_array(X, **check_params)
421 out = X
422 else:
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in inner_f(*args, **kwargs)
70 FutureWarning)
71 kwargs.update({k: arg for k, arg in zip(sig.parameters, args)})
---> 72 return f(**kwargs)
73 return inner_f
74
/opt/conda/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, estimator)
596 array = array.astype(dtype, casting="unsafe", copy=False)
597 else:
--> 598 array = np.asarray(array, order=order, dtype=dtype)
599 except ComplexWarning:
600 raise ValueError("Complex data not supported\n"
/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
/opt/conda/lib/python3.7/site-packages/pandas/core/generic.py in __array__(self, dtype)
1991
1992 def __array__(self, dtype: NpDtype | None = None) -> np.ndarray:
-> 1993 return np.asarray(self._values, dtype=dtype)
1994
1995 def __array_wrap__(
/opt/conda/lib/python3.7/site-packages/numpy/core/_asarray.py in asarray(a, dtype, order)
81
82 """
---> 83 return array(a, dtype, copy=False, order=order)
84
85
TypeError: float() argument must be a string or a number, not 'datetime.datetime'
So, I decided to run just the import part and look at the head in another notebook. It didn't format correctly
date num
0 1995-12-01 00:00:00 700
1 1996-01-01 00:00:00 500
2 1996-02-01 00:00:00 1300
3 1996-03-01 00:00:00 2800
4 1997-04-01 00:00:00 3500
This is definitely not what I wanted (wanted YYYY-MM) and I know it's saved as such. I know this must be from the parser and it's not saving it to the dataframe in the way that I am expecting.
How do I address this? As a note, the guy on Github had this for is parser but it choked when I tried it:
def parser(x):
return pd.datetime.strptime('190'+x, '%Y-%m')
df = pd.read_csv('shampoo.csv', parse_dates=[0], index_col=0, date_parser=parser)
(He added '190' to the last digit of a year with a dash and a month number whereas I am using a year dash month number.)
Any suggestions? Thanks for having a look!
Thanks!
I have a dataframe (cgf) that looks as follows and I want to remove the outliers for only the numerical columns:
Product object
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 180 entries, 0 to 179
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Product 180 non-null object
1 Age 180 non-null int64
2 Gender 180 non-null object
3 Education 180 non-null category
4 MaritalStatus 180 non-null object
5 Usage 180 non-null int64
6 Fitness 180 non-null category
7 Income 180 non-null int64
8 Miles 180 non-null int64
dtypes: category(2), int64(4), object(3)
I tried several scripts using z-score and IQR methods, but none of them worked. For example, here is a script for the z-score that didn't work
from scipy import stats
import numpy as np
z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
print(z)
I get this error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-102-2759aa3fbd60> in <module>
----> 1 z = np.abs(stats.zscore(cgf)) # get the z-score of every value with respect to their columns
2 print(z)
~\anaconda3\lib\site-packages\scipy\stats\stats.py in zscore(a, axis, ddof, nan_policy)
2495 sstd = np.nanstd(a=a, axis=axis, ddof=ddof, keepdims=True)
2496 else:
-> 2497 mns = a.mean(axis=axis, keepdims=True)
2498 sstd = a.std(axis=axis, ddof=ddof, keepdims=True)
2499
~\anaconda3\lib\site-packages\numpy\core\_methods.py in _mean(a, axis, dtype, out, keepdims)
160 ret = umr_sum(arr, axis, dtype, out, keepdims)
161 if isinstance(ret, mu.ndarray):
--> 162 ret = um.true_divide(
163 ret, rcount, out=ret, casting='unsafe', subok=False)
164 if is_float16_result and out is None:
TypeError: unsupported operand type(s) for /: 'str' and 'int'
Here is the IQR method I tried, but it also failed as follows:
np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
error message:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-96-bb3dfd2ce6c5> in <module>
----> 1 np.where((cgf < (Q1 - 1.5 * IQR)) | (cgf > (Q3 + 1.5 * IQR)))
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in f(self, other)
702
703 # See GH#4537 for discussion of scalar op behavior
--> 704 new_data = dispatch_to_series(self, other, op, axis=axis)
705 return self._construct_result(new_data)
706
~\anaconda3\lib\site-packages\pandas\core\ops\__init__.py in dispatch_to_series(left, right, func, axis)
273 # _frame_arith_method_with_reindex
274
--> 275 bm = left._mgr.operate_blockwise(right._mgr, array_op)
276 return type(left)(bm)
277
~\anaconda3\lib\site-packages\pandas\core\internals\managers.py in operate_blockwise(self, other, array_op)
362 Apply array_op blockwise with another (aligned) BlockManager.
363 """
--> 364 return operate_blockwise(self, other, array_op)
365
366 def apply(self: T, f, align_keys=None, **kwargs) -> T:
~\anaconda3\lib\site-packages\pandas\core\internals\ops.py in operate_blockwise(left, right, array_op)
36 lvals, rvals = _get_same_shape_values(blk, rblk, left_ea, right_ea)
37
---> 38 res_values = array_op(lvals, rvals)
39 if left_ea and not right_ea and hasattr(res_values, "reshape"):
40 res_values = res_values.reshape(1, -1)
~\anaconda3\lib\site-packages\pandas\core\ops\array_ops.py in comparison_op(left, right, op)
228 if should_extension_dispatch(lvalues, rvalues):
229 # Call the method on lvalues
--> 230 res_values = op(lvalues, rvalues)
231
232 elif is_scalar(rvalues) and isna(rvalues):
~\anaconda3\lib\site-packages\pandas\core\ops\common.py in new_method(self, other)
63 other = item_from_zerodim(other)
64
---> 65 return method(self, other)
66
67 return new_method
~\anaconda3\lib\site-packages\pandas\core\arrays\categorical.py in func(self, other)
74 if not self.ordered:
75 if opname in ["__lt__", "__gt__", "__le__", "__ge__"]:
---> 76 raise TypeError(
77 "Unordered Categoricals can only compare equality or not"
78 )
TypeError: Unordered Categoricals can only compare equality or not
How do I resolve some of these errors? It appears the combination of categorical and numerical data in my df is causing a problem, but I am a newbie and I don't know how to fix it so that I can remove outliers
For example, if you're dropping outliers in the 'Age' column, then the changes happened in this column will get reflected in the data frame. i.e., that entire row will be dropped.
Reference: towardsdatascience
Reference: how-to-remove-outliers
So I have multiple data frames and all need the same kind of formula applied to certain sets within this data frame. I got the locations of the sets inside the df, but I don't know how to access those sets.
This is my code:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt #might used/need it later to check the output
df = pd.read_csv('Dalfsen.csv')
l = []
x = []
y = []
#the formula(trendline)
def rechtzetten(x,y):
a = (len(x)*sum(x*y)- sum(x)*sum(y))/(len(x)*sum(x**2)-sum(x)**2)
b = (sum(y)-a*sum(x))/len(x)
y1 = x*a+b
print(y1)
METING = df.ID.str.contains("<METING>") #locating the sets
indicatie = np.where(METING == False)[0] #and saving them somewhere
if n in df[n] != indicatie & n+1 != indicatie: #attempt to add parts of the set in l
append.l
elif n in df[n] != indicatie & n+1 == indicatie: #attempt defining the end of the set and using the formula for the set
append.l
rechtzetten(l.x, l.y)
else: #emptying the storage for the new set
l = []
indicatie has the following numbers:
0 12 13 26 27 40 41 53 54 66 67 80 81 94 95 108 109 121
122 137 138 149 150 162 163 177 178 190 191 204 205 217 218 229 230 242
243 255 256 268 269 291 292 312 313 340 341 373 374 401 402 410 411 420
421 430 431 449 450 468 469 487 488 504 505 521 522 538 539 558 559 575
576 590 591 604 605 619 620 633 634 647
Because my df looks like this:
ID,NUM,x,y,nap,abs,end
<PROFIEL>not used data
<METING>data</METING>
<METING>data</METING>
...
<METING>data</METING>
<METING>data</METING>
</PROFIEL>,,,,,,
<PROFIEL>not usde data
...
</PROFIEL>,,,,,,
tl;dr I'm trying to use a formula in each profile as shown above. I want to edit the data between 2 numbers of the list indicatie.
For example:
the fucntion rechtzetten(x,y) for the x and y df.x&df.y[1:11](Because [0]&[12] are in the list indicatie.) And then the same for [14:25] etc. etc.
What I try to avoid is typing the following hundreds of times manually:
x_#=df.x[1:11]
y_#=df.y[1:11]
rechtzetten(x_#,y_#)
I cant understand your question clearly, but if you want to replace a specific column of your pandas dataframe with a numpy array, you could simply assign it :
df['Column'] = numpy_array
Can you be more clear ?
I am trying to convert my categorical columns into integers with Label Encoder in order to create a correlation matrix consisting of a mix of numerical and categorical variables. This is my table structure:
a int64
b int64
c object
d object
e object
f object
g object
dtype: object
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
for x in df.columns:
if df[x].dtypes=='object':
df[x]=le.fit_transform(df[x])
corr = df.corr()
Then I get this error:
TypeError: unorderable types: int() < str()
TypeError Traceback (most recent call last)
<command-205607> in <module>()
3 for x in df.columns:
4 if df[x].dtypes=='object':
----> 5 df[x]=le.fit_transform(df[x])
6 corr = df.corr()
/databricks/python/lib/python3.5/site-packages/sklearn/preprocessing/label.py in fit_transform(self, y)
129 y = column_or_1d(y, warn=True)
130 _check_numpy_unicode_bug(y)
--> 131 self.classes_, y = np.unique(y, return_inverse=True)
132 return y
133
/databricks/python/lib/python3.5/site-packages/numpy/lib/arraysetops.py in unique(ar, return_index, return_inverse, return_counts, axis)
221 ar = np.asanyarray(ar)
222 if axis is None:
--> 223 return _unique1d(ar, return_index, return_inverse, return_counts)
224 if not (-ar.ndim <= axis < ar.ndim):
225 raise ValueError('Invalid axis kwarg specified for unique')
/databricks/python/lib/python3.5/site-packages/numpy/lib/arraysetops.py in _unique1d(ar, return_index, return_inverse, return_counts)
278
279 if optional_indices:
--> 280 perm = ar.argsort(kind='mergesort' if return_index else 'quicksort')
281 aux = ar[perm]
282 else:
TypeError: unorderable types: int() < str()
Does anybody have an idea what is wrong?
Change df[x]=le.fit_transform(df[x]) to
df[x]=le.fit_transform(df[x].astype(str))
And it should work.