Get Pandas Duplicate Row Count with Original Index - python

I need to find duplicate rows in a Pandas Dataframe, and then add an extra column with the count. Lets say we have a dataframe:
>>print(df)
+----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 |
| 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 9 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 12 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 |
| 18 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+
The above frame would then become the one below with an additional column with the count. You can see that we are still preserving the index column.
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
| | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|----+-----+-----+-----+-----+-----+-----+-----+-----|-----|
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 2 | 2 | 4 | 3 | 4 | 1 | 1 | 4 | 4 | 1 |
| 3 | 4 | 3 | 4 | 0 | 0 | 0 | 0 | 0 | 2 |
| 4 | 2 | 3 | 4 | 3 | 4 | 0 | 0 | 0 | 1 |
| 5 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 |
| 6 | 4 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 7 | 1 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 1 |
| 10 | 3 | 3 | 4 | 3 | 5 | 5 | 5 | 0 | 1 |
| 11 | 5 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 13 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 15 | 1 | 3 | 5 | 0 | 0 | 0 | 0 | 0 | 1 |
| 16 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 17 | 3 | 3 | 4 | 4 | 0 | 0 | 0 | 0 | 1 |
+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+
I've seen other solutions to this such as:
df.groupby(list(df.columns.values)).size()
But that returns a matrix with gaps and with no initial index.

You can use reset_index first for convert index to columns and then aggregate by first and len:
Also if need groupby by all columns is necessary remove index column by difference:
print (df.columns.difference(['index']))
Index(['2', '3', '4', '5', '6', '7', '8', '9'], dtype='object')
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None))
2 3 4 5 6 7 8 9 size
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1
If necessary add next column 10 need rename:
#if necessary convert to str
last_col = str(df.columns.astype(int).max() + 1)
print (last_col)
10
print (df.reset_index()
.groupby(df.columns.difference(['index']).tolist())['index']
.agg(['first', 'size'])
.reset_index()
.set_index(['first'])
.sort_index()
.rename_axis(None)
.rename(columns={'size':last_col}))
2 3 4 5 6 7 8 9 10
0 0 0 0 0 0 0 0 0 2
1 2 0 0 0 0 0 0 0 2
2 2 4 3 4 1 1 4 4 1
3 4 3 4 0 0 0 0 0 2
4 2 3 4 3 4 0 0 0 1
5 5 0 0 0 0 0 0 0 3
6 4 5 0 0 0 0 0 0 1
7 1 1 4 0 0 0 0 0 1
10 3 3 4 3 5 5 5 0 1
11 5 4 0 0 0 0 0 0 1
13 0 4 0 0 0 0 0 0 1
15 1 3 5 0 0 0 0 0 1
16 4 0 0 0 0 0 0 0 1
17 3 3 4 4 0 0 0 0 1

Related

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Group by as a list and create new colum for each value

I have a dataframe where every row is a user id and if he has an item:
| user | item_id |
|------|---------|
| 1 | a |
| 1 | b |
| 2 | b |
| 3 | c |
| 4 | a |
| 4 | c |
I want to create n columns where n is all the possible values of item_id, group one row per user and fill 1/0 according if the value is present for the user.
| user | item_a | item_b | item_c |
|------|---------|---------|----------|
| 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 |
| 3 | 0 | 1 | 1 |
| 4 | 1 | 0 | 1 |
Use pivot_table:
import pandas as pd
df = pd.DataFrame({'user': [1,1,2,3,4,4], 'item_id': list('abbcac')})
df = df.assign(val=1).pivot_table(values='val',
index='user',
columns='item_id',
fill_value=0)
pd.crosstab(df.user,df.item_id).add_prefix('item_').reset_index()
Yet another approach is to use get_dummies and group by sum where:
pd.get_dummies(df, columns=['item_id']).groupby('user').sum().reset_index()
desired result:
user item_id_a item_id_b item_id_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1
and to change the columns:
df.columns = df.columns.str.replace(r"_id", "")
df
user item_a item_b item_c
0 1 1 1 0
1 2 0 1 0
2 3 0 0 1
3 4 1 0 1

Iterate and assign value in Pandas dataframe based on condition

I have a pandas dataframe composed of 8 columns (c1 to c7 and the last is called total). c1 to c7 are 0 and 1.
The column total should be an assignment for the maximum number of 1 in a sequence within c1 to c7. c1 to c7 represent weekdays, hence 7 should then flip to 1.
For example, if we would have and initial dataframe like df:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
My initial thought was to create a loop with an if statement within to evaluate the criteria within the columns and assign the value to the column total.
i = "c1"
d =
for i in df.iloc[:,0:7]:
if df[i] == 1 and df[i-1] == 1:
df["total"]:= df["total"] + 1
I would expect df to look like:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 2 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 6 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 2 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
I haven't been able to get to a result, was trying to build step by step but kept getting an error in the if statement evaluation
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
df = pd.DataFrame([[ 1,1,1,1,1,0,0], [0,0,1,0,0,0,0],[1,1,0,1,1,0,0]])
def fun(x):
count = 0;
result = 0;
n = len(x)
for i in range(0,2*n):
if x[i % n] == 0:
result = max(result, count)
count = 0
else:
count += 1
return result
df['total'] = df.apply(lambda x: fun(x), axis=1)
0 1 2 3 4 5 6 total
0 1 1 1 1 1 0 0 5
1 0 0 1 0 0 0 0 1
2 1 1 0 1 1 0 0 2
Bugs in your loop
df[i-1] when i==0 will throw an error
df[i] gives the values of ith column of all the rows
7 should then flip to 1: This part is missing in your code
To flip the tail (7) of the row back to head(1), place a copy of row at the tail and then check for constitutive 1's. This can also be done by looping the row twice and using a modulus operator. Check this algorithm for more details

Create a month for every date between a period and make them columns

I want to separate every month inside the period between the 'start' and 'end' column than I know I can use a pivot_table to make them columns:
subscription|values| start | end
x |1 |5/5/2018 |6/5/2018
y |2 |5/5/2018 |8/5/2018
z |1 |5/5/2018 |9/5/2018
a |3 |5/5/2018 |10/5/2018
b |4 |5/5/2018 |11/5/2018
c |2 |5/5/2018 |12/5/2018
Desired Output:
subscription|jan| feb | mar | abr | jun | jul | aug | sep | out | nov | dez
x | | | | | 1 | 1 | | | | |
y | | | | | 2 | 2 | 2 | | | |
z | | | | | 1 | 1 | 1 | 1 | | |
a | | | | | 3 | 3 | 3 | 3 | 3 | |
b | | | | | 4 | 4 | 4 | 4 | 4 | 4 |
c | | | | | 2 | 2 | 2 | 2 | 2 | 2 | 2
Using simple pd.Series.cumsum
import calendar
df2 = pd.DataFrame(np.zeros(shape=[len(df),13]),
columns=map(lambda s: calendar.month_abbr[s],
np.arange(13)))
First set begin as values, and end as -values.
r = np.arange(len(df))
df2.values[r, df.start.dt.month] = df['values']
df2.values[r, df.end.dt.month] = -df['values']
Then cumsum through axis=1
df2 = df2.cumsum(1)
Set the final to values
df2.values[r, df.end.dt.month]= df['values']
Final output:
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
0 0 0 0 0 0 1 1 0 0 0 0 0 0
1 0 0 0 0 0 2 2 2 2 0 0 0 0
2 0 0 0 0 0 1 1 1 1 1 0 0 0
3 0 0 0 0 0 3 3 3 3 3 3 0 0
4 0 0 0 0 0 4 4 4 4 4 4 4 0
5 0 0 0 0 0 2 2 2 2 2 2 2 2
A method from sklearn MultiLabelBinarizer
from sklearn.preprocessing import MultiLabelBinarizer
df['L'] = [pd.date_range(x, y, freq='M') for x, y in zip(df.start, df.end)]
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['L']),columns=mlb.classes_, index=df.index).mul(df['values'],0)
yourdf.columns=yourdf.columns.strftime('%Y%B')
yourdf['subscription']=df['subscription']
yourdf
Out[75]:
2018May 2018June ... 2018November subscription
0 1 0 ... 0 x
1 2 2 ... 0 y
2 1 1 ... 0 z
3 3 3 ... 0 a
4 4 4 ... 0 b
5 2 2 ... 2 c
[6 rows x 8 columns]

Unexpected Python TypeError: when using scalars

I am new to Python and in my opinion it is much different than Java.
I have looked at other answers which implies that the error is because I am passing an array when it is expecting a value. I don't know about that. I am pretty sure I am simply passing a value.
The line, 97, is:
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
The complete text of the error is:
Traceback (most recent call last):
File "D:/Personal/Python/NB.py", line 153, in <module>
main()
File "D:/Personal/Python/NB.py", line 148, in main
predictions = getPredict(summaries, testing_set)
File "D:/Personal/Python/NB.py", line 129, in getPredict
classification = predict(results, testData[index])
File "D:/Personal/Python/NB.py", line 117, in predict
probabilities = Classify(feature_summaries, classifications)
File "D:/Personal/Python/NB.py", line 113, in Classify
probabilities[classes] = probabilities[classes] * GaussianProbabilityDensity(feature_value, mean, standard_deviation)
File "D:/Personal/Python/NB.py", line 97, in GaussianProbabilityDensity
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
TypeError: only size-1 arrays can be converted to Python scalars
If it is useful, the csv is below. It should be noted I have two other algorithms that run on this dataset just fine.
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Your issue is that class_summaries (from line 107) is a list of tuples, one of which you select and pass into the GaussianProbabilityDensity function as feature_value.
It ends up causing the error on line 97. Note that if you were to fix it (I replaced the value with a hard 1.0), you'll end up with a division by zero error, as the standard_deviation you're putting in happens to be 0 at that point.
The way I found this was to use a Python IDE that has a proper debugger (I like PyCharm) and by setting a breakpoint on the line you indicated, inspecting the various variables before the error occurs. I recommend trying to solve these types of problems in a similar fashion, as it save a lot of time and spurious print statements.
math.pow (like all math functions) only works with scalars, that is single numbers (integer or float). The error says that one of the arguments, such as standard_deviation is a numpy array with more than one element, so it can't be converted to a scalar and passed to math.pow.
This occurs in your own code, so there's no difficulty in tracing those variables back to their source.
Either you unintentionally passed an array to this function, or you need to replace math.pow with np.pow (and np.exp) functions which do work with arrays.
You generate a numpy array when loading from the csv
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except ValueError:
data[index] = 0
loadtxt returns an array, with float dtype (the default). All its elements will be floats - if it read something that wasn't valid float, it would have raised an error. Thus the loop isn't needed. And the loop looks too much like it was written for a list, not an array.
randomize_data shouldn't return anything. np.random.shuffle operates in-place on csv. That doesn't cause an error.

Categories