pandas: Split and convert series of alphanumeric texts to columns and rows - python

Current data frame: I have a pandas data frame where each employee has a text code(all codes start with T) and an associated frequency right next to the code. All text codes have 8 characters.
+----------+-------------------------------------------------------------+
| emp_id | text |
+----------+-------------------------------------------------------------+
| E0001 | [T0431516,-8,T0401531,-12,T0517519,12] |
| E0002 | [T0701540,-1,T0431516,-2] |
| E0003 | [T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]|
| E0004 | [T0516319,-3] |
| E0005 | [T0431516,2] |
+----------+-------------------------------------------------------------+
Expected data frame: I am trying to make the text codes present in the data frame as individual columns and if an employee has a frequency for that code then populate frequency else 0.
+----------+----------------------------------------------------------------------------------------+
| emp_id | T0431516 | T0401531 | T0517519 | T0701540 | T0421531 | T0516319 | T0500371 | T0309711 |
+----------+----------------------------------------------------------------------------------------+
| E0001 | -8 | -12 | 12 | 0 | 0 | 0 | 0 | 0 |
| E0002 | -2 | 0 | 0 | -1 | 0 | 0 | 0 | 0 |
| E0003 | 0 | 0 | -1 | 0 | -7 | 9 | -6 | -3 |
| E0004 | 0 | 0 | 0 | 0 | 0 | -3 | 0 | 0 |
| E0005 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------+----------------------------------------------------------------------------------------+
Sample data:
pd.DataFrame({'emp_id' : {0: 'E0001', 1: 'E0002', 2: 'E0003', 3: 'E0004', 4: 'E0005'},
'text' : {0: '[T0431516,-8,T0401531,-12,T0517519,12]', 1: '[T0701540,-1,T0431516,-2]', 2: '[T0517519,-1,T0421531,-7,T0516319,9,T0500371,-6,T0309711,-3]', 3: '[T0516319,-3]', 4: '[T0431516,2]'}
})
So, far my attempts were unsuccessful. Any pointers/help is much appreciated!

You can explode the dataframe and then create a pivot_table:
df = pd.DataFrame({'emp_id' : ['E0001', 'E0002', 'E0003', 'E0004', 'E0005'],
'text' : [['T0431516',-8,'T0401531',-12,'T0517519',12],
['T0701540',-1,'T0431516',-2],['T0517519',-1,'T0421531',-7,'T0516319',9,'T0500371',-6,'T0309711',-3],
['T0516319',-3], ['T0431516',2]]})
df = df.explode('text')
df['freq'] = df['text'].shift(-1)
df = df[df['text'].str[0] == 'T']
df['freq'] = df['freq'].astype(int)
df = pd.pivot_table(df, index='emp_id', columns='text', values='freq',aggfunc = 'sum').fillna(0).astype(int)
df
Out[1]:
text T0309711 T0401531 T0421531 T0431516 T0500371 T0516319 T0517519 \
emp_id
E0001 0 -12 0 -8 0 0 12
E0002 0 0 0 -2 0 0 0
E0003 -3 0 -7 0 -6 9 -1
E0004 0 0 0 0 0 -3 0
E0005 0 0 0 2 0 0 0
text T0701540
emp_id
E0001 0
E0002 -1
E0003 0
E0004 0
E0005 0

Related

Pandas: group and custom transfrom dataframe long to wide

I have a dataframe in following form:
+---------+-------+-------+---------+---------+
| payment | type | err | country | source |
+---------+-------+-------+---------+---------+
| visa | type1 | OK | AR | source1 |
| paypal | type1 | OK | DE | source1 |
| mc | type2 | ERROR | AU | source2 |
| visa | type3 | OK | US | source2 |
| visa | type2 | OK | FR | source3 |
| visa | type1 | OK | FR | source2 |
+---------+-------+-------+---------+---------+
df = pd.DataFrame({'payment':['visa','paypal','mc','visa','visa','visa'],
'type':['type1','type1','type2','type3','type2','type1'],
'err':['OK','OK','ERROR','OK','OK','OK'],
'country':['AR','DE','AU','US','FR','FR'],
'source':['source1','source1','source2','source2','source3','source2'],
})
My goal is to transform it so that I have group by payment and country, but create new columns:
number_payments - just count for groupby,
num_errors - number of ERROR values for group,
num_type1.. num_type3 - number of corresponding values in column type (only 3 possible values),
num_source1.. num_source3 - number of corresponding values in column source (only 3 possible values).
Like this:
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| payment | country | number_payments | num_errors | num_type1 | num_type2 | num_type3 | num_source1 | num_source2 | num_source3 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
| visa | AR | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| visa | US | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| visa | FR | 2 | 0 | 1 | 2 | 0 | 0 | 1 | 1 |
| mc | AU | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
+---------+---------+-----------------+------------+-----------+-----------+-----------+-------------+-------------+-------------+
I tried to combine pandas groupby and pivot, but failed to make all and it's ugly. I'm pretty sure that there are some good and fast methods to do this..
Appreciate any help.
You can use get dummies and then select the 2 grouper columns and create the group, then join the size with sum:
c = df['err'].eq("ERROR")
g = (df[['payment','country']].assign(num_errors=c,
**pd.get_dummies(df[['type','source']],prefix=['num','num']))
.groupby(['payment','country']))
out = g.size().to_frame("number_payments").join(g.sum()).reset_index()
print(out)
payment country number_payments num_errors num_type1 num_type2 \
0 mc AU 1 1 0 1
1 paypal DE 1 0 1 0
2 visa AR 1 0 1 0
3 visa FR 2 0 1 1
4 visa US 1 0 0 0
num_type3 num_source1 num_source2 num_source3
0 0 0 1 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 1
4 1 0 1 0
First it is better to clean the data for your stated purposes:
df['err_bool'] = (df['err'] == 'ERROR').astype(int)
Then we use groupby for applicable columns:
df_grouped = df.groupby(['country','payment']).agg({
'number_payments' : 'count',
'err_bool':sum})
Then we can use the pivot for type and source:
df['dummy'] = 1
df_type = df.pivot(
index=['country','payment'],
columns='type',
value='dummy',
aggfunc = np.sum
)
df_source = df.pivot_table(
index=['country','payment'],
columns='source',
value='dummy',
aggfunc = np.sum
)
Then we join everything together:
df_grouped = df_grouped.join(df_type).join(df_source)

Mapping duplicate rows to originals with dictionary - Python 3.6

I am trying to locate duplicate rows in my pandas dataframe. In reality, df.shape is 438796, 4531, but I am using this toy example below for an MRE
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low |
| id_104 | 1 | 1 | 10 | 1 | 1 | High |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low |
| id_106 | 0 | 0 | 0 | 0 | 0 | High |
| id_107 | 1 | 1 | 6 | 0 | 1 | High |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium |
| id_110 | 0 | 1 | 32 | 0 | 1 | High |
What I am trying to accomplish is observing a subset of the features, and if there are duplicate rows, to keep the first and then denote which id: label pair is the duplicate.
I have looked at the following posts:
find duplicate rows in a pandas dataframe
(I could not figure out how to replace col1 in df['index_original'] = df.groupby(['col1', 'col2']).col1.transform('idxmin') with my list of cols)
Find all duplicate rows in a pandas dataframe
I know pandas has a duplicated() call. So I tried implementing that and it sort of works:
import pandas as pd
# Read in example data
df = pd.read_clipboard()
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
# Create a list of duplicates
dupes = sub_df.index[sub_df.duplicated(keep='first')].tolist()
# Loop through the duplicates and print out the values I want
for idx in dupes:
# print(df[:idx])
print(df.loc[[idx],['id', 'label']])
However, what I am trying to do is for a particular row, determine which rows are duplicates of it by saving those rows as id: label combination. So while I'm able to extract the id and label for each duplicate, I have no ability to map it back to the original row for which it is a duplicate.
An ideal dataset would look like:
| id | ft1 | ft2 | ft3 | ft4 | ft5 | label | duplicates |
|:------:|:---:|:---:|:---:|:---:|:---:|:------:|:-------------------------------------------:|
| id_100 | 1 | 1 | 43 | 1 | 1 | High | {id_102: Low, id_104: High, id_108: Medium} |
| id_101 | 1 | 1 | 33 | 0 | 1 | Medium | {id_107: High} |
| id_102 | 1 | 1 | 12 | 1 | 1 | Low | |
| id_103 | 1 | 1 | 46 | 1 | 0 | Low | |
| id_104 | 1 | 1 | 10 | 1 | 1 | High | |
| id_105 | 0 | 1 | 99 | 0 | 1 | Low | {id_110: High} |
| id_106 | 0 | 0 | 0 | 0 | 0 | High | |
| id_107 | 1 | 1 | 6 | 0 | 1 | High | |
| id_108 | 1 | 1 | 29 | 1 | 1 | Medium | |
| id_109 | 1 | 0 | 27 | 0 | 0 | Medium | |
| id_110 | 0 | 1 | 32 | 0 | 1 | High | |
How can I take my duplicated values and map them back to their originals efficiently (understanding the size of my actual dataset)?
Working with dictionaries in columns is really complicated, here is one possible solution:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
# Create a subset of my dataframe with only the columns I care about
sub_df = df[cols]
#mask for first dupes
m = sub_df.duplicated()
#create tuples, aggregate to list of tuples
s = (df.assign(a = df[['id','label']].apply(tuple, 1))[m]
.groupby(cols)['a']
.agg(lambda x: dict(list(x))))
#add new column
df = df.join(s.rename('duplicates'), on=cols)
#repalce missing values and not first duplciates to empty strings
df['duplicates'] = df['duplicates'].fillna('').mask(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicates
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10
Alternative with custom function for assign all dupes without first one to first value of new column per groups, last is changed mask for replace empty strings:
# Declare columns I am interested in
cols = ['ft1', 'ft2', 'ft4', 'ft5']
m = ~df.duplicated(subset=cols) & df.duplicated(subset=cols, keep=False)
def f(x):
x.loc[x.index[0], 'duplicated'] = [dict(x[['id','label']].to_numpy()[1:])]
return x
df = df.groupby(cols).apply(f)
df['duplicated'] = df['duplicated'].where(m, '')
print (df)
id ft1 ft2 ft3 ft4 ft5 label \
0 id_100 1 1 43 1 1 High
1 id_101 1 1 33 0 1 Medium
2 id_102 1 1 12 1 1 Low
3 id_103 1 1 46 1 0 Low
4 id_104 1 1 10 1 1 High
5 id_105 0 1 99 0 1 Low
6 id_106 0 0 0 0 0 High
7 id_107 1 1 6 0 1 High
8 id_108 1 1 29 1 1 Medium
9 id_109 1 0 27 0 0 Medium
10 id_110 0 1 32 0 1 High
duplicated
0 {'id_102': 'Low', 'id_104': 'High', 'id_108': ...
1 {'id_107': 'High'}
2
3
4
5 {'id_110': 'High'}
6
7
8
9
10

Pandas: Wide to first, second, third, identified categories

I am wondering if anyone knows a quick way in pandas to pivot a dataframe to achieve the desired transformation below. It is sort of a wide-to-long pivot, but not quite.
Input Dataframe structure (needs to be able to support N number of categories, not just 3 as case below)
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| id | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0001 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0002 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0003 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0004 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
Output Transformed Dataframe structure: (needs to support N number of categories, not just 3 as example shows)
+------+------+-------+------+-------+------+-------+
| id | cat1 | sent1 | cat2 | sent2 | cat3 | sent3 |
+------+------+-------+------+-------+------+-------+
| 0001 | catA | pos | catC | neg | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0002 | catB | pos | catC | pos | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0003 | catA | ntrl | catB | ntrl | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0004 | catA | pos | catB | pos | catC | ntrl |
+------+------+-------+------+-------+------+-------+
| 0005 | catC | neg | NULL | NULL | NULL | NULL |
+------+------+-------+------+-------+------+-------+
I don't think it's a pivot at all.. However, anything is possible so here we go:
import io
import itertools
import pandas
# Your data
data = io.StringIO(
"""
id | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl
0001 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0
0002 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0
0003 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0
0004 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1
0005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0
"""
)
df = pandas.read_table(data, sep="\s*\|\s*")
def get_sentiment(row: pandas.Series) -> pandas.Series:
if row["cat_pos"] == 1:
return "pos"
elif row["cat_neg"] == 1:
return "neg"
elif row["cat_ntrl"] == 1:
return "ntrl"
else:
return None
# Initialize a dict that will hold an entry for every index in the dataframe, with a list of categories and sentiments
categories_per_index = {index: [] for index in df.index}
# Extract a list of unique names of all possible categories
categories = set([column[3] for column in df.columns if column.startswith("cat")])
# Loop over the unique categories
for key in categories:
# Select only the columns for a particular category, and where that category is present
group = df.loc[df[f"cat{key}_present"] == 1, [f"cat{key}_present", f"cat{key}_pos", f"cat{key}_neg", f"cat{key}_ntrl"]]
# Change the column names for generic processing
group.columns = ["cat_present", "cat_pos", "cat_neg", "cat_ntrl"]
# Figure out the sentiment for every line
group["sentiment"] = group.apply(get_sentiment, axis=1)
# Loop the rows in the group and add the sentiment for this category to the indices
for index, row in group.iterrows():
# Add the name of the category and the sentiment to the index
categories_per_index[index].append(f"cat{key}")
categories_per_index[index].append(row["sentiment"])
# Reconstruct the dataframe from the dictionary
df = pandas.DataFrame.from_dict(categories_per_index, orient="index", columns=list(itertools.chain.from_iterable([ [f"cat{i}", f"sent{i}"] for i in range(len(categories)) ])))
Output:
print(df)
cat0 sent0 cat1 sent1 cat2 sent2
0 catA pos catC neg None None
1 catB pos catC pos None None
2 catB ntrl catA ntrl None None
3 catB pos catA pos catC ntrl
4 catC neg None None None None

Iterate and assign value in Pandas dataframe based on condition

I have a pandas dataframe composed of 8 columns (c1 to c7 and the last is called total). c1 to c7 are 0 and 1.
The column total should be an assignment for the maximum number of 1 in a sequence within c1 to c7. c1 to c7 represent weekdays, hence 7 should then flip to 1.
For example, if we would have and initial dataframe like df:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 0 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 0 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |
My initial thought was to create a loop with an if statement within to evaluate the criteria within the columns and assign the value to the column total.
i = "c1"
d =
for i in df.iloc[:,0:7]:
if df[i] == 1 and df[i-1] == 1:
df["total"]:= df["total"] + 1
I would expect df to look like:
| c1 | c2 | c3 | c4 | c5 | c6 | c7 | total |
|:---|:---|:---|: --|:---|:---|:---|:-----:|
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 5 |
| 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 2 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 6 |
| 1 | 0 | 1 | 1 | 1 | 0 | 1 | 3 |
| 1 | 0 | 1 | 0 | 1 | 0 | 1 | 2 |
| 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
I haven't been able to get to a result, was trying to build step by step but kept getting an error in the if statement evaluation
ValueError: The truth value of a Series is ambiguous. Use a.empty,
a.bool(), a.item(), a.any() or a.all().
df = pd.DataFrame([[ 1,1,1,1,1,0,0], [0,0,1,0,0,0,0],[1,1,0,1,1,0,0]])
def fun(x):
count = 0;
result = 0;
n = len(x)
for i in range(0,2*n):
if x[i % n] == 0:
result = max(result, count)
count = 0
else:
count += 1
return result
df['total'] = df.apply(lambda x: fun(x), axis=1)
0 1 2 3 4 5 6 total
0 1 1 1 1 1 0 0 5
1 0 0 1 0 0 0 0 1
2 1 1 0 1 1 0 0 2
Bugs in your loop
df[i-1] when i==0 will throw an error
df[i] gives the values of ith column of all the rows
7 should then flip to 1: This part is missing in your code
To flip the tail (7) of the row back to head(1), place a copy of row at the tail and then check for constitutive 1's. This can also be done by looping the row twice and using a modulus operator. Check this algorithm for more details

Unexpected Python TypeError: when using scalars

I am new to Python and in my opinion it is much different than Java.
I have looked at other answers which implies that the error is because I am passing an array when it is expecting a value. I don't know about that. I am pretty sure I am simply passing a value.
The line, 97, is:
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
The complete text of the error is:
Traceback (most recent call last):
File "D:/Personal/Python/NB.py", line 153, in <module>
main()
File "D:/Personal/Python/NB.py", line 148, in main
predictions = getPredict(summaries, testing_set)
File "D:/Personal/Python/NB.py", line 129, in getPredict
classification = predict(results, testData[index])
File "D:/Personal/Python/NB.py", line 117, in predict
probabilities = Classify(feature_summaries, classifications)
File "D:/Personal/Python/NB.py", line 113, in Classify
probabilities[classes] = probabilities[classes] * GaussianProbabilityDensity(feature_value, mean, standard_deviation)
File "D:/Personal/Python/NB.py", line 97, in GaussianProbabilityDensity
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
TypeError: only size-1 arrays can be converted to Python scalars
If it is useful, the csv is below. It should be noted I have two other algorithms that run on this dataset just fine.
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Your issue is that class_summaries (from line 107) is a list of tuples, one of which you select and pass into the GaussianProbabilityDensity function as feature_value.
It ends up causing the error on line 97. Note that if you were to fix it (I replaced the value with a hard 1.0), you'll end up with a division by zero error, as the standard_deviation you're putting in happens to be 0 at that point.
The way I found this was to use a Python IDE that has a proper debugger (I like PyCharm) and by setting a breakpoint on the line you indicated, inspecting the various variables before the error occurs. I recommend trying to solve these types of problems in a similar fashion, as it save a lot of time and spurious print statements.
math.pow (like all math functions) only works with scalars, that is single numbers (integer or float). The error says that one of the arguments, such as standard_deviation is a numpy array with more than one element, so it can't be converted to a scalar and passed to math.pow.
This occurs in your own code, so there's no difficulty in tracing those variables back to their source.
Either you unintentionally passed an array to this function, or you need to replace math.pow with np.pow (and np.exp) functions which do work with arrays.
You generate a numpy array when loading from the csv
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except ValueError:
data[index] = 0
loadtxt returns an array, with float dtype (the default). All its elements will be floats - if it read something that wasn't valid float, it would have raised an error. Thus the loop isn't needed. And the loop looks too much like it was written for a list, not an array.
randomize_data shouldn't return anything. np.random.shuffle operates in-place on csv. That doesn't cause an error.

Categories