Pandas: Wide to first, second, third, identified categories - python

I am wondering if anyone knows a quick way in pandas to pivot a dataframe to achieve the desired transformation below. It is sort of a wide-to-long pivot, but not quite.
Input Dataframe structure (needs to be able to support N number of categories, not just 3 as case below)
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| id | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0001 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0002 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0003 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0004 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
| 0005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
+------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+--------------+----------+----------+-----------+
Output Transformed Dataframe structure: (needs to support N number of categories, not just 3 as example shows)
+------+------+-------+------+-------+------+-------+
| id | cat1 | sent1 | cat2 | sent2 | cat3 | sent3 |
+------+------+-------+------+-------+------+-------+
| 0001 | catA | pos | catC | neg | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0002 | catB | pos | catC | pos | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0003 | catA | ntrl | catB | ntrl | NULL | NULL |
+------+------+-------+------+-------+------+-------+
| 0004 | catA | pos | catB | pos | catC | ntrl |
+------+------+-------+------+-------+------+-------+
| 0005 | catC | neg | NULL | NULL | NULL | NULL |
+------+------+-------+------+-------+------+-------+

I don't think it's a pivot at all.. However, anything is possible so here we go:
import io
import itertools
import pandas
# Your data
data = io.StringIO(
"""
id | catA_present | catA_pos | catA_neg | catA_ntrl | catB_present | catB_pos | catB_neg | catB_ntrl | catC_present | catC_pos | catC_neg | catC_ntrl
0001 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0
0002 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0
0003 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0
0004 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1
0005 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0
"""
)
df = pandas.read_table(data, sep="\s*\|\s*")
def get_sentiment(row: pandas.Series) -> pandas.Series:
if row["cat_pos"] == 1:
return "pos"
elif row["cat_neg"] == 1:
return "neg"
elif row["cat_ntrl"] == 1:
return "ntrl"
else:
return None
# Initialize a dict that will hold an entry for every index in the dataframe, with a list of categories and sentiments
categories_per_index = {index: [] for index in df.index}
# Extract a list of unique names of all possible categories
categories = set([column[3] for column in df.columns if column.startswith("cat")])
# Loop over the unique categories
for key in categories:
# Select only the columns for a particular category, and where that category is present
group = df.loc[df[f"cat{key}_present"] == 1, [f"cat{key}_present", f"cat{key}_pos", f"cat{key}_neg", f"cat{key}_ntrl"]]
# Change the column names for generic processing
group.columns = ["cat_present", "cat_pos", "cat_neg", "cat_ntrl"]
# Figure out the sentiment for every line
group["sentiment"] = group.apply(get_sentiment, axis=1)
# Loop the rows in the group and add the sentiment for this category to the indices
for index, row in group.iterrows():
# Add the name of the category and the sentiment to the index
categories_per_index[index].append(f"cat{key}")
categories_per_index[index].append(row["sentiment"])
# Reconstruct the dataframe from the dictionary
df = pandas.DataFrame.from_dict(categories_per_index, orient="index", columns=list(itertools.chain.from_iterable([ [f"cat{i}", f"sent{i}"] for i in range(len(categories)) ])))
Output:
print(df)
cat0 sent0 cat1 sent1 cat2 sent2
0 catA pos catC neg None None
1 catB pos catC pos None None
2 catB ntrl catA ntrl None None
3 catB pos catA pos catC ntrl
4 catC neg None None None None

Related

How to split a column into several columns by taking the string values as column headers?

This is my dataset:
| Name | Dept | Project area/areas interested |
| -------- | -------- |-----------------------------------|
| Joe | Biotech | Cell culture//Bioinfo//Immunology |
| Ann | Biotech | Cell culture |
| Ben | Math | Trigonometry//Algebra |
| Keren | Biotech | Microbio |
| Alice | Physics | Optics |
This is how I want my result:
| Name | Dept |Cell culture|Bioinfo|Immunology|Trigonometry|Algebra|Microbio|Optics|
| -------- | -------- |------------|-------|----------|------------|-------|--------|------|
| Joe | Biotech | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| Ann | Biotech | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| Ben | Math | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
| Keren | Biotech | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| Alice | Physics | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Not only do I have to split the last column into the different columns based on the rows - I have to resplit certain column values that are seperated by "//". And the values in the dataframe have to be replaced with 1 or 0 (int).
I've been stuck on this for a while now (-_-;)
You can use pandas.concat in combination with pandas.get_dummies like this:
pd.concat([df[["Name", "Dept"]], df["Project area/areas interested"].str.get_dummies(sep='//')], axis=1)

Is there a method in turning user input into csv format?

This is the example data that would be pasted into an input() prompt and ideally I would like it to be processed and made into a csv file through python:
,,,,,,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,SCA,SCA,Passes,Passes,Passes,Passes,Carries,Carries,Dribbles,Dribbles,-additional
Player,#,Nation,Pos,Age,Min,Gls,Ast,PK,PKatt,Sh,SoT,CrdY,CrdR,Touches,Press,Tkl,Int,Blocks,xG,npxG,xA,SCA,GCA,Cmp,Att,Cmp%,Prog,Carries,Prog,Succ,Att,-9999
Gabriel Jesus,9,br BRA,FW,25-124,82,0,0,0,0,1,0,0,0,40,13,1,1,0,0.1,0.1,0.0,4,0,20,27,74.1,2,33,1,4,5,b66315ae
Eddie Nketiah,14,eng ENG,FW,23-067,8,0,0,0,0,0,0,0,0,6,2,0,0,0,0.0,0.0,0.1,2,0,4,4,100.0,1,4,1,0,0,a53649b7
Martinelli,11,br BRA,LW,21-048,90,1,0,0,0,2,1,0,0,38,21,0,2,1,0.6,0.6,0.1,1,0,24,28,85.7,1,34,5,3,4,48a5a5d6
Bukayo Saka,7,eng ENG,RW,20-334,90,0,0,0,0,3,0,0,0,52,23,3,0,3,0.2,0.2,0.0,2,1,24,36,66.7,2,37,8,2,2,bc7dc64d
Martin Ødegaard,8,no NOR,AM,23-231,89,0,0,0,0,2,0,0,0,50,22,2,1,2,0.1,0.1,0.0,2,0,30,39,76.9,5,28,3,1,2,79300479
Albert Sambi Lokonga,23,be BEL,CM,22-287,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0.0,0.0,0.0,0,0,1,1,100.0,0,1,1,0,0,1b4f1169
Granit Xhaka,34,ch SUI,DM,29-312,90,0,0,0,0,0,0,1,0,60,5,0,2,3,0.0,0.0,0.0,4,0,42,49,85.7,6,32,2,0,0,e61b8aee
Thomas Partey,5,gh GHA,DM,29-053,90,0,0,0,0,1,0,0,0,62,25,7,1,2,0.1,0.1,0.0,0,0,40,47,85.1,5,26,4,0,1,529f49ab
Oleksandr Zinchenko,35,ua UKR,LB,25-233,82,0,1,0,0,1,1,0,0,64,16,3,3,1,0.0,0.0,0.3,2,1,44,54,81.5,6,36,5,0,0,51cf8561
Kieran Tierney,3,sct SCO,LBWB,25-061,8,0,0,0,0,0,0,0,0,6,1,0,0,0,0.0,0.0,0.0,0,0,2,4,50.0,0,1,0,0,0,fce2302c
Gabriel Dos Santos,6,br BRA,CB,24-229,90,0,0,0,0,0,0,0,0,67,5,1,1,2,0.0,0.0,0.0,0,0,52,58,89.7,1,48,3,0,0,67ac5bb8
William Saliba,12,fr FRA,CB,21-134,90,0,0,0,0,0,0,0,0,58,3,1,2,2,0.0,0.0,0.0,0,0,42,46,91.3,1,35,1,0,0,972aeb2a
Ben White,4,eng ENG,RB,24-301,90,0,0,0,0,0,0,1,0,61,22,7,4,5,0.0,0.0,0.1,1,0,29,40,72.5,5,25,2,1,1,35e413f1
Aaron Ramsdale,1,eng ENG,GK,24-083,90,0,0,0,0,0,0,0,0,33,0,0,0,0,0.0,0.0,0.0,0,0,24,32,75.0,0,21,0,0,0,466fb2c5
14 Players,,,,,990,1,1,0,0,10,2,2,0,599,158,25,17,21,1.1,1.1,0.5,18,2,378,465,81.3,35,361,36,11,15,-9999
The link to the table is: https://fbref.com/en/matches/e62f6e78/Crystal-Palace-Arsenal-August-5-2022-Premier-League#stats_18bb7c10_summary
I have attempted to use pandas dataframe but I am only able to export the first row of headers and nothing else (only the items before player).
Would have been nice for you to include your attempt.
Pandas works just fine:
import pandas as pd
url = 'https://fbref.com/en/matches/e62f6e78/Crystal-Palace-Arsenal-August-5-2022-Premier-League#stats_18bb7c10_summary'
df = pd.read_html(url)[10]
cols = [f'{each[0]}_{each[1]}' if 'Unnamed' not in each[0] else f'{each[1]}' for each in df.columns]
df.columns = cols
df.to_csv('output.csv', index=False)
Output:
print(df.to_markdown())
| | Player | # | Nation | Pos | Age | Min | Gls | Ast | PK | PKatt | Sh | SoT | CrdY | CrdR | Touches | Press | Tkl | Int | Blocks | xG | npxG | xA | SCA | GCA | Cmp | Att | Cmp% | Prog | Carries | Prog.1 | Succ | Att.1 |
|---:|:---------------------|----:|:---------|:------|:-------|------:|------:|------:|-----:|--------:|-----:|------:|-------:|-------:|----------:|--------:|------:|------:|---------:|-----:|-------:|-----:|------:|------:|------:|------:|-------:|-------:|----------:|---------:|-------:|--------:|
| 0 | Gabriel Jesus | 9 | br BRA | FW | 25-124 | 82 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 40 | 13 | 1 | 1 | 0 | 0.1 | 0.1 | 0 | 4 | 0 | 20 | 27 | 74.1 | 2 | 33 | 1 | 4 | 5 |
| 1 | Eddie Nketiah | 14 | eng ENG | FW | 23-067 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 2 | 0 | 0 | 0 | 0 | 0 | 0.1 | 2 | 0 | 4 | 4 | 100 | 1 | 4 | 1 | 0 | 0 |
| 2 | Martinelli | 11 | br BRA | LW | 21-048 | 90 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 38 | 21 | 0 | 2 | 1 | 0.6 | 0.6 | 0.1 | 1 | 0 | 24 | 28 | 85.7 | 1 | 34 | 5 | 3 | 4 |
| 3 | Bukayo Saka | 7 | eng ENG | RW | 20-334 | 90 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 0 | 52 | 23 | 3 | 0 | 3 | 0.2 | 0.2 | 0 | 2 | 1 | 24 | 36 | 66.7 | 2 | 37 | 8 | 2 | 2 |
| 4 | Martin Ødegaard | 8 | no NOR | AM | 23-231 | 89 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 50 | 22 | 2 | 1 | 2 | 0.1 | 0.1 | 0 | 2 | 0 | 30 | 39 | 76.9 | 5 | 28 | 3 | 1 | 2 |
| 5 | Albert Sambi Lokonga | 23 | be BEL | CM | 22-287 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 100 | 0 | 1 | 1 | 0 | 0 |
| 6 | Granit Xhaka | 34 | ch SUI | DM | 29-312 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 60 | 5 | 0 | 2 | 3 | 0 | 0 | 0 | 4 | 0 | 42 | 49 | 85.7 | 6 | 32 | 2 | 0 | 0 |
| 7 | Thomas Partey | 5 | gh GHA | DM | 29-053 | 90 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 62 | 25 | 7 | 1 | 2 | 0.1 | 0.1 | 0 | 0 | 0 | 40 | 47 | 85.1 | 5 | 26 | 4 | 0 | 1 |
| 8 | Oleksandr Zinchenko | 35 | ua UKR | LB | 25-233 | 82 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 64 | 16 | 3 | 3 | 1 | 0 | 0 | 0.3 | 2 | 1 | 44 | 54 | 81.5 | 6 | 36 | 5 | 0 | 0 |
| 9 | Kieran Tierney | 3 | sct SCO | LB,WB | 25-061 | 8 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 4 | 50 | 0 | 1 | 0 | 0 | 0 |
| 10 | Gabriel Dos Santos | 6 | br BRA | CB | 24-229 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 67 | 5 | 1 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 52 | 58 | 89.7 | 1 | 48 | 3 | 0 | 0 |
| 11 | William Saliba | 12 | fr FRA | CB | 21-134 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 58 | 3 | 1 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 42 | 46 | 91.3 | 1 | 35 | 1 | 0 | 0 |
| 12 | Ben White | 4 | eng ENG | RB | 24-301 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 61 | 22 | 7 | 4 | 5 | 0 | 0 | 0.1 | 1 | 0 | 29 | 40 | 72.5 | 5 | 25 | 2 | 1 | 1 |
| 13 | Aaron Ramsdale | 1 | eng ENG | GK | 24-083 | 90 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 33 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 32 | 75 | 0 | 21 | 0 | 0 | 0 |
| 14 | 14 Players | nan | nan | nan | nan | 990 | 1 | 1 | 0 | 0 | 10 | 2 | 2 | 0 | 599 | 158 | 25 | 17 | 21 | 1.1 | 1.1 | 0.5 | 18 | 2 | 378 | 465 | 81.3 | 35 | 361 | 36 | 11 | 15 |
could you elaborate more?
maybe you could split the raw text by comma and then convert it to a dataframe
like:
list_of_string = input.split(',')
df = pd.DataFrame(list_of_string)
df.to_csv('yourfile.csv')
The correct approach is as proposed by chitown88, however if you want to copy paste the data by hand into the terminal and get a csv you can do something like this:
import pandas as pd
from datetime import datetime
while True:
print("Enter/Paste your content. Ctrl-D or Ctrl-Z ( windows ) to save it.")
contents = []
while True:
try:
line = input()
except EOFError:
break
contents.append(line)
df = pd.DataFrame(contents)
df.to_csv(f"df_{int(datetime.now().timestamp())}.csv", index=None)
Start the Python script, paste the data into the terminal, press CTRL+D and press enter to export the data you pasted into the terminal into a csv file.
You can use user input controlled while loop to get user input. Finally, you may exit depending on the user’s choice. Look at the code below:
user_input = 'Y'
while user_input.lower() == 'y':
# Run your code here.
user_input = input('Do you want to add one more entry: Y or N?')
This is most intuitive and understandable solution I could come up with uses of basic linear algebra to solve the problem which I find pretty neat. I recommend you to find an another way to parse the data. Check out beautifulsoup and requests.
import pandas as pd#for dataframe
data = '''
,,,,,,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Performance,Expected,Expected,Expected,SCA,SCA,Passes,Passes,Passes,Passes,Carries,Carries,Dribbles,Dribbles,-additional
Player,#,Nation,Pos,Age,Min,Gls,Ast,PK,PKatt,Sh,SoT,CrdY,CrdR,Touches,Press,Tkl,Int,Blocks,xG,npxG,xA,SCA,GCA,Cmp,Att,Cmp%,Prog,Carries,Prog,Succ,Att,-9999
Gabriel Jesus,9,br BRA,FW,25-124,82,0,0,0,0,1,0,0,0,40,13,1,1,0,0.1,0.1,0.0,4,0,20,27,74.1,2,33,1,4,5,b66315ae
Eddie Nketiah,14,eng ENG,FW,23-067,8,0,0,0,0,0,0,0,0,6,2,0,0,0,0.0,0.0,0.1,2,0,4,4,100.0,1,4,1,0,0,a53649b7
Martinelli,11,br BRA,LW,21-048,90,1,0,0,0,2,1,0,0,38,21,0,2,1,0.6,0.6,0.1,1,0,24,28,85.7,1,34,5,3,4,48a5a5d6
Bukayo Saka,7,eng ENG,RW,20-334,90,0,0,0,0,3,0,0,0,52,23,3,0,3,0.2,0.2,0.0,2,1,24,36,66.7,2,37,8,2,2,bc7dc64d
Martin Ødegaard,8,no NOR,AM,23-231,89,0,0,0,0,2,0,0,0,50,22,2,1,2,0.1,0.1,0.0,2,0,30,39,76.9,5,28,3,1,2,79300479
Albert Sambi Lokonga,23,be BEL,CM,22-287,1,0,0,0,0,0,0,0,0,2,0,0,0,0,0.0,0.0,0.0,0,0,1,1,100.0,0,1,1,0,0,1b4f1169
Granit Xhaka,34,ch SUI,DM,29-312,90,0,0,0,0,0,0,1,0,60,5,0,2,3,0.0,0.0,0.0,4,0,42,49,85.7,6,32,2,0,0,e61b8aee
Thomas Partey,5,gh GHA,DM,29-053,90,0,0,0,0,1,0,0,0,62,25,7,1,2,0.1,0.1,0.0,0,0,40,47,85.1,5,26,4,0,1,529f49ab
Oleksandr Zinchenko,35,ua UKR,LB,25-233,82,0,1,0,0,1,1,0,0,64,16,3,3,1,0.0,0.0,0.3,2,1,44,54,81.5,6,36,5,0,0,51cf8561
Kieran Tierney,3,sct SCO,LBWB,25-061,8,0,0,0,0,0,0,0,0,6,1,0,0,0,0.0,0.0,0.0,0,0,2,4,50.0,0,1,0,0,0,fce2302c
Gabriel Dos Santos,6,br BRA,CB,24-229,90,0,0,0,0,0,0,0,0,67,5,1,1,2,0.0,0.0,0.0,0,0,52,58,89.7,1,48,3,0,0,67ac5bb8
William Saliba,12,fr FRA,CB,21-134,90,0,0,0,0,0,0,0,0,58,3,1,2,2,0.0,0.0,0.0,0,0,42,46,91.3,1,35,1,0,0,972aeb2a
Ben White,4,eng ENG,RB,24-301,90,0,0,0,0,0,0,1,0,61,22,7,4,5,0.0,0.0,0.1,1,0,29,40,72.5,5,25,2,1,1,35e413f1
Aaron Ramsdale,1,eng ENG,GK,24-083,90,0,0,0,0,0,0,0,0,33,0,0,0,0,0.0,0.0,0.0,0,0,24,32,75.0,0,21,0,0,0,466fb2c5
14 Players,,,,,990,1,1,0,0,10,2,2,0,599,158,25,17,21,1.1,1.1,0.5,18,2,378,465,81.3,35,361,36,11,15,-9999
'''
#you can just replace data with user input
def tryNum(x):#input a value and if its a number then it returns a number, if not it returns itself back
try:
x = float(x)
return x
except:
return x
rows = [i.split(',')[:-1] for i in data.split('\n')[2:-2]]#removing useless lines
col_names = [i for i in rows[0]]#fetching all column names
cols = [[tryNum(rows[j][i]) for j in range(1,len(rows))] for i in range(len(rows[0]))]#get all column info by transposing the "matrix" if you will
full = {}#setting up the dictionary
for i,y in zip(col_names,cols):#putting the data in the dict
full[i]=y
df = pd.DataFrame(data = full)#uploading it all to the df
print(df.head())

how to find max of a columns with same name

im having some trouble with this data frame where columns having the same name have to be reduced to values with at least one "1" as "1".
+---+---+---+---+---+---+---+---+---+
| a | a | a | b | c | c | c | d | d |
+---+---+---+---+---+---+---+---+---+
| 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 |
| 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |
| 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
+---+---+---+---+---+---+---+---+---+
to something like this using "or" condition for every column for a huge dataset could be a time-consuming task so I am having trouble figuring it out. I used max(axis=1, level=0) still couldn't make it.
my desired output :
+---+---+---+---+
| a | b | c | d |
+---+---+---+---+
| 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 1 | 0 |
+---+---+---+---+
Check with max
df = df.max(level=0, axis=1)

Unexpected Python TypeError: when using scalars

I am new to Python and in my opinion it is much different than Java.
I have looked at other answers which implies that the error is because I am passing an array when it is expecting a value. I don't know about that. I am pretty sure I am simply passing a value.
The line, 97, is:
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
The complete text of the error is:
Traceback (most recent call last):
File "D:/Personal/Python/NB.py", line 153, in <module>
main()
File "D:/Personal/Python/NB.py", line 148, in main
predictions = getPredict(summaries, testing_set)
File "D:/Personal/Python/NB.py", line 129, in getPredict
classification = predict(results, testData[index])
File "D:/Personal/Python/NB.py", line 117, in predict
probabilities = Classify(feature_summaries, classifications)
File "D:/Personal/Python/NB.py", line 113, in Classify
probabilities[classes] = probabilities[classes] * GaussianProbabilityDensity(feature_value, mean, standard_deviation)
File "D:/Personal/Python/NB.py", line 97, in GaussianProbabilityDensity
exponential = math.exp(-(math.pow(feature_value-mean, 2) / (2*math.pow(standard_deviation, 2))))
TypeError: only size-1 arrays can be converted to Python scalars
If it is useful, the csv is below. It should be noted I have two other algorithms that run on this dataset just fine.
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
| 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 1 |
| 1 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
Your issue is that class_summaries (from line 107) is a list of tuples, one of which you select and pass into the GaussianProbabilityDensity function as feature_value.
It ends up causing the error on line 97. Note that if you were to fix it (I replaced the value with a hard 1.0), you'll end up with a division by zero error, as the standard_deviation you're putting in happens to be 0 at that point.
The way I found this was to use a Python IDE that has a proper debugger (I like PyCharm) and by setting a breakpoint on the line you indicated, inspecting the various variables before the error occurs. I recommend trying to solve these types of problems in a similar fashion, as it save a lot of time and spurious print statements.
math.pow (like all math functions) only works with scalars, that is single numbers (integer or float). The error says that one of the arguments, such as standard_deviation is a numpy array with more than one element, so it can't be converted to a scalar and passed to math.pow.
This occurs in your own code, so there's no difficulty in tracing those variables back to their source.
Either you unintentionally passed an array to this function, or you need to replace math.pow with np.pow (and np.exp) functions which do work with arrays.
You generate a numpy array when loading from the csv
data = numpy.loadtxt(data, delimiter=',')
# Loop through the data in the array
for index in range(len(data)):
# Utilize a try catch to try and convert to float, if it can't convert to float, converts to 0
try:
data[index] = [float(x) for x in data[index]]
except ValueError:
data[index] = 0
loadtxt returns an array, with float dtype (the default). All its elements will be floats - if it read something that wasn't valid float, it would have raised an error. Thus the loop isn't needed. And the loop looks too much like it was written for a list, not an array.
randomize_data shouldn't return anything. np.random.shuffle operates in-place on csv. That doesn't cause an error.

Iterate through rows of grouped pandas dataframe to create new columns

I'm new to Python and am trying to get to grips with Pandas for data analysis.
I wondered if anyone can help me loop through rows of grouped data in a dataframe to create new variables.
Suppose I have a dataframe called data, that looks like this:
+----+-----------+--------+
| ID | YearMonth | Status |
+----+-----------+--------+
| 1 | 201506 | 0 |
| 1 | 201507 | 0 |
| 1 | 201508 | 0 |
| 1 | 201509 | 0 |
| 1 | 201510 | 0 |
| 2 | 201506 | 0 |
| 2 | 201507 | 1 |
| 2 | 201508 | 2 |
| 2 | 201509 | 3 |
| 2 | 201510 | 0 |
| 3 | 201506 | 0 |
| 3 | 201507 | 1 |
| 3 | 201508 | 2 |
| 3 | 201509 | 3 |
| 3 | 201510 | 4 |
+----+-----------+--------+
There are multiple rows for each ID, MonthYear is of the form yyyymm, and Status is the status at each MonthYear (takes values 0 to 6)
I have manged to create columns to show me the cumulative maximum status, and an ever3 (to show me if an ID has ever had a status or 3 or more regardless of current status) indicator like this:
data1['Max_Stat'] = data1.groupby(['Custno'])['Status'].cummax()
data1['Ever3'] = np.where(data1['Max_Stat'] >= 3, 1, 0)
What I would also like to do, is create the other columns to create metrics such as the number of times something has happened, or how long since an event. For example
Times3Plus : To show how many times the ID has had a status 3 or more at that point in time
Into3 : Set to Y the first time the ID has a status of 3 or more (not for subsequent times)
+----+-----------+--------+----------+-------+------------+-------+
| ID | YearMonth | Status | Max_Stat | Ever3 | Times3Plus | Into3 |
+----+-----------+--------+----------+-------+------------+-------+
| 1 | 201506 | 0 | 0 | 0 | 0 | |
| 1 | 201507 | 0 | 0 | 0 | 0 | |
| 1 | 201508 | 0 | 0 | 0 | 0 | |
| 1 | 201509 | 0 | 0 | 0 | 0 | |
| 1 | 201510 | 0 | 0 | 0 | 0 | |
| 2 | 201506 | 0 | 0 | 0 | 0 | |
| 2 | 201507 | 1 | 1 | 0 | 0 | |
| 2 | 201508 | 2 | 2 | 0 | 0 | |
| 2 | 201509 | 3 | 3 | 1 | 1 | Y |
| 2 | 201510 | 0 | 3 | 1 | 1 | |
| 3 | 201506 | 0 | 0 | 0 | 0 | |
| 3 | 201507 | 1 | 1 | 0 | 0 | |
| 3 | 201508 | 2 | 2 | 0 | 0 | |
| 3 | 201509 | 3 | 3 | 1 | 1 | Y |
| 3 | 201510 | 4 | 4 | 1 | 2 | |
+----+-----------+--------+----------+-------+------------+-------+
I can do this quite easily in SAS, using BY and RETAIN statements, but can't work out how to replicate this in Python.
I have managed to do this without iterating over each row, as I'm not sure what I was trying to do was possible. I had wanted to set up counters or indicators at group level,as is possible in SAS, and modify these row by row. Eg something like
Times3Plus=0
if row['Status'] >= 3:
Times3Plus += 1
Return Times3Plus
In the end, I created a binary 3Plus indicator
data['3Plus'] = np.where(data1['Status'] >= 3, 1, 0)
Then used groupby to summarise these to create Times3Plus at group level
data['Times3Plus'] = data.groupby(['ID'])['3Plus'].cumsum()
Into3 could then be populated using a function
def into3(row):
if row['3Plus'] == 1 and row['Times3Plus'] == 1: #i.e it is the first time
return 1
data['Into3'] = data.apply(into3, axis = 1)

Categories