Related
I have a DataFrame, and taking a subset of it, it has a dict constructor like:
df = pd.DataFrame(data = {'K': {8: 3.9274999999999998, 9: 1.9275, 10: 2.9274999999999998, 11: 2.9274999999999998, 12: 2.275, 13: 3.2750000000000004, 14: 2.775, 15: 2.8000000000000003, 16: 1.7999999999999998, 17: 2.8000000000000003, 18: 2.82, 19: 2.82, 20: 2.8000000000000003, 21: 2.8000000000000003, 22: 2.82, 23: 2.82, 24: 1.82, 25: 1.82, 26: 1.7999999999999998}, 'Struct': {8: 'Call', 9: 'Put', 10: 'Straddle', 11: 'Straddle', 12: 'Put', 13: 'Call', 14: 'Straddle', 15: 'Delta', 16: 'Put', 17: 'Put', 18: 'Put', 19: 'Delta', 20: 'Put', 21: 'Delta', 22: 'Delta', 23: 'Put', 24: 'Put', 25: 'Put', 26: 'Put'}, 'MainID': {8: 10, 9: 10, 10: 10, 11: 10, 12: 20, 13: 20, 14: 20, 15: 21, 16: 21, 17: 21, 18: 23, 19: 23, 20: 23, 21: 23, 22: 23, 23: 23, 24: 23, 25: 23, 26: 23}})
Markdown
Index
K
Struct
MainID
8
3.9275
Call
10
9
1.9275
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
2.275
Put
20
13
3.275
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
1.8
Put
21
17
2.8
Put
21
18
2.82
Put
23
19
2.82
Delta
23
20
2.8
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
2.82
Put
23
24
1.82
Put
23
25
1.82
Put
23
26
1.8
Put
23
I am trying to find a way to do the following steps:
Groupby("MainID")
For any Call or Put, subtract "K" from either "Straddle", or "Delta" if it exists within the Groupby("MainID")
In the case that you have multiple delta/put/call within a Groupby("MainID"), you would want to subtract based on ascending values. Ie, if K[Struct==Put] = [1,2,3] and K[Struct==Delta] = [2,2,3] the result would be [-1, 0, 0]
The resulting DF would look like
Index
K
Struct
MainID
8
1
Call
10
9
-1
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
-0.50
Put
20
13
0.50
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
-1
Put
21
17
0
Put
21
18
0
Put
23
19
2.82
Delta
23
20
0
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
0
Put
23
24
-1
Put
23
25
-1
Put
23
26
-1
Put
23
Thanks so much! It's a tricky one...
I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.
I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.
But I haven't been able to get it right. I've tried to model on this SO question and answer.
This is my starting dataframe:
df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
"county_fips" : ['06019','06021','06023'] ,
"grade11" : [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]})
df
This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.
projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
for i in df['county_fips'].unique():
print(i)
grade11_change=df.iloc[0]['grade11_chg']
grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
full_name=[]
for year in projection_years:
#print(year)
df_select=df[df['county_fips']==i]
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row = {}
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
print('Finished processing')
But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)
df_final_goal=pd.DataFrame({'year': {0: '2020_21', 1: '2021_22', 2: '2022_23', 3: '2023_24', 4: '2024_25', 5: '2025_26',
6: '2026_27', 7: '2027_28', 8: '2028_29', 9: '2029_30', 10: '2030_31', 11: '2020_21', 12: '2021_22', 13: '2022_23',
14: '2023_24', 15: '2024_25', 16: '2025_26', 17: '2026_27', 18: '2027_28', 19: '2028_29', 20: '2029_30', 21: '2030_31',
22: '2020_21', 23: '2021_22', 24: '2022_23', 25: '2023_24', 26: '2024_25', 27: '2025_26', 28: '2026_27', 29: '2027_28',
30: '2028_29', 31: '2029_30', 32: '2030_31'},
'county_fips': {0: '06019', 1: '06019', 2: '06019', 3: '06019', 4: '06019', 5: '06019', 6: '06019', 7: '06019', 8: '06019',
9: '06019', 10: '06019', 11: '06021', 12: '06021', 13: '06021', 14: '06021', 15: '06021', 16: '06021', 17: '06021', 18: '06021',
19: '06021', 20: '06021', 21: '06021', 22: '06023', 23: '06023', 24: '06023', 25: '06023', 26: '06023', 27: '06023',
28: '06023', 29: '06023', 30: '06023', 31: '06023', 32: '06023'},
'grade11': {0: 5000, 1: 5050, 2: 5101, 3: 5152, 4: 5203, 5: 5255, 6: 5308, 7: 5361, 8: 5414, 9: 5468, 10: 5523,
11: 2000, 12: 2040, 13: 2081, 14: 2122, 15: 2165, 16: 2208, 17: 2252, 18: 2297, 19: 2343, 20: 2390, 21: 2438,
22: 2000, 23: 2060, 24: 2122, 25: 2185, 26: 2251, 27: 2319, 28: 2388, 29: 2460, 30: 2534, 31: 2610, 32: 2688},
'grade12': {0: 5200, 1: 4500, 2: 4545, 3: 4590, 4: 4636, 5: 4683, 6: 4730, 7: 4777, 8: 4825, 9: 4873, 10: 4922,
11: 2200, 12: 1600, 13: 1632, 14: 1665, 15: 1698, 16: 1732, 17: 1767, 18: 1802, 19: 1838, 20: 1875, 21: 1912,
22: 2200, 23: 1740, 24: 1792, 25: 1846, 26: 1901, 27: 1958, 28: 2017, 29: 2078, 30: 2140, 31: 2204, 32: 2270}})
Thanks for any assistance.
Creating a helper function for calculating grade11 helps make this a bit easier.
import pandas as pd
def expand_grade11(
grade11: int,
grade11_chg: float,
len_projection_years: int
) -> list:
"""
Calculate `grade11` values based on current
`grade11`, `grade11_chg`, and number of
`projection_years`.
"""
list_of_vals = []
while len(list_of_vals) < len_projection_years:
grade11 = int(grade11 * grade11_chg)
list_of_vals.append(grade11)
return list_of_vals
# initial info
df = pd.DataFrame({
"year": ['2020_21', '2020_21','2020_21'],
"county_fips": ['06019','06021','06023'] ,
"grade11": [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])
# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)
# calculate grade11
final = df.groupby([
"county_fips",
"year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")
# concat with original df to get other columns
final = pd.concat([
df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)
# calculate grade12
grade12 = final.groupby([
"county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)
# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)
there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
full_name = []
for i in df['county_fips'].unique():
print(i)
df_select = df[df['county_fips']==i]
grade11_change = df_select.iloc[0]['grade11_chg']
grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
for year in projection_years:
#print(year)
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
print('Finished processing')
fixes:
full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)
(edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)
the final result (print(df_final.to_markdown())) with the above code is:
index
year
county_fips
grade11
grade12
0
0
2020_21
06019
5000
5200
1
0
2021_22
06019
5050
4500
2
0
2022_23
06019
5100
4545
3
0
2023_24
06019
5151
4590
4
0
2024_25
06019
5202
4635
5
0
2025_26
06019
5254
4681
6
0
2026_27
06019
5306
4728
7
0
2027_28
06019
5359
4775
8
0
2028_29
06019
5412
4823
9
0
2029_30
06019
5466
4870
10
0
2030_31
06019
5520
4919
11
1
2020_21
06021
2000
2200
12
0
2021_22
06021
2040
1600
13
0
2022_23
06021
2080
1632
14
0
2023_24
06021
2121
1664
15
0
2024_25
06021
2163
1696
16
0
2025_26
06021
2206
1730
17
0
2026_27
06021
2250
1764
18
0
2027_28
06021
2295
1800
19
0
2028_29
06021
2340
1836
20
0
2029_30
06021
2386
1872
21
0
2030_31
06021
2433
1908
22
2
2020_21
06023
2000
2200
23
0
2021_22
06023
2060
1740
24
0
2022_23
06023
2121
1792
25
0
2023_24
06023
2184
1845
26
0
2024_25
06023
2249
1900
27
0
2025_26
06023
2316
1956
28
0
2026_27
06023
2385
2014
29
0
2027_28
06023
2456
2074
30
0
2028_29
06023
2529
2136
31
0
2029_30
06023
2604
2200
32
0
2030_31
06023
2682
2265
note: edited to address the comments
I am working with a dataset (10000 data points) that provides 100 different account numbers with transaction amounts, date and time of transactions etc.
From this dataset I want to create a separate data frame for one account number, which then contains all the transactions (ordered by time) that that account number made throughout the year.
I tried to do this by:
group = df.groupby('account_num')
which then gives me
pandas.core.groupby.generic.DataFrameGroupBy
Then, when I want to get the group for a specific account number, say 51234:
group.get_group('51234')
I receive an error:
KeyError: 51234
How can I make a separate data frame containing all the transaction for one single account number?
(Sorry if this is a very basic question, Im a newbie)
IIUC, you can get your output in a slightly different way. You can start by making sure your time column, which I assume is a date based on your description, is actually a datetime object, and then filtering your dataframe for the specific account number - there are plenty of ways to do this, a common one is loc, but in my case I use query. Then you can sort based on your date, using sort_values, and lastly you can use groupby on the year part of your date column:
# Convert your date column to datetime
df['date'] = pd.to_datetime(df['date'])
# Filter and sort
>>> print(df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True))
# Equivalently with loc
print(
df.loc[df['account_num'] == 51234]\
.sort_values(by=['date'],ascending=True))
account_num date
0 51234 2020-01-01
1 51234 2020-02-01
2 51234 2020-03-01
7 51234 2020-08-01
9 51234 2020-08-01
11 51234 2020-08-01
13 51234 2020-08-01
3 51234 2021-04-01
4 51234 2021-05-01
5 51234 2023-06-01
6 51234 2023-07-01
8 51234 2023-07-01
10 51234 2023-07-01
12 51234 2023-07-01
# Filter, sort, and get yearly count
>>> print(
df.query('account_num == 51234')\
.sort_values(by=['date'],ascending=True)\
.groupby(df['date'].dt.year).account_num.count())
date
2020 7
2021 2
2023 5
Based on the below sample DF:
{'account_num': {0: 51234,
1: 51234,
2: 51234,
3: 51234,
4: 51234,
5: 51234,
6: 51234,
7: 51234,
8: 51234,
9: 51234,
10: 51234,
11: 51234,
12: 51234,
13: 51234,
14: 512346,
15: 512346,
16: 512346,
17: 512346,
18: 512346,
19: 512346,
20: 512346,
21: 512346,
22: 512346,
23: 13123,
24: 13123,
25: 13123,
26: 13123,
27: 13123,
28: 13123,
29: 13123,
30: 13123,
31: 13123},
'date': {0: '01/01/2020',
1: '02/01/2020',
2: '03/01/2020',
3: '04/01/2021',
4: '05/01/2021',
5: '06/01/2023',
6: '07/01/2023',
7: '08/01/2020',
8: '07/01/2023',
9: '08/01/2020',
10: '07/01/2023',
11: '08/01/2020',
12: '07/01/2023',
13: '08/01/2020',
14: '09/01/2020',
15: '10/01/2020',
16: '11/01/2020',
17: '12/01/2020',
18: '13/01/2020',
19: '14/01/2020',
20: '15/01/2020',
21: '16/01/2020',
22: '17/01/2020',
23: '18/01/2020',
24: '19/01/2020',
25: '20/01/2020',
26: '21/01/2020',
27: '22/01/2020',
28: '23/01/2020',
29: '24/01/2020',
30: '25/01/2020',
31: '26/01/2020'}}
I have a list:
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
further, I do:
import pandas as pd
pd.DataFrame(sorted_info)
0 1: surgery?\n 1 2:
Age\n 2 3: Hospital Number\n 3 4: rectal temperature\n 4 5: pulse\n
5 - is a reflection of the heart condi... 6 6: respiratory rate\n
7 7: temperature of extremities\n 8 - possible values:\n 9 8:
peripheral pulse\n 10 - possible values are:\n 11 9: mucous
membranes\n 12 - possible values are:\n 13 10: capillary refill
time\n 14 11: pain - a subjective judgement of the hors... 15 -
possible values:\n 16 12: peristalsis\n 17 - possible values:\n
18 13: abdominal distension\n 19 14: nasogastric tube\n 20 - possible
values:\n 21 15: nasogastric reflux\n 22 16: nasogastric reflux PH\n
23 17: rectal examination - feces\n 24 18: abdomen\n 25 19: packed
cell volume\n 26 20: total protein\n 27 21: abdominocentesis
appearance\n 28 - possible values:\n 29 22: abdomcentesis total
protein\n 30 23: outcome\n 31 - possible values:\n 32 24: surgical
lesion?\n 33 - possible values:\n 34 25, 26, 27: type of lesion\n
35 28: cp_data\n
So I am trying to sort in that DF will look like:
Col1 Col2
1: surgery
2: Age
3: Hospital Number
etc.
Any suggestions how to split Series into 2 Cols and clean/delete rest of info?
Try:
import pandas as pd
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
sorted_info = [x.strip() for x in sorted_info]
joined_list = []
for x in sorted_info:
if x.startswith('-'):
joined_list[-1] += ' ' + x
else:
joined_list.append(x)
df = pd.DataFrame(joined_list)
df[['Number', 'Text']] = df[0].str.split(':', n=1, expand=True)
del df[0]
print(df)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25, 26, 27 type of lesion
25 28 cp_data
Additional:
If you wanted to then go forwards and expand the 25, 26, 27 values in row 24 try:
df = df.apply(lambda x: x.str.split(',').explode()).reset_index(drop=True)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25 type of lesion
25 26 type of lesion
26 27 type of lesion
27 28 cp_data
We can do it this way:
Read the list sorted_info into a Pandas Series.
Use .str.extract() with Regex to extract the number tag and main contents of one line of text. Ready with Col1 and Col2 after the extraction.
For continuation line without number tag, we use .ffill() to forward fill the missing tag number
Group by the tag number in Col1 and join texts with continuation line based on the same tag number
Here, the codes:
# Read the list `sorted_info` into a Pandas Series:
s = pd.Series(sorted_info)
# Extract the number tag and main contents of a line of text:
df = s.str.extract(r'\s*(?:(?P<Col1>\d+(?:,\s*\d+)*:)|-)\s*(?P<Col2>.*)', expand=True)
# For continuation lines without number tag, forward fill the missing tag number
df['Col1'] = df['Col1'].ffill()
# Group by the tag numbers in `Col1` and join text in continuation line based on the same tag number
df_out = df.groupby('Col1', sort=False, as_index=False).agg(' - '.join)
Result:
print(df_out)
Col1 Col2
0 1: surgery?
1 2: Age
2 3: Hospital Number
3 4: rectal temperature
4 5: pulse - is a reflection of the heart condition: 30 -40 is normal for adults
5 6: respiratory rate
6 7: temperature of extremities - possible values:
7 8: peripheral pulse - possible values are:
8 9: mucous membranes - possible values are:
9 10: capillary refill time
10 11: pain - a subjective judgement of the horse's pain level - possible values:
11 12: peristalsis - possible values:
12 13: abdominal distension
13 14: nasogastric tube - possible values:
14 15: nasogastric reflux
15 16: nasogastric reflux PH
16 17: rectal examination - feces
17 18: abdomen
18 19: packed cell volume
19 20: total protein
20 21: abdominocentesis appearance - possible values:
21 22: abdomcentesis total protein
22 23: outcome - possible values:
23 24: surgical lesion? - possible values:
24 25, 26, 27: type of lesion
25 28: cp_data
You should first clean the data itself. It is unfortunate that it reaches the part where you want to make a DataFrame in such a noisy form. Ideally, the cleanup should come closer to the data collection itself.
Generally speaking, data cleaning is highly domain-dependent. For simple string cleanup, usually a combination of string.split(), re.match() and list comprehensions can go a long way.
For your specific case, the following gives good results (exercise for the reader: understand each bit of the expression, by trying reduced forms of it, starting by e.g. [v.splitlines()[0].split(':', 1) for v in sorted_info] building up toward the final form):
import re
cleaned = [
[s.split(' - ', 1)[0].strip().rstrip('?')
for s in v.splitlines()[0].split(':', 1)]
for v in sorted_info if re.match('^ *\d+:', v)
]
>>> cleaned
[['1', 'surgery'],
['2', 'Age'],
['3', 'Hospital Number'],
['4', 'rectal temperature'],
['5', 'pulse'],
['6', 'respiratory rate'],
['7', 'temperature of extremities'],
['8', 'peripheral pulse'],
['9', 'mucous membranes'],
['10', 'capillary refill time'],
['11', 'pain'],
['12', 'peristalsis'],
['13', 'abdominal distension'],
['14', 'nasogastric tube'],
['15', 'nasogastric reflux'],
['16', 'nasogastric reflux PH'],
['17', 'rectal examination'],
['18', 'abdomen'],
['19', 'packed cell volume'],
['20', 'total protein'],
['21', 'abdominocentesis appearance'],
['22', 'abdomcentesis total protein'],
['23', 'outcome'],
['24', 'surgical lesion'],
['28', 'cp_data']]
# and
df = pd.DataFrame(cleaned, columns=['Col1', 'Col2'])
>>> df
Col1 Col2
0 1 surgery
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse
5 6 respiratory rate
6 7 temperature of extremities
7 8 peripheral pulse
8 9 mucous membranes
9 10 capillary refill time
10 11 pain
11 12 peristalsis
12 13 abdominal distension
13 14 nasogastric tube
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance
21 22 abdomcentesis total protein
22 23 outcome
23 24 surgical lesion
24 28 cp_data
I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are labels.
Alternatively, I've tried using fcluster with the same threshold value I identified in dendrogram; however, this does not render the same result -- it gives me one cluster instead of three.
here's my code.
import pandas
data = pandas.DataFrame({'total_runs': {0: 2.489857755536053,
1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007,
6: 0.68217502514600825, 7: 0.67963194285399975, 8: 0.64238326692987524, 9: 0.6102581538587678, 10: 0.52588765899448564,
11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,
16: 0.14023670906519603, 17: 0.096817318756050832, 18: 0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,
21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,
26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322, 29: -1.1981351436422796, 30: -1.944118415387774,
31: -2.1899746357945964, 32: -2.9077222144449961},
'total_salaries': {0: 3.5998991340231234,
1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,
6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848,
11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,
16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,
21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,
26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,
31: -1.5512503530547552, 32: -1.6573856443389405}})
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
distanceMatrix = pdist(data)
dend = dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=4,
leaf_font_size=10,
labels = df.teamID.tolist())
len(dend['color_list'])
Out[169]: 32
len(df.index)
Out[170]: 33
Why is dendrogram only assigning colors to 32 labels, although there are 33 observations in the data? Is this how I extract the labels and their corresponding clusters (colored in blue, green and red above)? If not, how else do I 'cut' the tree properly?
Here's my attempt at using fcluster. Why does it return only one cluster for the set, when the same threshold for dend returns three?
from scipy.cluster.hierarchy import fcluster
fcluster(linkage(distanceMatrix, method='complete'), 4)
Out[175]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Here's the answer - I didn't add 'distance' as an option to fcluster. With it, I get the correct (3) cluster assignments.
assignments = fcluster(linkage(distanceMatrix, method='complete'),4,'distance')
print assignments
[3 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
cluster_output = pandas.DataFrame({'team':df.teamID.tolist() , 'cluster':assignments})
print cluster_output
cluster team
0 3 NYA
1 2 BOS
2 2 PHI
3 2 CHA
4 2 SFN
5 2 LAN
6 2 TEX
7 2 ATL
8 2 SLN
9 2 SEA
10 2 NYN
11 2 HOU
12 1 BAL
13 2 DET
14 1 ARI
15 2 CHN
16 1 CLE
17 1 CIN
18 1 TOR
19 1 COL
20 1 OAK
21 1 MIL
22 1 MIN
23 1 SDN
24 1 KCA
25 1 TBA
26 1 FLO
27 1 PIT
28 1 LAA
29 1 WAS
30 1 ANA
31 1 MON
32 1 MIA