I'm trying to use SciPy's dendrogram method to cut my data into a number of clusters based on a threshold value. However, once I create a dendrogram and retrieve its color_list, there is one fewer entry in the list than there are labels.
Alternatively, I've tried using fcluster with the same threshold value I identified in dendrogram; however, this does not render the same result -- it gives me one cluster instead of three.
here's my code.
import pandas
data = pandas.DataFrame({'total_runs': {0: 2.489857755536053,
1: 1.2877651950650333, 2: 0.8898850111727028, 3: 0.77750321282732704, 4: 0.72593099987615461, 5: 0.70064977003207007,
6: 0.68217502514600825, 7: 0.67963194285399975, 8: 0.64238326692987524, 9: 0.6102581538587678, 10: 0.52588765899448564,
11: 0.44813665774322564, 12: 0.30434031343774476, 13: 0.26151929543260161, 14: 0.18623657993534984, 15: 0.17494230269731209,
16: 0.14023670906519603, 17: 0.096817318756050832, 18: 0.085822227670014059, 19: 0.042178447746868117, 20: -0.073494398270518693,
21: -0.13699665903273103, 22: -0.13733324345373216, 23: -0.31112299949731331, 24: -0.42369178918768974, 25: -0.54826542322710636,
26: -0.56090603814914863, 27: -0.63252372328438811, 28: -0.68787316140457322, 29: -1.1981351436422796, 30: -1.944118415387774,
31: -2.1899746357945964, 32: -2.9077222144449961},
'total_salaries': {0: 3.5998991340231234,
1: 1.6158435140488829, 2: 0.87501176080187315, 3: 0.57584734201367749, 4: 0.54559862861592978, 5: 0.85178295446270169,
6: 0.18345463930386757, 7: 0.81380836410678736, 8: 0.43412670908952178, 9: 0.29560433676606418, 10: 1.0636736398252848,
11: 0.08930130612600648, 12: -0.20839133305170349, 13: 0.33676911316165403, 14: -0.12404710480916628, 15: 0.82454221267393346,
16: -0.34510456295395986, 17: -0.17162157282367937, 18: -0.064803261585569982, 19: -0.22807757277294818, 20: -0.61709008778669083,
21: -0.42506873158089231, 22: -0.42637946918743924, 23: -0.53516500398181921, 24: -0.68219830809296633, 25: -1.0051418692474947,
26: -1.0900316082184143, 27: -0.82421065378673986, 28: 0.095758053930450004, 29: -0.91540963929213015, 30: -1.3296449323844519,
31: -1.5512503530547552, 32: -1.6573856443389405}})
from scipy.spatial.distance import pdist
from scipy.cluster.hierarchy import linkage, dendrogram
distanceMatrix = pdist(data)
dend = dendrogram(linkage(distanceMatrix, method='complete'),
color_threshold=4,
leaf_font_size=10,
labels = df.teamID.tolist())
len(dend['color_list'])
Out[169]: 32
len(df.index)
Out[170]: 33
Why is dendrogram only assigning colors to 32 labels, although there are 33 observations in the data? Is this how I extract the labels and their corresponding clusters (colored in blue, green and red above)? If not, how else do I 'cut' the tree properly?
Here's my attempt at using fcluster. Why does it return only one cluster for the set, when the same threshold for dend returns three?
from scipy.cluster.hierarchy import fcluster
fcluster(linkage(distanceMatrix, method='complete'), 4)
Out[175]:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
Here's the answer - I didn't add 'distance' as an option to fcluster. With it, I get the correct (3) cluster assignments.
assignments = fcluster(linkage(distanceMatrix, method='complete'),4,'distance')
print assignments
[3 2 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
cluster_output = pandas.DataFrame({'team':df.teamID.tolist() , 'cluster':assignments})
print cluster_output
cluster team
0 3 NYA
1 2 BOS
2 2 PHI
3 2 CHA
4 2 SFN
5 2 LAN
6 2 TEX
7 2 ATL
8 2 SLN
9 2 SEA
10 2 NYN
11 2 HOU
12 1 BAL
13 2 DET
14 1 ARI
15 2 CHN
16 1 CLE
17 1 CIN
18 1 TOR
19 1 COL
20 1 OAK
21 1 MIL
22 1 MIN
23 1 SDN
24 1 KCA
25 1 TBA
26 1 FLO
27 1 PIT
28 1 LAA
29 1 WAS
30 1 ANA
31 1 MON
32 1 MIA
Related
Dataset is something like this (there will be duplicate rows in the original):
Code:
import pandas as pd
df_in = pd.DataFrame({'email_ID': {0: 'sachinlaltaprayoohoo',
1: 'sachinlaltaprayoohoo',
2: 'sachinlaltaprayoohoo',
3: 'sachinlaltaprayoohoo',
4: 'sachinlaltaprayoohoo',
5: 'sachinlaltaprayoohoo',
6: 'sheldon.yokoohoo',
7: 'sheldon.yokoohoo',
8: 'sheldon.yokoohoo',
9: 'sheldon.yokoohoo',
10: 'sheldon.yokoohoo',
11: 'sheldon.yokoohoo'},
'time_stamp': {0: '2021-09-10 09:01:56.340259',
1: '2021-09-10 09:01:56.672814',
2: '2021-09-10 09:01:57.471423',
3: '2021-09-10 09:01:57.480891',
4: '2021-09-10 09:01:57.484644',
5: '2021-09-10 09:01:57.984644',
6: '2021-09-10 09:01:56.340259',
7: '2021-09-10 09:01:56.672814',
8: '2021-09-10 09:01:57.471423',
9: '2021-09-10 09:01:57.480891',
10: '2021-09-10 09:01:57.484644',
11: '2021-09-10 09:01:57.984644'},
'screen': {0: 'rewardapp.SplashActivity',
1: 'i1',
2: 'rewardapp.Signup_in',
3: 'rewardapp.PaymentFinalConfirmationActivity',
4: 'rewardapp.Signup_in',
5: 'i1',
6: 'rewardapp.SplashActivity',
7: 'i1',
8: 'rewardapp.Signup_in',
9: 'i1',
10: 'rewardapp.Signup_in',
11: 'rewardapp.PaymentFinalConfirmationActivity'}})
df_in['time_stamp'] = df_in['time_stamp'].astype('datetime64[ns]')
df_in
Output should be this:
Code:
import pandas as pd
df_out = pd.DataFrame({'email_ID': {0: 'sachinlaltaprayoohoo',
1: 'sachinlaltaprayoohoo',
2: 'sachinlaltaprayoohoo',
3: 'sachinlaltaprayoohoo',
4: 'sachinlaltaprayoohoo',
5: 'sachinlaltaprayoohoo',
6: 'sheldon.yokoohoo',
7: 'sheldon.yokoohoo',
8: 'sheldon.yokoohoo',
9: 'sheldon.yokoohoo',
10: 'sheldon.yokoohoo',
11: 'sheldon.yokoohoo'},
'time_stamp': {0: '2021-09-10 09:01:56.340259',
1: '2021-09-10 09:01:56.672814',
2: '2021-09-10 09:01:57.471423',
3: '2021-09-10 09:01:57.480891',
4: '2021-09-10 09:01:57.484644',
5: '2021-09-10 09:01:57.984644',
6: '2021-09-10 09:01:56.340259',
7: '2021-09-10 09:01:56.672814',
8: '2021-09-10 09:01:57.471423',
9: '2021-09-10 09:01:57.480891',
10: '2021-09-10 09:01:57.484644',
11: '2021-09-10 09:01:57.984644'},
'screen': {0: 'rewardapp.SplashActivity',
1: 'i1',
2: 'rewardapp.Signup_in',
3: 'rewardapp.PaymentFinalConfirmationActivity',
4: 'rewardapp.Signup_in',
5: 'i1',
6: 'rewardapp.SplashActivity',
7: 'i1',
8: 'rewardapp.Signup_in',
9: 'i1',
10: 'rewardapp.Signup_in',
11: 'rewardapp.PaymentFinalConfirmationActivity'},
'series1': {0: 0,
1: 1,
2: 2,
3: 3,
4: 0,
5: 1,
6: 0,
7: 1,
8: 2,
9: 3,
10: 4,
11: 5},
'series2': {0: 0,
1: 0,
2: 0,
3: 0,
4: 1,
5: 1,
6: 2,
7: 2,
8: 2,
9: 2,
10: 2,
11: 2}})
df_out['time_stamp'] = df['time_stamp'].astype('datetime64[ns]')
df_out
'series1' column values starts row by row as 0, 1, 2, and so on but resets to 0 when:
'email_ID' column value changes.
'screen' column value == 'rewardapp.PaymentFinalConfirmationActivity'
'series2' column values starts with 0 and increments by 1 whenever 'series1' resets.
My progress:
series1 = [0]
x = 0
for index in df[1:].index:
if ((df._get_value(index - 1, 'email_ID')) == df._get_value(index, 'email_ID')) and (df._get_value(index - 1, 'screen') != 'rewardapp.PaymentFinalConfirmationActivity'):
x += 1
series1.append(x)
else:
x = 0
series1.append(x)
df['series1'] = series1
df
series2 = [0]
x = 0
for index in df[1:].index:
if df._get_value(index, 'series1') - df._get_value(index - 1, 'series1') == 1:
series2.append(x)
else:
x += 1
series2.append(x)
df['series2'] = series2
df
I think the code above is working, I'll test answered codes and select the best in a few hours, thank you.
Let's try
m = (df_in['email_ID'].ne(df_in['email_ID'].shift().bfill()) |
df_in['screen'].shift().eq('rewardapp.PaymentFinalConfirmationActivity'))
df_in['series1'] = df_in.groupby(m.cumsum()).cumcount()
df_in['series2'] = m.cumsum()
print(df_in)
email_ID time_stamp screen series1 series2
0 sachinlaltaprayoohoo 2021-09-10 09:01:56.340259 rewardapp.SplashActivity 0 0
1 sachinlaltaprayoohoo 2021-09-10 09:01:56.672814 i1 1 0
2 sachinlaltaprayoohoo 2021-09-10 09:01:57.471423 rewardapp.Signup_in 2 0
3 sachinlaltaprayoohoo 2021-09-10 09:01:57.480891 rewardapp.PaymentFinalConfirmationActivity 3 0
4 sachinlaltaprayoohoo 2021-09-10 09:01:57.484644 rewardapp.Signup_in 0 1
5 sachinlaltaprayoohoo 2021-09-10 09:01:57.984644 i1 1 1
6 sheldon.yokoohoo 2021-09-10 09:01:56.340259 rewardapp.SplashActivity 0 2
7 sheldon.yokoohoo 2021-09-10 09:01:56.672814 i1 1 2
8 sheldon.yokoohoo 2021-09-10 09:01:57.471423 rewardapp.Signup_in 2 2
9 sheldon.yokoohoo 2021-09-10 09:01:57.480891 i1 3 2
10 sheldon.yokoohoo 2021-09-10 09:01:57.484644 rewardapp.Signup_in 4 2
11 sheldon.yokoohoo 2021-09-10 09:01:57.984644 rewardapp.PaymentFinalConfirmationActivity 5 2
You can use:
m = df_in['screen']=='rewardapp.PaymentFinalConfirmationActivity'
df_in['pf'] = np.where(m, 1, np.nan)
df_in.loc[m, 'pf'] = df_in[m].cumsum()
grouper = df_in.groupby('email_ID')['pf'].bfill()
df_in['series1'] = df_in.groupby(grouper).cumcount()
df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
df_in.drop('pf', axis=1, inplace=True)
print(df_in):
email_ID time_stamp \
0 sachinlaltaprayoohoo 2021-09-10 09:01:56.340259
1 sachinlaltaprayoohoo 2021-09-10 09:01:56.672814
2 sachinlaltaprayoohoo 2021-09-10 09:01:57.471423
3 sachinlaltaprayoohoo 2021-09-10 09:01:57.480891
4 sachinlaltaprayoohoo 2021-09-10 09:01:57.484644
5 sachinlaltaprayoohoo 2021-09-10 09:01:57.984644
6 sheldon.yokoohoo 2021-09-10 09:01:56.340259
7 sheldon.yokoohoo 2021-09-10 09:01:56.672814
8 sheldon.yokoohoo 2021-09-10 09:01:57.471423
9 sheldon.yokoohoo 2021-09-10 09:01:57.480891
10 sheldon.yokoohoo 2021-09-10 09:01:57.484644
11 sheldon.yokoohoo 2021-09-10 09:01:57.984644
screen series1 series2
0 rewardapp.SplashActivity 0 0
1 i1 1 0
2 rewardapp.Signup_in 2 0
3 rewardapp.PaymentFinalConfirmationActivity 3 0
4 rewardapp.Signup_in 0 1
5 i1 1 1
6 rewardapp.SplashActivity 0 2
7 i1 1 2
8 rewardapp.Signup_in 2 2
9 i1 3 2
10 rewardapp.Signup_in 4 2
11 rewardapp.PaymentFinalConfirmationActivity 5 2
Explanation:
First locate the rows where 'screen' is 'PaymentFinalConfirmationActivity' and then use cumsum() to identify their numbers.
This is accomplished by:
df_in['pf'] = np.where(m, 1, np.nan)
df_in.loc[m, 'pf'] = df_in[m].cumsum()
Then use bfill to backfill the NaN values with the positions where 'screen' shows 'PaymentFinalConfirmationActivity'. This will ensure the above rows are of the same group, but do it per email_ID. This is accomplished by:
grouper = df_in.groupby('email_ID')['pf'].bfill()
Then it is straightforward to see that once you define a grouper, you can use cumcount to get the series1 column. This is done by:
df_in['series1'] = df_in.groupby(grouper).cumcount()
Then get series2 column by using ngroup(). But make sure the groupby is done with sort=False. Done by:
df_in['series2'] = df_in.groupby(grouper.fillna(0), sort=False).ngroup()
Finally drop the unwanted column pf.
df_in.drop('pf', axis=1, inplace=True)
I have a DataFrame, and taking a subset of it, it has a dict constructor like:
df = pd.DataFrame(data = {'K': {8: 3.9274999999999998, 9: 1.9275, 10: 2.9274999999999998, 11: 2.9274999999999998, 12: 2.275, 13: 3.2750000000000004, 14: 2.775, 15: 2.8000000000000003, 16: 1.7999999999999998, 17: 2.8000000000000003, 18: 2.82, 19: 2.82, 20: 2.8000000000000003, 21: 2.8000000000000003, 22: 2.82, 23: 2.82, 24: 1.82, 25: 1.82, 26: 1.7999999999999998}, 'Struct': {8: 'Call', 9: 'Put', 10: 'Straddle', 11: 'Straddle', 12: 'Put', 13: 'Call', 14: 'Straddle', 15: 'Delta', 16: 'Put', 17: 'Put', 18: 'Put', 19: 'Delta', 20: 'Put', 21: 'Delta', 22: 'Delta', 23: 'Put', 24: 'Put', 25: 'Put', 26: 'Put'}, 'MainID': {8: 10, 9: 10, 10: 10, 11: 10, 12: 20, 13: 20, 14: 20, 15: 21, 16: 21, 17: 21, 18: 23, 19: 23, 20: 23, 21: 23, 22: 23, 23: 23, 24: 23, 25: 23, 26: 23}})
Markdown
Index
K
Struct
MainID
8
3.9275
Call
10
9
1.9275
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
2.275
Put
20
13
3.275
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
1.8
Put
21
17
2.8
Put
21
18
2.82
Put
23
19
2.82
Delta
23
20
2.8
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
2.82
Put
23
24
1.82
Put
23
25
1.82
Put
23
26
1.8
Put
23
I am trying to find a way to do the following steps:
Groupby("MainID")
For any Call or Put, subtract "K" from either "Straddle", or "Delta" if it exists within the Groupby("MainID")
In the case that you have multiple delta/put/call within a Groupby("MainID"), you would want to subtract based on ascending values. Ie, if K[Struct==Put] = [1,2,3] and K[Struct==Delta] = [2,2,3] the result would be [-1, 0, 0]
The resulting DF would look like
Index
K
Struct
MainID
8
1
Call
10
9
-1
Put
10
10
2.9275
Straddle
10
11
2.9275
Straddle
10
12
-0.50
Put
20
13
0.50
Call
20
14
2.775
Straddle
20
15
2.8
Delta
21
16
-1
Put
21
17
0
Put
21
18
0
Put
23
19
2.82
Delta
23
20
0
Put
23
21
2.8
Delta
23
22
2.82
Delta
23
23
0
Put
23
24
-1
Put
23
25
-1
Put
23
26
-1
Put
23
Thanks so much! It's a tricky one...
I am trying to build a dataframe that combines individual dataframes of county-level high school enrollment projections generated in a for loop.
I can do this for a single county, based on this SO question. It works great. My goal now is to do a nested for loop that would take multiple county FIPS codes, filter the inner loop on that, and generate an 11-row dataframe that would then be appended to a master dataframe. For three counties, for example, the final dataframe would be 33 rows.
But I haven't been able to get it right. I've tried to model on this SO question and answer.
This is my starting dataframe:
df = pd.DataFrame({"year": ['2020_21', '2020_21','2020_21'],
"county_fips" : ['06019','06021','06023'] ,
"grade11" : [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]})
df
This is my code with the nested loops. My intent is to run through the county codes in the outer loop and the projection year calculations in the inner loop.
projection_years=['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
for i in df['county_fips'].unique():
print(i)
grade11_change=df.iloc[0]['grade11_chg']
grade11_12_ratio=df.iloc[0]['grade11_12_ratio']
full_name=[]
for year in projection_years:
#print(year)
df_select=df[df['county_fips']==i]
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row = {}
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final=pd.concat(full_name)
df_final=df_final[['year','county_fips','grade11','grade12']]
print('Finished processing')
But I end up with NaN values and repeating years. Below shows my desired output (I built this in Excel and the numbers reflect rounding. (Update - this corrects the original df_final_goal .)
df_final_goal=pd.DataFrame({'year': {0: '2020_21', 1: '2021_22', 2: '2022_23', 3: '2023_24', 4: '2024_25', 5: '2025_26',
6: '2026_27', 7: '2027_28', 8: '2028_29', 9: '2029_30', 10: '2030_31', 11: '2020_21', 12: '2021_22', 13: '2022_23',
14: '2023_24', 15: '2024_25', 16: '2025_26', 17: '2026_27', 18: '2027_28', 19: '2028_29', 20: '2029_30', 21: '2030_31',
22: '2020_21', 23: '2021_22', 24: '2022_23', 25: '2023_24', 26: '2024_25', 27: '2025_26', 28: '2026_27', 29: '2027_28',
30: '2028_29', 31: '2029_30', 32: '2030_31'},
'county_fips': {0: '06019', 1: '06019', 2: '06019', 3: '06019', 4: '06019', 5: '06019', 6: '06019', 7: '06019', 8: '06019',
9: '06019', 10: '06019', 11: '06021', 12: '06021', 13: '06021', 14: '06021', 15: '06021', 16: '06021', 17: '06021', 18: '06021',
19: '06021', 20: '06021', 21: '06021', 22: '06023', 23: '06023', 24: '06023', 25: '06023', 26: '06023', 27: '06023',
28: '06023', 29: '06023', 30: '06023', 31: '06023', 32: '06023'},
'grade11': {0: 5000, 1: 5050, 2: 5101, 3: 5152, 4: 5203, 5: 5255, 6: 5308, 7: 5361, 8: 5414, 9: 5468, 10: 5523,
11: 2000, 12: 2040, 13: 2081, 14: 2122, 15: 2165, 16: 2208, 17: 2252, 18: 2297, 19: 2343, 20: 2390, 21: 2438,
22: 2000, 23: 2060, 24: 2122, 25: 2185, 26: 2251, 27: 2319, 28: 2388, 29: 2460, 30: 2534, 31: 2610, 32: 2688},
'grade12': {0: 5200, 1: 4500, 2: 4545, 3: 4590, 4: 4636, 5: 4683, 6: 4730, 7: 4777, 8: 4825, 9: 4873, 10: 4922,
11: 2200, 12: 1600, 13: 1632, 14: 1665, 15: 1698, 16: 1732, 17: 1767, 18: 1802, 19: 1838, 20: 1875, 21: 1912,
22: 2200, 23: 1740, 24: 1792, 25: 1846, 26: 1901, 27: 1958, 28: 2017, 29: 2078, 30: 2140, 31: 2204, 32: 2270}})
Thanks for any assistance.
Creating a helper function for calculating grade11 helps make this a bit easier.
import pandas as pd
def expand_grade11(
grade11: int,
grade11_chg: float,
len_projection_years: int
) -> list:
"""
Calculate `grade11` values based on current
`grade11`, `grade11_chg`, and number of
`projection_years`.
"""
list_of_vals = []
while len(list_of_vals) < len_projection_years:
grade11 = int(grade11 * grade11_chg)
list_of_vals.append(grade11)
return list_of_vals
# initial info
df = pd.DataFrame({
"year": ['2020_21', '2020_21','2020_21'],
"county_fips": ['06019','06021','06023'] ,
"grade11": [5000,2000,2000],
"grade12": [5200,2200,2200],
"grade11_chg": [1.01,1.02,1.03],
"grade11_12_ratio": [0.9,0.8,0.87]
})
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
# converting to pd.MultiIndex
prods_index = pd.MultiIndex.from_product((df.county_fips.unique(), projection_years), names=["county_fips", "year"])
# setting index for future grouping/joining
df.set_index(["county_fips", "year"], inplace=True)
# calculate grade11
final = df.groupby([
"county_fips",
"year",
]).apply(lambda x: expand_grade11(x.grade11, x.grade11_chg, len(projection_years)))
final = final.explode()
final.index = prods_index
final = final.to_frame("grade11")
# concat with original df to get other columns
final = pd.concat([
df, final
])
final.sort_index(level=["county_fips", "year"], inplace=True)
final.grade11_12_ratio.ffill(inplace=True)
# calculate grade12
grade12 = final.groupby([
"county_fips"
]).apply(lambda x: x["grade11"] * x["grade11_12_ratio"])
grade12 = grade12.groupby("county_fips").shift(1)
grade12 = grade12.droplevel(0)
# put it all together
final.grade12.fillna(grade12, inplace=True)
final = final[["grade11", "grade12"]]
final = final.astype(int)
final.reset_index(inplace=True)
there are some bugs in the code, this code seems to produce the result you expect (the final dataframe is currently not consistent with the initial one):
projection_years = ['2021_22','2022_23','2023_24','2024_25','2025_26','2026_27','2027_28','2028_29','2029_30','2030_31']
full_name = []
for i in df['county_fips'].unique():
print(i)
df_select = df[df['county_fips']==i]
grade11_change = df_select.iloc[0]['grade11_chg']
grade11_12_ratio = df_select.iloc[0]['grade11_12_ratio']
for year in projection_years:
#print(year)
lr = df_select.iloc[-1]
row = {}
row['year'] = year
row['county_fips'] = i
row['grade11'] = int(lr['grade11'] * grade11_change)
row['grade12'] = int(lr['grade11'] * grade11_12_ratio)
df_select = df_select.append([row])
full_name.append(df_select)
df_final = pd.concat(full_name)
df_final = df_final[['year','county_fips','grade11','grade12']].reset_index()
print('Finished processing')
fixes:
full_name initialized before the outer loop
do not redefine df_select in the inner loop
row was initialized twice inside the inner loop
full_name.append moved outside of the inner loop and after it
added reset_index() to df_final (mostly cosmetic)
(edit) grade change variables (grade11_change and grade11_12_ratio) are now computed from df_select last row (and not df)
the final result (print(df_final.to_markdown())) with the above code is:
index
year
county_fips
grade11
grade12
0
0
2020_21
06019
5000
5200
1
0
2021_22
06019
5050
4500
2
0
2022_23
06019
5100
4545
3
0
2023_24
06019
5151
4590
4
0
2024_25
06019
5202
4635
5
0
2025_26
06019
5254
4681
6
0
2026_27
06019
5306
4728
7
0
2027_28
06019
5359
4775
8
0
2028_29
06019
5412
4823
9
0
2029_30
06019
5466
4870
10
0
2030_31
06019
5520
4919
11
1
2020_21
06021
2000
2200
12
0
2021_22
06021
2040
1600
13
0
2022_23
06021
2080
1632
14
0
2023_24
06021
2121
1664
15
0
2024_25
06021
2163
1696
16
0
2025_26
06021
2206
1730
17
0
2026_27
06021
2250
1764
18
0
2027_28
06021
2295
1800
19
0
2028_29
06021
2340
1836
20
0
2029_30
06021
2386
1872
21
0
2030_31
06021
2433
1908
22
2
2020_21
06023
2000
2200
23
0
2021_22
06023
2060
1740
24
0
2022_23
06023
2121
1792
25
0
2023_24
06023
2184
1845
26
0
2024_25
06023
2249
1900
27
0
2025_26
06023
2316
1956
28
0
2026_27
06023
2385
2014
29
0
2027_28
06023
2456
2074
30
0
2028_29
06023
2529
2136
31
0
2029_30
06023
2604
2200
32
0
2030_31
06023
2682
2265
note: edited to address the comments
I have a pandas dataframe df which looks as follows:
Unnamed: 0 Characters Length Characters Split A B C D Names with common 3-letters
0 FROKDUWJU 9 [FRO, KDU, WJU] FRO KDU WJU NaN
1 IDJWPZSUR 9 [IDJ, WPZ, SUR] IDJ WPZ SUR NaN
2 UCFURKIRODCQ 12 [UCF, URK, IRO, DCQ] UCF URK IRO DCQ
3 ORI 3 [ORI] ORI NaN NaN NaN
4 PROIRKIQARTIBPO 15 [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI
5 QAZWREDCQIBR 12 [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR
6 PLPRUFSWURKI 12 [PLP, RUF, SWU, RKI] PLP RUF SWU RKI
7 FROIEUSKIKIR 12 [FRO, IEU, SKI, KIR] FRO IEU SKI KIR
8 ORIUWJZSRFRO 12 [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO
9 URKIFJVUR 9 [URK, IFJ, VUR] URK IFJ VUR NaN
10 RUFOFR 6 [RUF, OFR] RUF OFR NaN NaN
11 IEU 3 [IEU] IEU NaN NaN NaN
12 PIMIEU 6 [PIM, IEU] PIM IEU NaN NaN
In the last column, Names with common 3-letters, I'd like to have a list of all the names from first column which have common 3-letters set in their names. For example, in the first row, I'd like to have a list of all the names which have FRO, KRU and WJU in their names. These 3-letter split of names could also be found in "Characters Split" or A,B, C, and D columns for reference.
To talk in a stepwise manner, I need to scan whether the 3-letter set present in a name in a given row is also present in any name in rest of the rows. And if it is present, I need to add the corresponding name of the other row as a list in "Names with common 3-letters" column. As an example, in the screenshot attached, in column C, the yellow highlighted cells have the names that have common 3-letter set with the name in same row.
What would be the appropriate way to accomplish this? Should I use a function or a loop-statement?
Note: df.to_dict() looks as follows:
{'Unnamed: 0': {0: 'FROKDUWJU',
1: 'IDJWPZSUR',
2: 'UCFURKIRODCQ',
3: 'ORI',
4: 'PROIRKIQARTIBPO',
5: 'QAZWREDCQIBR',
6: 'PLPRUFSWURKI',
7: 'FROIEUSKIKIR',
8: 'ORIUWJZSRFRO',
9: 'URKIFJVUR',
10: 'RUFOFR',
11: 'IEU',
12: 'PIMIEU'},
'Characters Length': {0: 9,
1: 9,
2: 12,
3: 3,
4: 15,
5: 12,
6: 12,
7: 12,
8: 12,
9: 9,
10: 6,
11: 3,
12: 6},
'Characters Split': {0: ['FRO', 'KDU', 'WJU'],
1: ['IDJ', 'WPZ', 'SUR'],
2: ['UCF', 'URK', 'IRO', 'DCQ'],
3: ['ORI'],
4: ['PRO', 'IRK', 'IQA', 'RTI', 'BPO'],
5: ['QAZ', 'WRE', 'DCQ', 'IBR'],
6: ['PLP', 'RUF', 'SWU', 'RKI'],
7: ['FRO', 'IEU', 'SKI', 'KIR'],
8: ['ORI', 'UWJ', 'ZSR', 'FRO'],
9: ['URK', 'IFJ', 'VUR'],
10: ['RUF', 'OFR'],
11: ['IEU'],
12: ['PIM', 'IEU']},
'A': {0: 'FRO',
1: 'IDJ',
2: 'UCF',
3: 'ORI',
4: 'PRO',
5: 'QAZ',
6: 'PLP',
7: 'FRO',
8: 'ORI',
9: 'URK',
10: 'RUF',
11: 'IEU',
12: 'PIM'},
'B': {0: 'KDU',
1: 'WPZ',
2: 'URK',
3: nan,
4: 'IRK',
5: 'WRE',
6: 'RUF',
7: 'IEU',
8: 'UWJ',
9: 'IFJ',
10: 'OFR',
11: nan,
12: 'IEU'},
'C': {0: 'WJU',
1: 'SUR',
2: 'IRO',
3: nan,
4: 'IQA',
5: 'DCQ',
6: 'SWU',
7: 'SKI',
8: 'ZSR',
9: 'VUR',
10: nan,
11: nan,
12: nan},
'D': {0: nan,
1: nan,
2: 'DCQ',
3: nan,
4: 'RTI',
5: 'IBR',
6: 'RKI',
7: 'KIR',
8: 'FRO',
9: nan,
10: nan,
11: nan,
12: nan},
'Names with common 3-letters': {0: '',
1: '',
2: '',
3: '',
4: '',
5: '',
6: '',
7: '',
8: '',
9: '',
10: '',
11: '',
12: ''}}
There may be a quicker way to search and create the lists, but this works:
# create a different temporary, column (you can't search the Characters Split column directly as the 3 letter combinations aren't honored
df['patrn'] = df.apply( lambda x: '|'.join(x['Characters Split']), axis=1)
def find_matches(x):
# print(x.name) # index number
new_df = df[~df.index.isin([x.name])] # all rows except current index
return set(new_df.loc[df['patrn'].str.contains(x['patrn'], case=False)]['Unnamed: 0'].tolist())
df['Names with common 3-letters'] = df.apply(lambda x: find_matches(x), axis=1)
df
Output
Unnamed: 0 Characters Length Characters Split A B C D Names with common 3-letters patrn
0 FROKDUWJU 9 [FRO, KDU, WJU] FRO KDU WJU NaN {FROIEUSKIKIR, ORIUWJZSRFRO} FRO|KDU|WJU
1 IDJWPZSUR 9 [IDJ, WPZ, SUR] IDJ WPZ SUR NaN {} IDJ|WPZ|SUR
2 UCFURKIRODCQ 12 [UCF, URK, IRO, DCQ] UCF URK IRO DCQ {URKIFJVUR, QAZWREDCQIBR} UCF|URK|IRO|DCQ
3 ORI 3 [ORI] ORI NaN NaN NaN {ORIUWJZSRFRO} ORI
4 PROIRKIQARTIBPO 15 [PRO, IRK, IQA, RTI, BPO] PRO IRK IQA RTI {} PRO|IRK|IQA|RTI|BPO
5 QAZWREDCQIBR 12 [QAZ, WRE, DCQ, IBR] QAZ WRE DCQ IBR {UCFURKIRODCQ} QAZ|WRE|DCQ|IBR
6 PLPRUFSWURKI 12 [PLP, RUF, SWU, RKI] PLP RUF SWU RKI {RUFOFR} PLP|RUF|SWU|RKI
7 FROIEUSKIKIR 12 [FRO, IEU, SKI, KIR] FRO IEU SKI KIR {PIMIEU, FROKDUWJU, ORIUWJZSRFRO, IEU} FRO|IEU|SKI|KIR
8 ORIUWJZSRFRO 12 [ORI, UWJ, ZSR, FRO] ORI UWJ ZSR FRO {FROKDUWJU, FROIEUSKIKIR, ORI} ORI|UWJ|ZSR|FRO
9 URKIFJVUR 9 [URK, IFJ, VUR] URK IFJ VUR NaN {UCFURKIRODCQ} URK|IFJ|VUR
10 RUFOFR 6 [RUF, OFR] RUF OFR NaN NaN {PLPRUFSWURKI} RUF|OFR
11 IEU 3 [IEU] IEU NaN NaN NaN {PIMIEU, FROIEUSKIKIR} IEU
12 PIMIEU 6 [PIM, IEU] PIM IEU NaN NaN {FROIEUSKIKIR, IEU} PIM|IEU
I have a list:
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
further, I do:
import pandas as pd
pd.DataFrame(sorted_info)
0 1: surgery?\n 1 2:
Age\n 2 3: Hospital Number\n 3 4: rectal temperature\n 4 5: pulse\n
5 - is a reflection of the heart condi... 6 6: respiratory rate\n
7 7: temperature of extremities\n 8 - possible values:\n 9 8:
peripheral pulse\n 10 - possible values are:\n 11 9: mucous
membranes\n 12 - possible values are:\n 13 10: capillary refill
time\n 14 11: pain - a subjective judgement of the hors... 15 -
possible values:\n 16 12: peristalsis\n 17 - possible values:\n
18 13: abdominal distension\n 19 14: nasogastric tube\n 20 - possible
values:\n 21 15: nasogastric reflux\n 22 16: nasogastric reflux PH\n
23 17: rectal examination - feces\n 24 18: abdomen\n 25 19: packed
cell volume\n 26 20: total protein\n 27 21: abdominocentesis
appearance\n 28 - possible values:\n 29 22: abdomcentesis total
protein\n 30 23: outcome\n 31 - possible values:\n 32 24: surgical
lesion?\n 33 - possible values:\n 34 25, 26, 27: type of lesion\n
35 28: cp_data\n
So I am trying to sort in that DF will look like:
Col1 Col2
1: surgery
2: Age
3: Hospital Number
etc.
Any suggestions how to split Series into 2 Cols and clean/delete rest of info?
Try:
import pandas as pd
sorted_info = [' 1: surgery?\n', ' 2: Age\n', ' 3: Hospital Number\n', ' 4: rectal temperature\n', ' 5: pulse\n', ' - is a reflection of the heart condition: 30 -40 is normal for adults\n', ' 6: respiratory rate\n', ' 7: temperature of extremities\n', ' - possible values:\n', ' 8: peripheral pulse\n', ' - possible values are:\n', ' 9: mucous membranes\n', ' - possible values are:\n', ' 10: capillary refill time\n', " 11: pain - a subjective judgement of the horse's pain level\n", ' - possible values:\n', ' 12: peristalsis\n', ' - possible values:\n', ' 13: abdominal distension\n', ' 14: nasogastric tube\n', ' - possible values:\n', ' 15: nasogastric reflux\n', ' 16: nasogastric reflux PH\n', ' 17: rectal examination - feces\n', ' 18: abdomen\n', ' 19: packed cell volume\n', ' 20: total protein\n', ' 21: abdominocentesis appearance\n', ' - possible values:\n', ' 22: abdomcentesis total protein\n', ' 23: outcome\n', ' - possible values:\n', ' 24: surgical lesion?\n', ' - possible values:\n', ' 25, 26, 27: type of lesion\n', ' 28: cp_data\n']
sorted_info = [x.strip() for x in sorted_info]
joined_list = []
for x in sorted_info:
if x.startswith('-'):
joined_list[-1] += ' ' + x
else:
joined_list.append(x)
df = pd.DataFrame(joined_list)
df[['Number', 'Text']] = df[0].str.split(':', n=1, expand=True)
del df[0]
print(df)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25, 26, 27 type of lesion
25 28 cp_data
Additional:
If you wanted to then go forwards and expand the 25, 26, 27 values in row 24 try:
df = df.apply(lambda x: x.str.split(',').explode()).reset_index(drop=True)
Output:
Number Text
0 1 surgery?
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse - is a reflection of the heart conditi...
5 6 respiratory rate
6 7 temperature of extremities - possible values:
7 8 peripheral pulse - possible values are:
8 9 mucous membranes - possible values are:
9 10 capillary refill time
10 11 pain - a subjective judgement of the horse's ...
11 12 peristalsis - possible values:
12 13 abdominal distension
13 14 nasogastric tube - possible values:
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination - feces
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance - possible values:
21 22 abdomcentesis total protein
22 23 outcome - possible values:
23 24 surgical lesion? - possible values:
24 25 type of lesion
25 26 type of lesion
26 27 type of lesion
27 28 cp_data
We can do it this way:
Read the list sorted_info into a Pandas Series.
Use .str.extract() with Regex to extract the number tag and main contents of one line of text. Ready with Col1 and Col2 after the extraction.
For continuation line without number tag, we use .ffill() to forward fill the missing tag number
Group by the tag number in Col1 and join texts with continuation line based on the same tag number
Here, the codes:
# Read the list `sorted_info` into a Pandas Series:
s = pd.Series(sorted_info)
# Extract the number tag and main contents of a line of text:
df = s.str.extract(r'\s*(?:(?P<Col1>\d+(?:,\s*\d+)*:)|-)\s*(?P<Col2>.*)', expand=True)
# For continuation lines without number tag, forward fill the missing tag number
df['Col1'] = df['Col1'].ffill()
# Group by the tag numbers in `Col1` and join text in continuation line based on the same tag number
df_out = df.groupby('Col1', sort=False, as_index=False).agg(' - '.join)
Result:
print(df_out)
Col1 Col2
0 1: surgery?
1 2: Age
2 3: Hospital Number
3 4: rectal temperature
4 5: pulse - is a reflection of the heart condition: 30 -40 is normal for adults
5 6: respiratory rate
6 7: temperature of extremities - possible values:
7 8: peripheral pulse - possible values are:
8 9: mucous membranes - possible values are:
9 10: capillary refill time
10 11: pain - a subjective judgement of the horse's pain level - possible values:
11 12: peristalsis - possible values:
12 13: abdominal distension
13 14: nasogastric tube - possible values:
14 15: nasogastric reflux
15 16: nasogastric reflux PH
16 17: rectal examination - feces
17 18: abdomen
18 19: packed cell volume
19 20: total protein
20 21: abdominocentesis appearance - possible values:
21 22: abdomcentesis total protein
22 23: outcome - possible values:
23 24: surgical lesion? - possible values:
24 25, 26, 27: type of lesion
25 28: cp_data
You should first clean the data itself. It is unfortunate that it reaches the part where you want to make a DataFrame in such a noisy form. Ideally, the cleanup should come closer to the data collection itself.
Generally speaking, data cleaning is highly domain-dependent. For simple string cleanup, usually a combination of string.split(), re.match() and list comprehensions can go a long way.
For your specific case, the following gives good results (exercise for the reader: understand each bit of the expression, by trying reduced forms of it, starting by e.g. [v.splitlines()[0].split(':', 1) for v in sorted_info] building up toward the final form):
import re
cleaned = [
[s.split(' - ', 1)[0].strip().rstrip('?')
for s in v.splitlines()[0].split(':', 1)]
for v in sorted_info if re.match('^ *\d+:', v)
]
>>> cleaned
[['1', 'surgery'],
['2', 'Age'],
['3', 'Hospital Number'],
['4', 'rectal temperature'],
['5', 'pulse'],
['6', 'respiratory rate'],
['7', 'temperature of extremities'],
['8', 'peripheral pulse'],
['9', 'mucous membranes'],
['10', 'capillary refill time'],
['11', 'pain'],
['12', 'peristalsis'],
['13', 'abdominal distension'],
['14', 'nasogastric tube'],
['15', 'nasogastric reflux'],
['16', 'nasogastric reflux PH'],
['17', 'rectal examination'],
['18', 'abdomen'],
['19', 'packed cell volume'],
['20', 'total protein'],
['21', 'abdominocentesis appearance'],
['22', 'abdomcentesis total protein'],
['23', 'outcome'],
['24', 'surgical lesion'],
['28', 'cp_data']]
# and
df = pd.DataFrame(cleaned, columns=['Col1', 'Col2'])
>>> df
Col1 Col2
0 1 surgery
1 2 Age
2 3 Hospital Number
3 4 rectal temperature
4 5 pulse
5 6 respiratory rate
6 7 temperature of extremities
7 8 peripheral pulse
8 9 mucous membranes
9 10 capillary refill time
10 11 pain
11 12 peristalsis
12 13 abdominal distension
13 14 nasogastric tube
14 15 nasogastric reflux
15 16 nasogastric reflux PH
16 17 rectal examination
17 18 abdomen
18 19 packed cell volume
19 20 total protein
20 21 abdominocentesis appearance
21 22 abdomcentesis total protein
22 23 outcome
23 24 surgical lesion
24 28 cp_data