I am trying to to change the data of a column based on a condition. However, it doesn't seem to pass through the condition correctly and fills every value in the column with the change when it shouldn't. Here is the code:
uh['Age']= uh['Age']
uh['AgeStatus'] = uh['Age']
uh['AgeStatus'] = uh.loc[uh['AgeStatus'] > 25.0, 'AgeStatus'] = 'Veteran'
and it returns the Type Error:
TypeError: '>' not supported between instances of 'str' and 'float'
and the dataframe:
Year Age Tm Lg G PA ... BB SO BA OBP SLG AgeStatus
5 2021 28.0 CHW AL 88 391 ... 18 87 0.299 0.332 0.437 Veteran
2 2021 23.0 TOR AL 101 443 ... 29 90 0.296 0.348 0.487 Veteran
8 2021 28.0 BOS AL 97 409 ... 37 75 0.309 0.374 0.522 Veteran
6 2021 26.0 HOU AL 96 416 ... 53 80 0.272 0.368 0.476 Veteran
5 2021 27.0 ATL NL 105 431 ... 30 116 0.249 0.305 0.475 Veteran
2 2021 22.0 SDP NL 87 362 ... 43 102 0.292 0.373 0.651 Veteran
6 2021 28.0 WSN NL 96 420 ... 26 77 0.322 0.369 0.521 Veteran
[7 rows x 21 columns]
Really confused on what's causing this.
You need to use conditionals like this
uh.loc[uh['Age'] > 25.0, 'AgeStatus'] = 'Veteran'
I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks
You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0
I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92
I have the following code to extract data from a table but because of the second for loop it saves all the data of a column as a single element of the array
is there a to way separate each element from the array below . link for stat_table :
for table in stat_table:
for cell in table.find_all('table'):
stmat.append(cell.text)
print(cell.text)
count = count + 1
print(count)
print(stmat)
print(stmat[0])
this is the output where all the data of second loop is saved as a single element
[' Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66 ',
' Max Avg Min 68 66.6 66 68 64.8 0 66 65.2 64 66 65.9 64 68 65.8 64 66 65.3
64 66 64.7 64 68 66.3 64 70 67.1 64 68 65.9 63 70 66.4 64 68 67.2 66 68
66.4 64 68 66.0 64 70 67.4 66 70 67.0 66 68 65.5 64 66 65.4 64 70 67.1 64
70 67.1 66 68 65.6 64 66 61.6 59 66 60.3 55 64 60.0 50 66 62.7 59 68 64.8
63 68 63.8 61 66 63.9 61 68 64.3 63 68 64.8 61 ', ' Max Avg Min 94 80.1 58
88 75.1 0 88 75.3 58 88 71.4 51 94 74.0 48 94 74.8 54 94 73.4 54 94 78.4
54 100 80.7 58 100 73.9 51 100 76.7 51 100 81.0 61 94 76.0 58 94 80.3 65
94 82.5 61 94 81.4 61 94 76.8 54 94 74.4 54 100 82.0 58 100 86.1 65 100
80.4 54 100 73.1 48 94 64.6 39 100 62.2 32 88 64.3 40 94 70.4 48 94 69.2
48 94 68.8 45 88 73.4 48 94 68.9 43 ', ' Max Avg Min 23 15.9 10 22 15.7 10
26 15.2 8 20 13.6 8 21 13.6 8 21 13.2 8 22 14.8 9 20 12.2 7 15 10.4 3
14 8.8 0 16 10.2 5 14 8.7 1 16 10.9 6 17 12.1 7 17 11.1 6 16 11.2 5 18
11.2 5 17 12.4 8 15 10.1 5 15 9.2 3 17 11.6 7 15 9.3 3 12 6.1 0 12 5.2
0 10 6.1 0 10 5.8 0 9 4.8 0 10 5.2 0 10 4.5 0 14 4.7 0 ', ' Max Avg Min
26.8 26.7 26.6 26.8 26.1 0.0 26.8 26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9
26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9 26.8 26.8 26.9 26.8 26.7 26.8 26.8
26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.8 26.7
26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8 26.8 26.9
26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8
26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 ', ' Total 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ']
this is the output of stmat[0] where as I want stmat[0] = sep
Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Given the outputs you show, I'm guessing that
cell.text == "Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66"
So if you want actually individual values, you should probably do something like:
for table in stat_table:
for cell in table.find_all('table'):
cell_values = cell.text.split(" ")
stmat.extend(cell_values)
count = count + len(cell_values)
print(count)
print(stmat)
print(stmat[0])
I have a file which I have parsed as pandas DataFrame but want to collectively group by their individual element at column 3 w.r.t column 2.
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
2 00B2 1 -67 86 1.13
3 00B2 2 -67 87 1.13
4 00B2 3 -67 88 1.13
5 00B2 91 -67 39 1.13
6 00B2 4 -67 246 1.13
7 00B2 5 -67 78 1.13
8 00B2 6 -67 10 1.13
9 00B2 7 -67 153 1.13
10 00B2 1 -67 38 1.13
11 00B2 8 -67 225 1.13
12 00B2 9 -67 135 1.13
13 00B2 10 -67 23 1.13
14 00B2 4 -67 38 1.13
15 00B2 11 -67 132 1.13
16 00B2 12 -71 214 1.13
17 00B2 13 -71 71 1.13
18 00B2 14 -71 215 1.13
19 00B2 8 -71 38 1.13
20 00B2 15 -71 249 1.13
21 00B2 16 -71 174 1.13
22 00B2 17 -71 196 1.13
23 00B2 18 -71 38 1.13
24 00B2 19 -71 252 1.13
25 00B2 20 -71 196 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
28 00B2 23 -71 252 1.13
29 00B2 24 -71 39 1.13
.. ... .. ... ... ...
I want the data that looks something like this
DF1:
-67 37
-72 37
-71 37
... ...
DF2:
-68 38
-67 38
-70 38
... ...
DF3:
-64 39
-63 39
-62 39
... ...
I have tried the following:
e1 = pd.DataFrame(e1)
print (e1)
group = e1[3][2] == "group"
print (e1[group])
This leads to nowhere close to what I want so how to groupby such data according to my requirement?
I think need create dictionary of Series by converting groupby object to tuples and dicts:
d = dict(tuple(df.groupby(3)[2]))
print (d[39])
0 -67
1 -72
5 -67
26 -71
27 -71
29 -71
Name: 2, dtype: int64
For DataFrame:
d1 = dict(tuple(df.groupby(3)))
print (d1[39])
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
5 00B2 91 -67 39 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
29 00B2 24 -71 39 1.13