Grouping pandas dataframe based on common key - python

I have a file which I have parsed as pandas DataFrame but want to collectively group by their individual element at column 3 w.r.t column 2.
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
2 00B2 1 -67 86 1.13
3 00B2 2 -67 87 1.13
4 00B2 3 -67 88 1.13
5 00B2 91 -67 39 1.13
6 00B2 4 -67 246 1.13
7 00B2 5 -67 78 1.13
8 00B2 6 -67 10 1.13
9 00B2 7 -67 153 1.13
10 00B2 1 -67 38 1.13
11 00B2 8 -67 225 1.13
12 00B2 9 -67 135 1.13
13 00B2 10 -67 23 1.13
14 00B2 4 -67 38 1.13
15 00B2 11 -67 132 1.13
16 00B2 12 -71 214 1.13
17 00B2 13 -71 71 1.13
18 00B2 14 -71 215 1.13
19 00B2 8 -71 38 1.13
20 00B2 15 -71 249 1.13
21 00B2 16 -71 174 1.13
22 00B2 17 -71 196 1.13
23 00B2 18 -71 38 1.13
24 00B2 19 -71 252 1.13
25 00B2 20 -71 196 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
28 00B2 23 -71 252 1.13
29 00B2 24 -71 39 1.13
.. ... .. ... ... ...
I want the data that looks something like this
DF1:
-67 37
-72 37
-71 37
... ...
DF2:
-68 38
-67 38
-70 38
... ...
DF3:
-64 39
-63 39
-62 39
... ...
I have tried the following:
e1 = pd.DataFrame(e1)
print (e1)
group = e1[3][2] == "group"
print (e1[group])
This leads to nowhere close to what I want so how to groupby such data according to my requirement?

I think need create dictionary of Series by converting groupby object to tuples and dicts:
d = dict(tuple(df.groupby(3)[2]))
print (d[39])
0 -67
1 -72
5 -67
26 -71
27 -71
29 -71
Name: 2, dtype: int64
For DataFrame:
d1 = dict(tuple(df.groupby(3)))
print (d1[39])
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
5 00B2 91 -67 39 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
29 00B2 24 -71 39 1.13

Related

How to split pandas some of data frame rows that are not lists?

Is there a way to change this data frame:
40 4.5 95
41 1.76 95
112 0.17/0.43 >95/>95
to this using pandas:
40 4.5 95
41 1.76 95
112 0.17 95
112 0.43 95
This is the pandas dataframe:
a b
19 560 80
40 4.5 95
41 1.76 95
112 0.17/0.43 >95/>95
154 7.2/1 >95/>95
... ... ...
2991 55 95
2992 33 95
3887 6.1 87.7
3893 3.9 70.3
3908 100 40
216 rows × 2 columns
I would use explode:
df = df.apply(lambda x: x.astype(str).str.split('/').explode(ignore_index=True))

Merge dataframes and merge also columns into a single column

I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks
You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0

How to add additional text to matplotlib annotations

I have used seaborn's titanic dataset as a proxy for my very large dataset to create the chart and data based on that.
The following code runs without any errors:
import seaborn as sns
import pandas as pd
import numpy as np
sns.set_theme(style="darkgrid")
# Load the example Titanic dataset
df = sns.load_dataset("titanic")
# split fare into decile groups and order them
df['fare_grp'] = pd.qcut(df['fare'], q=10,labels=None, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp'],dropna=False).size()
df['fare_grp_num'] = pd.qcut(df['fare'], q=10,labels=False, retbins=False, precision=0).astype(str)
df.groupby(['fare_grp_num'],dropna=False).size()
df['fare_ord_grp'] = df['fare_grp_num'] + ' ' +df['fare_grp']
df['fare_ord_grp']
# set variables
target = 'survived'
ydim = 'fare_ord_grp'
xdim = 'embark_town'
#del [result]
non_events = pd.DataFrame(df[df[target]==0].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'non_events'})
non_events[xdim]=non_events[xdim].replace(np.nan, 'Missing', regex=True)
non_events[ydim]=non_events[ydim].replace(np.nan, 'Missing', regex=True)
non_events_total = pd.DataFrame(df[df[target]==0].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'non_events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
events = pd.DataFrame(df[df[target]==1].groupby([ydim,xdim],as_index=False, dropna=False)[target].count()).rename(columns={target: 'events'})
events[xdim]=events[xdim].replace(np.nan, 'Missing', regex=True)
events[ydim]=events[ydim].replace(np.nan, 'Missing', regex=True)
events_total = pd.DataFrame(df[df[target]==1].groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'events_total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total = pd.DataFrame(df.groupby([xdim],dropna=False,as_index=False)[target].count()).rename(columns={target: 'total_by_xdim'}).replace(np.nan, 'Missing', regex=True)
grand_total=grand_total.merge(non_events_total, how='left', on=xdim).merge(events_total, how='left', on=xdim)
result = pd.merge(non_events, events, how="outer",on=[ydim,xdim])
result['total'] = result['non_events'].fillna(0) + result['events'].fillna(0)
result[xdim] = result[xdim].replace(np.nan, 'Missing', regex=True)
result = pd.merge(result, grand_total, how="left",on=[xdim])
result['survival rate %'] = round(result['events']/result['total']*100,2)
result['% event dist by xdim'] = round(result['events']/result['events_total_by_xdim']*100,2)
result['% non-event dist by xdim'] = round(result['non_events']/result['non_events_total_by_xdim']*100,2)
result['% total dist by xdim'] = round(result['total']/result['total_by_xdim']*100,2)
display(result)
value_name1 = "% dist by " + str(xdim)
dfl = pd.melt(result, id_vars=[ydim, xdim],value_vars =['% total dist by xdim'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl2 = dfl.pivot(index=ydim, columns=xdim, values=value_name1)
print(dfl2)
title1 = "% dist by " + str(xdim)
ax=dfl2.T.plot(kind='bar', stacked=True, rot=1, figsize=(8, 8), title=title1)
ax.set_xticklabels(ax.get_xticklabels(), rotation=45)
ax.legend(bbox_to_anchor=(1.0, 1.0),title = 'Fare Range')
ax.set_ylabel('% Dist')
for p in ax.patches:
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
ax.text(x+width/2, y+height/2,'{:.0f}%'.format(height),horizontalalignment='center', verticalalignment='center')
It produces the following stacked percent bar chart, which shows the % of total distribution by embark town.
I also want to show the survival rate along with the %distribution in each block. For example, for Queenstown, fare range 1 (7.6, 7.9], the % total distribution is 56%. I want to display the survival rate 37.21% as (56%, 37.21%). I am not able to figure it out. Kindly offer any suggestions. Thanks.
Here is the output summary table for reference
fare_ord_grp
embark_town
non_events
events
total
total_by_xdim
non_events_total_by_xdim
events_total_by_xdim
survival rate %
% event dist by xdim
% non-event dist by xdim
% total dist by xdim
0
0 (-0.1,7.6]
Cherbourg
22
7
29
168
75
93
24.14
7.53
29.33
17.26
1
0 (-0.1,7.6]
Queenstown
4
NaN
4
77
47
30
NaN
NaN
8.51
5.19
2
0 (-0.1,7.6]
Southampton
53
6
59
644
427
217
10.17
2.76
12.41
9.16
3
1 (7.6,7.9]
Queenstown
27
16
43
77
47
30
37.21
53.33
57.45
55.84
4
1 (7.6,7.9]
Southampton
34
10
44
644
427
217
22.73
4.61
7.96
6.83
5
2 (7.9,8]
Cherbourg
4
1
5
168
75
93
20
1.08
5.33
2.98
6
2 (7.9,8]
Southampton
83
13
96
644
427
217
13.54
5.99
19.44
14.91
7
3 (8.0,10.5]
Cherbourg
2
1
3
168
75
93
33.33
1.08
2.67
1.79
8
3 (8.0,10.5]
Queenstown
2
NaN
2
77
47
30
NaN
NaN
4.26
2.6
9
3 (8.0,10.5]
Southampton
56
17
73
644
427
217
23.29
7.83
13.11
11.34
10
4 (10.5,14.5]
Cherbourg
7
8
15
168
75
93
53.33
8.6
9.33
8.93
11
4 (10.5,14.5]
Queenstown
1
2
3
77
47
30
66.67
6.67
2.13
3.9
12
4 (10.5,14.5]
Southampton
40
26
66
644
427
217
39.39
11.98
9.37
10.25
13
5 (14.5,21.7]
Cherbourg
9
10
19
168
75
93
52.63
10.75
12
11.31
14
5 (14.5,21.7]
Queenstown
5
3
8
77
47
30
37.5
10
10.64
10.39
15
5 (14.5,21.7]
Southampton
37
24
61
644
427
217
39.34
11.06
8.67
9.47
16
6 (21.7,27]
Cherbourg
1
4
5
168
75
93
80
4.3
1.33
2.98
17
6 (21.7,27]
Queenstown
2
3
5
77
47
30
60
10
4.26
6.49
18
6 (21.7,27]
Southampton
40
39
79
644
427
217
49.37
17.97
9.37
12.27
19
7 (27.0,39.7]
Cherbourg
14
10
24
168
75
93
41.67
10.75
18.67
14.29
20
7 (27.0,39.7]
Queenstown
5
NaN
5
77
47
30
NaN
NaN
10.64
6.49
21
7 (27.0,39.7]
Southampton
38
24
62
644
427
217
38.71
11.06
8.9
9.63
22
8 (39.7,78]
Cherbourg
5
19
24
168
75
93
79.17
20.43
6.67
14.29
23
8 (39.7,78]
Southampton
37
28
65
644
427
217
43.08
12.9
8.67
10.09
24
9 (78.0,512.3]
Cherbourg
11
33
44
168
75
93
75
35.48
14.67
26.19
25
9 (78.0,512.3]
Queenstown
1
1
2
77
47
30
50
3.33
2.13
2.6
26
9 (78.0,512.3]
Southampton
9
30
39
644
427
217
76.92
13.82
2.11
6.06
27
2 (7.9,8]
Queenstown
NaN
5
5
77
47
30
100
16.67
NaN
6.49
28
9 (78.0,512.3]
Missing
NaN
2
2
2
NaN
2
100
100
NaN
100
dfl2.T is being plotted, but 'survival rate %' is in result. As such, the indices for the values from dfl2.T do not correspond with 'survival rate %'.
Because all of values in result['% total dist by xdim'] are
not unique, we can't use a dict of matched key-values.
Create a corresponding pivoted DataFrame for 'survival rate %', and then flatten it. All of the values will be in the same order as the '% total dist by xdim' values from dfl2.T. As such, they can be indexed.
With respect to dfl2.T, the plot API plots in column order, which means .flatten(order='F') must be used to flatten the array in the correct order to be indexed.
# create a corresponding pivoted dataframe for survival rate %
dfl3 = pd.melt(result, id_vars=[ydim, xdim],value_vars =['survival rate %'], var_name = 'Type',value_name=value_name1).drop(columns='Type')
dfl4 = dfl3.pivot(index=ydim, columns=xdim, values=value_name1)
# flatten dfl4.T in column order
dfl4_flattened = dfl4.T.to_numpy().flatten(order='F')
for i, p in enumerate(ax.patches):
width, height = p.get_width(), p.get_height()
x, y = p.get_xy()
# only print values when height is not 0
if height != 0:
# create the text string
text = f'{height:.0f}%, {dfl4_flattened[i]:.0f}%'
# annotate the bar segments
ax.text(x+width/2, y+height/2, text, horizontalalignment='center', verticalalignment='center')
Notes
Here we can see dfl2.T and dfl4.T
# dfl2.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 17.26 NaN 2.98 1.79 8.93 11.31 2.98 14.29 14.29 26.19
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown 5.19 55.84 6.49 2.60 3.90 10.39 6.49 6.49 NaN 2.60
Southampton 9.16 6.83 14.91 11.34 10.25 9.47 12.27 9.63 10.09 6.06
# dfl4.T
fare_ord_grp 0 (-0.1, 7.6] 1 (7.6, 7.9] 2 (7.9, 8.0] 3 (8.0, 10.5] 4 (10.5, 14.5] 5 (14.5, 21.7] 6 (21.7, 27.0] 7 (27.0, 39.7] 8 (39.7, 78.0] 9 (78.0, 512.3]
embark_town
Cherbourg 24.14 NaN 20.00 33.33 53.33 52.63 80.00 41.67 79.17 75.00
Missing NaN NaN NaN NaN NaN NaN NaN NaN NaN 100.00
Queenstown NaN 37.21 100.00 NaN 66.67 37.50 60.00 NaN NaN 50.00
Southampton 10.17 22.73 13.54 23.29 39.39 39.34 49.37 38.71 43.08 76.92

for loop has saved a list as a single element

I have the following code to extract data from a table but because of the second for loop it saves all the data of a column as a single element of the array
is there a to way separate each element from the array below . link for stat_table :
for table in stat_table:
for cell in table.find_all('table'):
stmat.append(cell.text)
print(cell.text)
count = count + 1
print(count)
print(stmat)
print(stmat[0])
this is the output where all the data of second loop is saved as a single element
[' Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66 ',
' Max Avg Min 68 66.6 66 68 64.8 0 66 65.2 64 66 65.9 64 68 65.8 64 66 65.3
64 66 64.7 64 68 66.3 64 70 67.1 64 68 65.9 63 70 66.4 64 68 67.2 66 68
66.4 64 68 66.0 64 70 67.4 66 70 67.0 66 68 65.5 64 66 65.4 64 70 67.1 64
70 67.1 66 68 65.6 64 66 61.6 59 66 60.3 55 64 60.0 50 66 62.7 59 68 64.8
63 68 63.8 61 66 63.9 61 68 64.3 63 68 64.8 61 ', ' Max Avg Min 94 80.1 58
88 75.1 0 88 75.3 58 88 71.4 51 94 74.0 48 94 74.8 54 94 73.4 54 94 78.4
54 100 80.7 58 100 73.9 51 100 76.7 51 100 81.0 61 94 76.0 58 94 80.3 65
94 82.5 61 94 81.4 61 94 76.8 54 94 74.4 54 100 82.0 58 100 86.1 65 100
80.4 54 100 73.1 48 94 64.6 39 100 62.2 32 88 64.3 40 94 70.4 48 94 69.2
48 94 68.8 45 88 73.4 48 94 68.9 43 ', ' Max Avg Min 23 15.9 10 22 15.7 10
26 15.2 8 20 13.6 8 21 13.6 8 21 13.2 8 22 14.8 9 20 12.2 7 15 10.4 3
14 8.8 0 16 10.2 5 14 8.7 1 16 10.9 6 17 12.1 7 17 11.1 6 16 11.2 5 18
11.2 5 17 12.4 8 15 10.1 5 15 9.2 3 17 11.6 7 15 9.3 3 12 6.1 0 12 5.2
0 10 6.1 0 10 5.8 0 9 4.8 0 10 5.2 0 10 4.5 0 14 4.7 0 ', ' Max Avg Min
26.8 26.7 26.6 26.8 26.1 0.0 26.8 26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9
26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9 26.8 26.8 26.9 26.8 26.7 26.8 26.8
26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.8 26.7
26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8 26.8 26.9
26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8
26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 ', ' Total 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ']
this is the output of stmat[0] where as I want stmat[0] = sep
Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Given the outputs you show, I'm guessing that
cell.text == "Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66"
So if you want actually individual values, you should probably do something like:
for table in stat_table:
for cell in table.find_all('table'):
cell_values = cell.text.split(" ")
stmat.extend(cell_values)
count = count + len(cell_values)
print(count)
print(stmat)
print(stmat[0])

Conditionally replacing values in one DataFrame column with values from another column

Below is a sample DataFrame with column Y already present, but I want to calculate Y from column X in this way:
If there is a decline in X for 3 consecutive weeks, and that cumulative decline is < -2%, then Y for this and the previous weeks should be equal to the last X value in that run of declining X values, up to 12 weeks previously.
Week X %Change Y
w1 96.07 NA 88.478
w2 95.835 -0.24% 88.478
w3 95.402 -0.45% 88.478
w4 94.914 -0.51% 88.478
w5 94.28 -0.67% 88.478
w6 93.042 -1.31% 88.206
w7 91.891 -1.24% 87.993
w8 90.074 -1.98% 87.189
w9 90.541 0.52% 86.637
w10 90.13 -0.45% 86.304
w11 88.635 -1.66% 86.304
w12 88.478 -0.18% 86.304
w13 88.486 0.01% 86.304
w14 87.798 -0.78% 86.304
w15 88.23 0.49% 86.304
w16 88.395 0.19% 90
w17 88.206 -0.21% 87.842
w18 87.993 -0.24% 86.301
w19 87.189 -0.91% 85.133
w20 86.637 -0.63% 83.567
w21 86.304 -0.38% 81.418
w22 86.539 0.27% 80.193
w23 88.411 2.16% 80.193
w24 89.475 1.20% 79.62
w25 90.229 0.84% 79.191
w26 90.581 0.39% 77.519
w27 90 -0.64% 77.513
w28 87.842 -2.40% 77.513
w29 86.301 -1.75% 76.651
w30 85.133 -1.35% 75.48
w31 83.567 -1.84% 74.813
w32 81.418 -2.57% 74.512
w33 80.193 -1.50% 73.479
w34 80.28 0.11% 72.895
w35 79.62 -0.82% 71.888
w36 79.191 -0.54% 71.24
w37 77.519 -2.11% 70.064
w38 77.513 -0.01% 69.456
w39 77.57 0.07% 67.542
w40 76.651 -1.18% 66.687
w41 75.48 -1.53% 65.568
w42 74.813 -0.88% 64.483
w43 74.512 -0.40% 63.60
w44 73.479 -1.39% 62.979
w45 72.895 -0.79% 62.829
w46 71.888 -1.38% 62.39
w47 71.24 -0.90% 61.819
w48 70.064 -1.65% 61.819
w49 69.456 -0.87% 61.819
w50 67.542 -2.76% 61.819
w51 66.687 -1.27% 61.819
w52 65.568 -1.68% 61.819
w53 64.483 -1.65% 61.819
w54 63.604 -1.36% 61.819
w55 62.979 -0.98% 61.819
w56 62.829 -0.24% 61.819
w57 62.39 -0.70% 61.819
w58 61.819 -0.92% 61.819
w59 61.83 0.02% 61.83
w60 62.796 1.56% 62.796
w61 63.52 1.15% 63.52
w62 65.132 2.54% 65.132
w63 66.148 1.56% 66.148
w64 66.698 0.83% 66.698
w65 67.324 0.94% 67.324
w66 68.418 1.62% 68.418
w67 68.432 0.02% 68.432
w68 67.818 -0.90% 72.41
w69 69.108 1.90% 72.296
w70 69.911 1.16% 71.682
w71 70.484 0.82% 71.411
w72 71.479 1.41% 70.835
w73 72.155 0.95% 69.561
w74 73.549 1.93% 68.628
w75 73.452 -0.13% 67.344
w76 73.928 0.65% 67.344
w77 72.832 -1.48% 67.344
w78 72.934 0.14% 67.344
w79 72.41 -0.72% 67.344
w80 72.296 -0.16% 67.344
w81 71.682 -0.85% 67.344
w82 71.411 -0.38% 67.344
w83 70.835 -0.81% 67.344
w84 69.561 -1.80% 67.344
w85 68.628 -1.34% 67.344
w86 67.344 -1.87% 67.344
w87 67.669 0.48% 67.669
Based on our discussion in the comments, I hope this does what you need:
import pandas as pd
def find_nY(i):
"""For index number i, find the number n of Y values to be replaced."""
if df.Change[i] >= 0:
return 1
j = i
while j >= 1 and df.Change[j - 1] < 0:
j -= 1
if i - j >= 2 and sum(df.Change[j:i+1]) <= -2:
n = min(i - j + 1, 12)
else:
n = 1
return n
def replace_Y(i):
"""Replaces Y values with X for a run of decreases ending at i."""
n = find_nY(i)
df.loc[i-n+1:i, 'Y'] = [df.X[i]] * n
df = pd.read_csv('ShiftingValues.txt', sep=' ', header=0)
df['Week'] = df['Week'].str.strip('w').astype(int)
df['Change'] = df['Change'].astype(str).str.strip('%').astype(float)
df['Y'] = df['X']
for i in df.index[2:df.index[-1]]:
if df.Change[i + 1] >= 0:
replace_Y(i)
replace_Y(df.index[-1])
print(df.to_string())
Week X Change Y
0 1 96.070 NaN 96.070
1 2 95.835 -0.24 90.074
2 3 95.402 -0.45 90.074
3 4 94.914 -0.51 90.074
4 5 94.280 -0.67 90.074
5 6 93.042 -1.31 90.074
6 7 91.891 -1.24 90.074
7 8 90.074 -1.98 90.074
8 9 90.541 0.52 90.541
9 10 90.130 -0.45 88.478
10 11 88.635 -1.66 88.478
11 12 88.478 -0.18 88.478
12 13 88.486 0.01 88.486
13 14 87.798 -0.78 87.798
14 15 88.230 0.49 88.230
15 16 88.395 0.19 88.395
16 17 88.206 -0.21 86.304
17 18 87.993 -0.24 86.304
18 19 87.189 -0.91 86.304
19 20 86.637 -0.63 86.304
20 21 86.304 -0.38 86.304
21 22 86.539 0.27 86.539
22 23 88.411 2.16 88.411
23 24 89.475 1.20 89.475
24 25 90.229 0.84 90.229
25 26 90.581 0.39 90.581
26 27 90.000 -0.64 80.193
27 28 87.842 -2.40 80.193
28 29 86.301 -1.75 80.193
29 30 85.133 -1.35 80.193
30 31 83.567 -1.84 80.193
31 32 81.418 -2.57 80.193
32 33 80.193 -1.50 80.193
33 34 80.280 0.11 80.280
34 35 79.620 -0.82 77.513
35 36 79.191 -0.54 77.513
36 37 77.519 -2.11 77.513
37 38 77.513 -0.01 77.513
38 39 77.570 0.07 77.570
39 40 76.651 -1.18 76.651
40 41 75.480 -1.53 75.480
41 42 74.813 -0.88 74.813
42 43 74.512 -0.40 74.512
43 44 73.479 -1.39 73.479
44 45 72.895 -0.79 72.895
45 46 71.888 -1.38 71.888
46 47 71.240 -0.90 61.819
47 48 70.064 -1.65 61.819
48 49 69.456 -0.87 61.819
49 50 67.542 -2.76 61.819
50 51 66.687 -1.27 61.819
51 52 65.568 -1.68 61.819
52 53 64.483 -1.65 61.819
53 54 63.604 -1.36 61.819
54 55 62.979 -0.98 61.819
55 56 62.829 -0.24 61.819
56 57 62.390 -0.70 61.819
57 58 61.819 -0.92 61.819
58 59 61.830 0.02 61.830
59 60 62.796 1.56 62.796
60 61 63.520 1.15 63.520
61 62 65.132 2.54 65.132
62 63 66.148 1.56 66.148
63 64 66.698 0.83 66.698
64 65 67.324 0.94 67.324
65 66 68.418 1.62 68.418
66 67 68.432 0.02 68.432
67 68 67.818 -0.90 67.818
68 69 69.108 1.90 69.108
69 70 69.911 1.16 69.911
70 71 70.484 0.82 70.484
71 72 71.479 1.41 71.479
72 73 72.155 0.95 72.155
73 74 73.549 1.93 73.549
74 75 73.452 -0.13 73.452
75 76 73.928 0.65 73.928
76 77 72.832 -1.48 72.832
77 78 72.934 0.14 72.934
78 79 72.410 -0.72 67.344
79 80 72.296 -0.16 67.344
80 81 71.682 -0.85 67.344
81 82 71.411 -0.38 67.344
82 83 70.835 -0.81 67.344
83 84 69.561 -1.80 67.344
84 85 68.628 -1.34 67.344
85 86 67.344 -1.87 67.344
86 87 67.669 0.48 67.669

Categories