Conditionally replacing values in one DataFrame column with values from another column - python

Below is a sample DataFrame with column Y already present, but I want to calculate Y from column X in this way:
If there is a decline in X for 3 consecutive weeks, and that cumulative decline is < -2%, then Y for this and the previous weeks should be equal to the last X value in that run of declining X values, up to 12 weeks previously.
Week X %Change Y
w1 96.07 NA 88.478
w2 95.835 -0.24% 88.478
w3 95.402 -0.45% 88.478
w4 94.914 -0.51% 88.478
w5 94.28 -0.67% 88.478
w6 93.042 -1.31% 88.206
w7 91.891 -1.24% 87.993
w8 90.074 -1.98% 87.189
w9 90.541 0.52% 86.637
w10 90.13 -0.45% 86.304
w11 88.635 -1.66% 86.304
w12 88.478 -0.18% 86.304
w13 88.486 0.01% 86.304
w14 87.798 -0.78% 86.304
w15 88.23 0.49% 86.304
w16 88.395 0.19% 90
w17 88.206 -0.21% 87.842
w18 87.993 -0.24% 86.301
w19 87.189 -0.91% 85.133
w20 86.637 -0.63% 83.567
w21 86.304 -0.38% 81.418
w22 86.539 0.27% 80.193
w23 88.411 2.16% 80.193
w24 89.475 1.20% 79.62
w25 90.229 0.84% 79.191
w26 90.581 0.39% 77.519
w27 90 -0.64% 77.513
w28 87.842 -2.40% 77.513
w29 86.301 -1.75% 76.651
w30 85.133 -1.35% 75.48
w31 83.567 -1.84% 74.813
w32 81.418 -2.57% 74.512
w33 80.193 -1.50% 73.479
w34 80.28 0.11% 72.895
w35 79.62 -0.82% 71.888
w36 79.191 -0.54% 71.24
w37 77.519 -2.11% 70.064
w38 77.513 -0.01% 69.456
w39 77.57 0.07% 67.542
w40 76.651 -1.18% 66.687
w41 75.48 -1.53% 65.568
w42 74.813 -0.88% 64.483
w43 74.512 -0.40% 63.60
w44 73.479 -1.39% 62.979
w45 72.895 -0.79% 62.829
w46 71.888 -1.38% 62.39
w47 71.24 -0.90% 61.819
w48 70.064 -1.65% 61.819
w49 69.456 -0.87% 61.819
w50 67.542 -2.76% 61.819
w51 66.687 -1.27% 61.819
w52 65.568 -1.68% 61.819
w53 64.483 -1.65% 61.819
w54 63.604 -1.36% 61.819
w55 62.979 -0.98% 61.819
w56 62.829 -0.24% 61.819
w57 62.39 -0.70% 61.819
w58 61.819 -0.92% 61.819
w59 61.83 0.02% 61.83
w60 62.796 1.56% 62.796
w61 63.52 1.15% 63.52
w62 65.132 2.54% 65.132
w63 66.148 1.56% 66.148
w64 66.698 0.83% 66.698
w65 67.324 0.94% 67.324
w66 68.418 1.62% 68.418
w67 68.432 0.02% 68.432
w68 67.818 -0.90% 72.41
w69 69.108 1.90% 72.296
w70 69.911 1.16% 71.682
w71 70.484 0.82% 71.411
w72 71.479 1.41% 70.835
w73 72.155 0.95% 69.561
w74 73.549 1.93% 68.628
w75 73.452 -0.13% 67.344
w76 73.928 0.65% 67.344
w77 72.832 -1.48% 67.344
w78 72.934 0.14% 67.344
w79 72.41 -0.72% 67.344
w80 72.296 -0.16% 67.344
w81 71.682 -0.85% 67.344
w82 71.411 -0.38% 67.344
w83 70.835 -0.81% 67.344
w84 69.561 -1.80% 67.344
w85 68.628 -1.34% 67.344
w86 67.344 -1.87% 67.344
w87 67.669 0.48% 67.669

Based on our discussion in the comments, I hope this does what you need:
import pandas as pd
def find_nY(i):
"""For index number i, find the number n of Y values to be replaced."""
if df.Change[i] >= 0:
return 1
j = i
while j >= 1 and df.Change[j - 1] < 0:
j -= 1
if i - j >= 2 and sum(df.Change[j:i+1]) <= -2:
n = min(i - j + 1, 12)
else:
n = 1
return n
def replace_Y(i):
"""Replaces Y values with X for a run of decreases ending at i."""
n = find_nY(i)
df.loc[i-n+1:i, 'Y'] = [df.X[i]] * n
df = pd.read_csv('ShiftingValues.txt', sep=' ', header=0)
df['Week'] = df['Week'].str.strip('w').astype(int)
df['Change'] = df['Change'].astype(str).str.strip('%').astype(float)
df['Y'] = df['X']
for i in df.index[2:df.index[-1]]:
if df.Change[i + 1] >= 0:
replace_Y(i)
replace_Y(df.index[-1])
print(df.to_string())
Week X Change Y
0 1 96.070 NaN 96.070
1 2 95.835 -0.24 90.074
2 3 95.402 -0.45 90.074
3 4 94.914 -0.51 90.074
4 5 94.280 -0.67 90.074
5 6 93.042 -1.31 90.074
6 7 91.891 -1.24 90.074
7 8 90.074 -1.98 90.074
8 9 90.541 0.52 90.541
9 10 90.130 -0.45 88.478
10 11 88.635 -1.66 88.478
11 12 88.478 -0.18 88.478
12 13 88.486 0.01 88.486
13 14 87.798 -0.78 87.798
14 15 88.230 0.49 88.230
15 16 88.395 0.19 88.395
16 17 88.206 -0.21 86.304
17 18 87.993 -0.24 86.304
18 19 87.189 -0.91 86.304
19 20 86.637 -0.63 86.304
20 21 86.304 -0.38 86.304
21 22 86.539 0.27 86.539
22 23 88.411 2.16 88.411
23 24 89.475 1.20 89.475
24 25 90.229 0.84 90.229
25 26 90.581 0.39 90.581
26 27 90.000 -0.64 80.193
27 28 87.842 -2.40 80.193
28 29 86.301 -1.75 80.193
29 30 85.133 -1.35 80.193
30 31 83.567 -1.84 80.193
31 32 81.418 -2.57 80.193
32 33 80.193 -1.50 80.193
33 34 80.280 0.11 80.280
34 35 79.620 -0.82 77.513
35 36 79.191 -0.54 77.513
36 37 77.519 -2.11 77.513
37 38 77.513 -0.01 77.513
38 39 77.570 0.07 77.570
39 40 76.651 -1.18 76.651
40 41 75.480 -1.53 75.480
41 42 74.813 -0.88 74.813
42 43 74.512 -0.40 74.512
43 44 73.479 -1.39 73.479
44 45 72.895 -0.79 72.895
45 46 71.888 -1.38 71.888
46 47 71.240 -0.90 61.819
47 48 70.064 -1.65 61.819
48 49 69.456 -0.87 61.819
49 50 67.542 -2.76 61.819
50 51 66.687 -1.27 61.819
51 52 65.568 -1.68 61.819
52 53 64.483 -1.65 61.819
53 54 63.604 -1.36 61.819
54 55 62.979 -0.98 61.819
55 56 62.829 -0.24 61.819
56 57 62.390 -0.70 61.819
57 58 61.819 -0.92 61.819
58 59 61.830 0.02 61.830
59 60 62.796 1.56 62.796
60 61 63.520 1.15 63.520
61 62 65.132 2.54 65.132
62 63 66.148 1.56 66.148
63 64 66.698 0.83 66.698
64 65 67.324 0.94 67.324
65 66 68.418 1.62 68.418
66 67 68.432 0.02 68.432
67 68 67.818 -0.90 67.818
68 69 69.108 1.90 69.108
69 70 69.911 1.16 69.911
70 71 70.484 0.82 70.484
71 72 71.479 1.41 71.479
72 73 72.155 0.95 72.155
73 74 73.549 1.93 73.549
74 75 73.452 -0.13 73.452
75 76 73.928 0.65 73.928
76 77 72.832 -1.48 72.832
77 78 72.934 0.14 72.934
78 79 72.410 -0.72 67.344
79 80 72.296 -0.16 67.344
80 81 71.682 -0.85 67.344
81 82 71.411 -0.38 67.344
82 83 70.835 -0.81 67.344
83 84 69.561 -1.80 67.344
84 85 68.628 -1.34 67.344
85 86 67.344 -1.87 67.344
86 87 67.669 0.48 67.669

Related

Python Pandas Conditional changes to column filling/filtering correctly

I am trying to to change the data of a column based on a condition. However, it doesn't seem to pass through the condition correctly and fills every value in the column with the change when it shouldn't. Here is the code:
uh['Age']= uh['Age']
uh['AgeStatus'] = uh['Age']
uh['AgeStatus'] = uh.loc[uh['AgeStatus'] > 25.0, 'AgeStatus'] = 'Veteran'
and it returns the Type Error:
TypeError: '>' not supported between instances of 'str' and 'float'
and the dataframe:
Year Age Tm Lg G PA ... BB SO BA OBP SLG AgeStatus
5 2021 28.0 CHW AL 88 391 ... 18 87 0.299 0.332 0.437 Veteran
2 2021 23.0 TOR AL 101 443 ... 29 90 0.296 0.348 0.487 Veteran
8 2021 28.0 BOS AL 97 409 ... 37 75 0.309 0.374 0.522 Veteran
6 2021 26.0 HOU AL 96 416 ... 53 80 0.272 0.368 0.476 Veteran
5 2021 27.0 ATL NL 105 431 ... 30 116 0.249 0.305 0.475 Veteran
2 2021 22.0 SDP NL 87 362 ... 43 102 0.292 0.373 0.651 Veteran
6 2021 28.0 WSN NL 96 420 ... 26 77 0.322 0.369 0.521 Veteran
[7 rows x 21 columns]
Really confused on what's causing this.
You need to use conditionals like this
uh.loc[uh['Age'] > 25.0, 'AgeStatus'] = 'Veteran'

Merge dataframes and merge also columns into a single column

I have a dataframe df1
index A B C D E
0 0 92 84
1 1 98 49
2 2 49 68
3 3 0 58
4 4 91 95
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
and also this data frame df2
index C D E F
0 0 27 95 51 45
1 1 99 33 92 67
2 2 68 37 29 65
3 3 99 25 48 40
4 4 33 74 55 66
5 13 65 76 19 62
I wish to get to the following outcome when merging df1 and df2
index A B C D E F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
However, I am keeping getting this when using pd. merge(),
df_total=df1.merge(df2,how="outer",on="index",suffixes=(None,"_"))
df_total.replace(to_replace=np.nan,value=" ", inplace=True)
df_total
index A B C D E C_ D_ E_ F
0 0 92 84 27 95 51 45
1 1 98 49 99 33 92 67
2 2 49 68 68 37 29 65
3 3 0 58 99 25 48 40
4 4 91 95 33 74 55 66
5 5 47 56 52 25 58
6 6 86 71 34 39 40
7 7 80 78 0 86 12
8 8 0 8 30 88 42
9 9 69 83 7 65 60
10 10 93 39 10 90 45
11 13 65 76 19 62
Is there a way to get the desirable outcome using pd.merge or similar function?
Thanks
You can use .combine_first():
# convert the empty cells ("") to NaNs
df1 = df1.replace("", np.nan)
df2 = df2.replace("", np.nan)
# set indices and combine the dataframes
df1 = df1.set_index("index")
print(df1.combine_first(df2.set_index("index")).reset_index().fillna(""))
Prints:
index A B C D E F
0 0 92.0 84.0 27.0 95.0 51.0 45.0
1 1 98.0 49.0 99.0 33.0 92.0 67.0
2 2 49.0 68.0 68.0 37.0 29.0 65.0
3 3 0.0 58.0 99.0 25.0 48.0 40.0
4 4 91.0 95.0 33.0 74.0 55.0 66.0
5 5 47.0 56.0 52.0 25.0 58.0
6 6 86.0 71.0 34.0 39.0 40.0
7 7 80.0 78.0 0.0 86.0 12.0
8 8 0.0 8.0 30.0 88.0 42.0
9 9 69.0 83.0 7.0 65.0 60.0
10 10 93.0 39.0 10.0 90.0 45.0
11 13 65.0 76.0 19.0 62.0

for loop has saved a list as a single element

I have the following code to extract data from a table but because of the second for loop it saves all the data of a column as a single element of the array
is there a to way separate each element from the array below . link for stat_table :
for table in stat_table:
for cell in table.find_all('table'):
stmat.append(cell.text)
print(cell.text)
count = count + 1
print(count)
print(stmat)
print(stmat[0])
this is the output where all the data of second loop is saved as a single element
[' Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66 ',
' Max Avg Min 68 66.6 66 68 64.8 0 66 65.2 64 66 65.9 64 68 65.8 64 66 65.3
64 66 64.7 64 68 66.3 64 70 67.1 64 68 65.9 63 70 66.4 64 68 67.2 66 68
66.4 64 68 66.0 64 70 67.4 66 70 67.0 66 68 65.5 64 66 65.4 64 70 67.1 64
70 67.1 66 68 65.6 64 66 61.6 59 66 60.3 55 64 60.0 50 66 62.7 59 68 64.8
63 68 63.8 61 66 63.9 61 68 64.3 63 68 64.8 61 ', ' Max Avg Min 94 80.1 58
88 75.1 0 88 75.3 58 88 71.4 51 94 74.0 48 94 74.8 54 94 73.4 54 94 78.4
54 100 80.7 58 100 73.9 51 100 76.7 51 100 81.0 61 94 76.0 58 94 80.3 65
94 82.5 61 94 81.4 61 94 76.8 54 94 74.4 54 100 82.0 58 100 86.1 65 100
80.4 54 100 73.1 48 94 64.6 39 100 62.2 32 88 64.3 40 94 70.4 48 94 69.2
48 94 68.8 45 88 73.4 48 94 68.9 43 ', ' Max Avg Min 23 15.9 10 22 15.7 10
26 15.2 8 20 13.6 8 21 13.6 8 21 13.2 8 22 14.8 9 20 12.2 7 15 10.4 3
14 8.8 0 16 10.2 5 14 8.7 1 16 10.9 6 17 12.1 7 17 11.1 6 16 11.2 5 18
11.2 5 17 12.4 8 15 10.1 5 15 9.2 3 17 11.6 7 15 9.3 3 12 6.1 0 12 5.2
0 10 6.1 0 10 5.8 0 9 4.8 0 10 5.2 0 10 4.5 0 14 4.7 0 ', ' Max Avg Min
26.8 26.7 26.6 26.8 26.1 0.0 26.8 26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9
26.8 26.7 26.8 26.8 26.7 26.8 26.8 26.7 26.9 26.8 26.8 26.9 26.8 26.7 26.8 26.8
26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.7 26.7 26.8 26.8 26.7
26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8 26.8 26.9
26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.9 26.8 26.9 26.9 26.8 26.9 26.8
26.8 26.9 26.8 26.8 26.9 26.8 26.8 26.9 26.8 26.8 ', ' Total 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 ']
this is the output of stmat[0] where as I want stmat[0] = sep
Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Given the outputs you show, I'm guessing that
cell.text == "Sep 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
25 26 27 28 29 30 ', ' Max Avg Min 82 73.6 70 82 72.9 0 81 74.2 70 84 76.4
70 86 75.3 68 82 74.6 68 82 74.6 68 82 74.1 68 81 73.9 68 82 75.4 68 84
75.4 68 81 73.9 68 82 75.0 68 79 72.8 68 81 73.6 68 81 73.5 68 82 74.2 68
82 74.9 68 82 73.6 68 79 71.9 66 82 72.7 66 81 71.3 63 82 74.1 63 82 75.0
64 86 76.4 68 84 75.7 68 82 75.4 68 84 75.5 66 84 74.0 66 86 76.7 66"
So if you want actually individual values, you should probably do something like:
for table in stat_table:
for cell in table.find_all('table'):
cell_values = cell.text.split(" ")
stmat.extend(cell_values)
count = count + len(cell_values)
print(count)
print(stmat)
print(stmat[0])

How can i read each row till the first occurence of NaN.?

From excel file i want to read each row and use it independently to process
here how the data looks like in excel file
12 32 45 67 89 54 23 56 78 98
34 76 34 89 34 3
76 34 54 12 43 78 56
76 56 45 23 43 45 67 76 67 8
87 9 9 0 89 90 6 89
23 90 90 32 23 34 56 9 56 87
23 56 34 3 5 8 7 6 98
32 23 34 6 65 78 67 87 89 87
12 23 34 32 43 67 45
343 76 56 7 8 9 4
but when i read it through pandas then the remaining columns are filled with NaN.
the data after reading from pandas looks like
0 12 32 45 67 89 54 23.0 56.0 78.0 98.0
1 34 76 34 89 34 3 NaN NaN NaN NaN
2 76 34 54 12 43 78 56.0 NaN NaN NaN
3 76 56 45 23 43 45 67.0 76.0 67.0 8.0
4 87 9 9 0 89 90 6.0 89.0 NaN NaN
5 23 90 90 32 23 34 56.0 9.0 56.0 87.0
6 23 56 34 3 5 8 7.0 6.0 98.0 NaN
7 32 23 34 6 65 78 67.0 87.0 89.0 87.0
8 12 23 34 32 43 67 45.0 NaN NaN NaN
9 343 76 56 7 8 9 4.0 5.0 8.0 68.0
Here it can be seen the remaining columns of each row is filled with NaN which i don't want.
Nor i wanted to replace it with some other value or drop the whole rows contains NaN .
How can i read columns of each row till the first occurence of NaN. ?
For eg.The second row in pandas is 34 76 34 89 34 3 NaN NaN NaN NaN
so my desired output will be that it reads only 34 76 34 89 34 3
My preference is pandas but if not possible then is their any other way of doing it like with some other libraries
Any resource or reference will be helpful?
Thanks
While calling the pd.read_excel function, try setting keep_default_na = False. This will avoid default NaN values while reading.

Grouping pandas dataframe based on common key

I have a file which I have parsed as pandas DataFrame but want to collectively group by their individual element at column 3 w.r.t column 2.
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
2 00B2 1 -67 86 1.13
3 00B2 2 -67 87 1.13
4 00B2 3 -67 88 1.13
5 00B2 91 -67 39 1.13
6 00B2 4 -67 246 1.13
7 00B2 5 -67 78 1.13
8 00B2 6 -67 10 1.13
9 00B2 7 -67 153 1.13
10 00B2 1 -67 38 1.13
11 00B2 8 -67 225 1.13
12 00B2 9 -67 135 1.13
13 00B2 10 -67 23 1.13
14 00B2 4 -67 38 1.13
15 00B2 11 -67 132 1.13
16 00B2 12 -71 214 1.13
17 00B2 13 -71 71 1.13
18 00B2 14 -71 215 1.13
19 00B2 8 -71 38 1.13
20 00B2 15 -71 249 1.13
21 00B2 16 -71 174 1.13
22 00B2 17 -71 196 1.13
23 00B2 18 -71 38 1.13
24 00B2 19 -71 252 1.13
25 00B2 20 -71 196 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
28 00B2 23 -71 252 1.13
29 00B2 24 -71 39 1.13
.. ... .. ... ... ...
I want the data that looks something like this
DF1:
-67 37
-72 37
-71 37
... ...
DF2:
-68 38
-67 38
-70 38
... ...
DF3:
-64 39
-63 39
-62 39
... ...
I have tried the following:
e1 = pd.DataFrame(e1)
print (e1)
group = e1[3][2] == "group"
print (e1[group])
This leads to nowhere close to what I want so how to groupby such data according to my requirement?
I think need create dictionary of Series by converting groupby object to tuples and dicts:
d = dict(tuple(df.groupby(3)[2]))
print (d[39])
0 -67
1 -72
5 -67
26 -71
27 -71
29 -71
Name: 2, dtype: int64
For DataFrame:
d1 = dict(tuple(df.groupby(3)))
print (d1[39])
0 1 2 3 4
0 00B2 0 -67 39 1.13
1 00B2 85 -72 39 1.13
5 00B2 91 -67 39 1.13
26 00B2 21 -71 39 1.13
27 00B2 22 -71 39 1.13
29 00B2 24 -71 39 1.13

Categories