Shift only selected rows in Pandas - python

I would like to shift only specific rows in my DataFrame by 1 period on the columns axis.
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 31 21 NaN
6 07 29 27 NaN
To have something like this:
Df
Out:
Month Year_2005 Year_2006 Year_2007
0 01 NaN 31 35
1 02 NaN 40 45
2 03 NaN 87 46
3 04 NaN 55 41
4 05 NaN 36 28
5 06 NaN 31 21
6 07 NaN 29 27
My code so far:
rows_to_shift = Df[Df['Year_2005'].notnull()].index
Df.iloc[rows_to_shift, 1] = Df.iloc[rows_to_shift,2].shift(1)

Try:
df = df.set_index("Month")
df[df["Year_2005"].notnull()] = df[df["Year_2005"].notnull()].shift(axis=1)
>>> df
Year_2005 Year_2006 Year_2007
Month
1 NaN 31.0 35.0
2 NaN 40.0 45.0
3 NaN 87.0 46.0
4 NaN 55.0 41.0
5 NaN 36.0 28.0
6 NaN 31.0 21.0
7 NaN 29.0 27.0

You can try:
df1 = df.set_index('Month')
df1 = df1.apply(lambda x: pd.Series(sorted(x, key=pd.notna), index=x.index), axis=1)
df = df1.reset_index()
Result:
Month Year_2005 Year_2006 Year_2007
0 1 NaN 31.0 35.0
1 2 NaN 40.0 45.0
2 3 NaN 87.0 46.0
3 4 NaN 55.0 41.0
4 5 NaN 36.0 28.0
5 6 NaN 31.0 21.0
6 7 NaN 29.0 27.0

Related

Pandas merge 2 dataframes

I am trying merge 2 dataframes.
df1
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
df2
Date B
07.01.2021 14
08.01.2021 27
09.01.2021 28
10.01.2021 29
11.01.2021 30
12.01.2021 31
13.01.2021 32
Both dataframes have one same row (although there could be several overlappings).
So I want to get df3 that looks as follows:
df3
Date A B C
01.01.2021 1 8 14
02.01.2021 2 9 15
03.01.2021 3 10 16
04.01.2021 4 11 17
05.01.2021 5 12 18
06.01.2021 6 13 19
07.01.2021 7 14 20
08.01.2021 Nan 27 Nan
09.01.2021 Nan 28 Nan
10.01.2021 Nan 29 Nan
11.01.2021 Nan 30 Nan
12.01.2021 Nan 31 Nan
13.01.2021 Nan 32 Nan
I've tried
df3=df1.merge(df2, on='Date', how='outer') but it gives extra A,B,C columns. Could you give some idea how to get df3?
Thanks a lot.
merge outer without specifying on (default on is the intersection of columns between the two DataFrames in this case ['Date', 'B']):
df3 = df1.merge(df2, how='outer')
df3:
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
7 08.01.2021 NaN 27 NaN
8 09.01.2021 NaN 28 NaN
9 10.01.2021 NaN 29 NaN
10 11.01.2021 NaN 30 NaN
11 12.01.2021 NaN 31 NaN
12 13.01.2021 NaN 32 NaN
Assuming you always want to keep the first full version, you can concat the df2 on the end of df1 and drop duplicates on the Date column.
pd.concat([df1,df2]).drop_duplicates(subset='Date')
Output
Date A B C
0 01.01.2021 1.0 8 14.0
1 02.01.2021 2.0 9 15.0
2 03.01.2021 3.0 10 16.0
3 04.01.2021 4.0 11 17.0
4 05.01.2021 5.0 12 18.0
5 06.01.2021 6.0 13 19.0
6 07.01.2021 7.0 14 20.0
1 08.01.2021 NaN 27 NaN
2 09.01.2021 NaN 28 NaN
3 10.01.2021 NaN 29 NaN
4 11.01.2021 NaN 30 NaN
5 12.01.2021 NaN 31 NaN
6 13.01.2021 NaN 32 NaN

Python add columns to DataFame with rolling window based on 3 previous rows

I have a dataframe like this:
original = pd.DataFrame(np.random.randint(0,100,size=(10, 3)), columns=["P1_day", "P1_week", "P1_month"])
print(original)
P1_day P1_week P1_month
0 50 17 55
1 45 3 10
2 93 79 84
3 99 38 33
4 44 35 35
5 25 43 87
6 38 88 56
7 20 66 6
8 4 23 6
9 39 75 3
I need to generate new dataframe starting from 3rd row of original dataframe and add new 9 columns based on rolling window defined as 3 previous rows with corresponding prefixes: [_0,_1, _2]. So, It's rows with index [0,1,2] from original dataframe .
For example, the next 3 columns will be from the original.iloc[0],
and after the next 3 columns will be from the original.iloc[1],
and the last 3 columns will be from the original.iloc[2]
I tried to solve it by the next code:
subset_shifted = original[["P1_day", "P1_week", "P1_month"]].shift(3)
subset_shifted.columns = ["P1_day_0", "P1_week_0", "P1_month_0"]
original_ = pd.concat([original, subset_shifted], axis = 1)
print(original_)
In result, I Have 3 additional columns with value from the previous 0 row:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 50 17 55 NaN NaN NaN
1 45 3 10 NaN NaN NaN
2 93 79 84 NaN NaN NaN
3 99 38 33 50.0 17.0 55.0
4 44 35 35 45.0 3.0 10.0
5 25 43 87 93.0 79.0 84.0
6 38 88 56 99.0 38.0 33.0
7 20 66 6 44.0 35.0 35.0
8 4 23 6 25.0 43.0 87.0
9 39 75 3 38.0 88.0 56.0
In the next iteration I did shift(2) with the same approach and received the columns from the original.iloc[1].
On the last iteration I did shift(1) and got expected result in view of:
result = original_.iloc[3:]
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 99 38 33 50.0 17.0 55.0 45.0 3.0 10.0 93.0 79.0 84.0
4 44 35 35 45.0 3.0 10.0 93.0 79.0 84.0 99.0 38.0 33.0
5 25 43 87 93.0 79.0 84.0 99.0 38.0 33.0 44.0 35.0 35.0
6 38 88 56 99.0 38.0 33.0 44.0 35.0 35.0 25.0 43.0 87.0
7 20 66 6 44.0 35.0 35.0 25.0 43.0 87.0 38.0 88.0 56.0
8 4 23 6 25.0 43.0 87.0 38.0 88.0 56.0 20.0 66.0 6.0
9 39 75 3 38.0 88.0 56.0 20.0 66.0 6.0 4.0 23.0 6.0
Question:
Is there any way to solve this task with better approach as I described? Thanks.
Unless you want all these extra DataFrames, you could just add the new columns to your orignal df directly:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_0", "P1_week_0", "P1_month_0"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(3)
print(original)
output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0
0 2 35 26 NaN NaN NaN
1 99 4 96 NaN NaN NaN
2 4 67 6 NaN NaN NaN
3 76 33 31 2.0 35.0 26.0
4 84 60 98 99.0 4.0 96.0
5 57 1 58 4.0 67.0 6.0
6 35 70 96 76.0 33.0 31.0
7 81 32 39 84.0 60.0 98.0
8 25 4 38 57.0 1.0 58.0
9 83 4 60 35.0 70.0 96.0
python tutor link to example
Edit: OP asked the follow up question:
yes, for the first row it makes sense. But, my task is to add first 3 rows with index 0-1-2 as new 9 columns for the respected rows started from 3rd index. In your output row with index 1st is not added to the 3rd row as 3 columns. In my code that's why I used shift(2) and shift(1) iteratively.
Here is how this could be done iteratively:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
for shift, n in ((3,0),(2,1),(1,2)):
original[
[f"P1_day_{n}", f"P1_week_{n}", f"P1_month_{n}"]
] = original[
["P1_day", "P1_week", "P1_month"]
].shift(shift)
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_0 P1_week_0 P1_month_0 P1_day_1 \
3 58 43 74 26.0 56.0 82.0 56.0
4 44 27 40 56.0 87.0 38.0 31.0
5 2 90 4 31.0 32.0 87.0 58.0
6 90 70 6 58.0 43.0 74.0 44.0
7 1 31 57 44.0 27.0 40.0 2.0
8 96 22 69 2.0 90.0 4.0 90.0
9 13 98 47 90.0 70.0 6.0 1.0
P1_week_1 P1_month_1 P1_day_2 P1_week_2 P1_month_2
3 87.0 38.0 31.0 32.0 87.0
4 32.0 87.0 58.0 43.0 74.0
5 43.0 74.0 44.0 27.0 40.0
6 27.0 40.0 2.0 90.0 4.0
7 90.0 4.0 90.0 70.0 6.0
8 70.0 6.0 1.0 31.0 57.0
9 31.0 57.0 96.0 22.0 69.0
python tutor link
Edit 2: Not to make any assumptions here, but if your end goal is to get something like the 4 period moving average from the data in all of these new columns then you might not need them at all. You can use pandas.DataFrame.rolling instead:
import pandas as pd
import numpy as np
original = pd.DataFrame(
np.random.randint(0,100,size=(10, 3)),
columns=["P1_day", "P1_week", "P1_month"],
)
original[
["P1_day_4PMA", "P1_week_4PMA", "P1_month_4PMA"]
] = original[
["P1_day", "P1_week", "P1_month"]
].rolling(4).mean()
pd.set_option('display.max_columns', None)
print(original.iloc[3:])
Output:
P1_day P1_week P1_month P1_day_4PMA P1_week_4PMA P1_month_4PMA
3 1 13 48 31.25 38.00 55.00
4 10 4 40 22.00 21.00 45.75
5 7 76 0 5.50 23.75 37.00
6 5 69 9 5.75 40.50 24.25
7 63 31 82 21.25 45.00 32.75
8 26 67 22 25.25 60.75 28.25
9 89 41 40 45.75 52.00 38.25
another python tutor link

An optimal way of getting a correct reshape of that dataframe

I've the following dataframe:
df =
c f V E
0 M 5 32 22
1 M 7 45 40
2 R 7 42 36
3 R 9 41 38
4 R 3 28 24
And I want a result like this, in which the values ​​of column 'f' are my new columns, and my new indexes are a combination of column 'c' and the rest of columns in the dataframe (the order of rows doesn't matter):
df_result =
3 5 7 9
V(M) NaN 32 45 NaN
E(M) NaN 22 40 NaN
V(R) 28 NaN 42 41
E(R) 24 NaN 36 38
Currently, my code is:
df_result = pd.concat([df.pivot('c','f',col).rename(index = {e: col + '(' + e + ')' for e in df.pivot('c','f',col).index}) for col in [e for e in df.columns if e not in ['c','f']]])
With that code I'm getting:
df_result =
f 3 5 7 9
c
E(M) NaN 22 40 NaN
E(R) 24 NaN 36 38
V(M) NaN 32 45 NaN
V(R) 28 NaN 42 41
I think it's a valid result, however, I don't know if there is a way to get exactly my desire result or, at least, a better way to get what I am already getting.
Thanks you very much in advance.
To get the table, this is .melt + .pivot_table
df_result = df.melt(['f', 'c']).pivot_table(index=['variable', 'c'], columns='f')
Then we can clean up the naming:
df_result = df_result.rename_axis([None, None], 1)
df_result.columns = [y for _,y in df_result.columns]
df_result.index = [f'{x}({y})' for x,y in df_result.index]
# Python 2.: ['{0}({1})'.format(*x) for x in df_result.index]
Output:
3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0
You might consider keeping the MultiIndex instead of flattening to new strings, as it can be simpler for certain aggregations.
Check with pivot_table
s=pd.pivot_table(df,index='c',columns='f',values=['V','E']).stack(level=0).sort_index(level=1)
s.index=s.index.map('{0[1]}({0[0]})'.format)
s
Out[95]:
f 3 5 7 9
E(M) NaN 22.0 40.0 NaN
E(R) 24.0 NaN 36.0 38.0
V(M) NaN 32.0 45.0 NaN
V(R) 28.0 NaN 42.0 41.0

Cumsum of fixed amount of elements without loops

I have Python which counts cumsum of 14 elements in column, starting from different elements and writes this sum in other column. Does anyone knows the way how to do it without loops?
import pandas as pd
import numpy as np
a = pd.DataFrame({"A": [i for i in range(25)]})
b = pd.DataFrame({"B": [np.nan for i in range(25)]})
for i in range(4, len(b)):
cumsum = 0
for k in range(i - 4, i):
cumsum += a.A[k]
b.B[k] = cumsum
pd.concat([a,b], axis=1)
IIUC you are looking for rolling(4) + sum():
In [83]: a['new'] = a.A.rolling(4).sum()
In [84]: a
Out[84]:
A new
0 0 NaN
1 1 NaN
2 2 NaN
3 3 6.0
4 4 10.0
5 5 14.0
6 6 18.0
7 7 22.0
8 8 26.0
9 9 30.0
10 10 34.0
11 11 38.0
12 12 42.0
13 13 46.0
14 14 50.0
15 15 54.0
16 16 58.0
17 17 62.0
18 18 66.0
19 19 70.0
20 20 74.0
21 21 78.0
22 22 82.0
23 23 86.0
24 24 90.0
check:
In [86]: pd.concat([a,b], axis=1)
Out[86]:
A new B
0 0 NaN NaN
1 1 NaN NaN
2 2 NaN NaN
3 3 6.0 6.0
4 4 10.0 10.0
5 5 14.0 14.0
6 6 18.0 18.0
7 7 22.0 22.0
8 8 26.0 26.0
9 9 30.0 30.0
10 10 34.0 34.0
11 11 38.0 38.0
12 12 42.0 42.0
13 13 46.0 46.0
14 14 50.0 50.0
15 15 54.0 54.0
16 16 58.0 58.0
17 17 62.0 62.0
18 18 66.0 66.0
19 19 70.0 70.0
20 20 74.0 74.0
21 21 78.0 78.0
22 22 82.0 82.0
23 23 86.0 86.0
24 24 90.0 NaN

Pandas Sort Multiindex by Group Sum

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'County':['A','B','C','D','A','B','C','D','A','B','C','D','A','B','C','D','A','B'],
'Hospital':['a','b','c','d','e','a','b','c','e','a','b','c','d','e','a','b','c','e'],
'Enrollment':[44,55,42,57,95,54,27,55,81,54,65,23,89,76,34,12,1,67],
'Year':['2012','2012','2012','2012','2012','2012','2012','2012','2012','2013',
'2013','2013','2013','2013','2013','2013','2013','2013']})
d2=pd.pivot_table(df,index=['County','Hospital'],columns=['Year'])#.sort_columns
d2
Enrollment
Year 2012 2013
County Hospital
A a 44.0 NaN
c NaN 1.0
d NaN 89.0
e 88.0 NaN
B a 54.0 54.0
b 55.0 NaN
e NaN 71.5
C a NaN 34.0
b 27.0 65.0
c 42.0 NaN
D b NaN 12.0
c 55.0 23.0
d 57.0 NaN
I need to sort the data frame such that County is sorted descendingly by the sum of Enrollment for the most recent year (I want to avoid using '2013' directly) like this:
Enrollment
Year 2012 2013
County Hospital
B a 54 54
b 55 NaN
e NaN 71.5
C a NaN 34
b 27 65
c 42 NaN
A a 44 NaN
c NaN 1
d NaN 89
e 88 NaN
D b NaN 12
c 55 23
d 57 NaN
Then, I'd like each hospital sorted within each county, descendingly, but 2013 enrollments like this:
Enrollment
Year 2012 2013
County Hospital
B e NaN 71.5
a 54 54
b 55 NaN
C b 27 65
a NaN 34
c 42 NaN
A d NaN 89
c NaN 1
a 44 NaN
e 88 NaN
D c 55 23
b NaN 12
d 57 NaN
So far, I've tried using groupby to get the sums and merge the back but have not had any luck:
d2.groupby('County').sum()
Thanks in advance!
You could:
max_col = max(d2.columns.get_level_values(1)) # get column 2013
d2['sum'] = d2.groupby(level='County').transform('sum').loc[:, ('Enrollment', max_col)]
d2 = d2.sort_values(['sum', ('Enrollment', max_col)], ascending=[False, False])
to get:
Enrollment sum
Year 2012 2013
County Hospital
B e NaN 71.5 125.5
a 54.0 54.0 125.5
b 55.0 NaN 125.5
C b 27.0 65.0 99.0
a NaN 34.0 99.0
c 42.0 NaN 99.0
A d NaN 89.0 90.0
c NaN 1.0 90.0
a 44.0 NaN 90.0
e 88.0 NaN 90.0
D c 55.0 23.0 35.0
b NaN 12.0 35.0
d 57.0 NaN 35.0

Categories