python pandas dataframe Melt multiindex multi-levels - python

I have a DF with the following structure:
| Level | Rate |
Indicator | AAA | BBB | CCC | XXX | YYY |
location variable |
One 2017 0.69 0.22 0.71 0.02 0.98
2018 0.31 0.15 0.78 0.03 0.96
2019 0.55 0.19 0.82 0.04 0.83
Two 2017 0.31 0.33 0.93 0.11 0.21
2018 0.24 0.35 0.01 0.12 0.14
2019 0.16 0.25 0.12 0.14 0.17
Three 2017 0.58 0.11 0.55 0.21 0.27
2018 0.75 0.10 0.68 0.22 0.25
2019 0.42 0.08 0.71 0.23 0.41
I need to get a DF the following structure (with only one level):
location | variable | Indicator | Level | Rate |
------------------------------------------------
One | 2017 | AAA | 0.69 | NaN |
...
Three | 2019 | YYY | NaN | 0.41 |
I've made several attempts like this below but they don't work:
df.melt(col_level=0, id_vars = ['Location','Indicator','variable'] , value_vars = ['Level', 'Rate'])
Any help would be highly appreciated

Use DataFrame.stack with DataFrame.rename_axis and DataFrame.reset_index:
df = df.stack().rename_axis(('location','variable','indicator')).reset_index()
print (df.head(10))
location variable indicator Level Rate
0 One 2017 AAA 0.69 NaN
1 One 2017 BBB 0.22 NaN
2 One 2017 CCC 0.71 NaN
3 One 2017 XXX NaN 0.02
4 One 2017 YYY NaN 0.98
5 One 2018 AAA 0.31 NaN
6 One 2018 BBB 0.15 NaN
7 One 2018 CCC 0.78 NaN
8 One 2018 XXX NaN 0.03
9 One 2018 YYY NaN 0.96

Related

Copy a column to multiple columns of a DataFrame with Pandas

I have a DataFrame with multiple columns, a few columns being NaN. The dataframe is quite big having around 5,000 columns. Below is a sample from it:
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 NaN NaN 0.26 0.89 NaN
3 30-Jun-15 NaN NaN NaN 0.90 NaN
4 30-Sep-15 NaN NaN 0.31 0.90 NaN
5 31-Dec-15 NaN NaN 0.41 0.91 NaN
I want to copy the value of column 'EZ19' to all columns where all values for row 2 and below are NaN. I tried the following code and it works:
nan_cols = df.columns[df_macro[2:].isnull().all()].to_list()
for c in nan_cols:
df.loc[2:,c]= df.loc[2:,'EZ19']
But I was thinking there should be a way to assign value of column 'EZ19' to the target columns without using a loop and am surprised that there didn't seem to be a straight forward way to do this. Other questions here don't seem to handle the exact issue I have and couldn't find a solution that worked for me.
Given the size of my dataframe(and it is expected to grow larger overtime) I really want to avoid using a loop in my final code so any help with this will be greatly appreciated.
If you're interested in replacing values of columns that contain all nulls, you can take a shortcut and simply overwrite all values below row 2 after identifying those values are entirely null.
# Identify columns that contain null values from row 2 onwards
all_null_cols = df.loc[2:].isnull().all()
# overwrite row 2 onwards in only our null columns with values from "EZ19"
df.loc[2:, all_nulls] = df.loc[2:, ["EZ19"]].values
print(df)
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91
Not sure if this is what you have in mind:
outcome = df.loc[2:, df.loc[2:].isna().all()].mask(
lambda df: df.isna(), df.loc[2:, "EZ19"], axis=0
)
outcome
ESP FIN PRT
2 0.89 0.89 0.89
3 0.90 0.90 0.90
4 0.90 0.90 0.90
5 0.91 0.91 0.91
df.update(outcome)
df
GeoCode ESP FIN USA EZ19 PRT
1 Geography Spain Finland USA EZ Portugal
2 31-Mar-15 0.89 0.89 0.26 0.89 0.89
3 30-Jun-15 0.90 0.90 NaN 0.90 0.90
4 30-Sep-15 0.90 0.90 0.31 0.90 0.90
5 31-Dec-15 0.91 0.91 0.41 0.91 0.91
It only fills completely null rows from row 2 downwards; USA is not completely null from row 2, that's why it was not altered.
A simple oneliner that replaces all empty values in a row with the value in EZ19:
df = df.apply(lambda row: row.where(pd.notnull(row), row.EZ19), axis=1)
Output:
GeoCode ESP FIN USA EZ19 PRT
0 Geography Spain Finland USA EZ Portugal
1 31-Mar-15 0.89 0.89 0.26 0.89 0.89
2 30-Jun-15 0.90 0.90 0.90 0.90 0.90
3 30-Sep-15 0.90 0.90 0.31 0.90 0.90
4 31-Dec-15 0.91 0.91 0.41 0.91 0.91

Python Dataframe Get Value of Last Non Null Column for Each Row

I have a dataframe such as the following:
ID 2016 2017 2018 2019 2020
0 1 1.64 NaN NaN NaN NaN
1 2 NaN NaN NaN 0.78 NaN
2 3 1.11 0.97 1.73 1.23 0.87
3 4 0.84 0.74 1.64 1.47 0.41
4 5 0.75 1.05 NaN NaN NaN
I want to get the values from the last non-null column such that:
ID 2016 2017 2018 2019 2020 LastValue
0 1 1.64 NaN NaN NaN NaN 1.64
1 2 NaN NaN NaN 0.78 NaN 0.78
2 3 1.11 0.97 1.73 1.23 0.87 0.87
3 4 0.84 0.74 1.64 1.47 0.41 0.41
4 5 0.75 1.05 NaN NaN NaN 1.05
I tried to loop through the year columns in reverse as follows but couldn't fully achieve what I want.
for i in reversed(df.columns[1:]):
if df[i] is not None:
val = df[i]
Could you help about this issue? Thanks.
Idea is select all columns without first by DataFrame.iloc, then forward filling per rows missing values and last select last column:
df['LastValue'] = df.iloc[:, 1:].ffill(axis=1).iloc[:, -1]
print (df)
ID 2016 2017 2018 2019 2020 LastValue
0 1 1.64 NaN NaN NaN NaN 1.64
1 2 NaN NaN NaN 0.78 NaN 0.78
2 3 1.11 0.97 1.73 1.23 0.87 0.87
3 4 0.84 0.74 1.64 1.47 0.41 0.41
4 5 0.75 1.05 NaN NaN NaN 1.05
Detail:
print (df.iloc[:, 1:].ffill(axis=1))
2016 2017 2018 2019 2020
0 1.64 1.64 1.64 1.64 1.64
1 NaN NaN NaN 0.78 0.78
2 1.11 0.97 1.73 1.23 0.87
3 0.84 0.74 1.64 1.47 0.41
4 0.75 1.05 1.05 1.05 1.05

Merging two pandas data frame with common columns

I have a lower triangular matrix and then I transpose it and I have the transpose of it.
I am trying to merge them together
lower triangular:
Data :
0 1 2 3
0 1 0 0 0
1 0.21 0 0 0
2 0.31 0.32 0 0
3 0.41 0.42 0.43 0
4 0.51 0.52 0.53 0.54
transpose triangular:
Data :
0 1 2 3
0 1 0.21 0.31 0.41
1 0 0 0.32 0.52
2 0 0 0 0.53
3 0 0 0 0.54
4 0 0 0 0
Merged matrix:
Data :
0 1 2 3 4
0 1 0.21 0.31 0.41 0.51
1 0.21 0 0.32 0.42 0.52
2 0.31 0.32 0 0.43 0.53
3 0.41 0.42 0.43 0 0.54
4 0.51 0.52 0.53 0.54 0
I tried using pd.merge but I couldn't get it to work
Let us using combine_first after mask
df.mask(df==0).T.combine_first(df).fillna(0)
Out[1202]:
0 1 2 3 4
0 1.00 0.21 0.31 0.41 0.51
1 0.21 0.00 0.32 0.42 0.52
2 0.31 0.32 0.00 0.43 0.53
3 0.41 0.42 0.43 0.00 0.54
4 0.51 0.52 0.53 0.54 0.00
How about just adding the two dataframes?
df3 = df1.add(df2, fill_value=0)
BR

splitting a dataframe into chunks and naming each new chunk into a dataframe

is there a good code to split dataframes into chunks and automatically name each chunk into its own dataframe?
for example, dfmaster has 1000 records. split by 200 and create df1, df2,….df5
any guidance would be much appreciated.
I've looked on other boards and there is no guidance for a function that can automatically create new dataframes.
Use numpy for splitting:
See example below:
In [2095]: df
Out[2095]:
0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.00 0.0 0.00 0.0 0.94 0.00 0.00 0.63 0.00
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.00 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN
In [2096]: np.split(df, 2)
Out[2096]:
[ 0 1 2 3 4 5 6 7 8 9 10
0 0.25 0.00 0.0 0.0 0.0 0.0 0.94 0.0 0.0 0.63 0.0
1 0.51 0.51 NaN NaN NaN NaN NaN NaN NaN NaN NaN,
0 1 2 3 4 5 6 7 8 9 10
2 0.54 0.54 0.00 0.0 0.63 0.0 0.51 0.54 0.51 1.0 0.51
3 0.81 0.05 0.13 0.7 0.02 NaN NaN NaN NaN NaN NaN]
df gets split into 2 dataframes having 2 rows each.
You can do np.split(df, 500)
I find these ideas helpful:
solution via list:
https://stackoverflow.com/a/49563326/10396469
solution using numpy.split:
https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.split.html
just use df = df.values first to convert from dataframe to numpy.array.

pandas why does int64 - float64 column subtraction yield NaN's

I am confused by the results of pandas subtraction of two columns. When I subtract two float64 and int64 columns it yields several NaN entries. Why is this happening? What could be the cause of this strange behavior?
Final Updae: As N.Wouda pointed out, my problem was that the index columns did not match.
Y_predd.reset_index(drop=True,inplace=True)
Y_train_2.reset_index(drop=True,inplace=True)
solved my problem
Update 2: It seems like my index columns don't match, which makes sense because they are both sampled from the same data frome. How can I "start fresh" with new index coluns?
Update: Y_predd- Y_train_2.astype('float64') also yields NaN values. I am confused why this did not raise an error. They are the same size. Why could this be yielding NaN?
In [48]: Y_predd.size
Out[48]: 182527
In [49]: Y_train_2.astype('float64').size
Out[49]: 182527
Original documentation of error:
In [38]: Y_train_2
Out[38]:
66419 0
2319 0
114195 0
217532 0
131687 0
144024 0
94055 0
143479 0
143124 0
49910 0
109278 0
215905 1
127311 0
150365 0
117866 0
28702 0
168111 0
64625 0
207180 0
14555 0
179268 0
22021 1
120169 0
218769 0
259754 0
188296 1
63503 1
175104 0
218261 0
35453 0
..
112048 0
97294 0
68569 0
60333 0
184119 1
57632 0
153729 1
155353 0
114979 1
180634 0
42842 0
99979 0
243728 0
203679 0
244381 0
55646 0
35557 0
148977 0
164008 0
53227 1
219863 0
4625 0
155759 0
232463 0
167807 0
123638 0
230463 1
198219 0
128459 1
53911 0
Name: objective_for_classifier, dtype: int64
In [39]: Y_predd
Out[39]:
0 0.00
1 0.48
2 0.04
3 0.00
4 0.48
5 0.58
6 0.00
7 0.00
8 0.02
9 0.06
10 0.22
11 0.32
12 0.12
13 0.26
14 0.18
15 0.18
16 0.28
17 0.30
18 0.52
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 0.64
26 0.30
27 0.76
28 0.10
29 0.42
...
182497 0.60
182498 0.00
182499 0.06
182500 0.12
182501 0.00
182502 0.40
182503 0.70
182504 0.42
182505 0.54
182506 0.24
182507 0.56
182508 0.34
182509 0.10
182510 0.18
182511 0.06
182512 0.12
182513 0.00
182514 0.22
182515 0.08
182516 0.22
182517 0.00
182518 0.42
182519 0.02
182520 0.50
182521 0.00
182522 0.08
182523 0.16
182524 0.00
182525 0.32
182526 0.06
Name: prediction_method_used, dtype: float64
In [40]: Y_predd - Y_tr
Y_train_1 Y_train_2
In [40]: Y_predd - Y_train_2
Out[41]:
0 NaN
1 NaN
2 0.04
3 NaN
4 0.48
5 NaN
6 0.00
7 0.00
8 NaN
9 NaN
10 NaN
11 0.32
12 -0.88
13 -0.74
14 0.18
15 NaN
16 NaN
17 NaN
18 NaN
19 0.32
20 0.38
21 0.00
22 0.02
23 0.00
24 0.22
25 NaN
26 0.30
27 NaN
28 0.10
29 0.42
...
260705 NaN
260706 NaN
260709 NaN
260710 NaN
260711 NaN
260713 NaN
260715 NaN
260716 NaN
260718 NaN
260721 NaN
260722 NaN
260723 NaN
260724 NaN
260725 NaN
260726 NaN
260727 NaN
260731 NaN
260735 NaN
260737 NaN
260738 NaN
260739 NaN
260740 NaN
260742 NaN
260743 NaN
260745 NaN
260748 NaN
260749 NaN
260750 NaN
260751 NaN
260752 NaN
dtype: float64
Posting here so we can close the question, from the comments:
Are you sure each dataframe has the same index range?
You can reset the indices on both frames by df.reset_index(drop=True) and then subtract the frames as you were already doing. This process should result in the desired output.

Categories