How to calculate correlation on a specific dataset

How to calculate correlation on a specific dataset - python

my dataset
Rate Class
11.2 A
21.4 A
0.11 B
51.6 B
43.7 C
90.8 C
8.3 D
14.4 D
I want to know the correlation with rate for each class.
A B C D
11.2 Nan Nan Nan
21.4 Nan Nan Nan
Nan 0.11 Nan Nan
Nan 51.6 Nan Nan
Nan Nan 43.7 Nan
Nan Nan 90.8 Nan
Nan Nan Nan 8.3
Nan Nan Nan 14.4
I modified the dataset as follows but cannot find the corrleation due to the nan value.
Can you know how to find the correlation in this case?

Related

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN

Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values

Pandas row filtering reproduces entire table and turns data into NaNs [duplicate]

This question already has answers here:
How to change column names in pandas Dataframe using a list of names?
(5 answers)
Closed 3 years ago.
I have a Pandas dataframe with several columns. I want to create a new dataframe which contains all the rows in the original dataframe for which the boolean value "Present" is True.
Normally the way you are supposed to do this is by calling grades[grades['Present']], but I get the following unexpected result:
It reproduces the entire dataframe, except changes the True values in the "Present" column to 1s (the False ones become NaNs).
Any idea why this might be happening?
Here is my full script:
import pandas as pd
# read CSV and clean up data
grades = pd.read_csv("2학기 speaking test grades - 2·3학년.csv")
grades = grades[["Year","Present?","내용 / 30","유찬성 / 40","태도 / 30"]]
grades.columns = [["Year","Present","Content","Fluency","Attitude"]]
# Change integer Present to a boolean
grades['Present']=grades['Present']==1
print(grades.head())
print(grades.dtypes)
print(grades[grades['Present']])
And terminal output:
Year Present Content Fluency Attitude
0 2 True 30.0 40.0 30.0
1 2 True 30.0 40.0 30.0
2 2 True 30.0 40.0 30.0
3 2 True 30.0 40.0 30.0
4 2 True 30.0 40.0 30.0
Year int64
Present bool
Content float64
Fluency float64
Attitude float64
dtype: object
Year Present Content Fluency Attitude
0 NaN 1.0 NaN NaN NaN
1 NaN 1.0 NaN NaN NaN
2 NaN 1.0 NaN NaN NaN
3 NaN 1.0 NaN NaN NaN
4 NaN 1.0 NaN NaN NaN
5 NaN 1.0 NaN NaN NaN
6 NaN 1.0 NaN NaN NaN
7 NaN 1.0 NaN NaN NaN
8 NaN 1.0 NaN NaN NaN
9 NaN 1.0 NaN NaN NaN
10 NaN 1.0 NaN NaN NaN
11 NaN 1.0 NaN NaN NaN
12 NaN 1.0 NaN NaN NaN
13 NaN 1.0 NaN NaN NaN
14 NaN 1.0 NaN NaN NaN
15 NaN 1.0 NaN NaN NaN
16 NaN 1.0 NaN NaN NaN
17 NaN 1.0 NaN NaN NaN
18 NaN 1.0 NaN NaN NaN
19 NaN 1.0 NaN NaN NaN
20 NaN 1.0 NaN NaN NaN
21 NaN 1.0 NaN NaN NaN
22 NaN 1.0 NaN NaN NaN
23 NaN 1.0 NaN NaN NaN
24 NaN 1.0 NaN NaN NaN
25 NaN 1.0 NaN NaN NaN
26 NaN 1.0 NaN NaN NaN
27 NaN 1.0 NaN NaN NaN
28 NaN 1.0 NaN NaN NaN
29 NaN 1.0 NaN NaN NaN
.. ... ... ... ... ...
91 NaN NaN NaN NaN NaN
92 NaN NaN NaN NaN NaN
93 NaN 1.0 NaN NaN NaN
94 NaN 1.0 NaN NaN NaN
95 NaN NaN NaN NaN NaN
96 NaN 1.0 NaN NaN NaN
97 NaN 1.0 NaN NaN NaN
98 NaN 1.0 NaN NaN NaN
99 NaN 1.0 NaN NaN NaN
100 NaN 1.0 NaN NaN NaN
101 NaN 1.0 NaN NaN NaN
102 NaN 1.0 NaN NaN NaN
103 NaN 1.0 NaN NaN NaN
104 NaN 1.0 NaN NaN NaN
105 NaN 1.0 NaN NaN NaN
106 NaN 1.0 NaN NaN NaN
107 NaN 1.0 NaN NaN NaN
108 NaN 1.0 NaN NaN NaN
109 NaN 1.0 NaN NaN NaN
110 NaN 1.0 NaN NaN NaN
111 NaN 1.0 NaN NaN NaN
112 NaN 1.0 NaN NaN NaN
113 NaN 1.0 NaN NaN NaN
114 NaN 1.0 NaN NaN NaN
115 NaN 1.0 NaN NaN NaN
116 NaN 1.0 NaN NaN NaN
117 NaN 1.0 NaN NaN NaN
118 NaN 1.0 NaN NaN NaN
119 NaN 1.0 NaN NaN NaN
120 NaN 1.0 NaN NaN NaN
[121 rows x 5 columns]
Here is the CSV file. SE won't let me upload it directly, so if you paste it into your own CSV file you'll need to modify the Python code above to specify that it's in the EUC-KR encoding, like so: pd.read_csv("paste.csv",encoding="EUC-KR")
Year,Class,Year / class * presence (used to filter for averages),Present?,내용 / 30,유찬성 / 40,태도 / 30,Total,,,Averages (평균점),,
2,2,22,1,30,40,30,100,,,Grade distribution (점수 막대 그래프),,
2,2,22,1,30,40,30,100,,,The graph below includes the scores of all students in grades 2 and 3. ,,
2,2,22,1,30,40,30,100,,,아래 그래프에는 2·3학년에서 모든 학생의 점수가 정리됩니다.,,
2,2,22,1,30,40,30,100,,,,,
2,2,22,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
2,1,21,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
3,2,32,1,30,40,30,100,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,Average scores (평균점),,
2,2,22,1,30,30,30,90,,,These averages only count students who were present for the test.,,
2,2,22,1,30,30,30,90,,,평균점에는 참석한 학생의 점수만 포함됩니다.,,
2,2,22,1,30,30,30,90,,,,,
2,2,22,1,30,30,30,90,,,2학년 1반,,77.1
2,1,21,1,30,30,30,90,,,2학년 2반,,77.6
2,1,21,1,30,30,30,90,,,3학년 1반,,71.5
2,1,21,1,30,30,30,90,,,3학년 2반,,77.4
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
2,1,21,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,30,30,30,90,,,,,
3,2,32,1,20,40,30,90,,,,,
2,2,22,1,20,30,30,80,,,,,
2,2,22,1,20,30,30,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,2,22,1,30,30,20,80,,,,,
2,2,22,1,30,20,30,80,,,,,
2,1,21,1,20,30,30,80,,,,,
2,1,21,1,20,30,30,80,,,,,
2,1,21,1,30,30,20,80,,,,,
3,2,32,1,20,30,30,80,,,,,
3,2,32,1,30,20,30,80,,,,,
3,2,32,1,20,30,30,80,,,,,
3,2,32,1,30,30,20,80,,,,,
3,2,32,1,30,20,30,80,,,,,
2,2,22,1,10,30,30,70,,,,,
2,2,22,1,20,20,30,70,,,,,
2,2,22,1,30,20,20,70,,,,,
2,2,22,1,20,20,30,70,,,,,
2,2,22,1,20,20,30,70,,,,,
3,2,32,1,30,10,30,70,,,,,
3,2,32,1,20,30,20,70,,,,,
3,2,32,1,20,20,30,70,,,,,
2,1,21,1,20,20,20,60,,,,,
2,1,21,1,10,20,30,60,,,,,
2,2,22,1,10,20,20,50,,,,,
2,2,22,1,10,10,30,50,,,,,
2,1,21,1,10,10,30,50,,,,,
2,1,21,1,20,20,10,50,,,,,
3,2,32,1,10,10,30,50,,,,,
3,2,32,1,10,10,30,50,,,,,
2,2,22,1,10,0,30,40,,,,,
2,1,21,1,10,0,30,40,,,,,
3,2,32,1,10,0,30,40,,,,,
3,2,32,1,10,10,20,40,,,,,
2,2,22,1,0,0,30,30,,,,,
2,1,21,1,0,0,30,30,,,,,
2,1,21,1,0,0,30,30,,,,,
3,2,32,1,0,0,30,30,,,,,
3,2,32,1,0,0,20,20,,,,,
2,1,21,1,0,0,10,10,,,,,
2,2,22,1,0,0,30,30,,,,,
2,2,0,0,,,,0,,,,,
2,2,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
2,1,0,0,,,,0,,,,,
3,2,0,0,,,,0,,,,,
3,2,0,0,,,,0,,,,,
3,1,0,0,,,,0,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,0,0,30,30,,,,,
3,1,0,0,,,,0,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,20,20,60,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,30,40,30,100,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,30,20,70,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,40,30,100,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,20,10,20,50,,,,,
3,1,31,1,30,20,30,80,,,,,
3,1,31,1,0,0,20,20,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,0,0,20,20,,,,,
3,1,31,1,20,10,10,40,,,,,
3,1,31,1,30,30,30,90,,,,,
3,1,31,1,20,20,30,70,,,,,
3,1,31,1,30,20,10,60,,,,,
3,1,31,1,10,10,30,50,,,,,
Thank you.

You forgot to filter the Present column by True.
You can do it this way.
grades = grades[grades["Present"] == True]
If the boolean is stored as a string then use the double quotation.
grades = grades[grades["Present"] == "True"]

Transforming pandas dataframe, where column entries are column headers

My dataset has 12 columns, X1-X6 and Y1-Y6. The variables X and Y match to each other - the first record means: 80 parts of A, 10 parts of C, 2 parts of J and 8 parts of K (each row has 100 total).
I would like to be able to transform my dataset into a dataset in which the entries in columns X1-X6 are now the headers. See before and after datasets below.
My dataset (before):
X1 X2 X3 X4 X5 X6 Y1 Y2 Y3 Y4 Y5 Y6
0 A C J K NaN NaN 80.0 10.0 2.0 8.0 NaN NaN
1 F N O NaN NaN NaN 2.0 25.0 73.0 NaN NaN NaN
2 A H J M NaN NaN 70.0 6.0 15.0 9.0 NaN NaN
3 B I K P NaN NaN 0.5 1.5 2.0 96.0 NaN NaN
4 A B F H O P 83.0 4.0 9.0 2.0 1.0 1.0
5 A B F G NaN NaN 1.0 16.0 9.0 74.0 NaN NaN
6 A B D F L NaN 95.0 2.0 1.0 1.0 1.0 NaN
7 B F H P NaN NaN 0.2 0.4 0.4 99.0 NaN NaN
8 A D F L NaN NaN 35.0 12.0 30.0 23.0 NaN NaN
9 A B F I O NaN 95.0 0.3 0.1 1.6 3.0 NaN
10 B E G NaN NaN NaN 10.0 31.0 59.0 NaN NaN NaN
11 A F G L NaN NaN 24.0 6.0 67.0 3.0 NaN NaN
12 A C I NaN NaN NaN 65.0 30.0 5.0 NaN NaN NaN
13 A F G L NaN NaN 55.0 6.0 4.0 35.0 NaN NaN
14 A F J K L NaN 22.0 3.0 12.0 0.8 62.2 NaN
15 B F I P NaN NaN 0.6 1.2 0.2 98.0 NaN NaN
16 A B F H O NaN 27.0 6.0 46.0 13.0 8.0 NaN
The dataset I'd like to transform to:
A B C D E F G H I J K L M \
0 80.0 NaN 10.0 NaN NaN NaN NaN NaN NaN 2.0 8.0 NaN NaN
1 NaN NaN NaN NaN NaN 2.0 NaN NaN NaN NaN NaN NaN NaN
2 70.0 NaN NaN NaN NaN NaN NaN 6.0 NaN 15.0 NaN NaN 9.0
3 NaN 0.5 NaN NaN NaN NaN NaN NaN 1.5 NaN 2.0 NaN NaN
4 83.0 4.0 NaN NaN NaN 9.0 NaN 2.0 NaN NaN NaN NaN NaN
5 1.0 16.0 NaN NaN NaN 9.0 74.0 NaN NaN NaN NaN NaN NaN
6 95.0 2.0 NaN 1.0 NaN 1.0 NaN NaN NaN NaN NaN 1.0 NaN
7 NaN 0.2 NaN NaN NaN 0.4 NaN 0.4 NaN NaN NaN NaN NaN
8 35.0 NaN NaN 12.0 NaN 30.0 NaN NaN NaN NaN NaN 23.0 NaN
9 95.0 0.3 NaN NaN NaN 0.1 NaN NaN 1.6 NaN NaN NaN NaN
10 NaN 10.0 NaN NaN 31.0 NaN 59.0 NaN NaN NaN NaN NaN NaN
11 24.0 NaN NaN NaN NaN 6.0 67.0 NaN NaN NaN NaN 3.0 NaN
12 65.0 NaN 30.0 NaN NaN NaN NaN NaN 5.0 NaN NaN NaN NaN
13 55.0 NaN NaN NaN NaN 6.0 4.0 NaN NaN NaN NaN 35.0 NaN
14 22.0 NaN NaN NaN NaN 3.0 NaN NaN NaN 12.0 0.8 62.2 NaN
15 NaN 0.6 NaN NaN NaN 1.2 NaN NaN 0.2 NaN NaN NaN NaN
16 27.0 6.0 NaN NaN NaN 46.0 NaN 13.0 NaN NaN NaN NaN NaN
N O P
0 NaN NaN NaN
1 25.0 73.0 NaN
2 NaN NaN NaN
3 NaN NaN 96.0
4 NaN 1.0 1.0
5 NaN NaN NaN
6 NaN NaN NaN
7 NaN NaN 99.0
8 NaN NaN NaN
9 NaN 3.0 NaN
10 NaN NaN NaN
11 NaN NaN NaN
12 NaN NaN NaN
13 NaN NaN NaN
14 NaN NaN NaN
15 NaN NaN 98.0
16 NaN 8.0 NaN

As you know that you want the Xi part to contain the column names for the new dataframe, while the Yi part would be the value, it is enough to change every line in a dict where Xi is the key and Yi the value. Then you use the list of that dictionnaries to feed the new dataframe:
data = list(df.apply(lambda x: {x['X'+ str(i)]: x['Y'+str(i)] for i in range(1,7)
if x['X'+str(i)]!= 'NaN'}, axis=1))
resul = pd.DataFrame(data)
print(resul)
gives:
A B C D E F ... K L M N O P
0 80.0 NaN 10.0 NaN NaN NaN ... 8.0 NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN 2.0 ... NaN NaN NaN 25.0 73.0 NaN
2 70.0 NaN NaN NaN NaN NaN ... NaN NaN 9.0 NaN NaN NaN
3 NaN 0.5 NaN NaN NaN NaN ... 2.0 NaN NaN NaN NaN 96.0
4 83.0 4.0 NaN NaN NaN 9.0 ... NaN NaN NaN NaN 1.0 1.0
5 1.0 16.0 NaN NaN NaN 9.0 ... NaN NaN NaN NaN NaN NaN
6 95.0 2.0 NaN 1.0 NaN 1.0 ... NaN 1.0 NaN NaN NaN NaN
7 NaN 0.2 NaN NaN NaN 0.4 ... NaN NaN NaN NaN NaN 99.0
8 35.0 NaN NaN 12.0 NaN 30.0 ... NaN 23.0 NaN NaN NaN NaN
9 95.0 0.3 NaN NaN NaN 0.1 ... NaN NaN NaN NaN 3.0 NaN
10 NaN 10.0 NaN NaN 31.0 NaN ... NaN NaN NaN NaN NaN NaN
11 24.0 NaN NaN NaN NaN 6.0 ... NaN 3.0 NaN NaN NaN NaN
12 65.0 NaN 30.0 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN
13 55.0 NaN NaN NaN NaN 6.0 ... NaN 35.0 NaN NaN NaN NaN
14 22.0 NaN NaN NaN NaN 3.0 ... 0.8 62.2 NaN NaN NaN NaN
15 NaN 0.6 NaN NaN NaN 1.2 ... NaN NaN NaN NaN NaN 98.0
16 27.0 6.0 NaN NaN NaN 46.0 ... NaN NaN NaN NaN 8.0 NaN
[17 rows x 16 columns]

One way to handle this. Loop through each row, splitting the dataframe in half using iloc. Then build a new dictionary using zip, then create a resulting dataframe.
df_dict = {x: list(zip(df.iloc[x,0:6], df.iloc[x,6:12])) for x in range(df.shape[0])}
df1 = pd.DataFrame.from_dict(pd_dict, orient='index')
df1.sort_index(1)
A B C F H I J K M N O P nan
0 80.0 NaN 10.0 NaN NaN NaN 2.0 8.0 NaN NaN NaN NaN NaN
1 NaN NaN NaN 2.0 NaN NaN NaN NaN NaN 25.0 73.0 NaN NaN
2 70.0 NaN NaN NaN 6.0 NaN 15.0 NaN 9.0 NaN NaN NaN NaN
3 NaN 0.5 NaN NaN NaN 1.5 NaN 2.0 NaN NaN NaN 96. NaN
4 83.0 4.0 NaN 9.0 2.0 NaN NaN NaN NaN NaN 1.0 1.0 NaN

pd.MultiIndex.from_tuples is adding nan values to the table

Changing the original values in the df
I have a df as shown below, which I obtain after performing a number of calculations:
Acc Ep Direction Ttest_t Ttest_s T_count TPNL TS TotalPNL TotalS
A KA B -10.62 -0.21 3 -625.52 14.59 -667.61 24.28
B EF B -4.25 2.63 2 -448.08 26.88 -448.08 26.88
D SE B -3.94 8.63 4 -533.70 75.41 -550.26 128.38
G UA S -6.85 -0.09 3 -563.83 19.58 -411.06 21.54
N EL B -5.39 2.84 2 -2230.23 464.56 -6641.1 1232.79
N SD B -4.70 -0.21 2 -1057.0 117.45 -6641.1 1232.79
S UD B -5.48 0.18 33 1416.69 3981.32 955.34 4475.32
then I use the MultiIndex function as follows:
columns = [('Index','Acc'), ('Index','Ep'), ('EPNL','Ttest_t'), ('EPNL','TPNL'), ('EPNL','TotalPNL'), ('SPaid','Ttest_s'), ('SPaid','TS'), ('SPaid','TotalS'), ('Other','Direction'), ('Other','T_count')]
temp3.columns=pd.MultiIndex.from_tuples(columns)
This does gives me the table format I want. but, it adds null values to my table(as shown below)
Index EPNL SPaid Other O
Acc Epic Ttest_t TPNL TotalPNL Ttest_s TS TotalS Direction t
NaN NaN NaN NaN NaN NaN NaN NaN NaN h
NaN NaN NaN NaN NaN NaN NaN NaN NaN e
NaN NaN NaN NaN NaN NaN NaN NaN NaN r
NaN NaN NaN NaN NaN NaN NaN NaN NaN T
NaN NaN NaN NaN NaN NaN NaN NaN NaN r
NaN NaN NaN NaN NaN NaN NaN NaN NaN a
NaN NaN NaN NaN NaN NaN NaN NaN NaN d
NaN NaN NaN NaN NaN NaN NaN NaN NaN e
NaN NaN NaN NaN NaN NaN NaN NaN NaN
NaN NaN NaN NaN NaN NaN NaN NaN NaN c
NaN NaN NaN NaN NaN NaN NaN NaN NaN o
NaN NaN NaN NaN NaN NaN NaN NaN NaN u
NaN NaN NaN NaN NaN NaN NaN NaN NaN n
NaN NaN NaN NaN NaN NaN NaN NaN NaN t
A KA B -10.62 -0.21 3 -625.52 14.59 -667.61 24.28
B EF B -4.25 2.63 2 -448.08 26.88 -448.08 26.88
D SE B -3.94 8.63 4 -533.70 75.41 -550.26 128.38
G UA S -6.85 -0.09 3 -563.83 19.58 -411.06 21.54
N EL B -5.39 2.84 2 -2230.23 464.56 -6641.17 1232.79
N SD B -4.70 -0.21 2 -1057.02 117.45 -6641.17 1232.79
S UD B -5.48 0.18 33 1416.69 3981.32 955.34 4475.32
Any ideas on why it is doing that. I prefer it not adding the values(I don't want to use dropna)

Because string is iterable, ('Other' 'T_count') is converted to
'O','t','h','e',r',' ','T','_','c','o','u','n','t'
and created 16 level MultiIndex.
Solution is add , like ('Other', 'T_count').
columns = [('Index','Acc'), ('Index','Ep'),
('EPNL','Ttest_t'), ('EPNL','TPNL'),
('EPNL','TotalPNL'), ('SPaid','Ttest_s'),
('SPaid','TS'), ('SPaid','TotalS'),
('Other','Direction'), ('Other','T_count')]
temp3.columns=pd.MultiIndex.from_tuples(columns)
print (temp3)
Index EPNL SPaid Other \
Acc Ep Ttest_t TPNL TotalPNL Ttest_s TS TotalS Direction
0 A KA B -10.62 -0.21 3 -625.52 14.59 -667.61
1 B EF B -4.25 2.63 2 -448.08 26.88 -448.08
2 D SE B -3.94 8.63 4 -533.70 75.41 -550.26
3 G UA S -6.85 -0.09 3 -563.83 19.58 -411.06
4 N EL B -5.39 2.84 2 -2230.23 464.56 -6641.10
5 N SD B -4.70 -0.21 2 -1057.00 117.45 -6641.10
6 S UD B -5.48 0.18 33 1416.69 3981.32 955.34
T_count
0 24.28
1 26.88
2 128.38
3 21.54
4 1232.79
5 1232.79
6 4475.32

Pandas: Merge data with different timing

I have two data frames that contain time-series data that are on different ranges. One starts earlier, and ends earlier. Also, one is monthly and one is quarterly. However, the index of both is in the form of YYYY-MM-DD. Is there a cute way of merging these dataframes using "Python" and "Pandas"?
Thanks!
/edit
One set:
DATE GDP GPDI NFLS
0 1947-01-01 243.1 35.9 112.815
1 1947-04-01 246.3 34.5 111.253
2 1947-07-01 250.1 34.9 113.023
3 1947-10-01 260.3 43.2 111.440
The other one:
DATE INDPRO M08354USM310NNBR GDP
(...)
334 1946-11-01 13.3916 NaN NaN
335 1946-12-01 13.4721 NaN NaN
336 1947-01-01 13.6332 42.8 NaN
337 1947-02-01 13.7137 42.5 NaN
Together I would like to join them, such that
DATE INDPRO M08354USM310NNBR GDP GPDI NFLS
1946-11-01 13.3916 NaN NaN NaN NaN
1946-12-01 13.4712 NaN NaN NaN NaN
1947-01-01 13.6332 42.8 243.1 35.9 112.815
1947-02-01 13.7137 42.5 NaN NaN NaN
(...)

Just perform a merge the fact the periods are different and don't overlap suits you in fact:
merged = df1.merge(df2, on='DATE', how='outer')
merged
Out[54]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
[7 rows x 7 columns]
You can rename, fill, drop the erroneous 'GDP_y' column
To sort the merged 'DATE' column just call sort:
In [57]:
merged.sort(['DATE'])
Out[57]:
DATE GDP_x GPDI NFLS INDPRO M08354USM310NNBR GDP_y
4 1946-11-01 NaN NaN NaN 13.3916 NaN NaN
5 1946-12-01 NaN NaN NaN 13.4721 NaN NaN
0 1947-01-01 243.1 35.9 112.815 13.6332 42.8 NaN
6 1947-02-01 NaN NaN NaN 13.7137 42.5 NaN
1 1947-04-01 246.3 34.5 111.253 NaN NaN NaN
2 1947-07-01 250.1 34.9 113.023 NaN NaN NaN
3 1947-10-01 260.3 43.2 111.440 NaN NaN NaN
[7 rows x 7 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to calculate correlation on a specific dataset - python

Related

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

Pandas row filtering reproduces entire table and turns data into NaNs [duplicate]

Transforming pandas dataframe, where column entries are column headers

pd.MultiIndex.from_tuples is adding nan values to the table

Pandas: Merge data with different timing

Categories

Resources