Converting text (json object?) in pandas cell to columns - python

I am trying to pull data from a text values in a pandas DataFrame.
df = pd.DataFrame(['{58={1=4.5}, 50={1=4.0}, 42={1=3.5}, 62={1=4.75}, 54={1=4.25}, 46={1=3.75}}',
'{a={1=15.0}, b={1=14.0}, c={1=13.0}, d={1=15.5}, e={1=14.5}, f={1=13.5}}',
'{58={1=15.5}, 50={1=14.5}, 42={1=13.5}, 62={1=16.0}, 54={1=15.0}, 46={1=14.0}}'])
I have tried
df.apply(pd.Series)
pd.DataFrame(df.tolist(),index=df.index)
json_normalize(df)
But with no success.
I want to have new columns 50, 52, a, b c etc. And the values without the '1=' and I dont mind the NaNs. How to do that? What is this format?
Really appreciate your help.

With specific replacement to prepare a valid json string:
In [184]: new_df = pd.DataFrame(df.apply(lambda s: s.str.replace(r'(\w+)=\{1=([^}]+)\}', '"\\1":\\2'))[0].apply(pd.io
...: .json.loads).tolist())
In [185]: new_df
Out[185]:
42 46 50 54 58 62 a b c d e f
0 3.5 3.75 4.0 4.25 4.5 4.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 13.5 14.00 14.5 15.00 15.5 16.00 NaN NaN NaN NaN NaN NaN

There is a way you can do it by changing strings in order to make your data look like a dictionary. There is probably a smarter way using regex, but that will depend on the assumptions of the entire data you have available.
My steps below are:
Change strings to transform your data into a dict-like structure
Use literal_eval to transform the str on a dict
Unfold the df into a new dataframe
from ast import literal_eval
df[0] = df[0].str.replace('={1=',"':")\ # remove 1= and left inner dict sign {
.str.replace('}, ',",'")\ # remove right inner dict sign }
.str.replace('}}','}')\ # remove outmost extra }
.str.replace('{',"{'")\ # add appropriate string sign to first value.
.apply(literal_eval) # read as a dict
pd.DataFrame(df[0].values.tolist()) # unfold as a new dataframe
Out[1]:
58 50 42 62 54 46 a b c d e f
0 4.5 4.0 3.5 4.75 4.25 3.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 15.5 14.5 13.5 16.00 15.00 14.00 NaN NaN NaN NaN NaN NaN

Related

Row by row mapping keys of dictionary of dataframes to new dictionary of dataframes

I have two dictionaries of data frames LP3 and ExeedenceDict. The ExeedenceDict is a dictionary of 4 dataframes with keys 'two','ten','twentyfive','onehundred'. The LP3 dictionary has keys 'LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston'
Edit: I am not sure of the most concise way to title this question but I think the title suites what I am asking.
There is a column in each dataframe within the ExeedenceDict that has row values equal to the keys in the LP3 dictionary.
Below is a 'blank' dataframe for two in the ExeedenceDict that I created. Using the code:
ExeedenceDF = []
cols = ['Location','Size','Annual Exceedence', 'With Reg Skew','Without Reg Skew','5% Lower','95% Upper']
for i in range(5):
i = pd.DataFrame(columns=cols)
i['Location'] = LP_names
i['Size'] = [39.8,24,34,29.7,21.2,53.7,61.7,27.6,31.6]
ExeedenceDF.append(i)
ExeedenceDict = {'two':ExeedenceDF[0], 'ten':ExeedenceDF[1], 'twentyfive':ExeedenceDF[2], 'onehundred':ExeedenceDF[3]}
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 NaN NaN NaN NaN NaN
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Below is the dataframe for the key LP_DevilMalad in the LP3 dictionary. This dictionary was built by reading in data from 10 excel spreadsheets. Using the code:
LP_names = ['LP_DevilMalad', 'LP_Bloomington', 'LP_DevilEvans', 'LP_Deep', 'LP_Maple', 'LP_CubMaple', 'LP_Cottonwood', 'LP_Mill', 'LP_CubNrPreston']
for i, df in enumerate(LP_Data):
LP_Data[i] = LP_Data[i].dropna()
LP_Data[i]['Annual Exceedence'] = 1 / LP_Data[i]['Annual Exceedence']
LP_Data[i] = LP_Data[i].loc[LP_Data[i]['Annual Exceedence'].isin([2, 10, 25, 100])]
LP3 = {k:v for (k,v) in zip(LP_names, LP_Data)}
'LP_DevilMalad': Annual Exceedence With Reg Skew Without Reg Skew Log Variance of Est \
6 2.0 21.4 22.4 0.0091
9 10.0 46.5 44.7 0.0119
10 25.0 60.2 54.6 0.0166
12 100.0 81.4 67.4 0.0270
5% Lower 95% Upper
6 14.1 31.2
9 32.1 85.7
10 40.6 136.2
12 51.3 250.6
I am having issues matching the column values of each dataframe within the dictionaries from the keys of LP3 to the Location column in ExeedenceDict dataframes. With the goal of coming up with a script that would do all of this iteratively with some sort of dictionary comprehension.
The caveat is that the two dataframe is just the 6 index value in the LP3 dataframes, ten is the 9th index value, 'twentyfive' is the 10th index value, and onehundred is the 12th index value.
The goale data frame for key two in ExeedenceDict based on the two data frames above would look something like this:
Noting that the rest of the dataframe would be filled with the values from the 6th index from the rest of the dataframe values within the LP3 dictionary.
Location Size Annual Exceedence With Reg Skew Without Reg Skew 5% Lower 95% Upper
0 LP_DevilMalad 39.8 2 21.4 22.4 14.1 31.2
1 LP_Bloomington 24.0 NaN NaN NaN NaN NaN
2 LP_DevilEvans 34.0 NaN NaN NaN NaN NaN
3 LP_Deep 29.7 NaN NaN NaN NaN NaN
4 LP_Maple 21.2 NaN NaN NaN NaN NaN
5 LP_CubMaple 53.7 NaN NaN NaN NaN NaN
6 LP_Cottonwood 61.7 NaN NaN NaN NaN NaN
7 LP_Mill 27.6 NaN NaN NaN NaN NaN
8 LP_CubNrPreston 31.6 NaN NaN NaN NaN NaN
Can't test it without a reproducible example, but I would do something along the lines:
index_map = {
"two": 6,
"ten": 9,
"twentyfive": 10,
"onehundred": 12
}
col_of_interest = ["Annual Exceedence", "With Reg Skew", "Without Reg Skew", "5% Lower", "95% Upper"]
for index_key, df in ExeedenceDict.items():
lp_index = index_map[index_key]
for lp_val in df['Location'].values:
df.loc[df['Location'] == lp_val, col_of_interest] = LP3[lp_val].loc[lp_index, col_of_interest].values

How to calculate correlation on a specific dataset

my dataset
Rate Class
11.2 A
21.4 A
0.11 B
51.6 B
43.7 C
90.8 C
8.3 D
14.4 D
I want to know the correlation with rate for each class.
A B C D
11.2 Nan Nan Nan
21.4 Nan Nan Nan
Nan 0.11 Nan Nan
Nan 51.6 Nan Nan
Nan Nan 43.7 Nan
Nan Nan 90.8 Nan
Nan Nan Nan 8.3
Nan Nan Nan 14.4
I modified the dataset as follows but cannot find the corrleation due to the nan value.
Can you know how to find the correlation in this case?

Python Pandas Dynamically Create a Dataframe

The code below will generate the desired output in ONE dataframe, however, I would like to dynamically create data frames in a FOR loop then assign the shifted value to that data frame. Example, data frame df_lag_12 would only contain column1_t12 and column2_12. Any ideas would be greatly appreciated. I attempted to dynamically create 12 dataframes using the EXEC statement, google searching seems to state this is poor practice.
import pandas as pd
list1=list(range(0,20))
list2=list(range(19,-1,-1))
d={'column1':list(range(0,20)),
'column2':list(range(19,-1,-1))}
df=pd.DataFrame(d)
df_lags=pd.DataFrame()
for col in df.columns:
for i in range(12,0,-1):
df_lags[col+'_t'+str(i)]=df[col].shift(i)
df_lags[col]=df[col].values
print(df_lags)
for df in (range(12,0,-1)):
exec('model_data_lag_'+str(df)+'=pd.DataFrame()')
Desired output for dymanically created dataframe DF_LAGS_12:
var_list=['column1_t12','column2_t12']
df_lags_12=df_lags[var_list]
print(df_lags_12)
I think the best is create dictionary of DataFrames:
d = {}
for i in range(12,0,-1):
d['t' + str(i)] = df.shift(i).add_suffix('_t' + str(i))
If need specify columns first:
d = {}
cols = ['column1','column2']
for i in range(12,0,-1):
d['t' + str(i)] = df[cols].shift(i).add_suffix('_t' + str(i))
dict comprehension solution:
d = {'t' + str(i): df.shift(i).add_suffix('_t' + str(i)) for i in range(12,0,-1)}
print (d['t10'])
column1_t10 column2_t10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 0.0 19.0
11 1.0 18.0
12 2.0 17.0
13 3.0 16.0
14 4.0 15.0
15 5.0 14.0
16 6.0 13.0
17 7.0 12.0
18 8.0 11.0
19 9.0 10.0
EDIT: Is it possible by globals, but much better is dictionary:
d = {}
cols = ['column1','column2']
for i in range(12,0,-1):
globals()['df' + str(i)] = df[cols].shift(i).add_suffix('_t' + str(i))
print (df10)
column1_t10 column2_t10
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
5 NaN NaN
6 NaN NaN
7 NaN NaN
8 NaN NaN
9 NaN NaN
10 0.0 19.0
11 1.0 18.0
12 2.0 17.0
13 3.0 16.0
14 4.0 15.0
15 5.0 14.0
16 6.0 13.0
17 7.0 12.0
18 8.0 11.0
19 9.0 10.0
for i in range(1, 16):
text=f"Version{i}=pd.DataFrame()"
exec(text)
A combination of exec and f"..." will help you do that.
If you need iterating or Versions of same variable above statement will help

How to swap index and values on pandas dataframe

I have some data, in which the index is a threshold, and the values are trns (true negative rates) for two classes, 0 and 1.
I want to get a dataframe, indexed by the tnr, of the threshold that corresponds to that tnr, for each class. Essentially, I want this:
I am able to achieve this effect by using the following:
pd.concat([pd.Series(data[0].index.values, index=data[0]),
pd.Series(data[1].index.values, index=data[1])],
axis=1)
Or, generalizing to any number of columns:
def invert_dataframe(df):
return pd.concat([pd.Series(df[col].index.values,
index=df[col]) for col in df.columns],
axis=1)
However, this seems extremely hacky and error prone. Is there a better way to do this, and is there maybe native Pandas functionality that would do this?
You can use stack with pivot:
data = pd.DataFrame({0:[10,20,31],10:[4,22,36],
1:[7,5,6]}, index=[2.1,1.07,2.13])
print (data)
0 1 10
2.10 10 7 4
1.07 20 5 22
2.13 31 6 36
df = data.stack().reset_index()
df.columns = list('abc')
df = df.pivot(index='c', columns='b', values='a')
print (df)
b 0 1 10
c
4 NaN NaN 2.10
5 NaN 1.07 NaN
6 NaN 2.13 NaN
7 NaN 2.10 NaN
10 2.10 NaN NaN
20 1.07 NaN NaN
22 NaN NaN 1.07
31 2.13 NaN NaN
36 NaN NaN 2.13

How to get indexes of values in a Pandas DataFrame?

I am sure there must be a very simple solution to this problem, but I am failing to find it (and browsing through previously asked questions, I didn't find the answer I wanted or didn't understand it).
I have a dataframe similar to this (just much bigger, with many more rows and columns):
x val1 val2 val3
0 0.0 10.0 NaN NaN
1 0.5 10.5 NaN NaN
2 1.0 11.0 NaN NaN
3 1.5 11.5 NaN 11.60
4 2.0 12.0 NaN 12.08
5 2.5 12.5 12.2 12.56
6 3.0 13.0 19.8 13.04
7 3.5 13.5 13.3 13.52
8 4.0 14.0 19.8 14.00
9 4.5 14.5 14.4 14.48
10 5.0 15.0 19.8 14.96
11 5.5 15.5 15.5 15.44
12 6.0 16.0 19.8 15.92
13 6.5 16.5 16.6 16.40
14 7.0 17.0 19.8 18.00
15 7.5 17.5 17.7 NaN
16 8.0 18.0 19.8 NaN
17 8.5 18.5 18.8 NaN
18 9.0 19.0 19.8 NaN
19 9.5 19.5 19.9 NaN
20 10.0 20.0 19.8 NaN
In the next step, I need to compute the derivative dVal/dx for each of the value columns (in reality I have more than 3 columns, so I need to have a robust solution in a loop, I can't select the rows manually each time). But because of the NaN values in some of the columns, I am facing the problem that x and val are not of the same dimension. I feel the way to overcome this would be to only select only those x intervals, for which the val is notnull. But I am not able to do that. I am probably making some very stupid mistakes (I am not a programmer and I am very untalented, so please be patient with me:) ).
Here is the code so far (now that I think of it, I may have introduced some mistakes just by leaving some old pieces of code because I've been messing with it for a while, trying different things):
import pandas as pd
import numpy as np
df = pd.read_csv('H:/DocumentsRedir/pokus/dataframe.csv', delimiter=',')
vals = list(df.columns.values)[1:]
for i in vals:
V = np.asarray(pd.notnull(df[i]))
mask = pd.notnull(df[i])
X = np.asarray(df.loc[mask]['x'])
derivative=np.diff(V)/np.diff(X)
But I am getting this error:
ValueError: operands could not be broadcast together with shapes (20,) (15,)
So, apparently, it did not select only the notnull values...
Is there an obvious mistake that I am making or a different approach that I should adopt? Thanks!
(And another less important question: is np.diff the right function to use here or had I better calculated it manually by finite differences? I'm not finding numpy documentation very helpful.)
To calculate dVal/dX:
dVal = df.iloc[:, 1:].diff() # `x` is in column 0.
dX = df['x'].diff()
>>> dVal.apply(lambda series: series / dX)
val1 val2 val3
0 NaN NaN NaN
1 1 NaN NaN
2 1 NaN NaN
3 1 NaN NaN
4 1 NaN 0.96
5 1 NaN 0.96
6 1 15.2 0.96
7 1 -13.0 0.96
8 1 13.0 0.96
9 1 -10.8 0.96
10 1 10.8 0.96
11 1 -8.6 0.96
12 1 8.6 0.96
13 1 -6.4 0.96
14 1 6.4 3.20
15 1 -4.2 NaN
16 1 4.2 NaN
17 1 -2.0 NaN
18 1 2.0 NaN
19 1 0.2 NaN
20 1 -0.2 NaN
We difference all columns (except the first one), and then apply a lambda function to each column which divides it by the difference in column X.

Categories