Retain R dataframe index values when converting to a pandas dataframe - python

Fitted mixed effects model using R (base version 3.5.2) package LME4, run via rpy2 2.9.4 from Python 3.6
Able to print random effects as an indexed dataframe, where the index values are the values of the categorical variable(s) used to define the groups (using radon data):
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
from rpy2.robjects.packages import importr
lme4 = importr('lme4')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = ro.r['ranef']
re = r_ranef(mod)
print(re[1])
Uppm (Intercept) floor (Intercept)
AITKIN -0.0026783361 -2.588735e-03 1.742426e-09 -0.0052003670
ANOKA -0.0056688495 -6.418760e-03 -4.482764e-09 -0.0128942943
BECKER 0.0021906431 1.190746e-03 1.211201e-09 0.0023920238
BELTRAMI 0.0093246041 8.190172e-03 5.135196e-09 0.0164527872
BENTON 0.0018747838 1.049496e-03 1.746748e-09 0.0021082742
BIG STONE -0.0073756824 -2.430404e-03 0.000000e+00 -0.0048823057
BLUE EARTH 0.0112939204 4.176931e-03 5.507525e-09 0.0083908075
BROWN 0.0069223055 2.544912e-03 4.911563e-11 0.0051123339
Converting this to a pandas DataFrame, the categorical values are lost from the index and replaced by integers:
pandas2ri.ri2py_dataframe(r_ranef[1]) # r_ranef is a dict of dataframes
Uppm (Intercept) floor (Intercept)
0 -0.002678 -0.002589 1.742426e-09 -0.005200
1 -0.005669 -0.006419 -4.482764e-09 -0.012894
2 0.002191 0.001191 1.211201e-09 0.002392
3 0.009325 0.008190 5.135196e-09 0.016453
4 0.001875 0.001049 1.746748e-09 0.002108
5 -0.007376 -0.002430 0.000000e+00 -0.004882
6 0.011294 0.004177 5.507525e-09 0.008391
7 0.006922 0.002545 4.911563e-11 0.005112
How do I retain the values of the original index?
The doc suggests as.data.frame could contain grp, which might be the values I'm after, but I'm struggling to implement that through rpy2; e.g.,
r_ranef = ro.r['ranef.as.data.frame']
does not work

Consider adding row.names as a new column in R data frame and then use this column to set_index in Pandas data frame:
base = importr('base')
# ADD NEW COLUMN TO R DATA FRAME
re[1] = base.transform(re[1], index = base.row_names(re[1]))
# SET INDEX IN PANDAS DATA FRAME
py_df = (pandas2ri.ri2py_dataframe(re[1])
.set_index('index')
.rename_axis(None)
)
And to do so across all data frames in list, use R's lapply loop and then Python's list comprehension for new list of Pandas indexed data frames.
base = importr('base')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = lme4.ranef(mod)
# R LAPPLY
new_r_ranef = base.lapply(r_ranef, lambda df:
base.transform(df, index=base.row_names(df)))
# PYTHON LIST COMPREHENSION
py_df_list = [(pandas2ri.ri2py_dataframe(df)
.set_index('index')
.rename_axis(None)
) for df in new_r_ranef]

import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
r_dataf = ro.r("""
data.frame(
Uppm = rnorm(5),
row.names = letters[1:5]
)
""")
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
# row names are "a".."f"
print(r_dataf)
# row names / indexes are now 0..4
print(pd_dataf)
This is likely a minor bug/missing feature in rpy2, but the workaround is rather straightforward:
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
pd_dataf.index = r_dataf.rownames

Related

How can I operate on grouped arrays from nested pandas DataFrames?

I have a series of nested pandas DataFrames containing several (hundreds) of arrays and I would like to average each variable across different nesting levels.
The variable mydatadf contains a very simple representative example of my actual data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
mydata = dict()
participant = ['participantA', 'participantB']
for p in participant:
ses = dict()
session = ['ses_1', 'ses_2']
for s in session:
series = dict()
set = ['s_1', 's_2', 's_3']
for se in set:
reps = dict()
rep = ['r_1', 'r_2', 'r_3', 'r_4', 'r_5']
for r in rep:
vars = dict()
vars = {'var1': np.sin(np.random.rand(1000)*2),
'var2': np.sin(np.random.rand(1000)*2)}
varsdf = pd.DataFrame(data=vars)
reps[r] = vars
series[se] = reps
ses[s] = series
mydata[p] = ses
mydatadf = pd.DataFrame(mydata)
How could I effectively average (for example) var1 across the nesting levels reps, series, ses and/or participant?
Eventually, I would like to plot all var1 objects and highlight with different colours averaged data across any desired nesting level.
for p in mydatadf.keys():
for ses in mydatadf[p].keys():
for set in mydatadf[p][ses].keys():
for rep in mydatadf[p][ses][set].keys():
data = mydatadf[p][ses][set][rep]['var1']
plt.plot(data)
plt.show()
You can always flatten the dataframe and do standard groupby operations (I don't know if it is optimal, but it works):
df = pd.io.json.json_normalize(mydata) #this will give a nested dataframe
df_flat = pd.DataFrame(df.T.index.str.split('.').tolist()).assign(values=df.T.values)
df_flat.head(3)
>> 0 1 2 3 4 \
0 participantA ses_1 s_1 r_1 var1
1 participantA ses_1 s_1 r_1 var2
2 participantA ses_1 s_1 r_2 var1
values
0 [0.7267196257553268, 0.9822775511169437, 0.991...
1 [0.6633676714415264, 0.2823588336690545, 0.977...
2 [0.2211576389168905, 0.9399581790280525, 0.645...
Edit: to groupby and apply a function (say, mean):
# in this case I choose column 4, corresponding to 'var'.
# You can change the name of the column using df_flat.columns.rename
# note that I use np.hstack as you are dealing with a an array of arrays
column = 4
df_flat.groupby(column)['Values'].apply(lambda x: np.hstack(x).mean())
>> 4
var1 0.707803
var2 0.707821
Name: Values, dtype: float64

h2o frame from pandas casting

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

How to pass an array in python pandas to plot two axes?

I am trying to create an XY chart using Python and the Pygal library. The source data is contained in a CSV file with three columns; ID, Portfolio and Value. Unfortunately I can only plot one axis and I suspect it's an issue with the array. Can anyone point me in the right direction? Do I need to use numpy? Thank you!
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value'] << I suspect this is wrong
)
xy_chart.render_in_browser()
With
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio']
)
xy_chart.render_in_browser()
I get:
A graph with a series of horizontal data points/values; i.e. it has the X values but no Y values.
With:
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
xy_chart = pygal.XY()
xy_chart.add('Portfolio', data['Portfolio','Value']
)
xy_chart.render_in_browser()
I get:
KeyError: ('Portfolio', 'Value')
Sample data:
ID Portfolio Value
1 1 -2560.042036
2 2 1208.106958
3 3 5702.386949
4 4 -8827.63913
5 5 -3881.665733
6 6 5951.602484
Maybe a little late here, but I just did something similar. Your second example requires multiple columns to be handed in as a array and then the DataFrame you get back needs to be converted into a list of tuples.
import pygal
import pandas as pd
data = pd.read_csv("profit.csv")
data.columns = ["ID", "Portfolio", "Value"]
points = data[['Portfolio','Value']].to_records(index=False).tolist()
xy_chart = pygal.XY()
xy_chart.add('Portfolio', points)
xy_chart.render_in_browser()
There may be a more elegant use of the pandas or pygal API to get the columns into a list of tuples.

Use Pandas GroupBy Columns in new DataFrame

I have a large temperature time series that I'm performing some functions on. I'm taking hourly observations and creating daily statistics. After I'm done with my calculations, I want to use the grouped year and Julian days that are objects in the Groupby ('aa' below) and the drangeT and drangeHI arrays that come out and make an entirely new DataFrame with those variables. Code is below:
import numpy as np
import scipy.stats as st
import pandas as pd
city = ['BUF']#,'PIT','CIN','CHI','STL','MSP','DET']
mons = np.arange(5,11,1)
for a in city:
data = 'H:/Classwork/GEOG612/Project/'+a+'Data_cut.txt'
df = pd.read_table(data,sep='\t')
df['TempF'] = ((9./5.)*df['TempC'])+32.
df1 = df.loc[df['Month'].isin(mons)]
aa = df1.groupby(['Year','Julian'],as_index=False)
maxT = aa.aggregate({'TempF':np.max})
minT = aa.aggregate({'TempF':np.min})
maxHI = aa.aggregate({'HeatIndex':np.max})
minHI = aa.aggregate({'HeatIndex':np.min})
drangeT = maxT - minT
drangeHI = maxHI - minHI
df2 = pd.DataFrame(data = {'Year':aa.Year,'Day':aa.Julian,'TRange':drangeT,'HIRange':drangeHI})
All variables in the df2 command are of length 8250, but I get this error message when I run the it:
ValueError: cannot copy sequence with size 3 to array axis with dimension 8250
Any suggestions are welcomed and appreciated. Thanks!

Associating units with Pandas DataFrame

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64

Categories