h2o frame from pandas casting - python

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.

Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...

# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

Related

Retain R dataframe index values when converting to a pandas dataframe

Fitted mixed effects model using R (base version 3.5.2) package LME4, run via rpy2 2.9.4 from Python 3.6
Able to print random effects as an indexed dataframe, where the index values are the values of the categorical variable(s) used to define the groups (using radon data):
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
from rpy2.robjects.packages import importr
lme4 = importr('lme4')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = ro.r['ranef']
re = r_ranef(mod)
print(re[1])
Uppm (Intercept) floor (Intercept)
AITKIN -0.0026783361 -2.588735e-03 1.742426e-09 -0.0052003670
ANOKA -0.0056688495 -6.418760e-03 -4.482764e-09 -0.0128942943
BECKER 0.0021906431 1.190746e-03 1.211201e-09 0.0023920238
BELTRAMI 0.0093246041 8.190172e-03 5.135196e-09 0.0164527872
BENTON 0.0018747838 1.049496e-03 1.746748e-09 0.0021082742
BIG STONE -0.0073756824 -2.430404e-03 0.000000e+00 -0.0048823057
BLUE EARTH 0.0112939204 4.176931e-03 5.507525e-09 0.0083908075
BROWN 0.0069223055 2.544912e-03 4.911563e-11 0.0051123339
Converting this to a pandas DataFrame, the categorical values are lost from the index and replaced by integers:
pandas2ri.ri2py_dataframe(r_ranef[1]) # r_ranef is a dict of dataframes
Uppm (Intercept) floor (Intercept)
0 -0.002678 -0.002589 1.742426e-09 -0.005200
1 -0.005669 -0.006419 -4.482764e-09 -0.012894
2 0.002191 0.001191 1.211201e-09 0.002392
3 0.009325 0.008190 5.135196e-09 0.016453
4 0.001875 0.001049 1.746748e-09 0.002108
5 -0.007376 -0.002430 0.000000e+00 -0.004882
6 0.011294 0.004177 5.507525e-09 0.008391
7 0.006922 0.002545 4.911563e-11 0.005112
How do I retain the values of the original index?
The doc suggests as.data.frame could contain grp, which might be the values I'm after, but I'm struggling to implement that through rpy2; e.g.,
r_ranef = ro.r['ranef.as.data.frame']
does not work
Consider adding row.names as a new column in R data frame and then use this column to set_index in Pandas data frame:
base = importr('base')
# ADD NEW COLUMN TO R DATA FRAME
re[1] = base.transform(re[1], index = base.row_names(re[1]))
# SET INDEX IN PANDAS DATA FRAME
py_df = (pandas2ri.ri2py_dataframe(re[1])
.set_index('index')
.rename_axis(None)
)
And to do so across all data frames in list, use R's lapply loop and then Python's list comprehension for new list of Pandas indexed data frames.
base = importr('base')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = lme4.ranef(mod)
# R LAPPLY
new_r_ranef = base.lapply(r_ranef, lambda df:
base.transform(df, index=base.row_names(df)))
# PYTHON LIST COMPREHENSION
py_df_list = [(pandas2ri.ri2py_dataframe(df)
.set_index('index')
.rename_axis(None)
) for df in new_r_ranef]
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
r_dataf = ro.r("""
data.frame(
Uppm = rnorm(5),
row.names = letters[1:5]
)
""")
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
# row names are "a".."f"
print(r_dataf)
# row names / indexes are now 0..4
print(pd_dataf)
This is likely a minor bug/missing feature in rpy2, but the workaround is rather straightforward:
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
pd_dataf.index = r_dataf.rownames

KeyError: 'column_name'

I am writing a python code, it should read the values of columns but I am getting the KeyError: 'column_name' error. Can anyone please tell me how to fix this issue.
import numpy as np
from sklearn.cluster import KMeans
import pandas as pd
### For the purposes of this example, we store feature data from our
### dataframe `df`, in the `f1` and `f2` arrays. We combine this into
### a feature matrix `X` before entering it into the algorithm.
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
print (df)
#df = pd.read_csv(csv_file)
"""
saved_column = df.Distance_Feature
saved_column = df.Speeding_Feature
print(saved_column)
"""
f1 = df['Distance_Feature'].tolist()
f2 = df['Speeding_Feature'].tolist()
print(f1)
print(f2)
X=np.matrix(zip(f1,f2))
print(X)
kmeans = KMeans(n_clusters=2).fit(X)
Can anyone please help me.
Asumming 'C:\Users\Desktop\data.csv' contains the following data
Distance_Feature Speeding_Feature
1 2
3 4
5 6
...
Change
df = pd.read_csv(r'C:\Users\Desktop\data.csv')
to
df = pd.read_csv("data.txt",names=["Distance_Feature","Speeding_Feature"],sep= "\s+|\t+|\s+\t+|\t+\s+",header=1)
# Here it is assumed white space separator, if another separator is used change `sep`.

Get the name of a pandas DataFrame

How do I get the name of a DataFrame and print it as a string?
Example:
boston (var name assigned to a csv file)
import pandas as pd
boston = pd.read_csv('boston.csv')
print('The winner is team A based on the %s table.) % boston
You can name the dataframe with the following, and then call the name wherever you like:
import pandas as pd
df = pd.DataFrame( data=np.ones([4,4]) )
df.name = 'Ones'
print df.name
>>>
Ones
Sometimes df.name doesn't work.
you might get an error message:
'DataFrame' object has no attribute 'name'
try the below function:
def get_df_name(df):
name =[x for x in globals() if globals()[x] is df][0]
return name
In many situations, a custom attribute attached to a pd.DataFrame object is not necessary. In addition, note that pandas-object attributes may not serialize. So pickling will lose this data.
Instead, consider creating a dictionary with appropriately named keys and access the dataframe via dfs['some_label'].
df = pd.DataFrame()
dfs = {'some_label': df}
From here what I understand DataFrames are:
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects.
And Series are:
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.).
Series have a name attribute which can be accessed like so:
In [27]: s = pd.Series(np.random.randn(5), name='something')
In [28]: s
Out[28]:
0 0.541
1 -1.175
2 0.129
3 0.043
4 -0.429
Name: something, dtype: float64
In [29]: s.name
Out[29]: 'something'
EDIT: Based on OP's comments, I think OP was looking for something like:
>>> df = pd.DataFrame(...)
>>> df.name = 'df' # making a custom attribute that DataFrame doesn't intrinsically have
>>> print(df.name)
'df'
DataFrames don't have names, but you have an (experimental) attribute dictionary you can use. For example:
df.attrs['name'] = "My name" # Can be retrieved later
attributes are retained through some operations.
Here is a sample function:
'df.name = file` : Sixth line in the code below
def df_list():
filename_list = current_stage_files(PATH)
df_list = []
for file in filename_list:
df = pd.read_csv(PATH+file)
df.name = file
df_list.append(df)
return df_list
I am working on a module for feature analysis and I had the same need as yours, as I would like to generate a report with the name of the pandas.Dataframe being analyzed. To solve this, I used the same solution presented by #scohe001 and #LeopardShark, originally in https://stackoverflow.com/a/18425523/8508275, implemented with the inspect library:
import inspect
def aux_retrieve_name(var):
callers_local_vars = inspect.currentframe().f_back.f_back.f_locals.items()
return [var_name for var_name, var_val in callers_local_vars if var_val is var]
Note the additional .f_back term since I intend to call it from another function:
def header_generator(df):
print('--------- Feature Analyzer ----------')
print('Dataframe name: "{}"'.format(aux_retrieve_name(df)))
print('Memory usage: {:03.2f} MB'.format(df.memory_usage(deep=True).sum() / 1024 ** 2))
return
Running this code with a given dataframe, I get the following output:
header_generator(trial_dataframe)
--------- Feature Analyzer ----------
Dataframe name: "trial_dataframe"
Memory usage: 63.08 MB

Associating units with Pandas DataFrame

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64

Exporting Pandas DataFrame with MultiIndex

I have just discovered pandas and am impressed by its capabilities.
I am having difficulties understanding how to work with DataFrame with MultiIndex.
I have two questions :
(1) Exporting the DataFrame
Here my problem:
This dataset
import pandas as pd
import StringIO
d1 = StringIO.StringIO(
"""Gender,Employed,Region,Degree
m,yes,east,ba
m,yes,north,ba
f,yes,south,ba
f,no,east,ba
f,no,east,bsc
m,no,north,bsc
m,yes,south,ma
f,yes,west,phd
m,no,west,phd
m,yes,west,phd """
)
df = pd.read_csv(d1)
# Frequencies tables
tab1 = pd.crosstab(df.Gender, df.Region)
tab2 = pd.crosstab(df.Gender, [df.Region, df.Degree])
tab3 = pd.crosstab([df.Gender, df.Employed], [df.Region, df.Degree])
# Now we export the datasets
tab1.to_excel('H:/test_tab1.xlsx') # OK
tab2.to_excel('H:/test_tab2.xlsx') # fails
tab3.to_excel('H:/test_tab3.xlsx') # fails
One work-around I could think of is to change the columns (The way R does)
def NewColums(DFwithMultiIndex):
NewCol = []
for item in DFwithMultiIndex.columns:
NewCol.append('-'.join(item))
return NewCol
# New Columns
tab2.columns = NewColums(tab2)
tab3.columns = NewColums(tab3)
# New export
tab2.to_excel('H:/test_tab2.xlsx') # OK
tab3.to_excel('H:/test_tab3.xlsx') # OK
My question is : Is there a more efficient way to do this in Pandas that I missed in the documentation ?
2) Selecting columns
This new structure does not allow to select colums on a given variable (the advantage of hierarchical indexing in first place). How can I select columns containing a given string (e.g. '-ba') ?
P.S: I have seen this question which is related but have not understood the reply proposed
This looks like a bug in to_excel, for the moment as a workaround I would recommend using to_csv (which seems not to show this issue).
I added this as an issue on github.
To answer the second question, if you really need to use to_excel...
You can use filter to select only those columns which include '-ba':
In [21]: filter(lambda x: '-ba' in x, tab2.columns)
Out[21]: ['east-ba', 'north-ba', 'south-ba']
In [22]: tab2[filter(lambda x: '-ba' in x, tab2.columns)]
Out[22]:
east-ba north-ba south-ba
Gender
f 1 0 1
m 1 1 0

Categories