Associating units with Pandas DataFrame - python

I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?

Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64

Related

Count of number of locations within certain distance

I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN
This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN
Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.

Retain R dataframe index values when converting to a pandas dataframe

Fitted mixed effects model using R (base version 3.5.2) package LME4, run via rpy2 2.9.4 from Python 3.6
Able to print random effects as an indexed dataframe, where the index values are the values of the categorical variable(s) used to define the groups (using radon data):
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
from rpy2.robjects.packages import importr
lme4 = importr('lme4')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = ro.r['ranef']
re = r_ranef(mod)
print(re[1])
Uppm (Intercept) floor (Intercept)
AITKIN -0.0026783361 -2.588735e-03 1.742426e-09 -0.0052003670
ANOKA -0.0056688495 -6.418760e-03 -4.482764e-09 -0.0128942943
BECKER 0.0021906431 1.190746e-03 1.211201e-09 0.0023920238
BELTRAMI 0.0093246041 8.190172e-03 5.135196e-09 0.0164527872
BENTON 0.0018747838 1.049496e-03 1.746748e-09 0.0021082742
BIG STONE -0.0073756824 -2.430404e-03 0.000000e+00 -0.0048823057
BLUE EARTH 0.0112939204 4.176931e-03 5.507525e-09 0.0083908075
BROWN 0.0069223055 2.544912e-03 4.911563e-11 0.0051123339
Converting this to a pandas DataFrame, the categorical values are lost from the index and replaced by integers:
pandas2ri.ri2py_dataframe(r_ranef[1]) # r_ranef is a dict of dataframes
Uppm (Intercept) floor (Intercept)
0 -0.002678 -0.002589 1.742426e-09 -0.005200
1 -0.005669 -0.006419 -4.482764e-09 -0.012894
2 0.002191 0.001191 1.211201e-09 0.002392
3 0.009325 0.008190 5.135196e-09 0.016453
4 0.001875 0.001049 1.746748e-09 0.002108
5 -0.007376 -0.002430 0.000000e+00 -0.004882
6 0.011294 0.004177 5.507525e-09 0.008391
7 0.006922 0.002545 4.911563e-11 0.005112
How do I retain the values of the original index?
The doc suggests as.data.frame could contain grp, which might be the values I'm after, but I'm struggling to implement that through rpy2; e.g.,
r_ranef = ro.r['ranef.as.data.frame']
does not work
Consider adding row.names as a new column in R data frame and then use this column to set_index in Pandas data frame:
base = importr('base')
# ADD NEW COLUMN TO R DATA FRAME
re[1] = base.transform(re[1], index = base.row_names(re[1]))
# SET INDEX IN PANDAS DATA FRAME
py_df = (pandas2ri.ri2py_dataframe(re[1])
.set_index('index')
.rename_axis(None)
)
And to do so across all data frames in list, use R's lapply loop and then Python's list comprehension for new list of Pandas indexed data frames.
base = importr('base')
mod = lme4.lmer(**kwargs) # Omitting arguments for brevity
r_ranef = lme4.ranef(mod)
# R LAPPLY
new_r_ranef = base.lapply(r_ranef, lambda df:
base.transform(df, index=base.row_names(df)))
# PYTHON LIST COMPREHENSION
py_df_list = [(pandas2ri.ri2py_dataframe(df)
.set_index('index')
.rename_axis(None)
) for df in new_r_ranef]
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri, default_converter
from rpy2.robjects.conversion import localconverter
r_dataf = ro.r("""
data.frame(
Uppm = rnorm(5),
row.names = letters[1:5]
)
""")
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
# row names are "a".."f"
print(r_dataf)
# row names / indexes are now 0..4
print(pd_dataf)
This is likely a minor bug/missing feature in rpy2, but the workaround is rather straightforward:
with localconverter(default_converter + pandas2ri.converter) as conv:
pd_dataf = conv.rpy2py(r_dataf)
pd_dataf.index = r_dataf.rownames

Appending multiple elements to a dataframe

I have a function that extracts a number of variables from zillow. I used a lambda function to append the returned values to a dataframe. I am wondering if there is a faster way to return all the variables and append them to the dataframe instead of individually.
Here is my code:
from xml.dom.minidom import parse,parseString
import xml.dom.minidom
import requests
import sys
import pandas as pd
import numpy as np
l_zwsid=''
df = pd.read_csv('data.csv')
def getElementValue(p_dom,p_element):
if len(p_dom.getElementsByTagName(p_element)) > 0:
l_value=p_dom.getElementsByTagName(p_element)[0]
return(l_value.firstChild.data)
else:
l_value='NaN'
return(l_value)
def getData(l_zwsid, a_addr, a_zip):
try:
l_url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id='+l_zwsid+'&address='+a_addr+'&citystatezip='+a_zip
xml=requests.get(l_url)
dom=parseString(xml.text)
responses=dom.getElementsByTagName('response')
zpid=getElementValue(dom,'zpid')
usecode=getElementValue(dom,'useCode')
taxyear=getElementValue(dom,'taxAssessmentYear')
tax=getElementValue(dom,'taxAssessment')
yearbuilt=getElementValue(dom,'yearBuilt')
sqft=getElementValue(dom,'finishedSqFt')
lotsize=getElementValue(dom,'lotSizeSqFt')
bathrooms=getElementValue(dom,'bathrooms')
bedrooms=getElementValue(dom,'bedrooms')
totalrooms=getElementValue(dom,'totalRooms')
lastSale=getElementValue(dom,'lastSoldDate')
lastPrice=getElementValue(dom,'lastSoldPrice')
latitude=getElementValue(dom, 'latitude')
longitude=getElementValue(dom, 'longitude')
for response in responses:
addresses=response.getElementsByTagName('address')
for addr in addresses:
street=getElementValue(addr,'street')
zipcode=getElementValue(addr,'zipcode')
zestimates=response.getElementsByTagName('zestimate')
for zest in zestimates:
amt=getElementValue(zest,'amount')
lastupdate=getElementValue(zest,'last-updated')
valranges=zest.getElementsByTagName('valuationRange')
for val in valranges:
low=getElementValue(val,'low')
high=getElementValue(val,'high')
return longitude, latitude
except AttributeError:
return None
df['Longtitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
df['Latitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
This currently does not work because the new columns will contain both the longitude and latitude.
Your getData function returns a tuple, which is why both columns have both lat and lon. One workaround could be to parameterise this function as follows:
def getData(l_zwsid, a_addr, a_zip, axis='lat'):
valid = ['lat', 'lon']
if axis not in valid:
raise ValueError(f'axis must be one of {valid}')
...
if axis == 'lat':
return latitude
else:
return longitude
This won't improve efficiency will make it even slower, however. Your main overhead is coming from making API calls for every row in the DataFrame, so you are constrained by network performance.
You can make your getData function return a string which contains comma separated values of all the elements
Append this csv string as ALL_TEXTcolumn in the dataframe df
Split the column ALL_TEXT into multiple columns (Lat, long, zipcode, street etc)
def split_into_columns(text):
required_columns = ['Latitude', 'Longtitude', 'Zipcode']
columns_value_list = text['ALL_TEXT'].split(',')
for i in range(len(required_columns)):
text[required_columns[i]] = columns_value_list[i]
return text
df= pd.DataFrame([ ['11.49, 12.56, 9823A'], ['14.02, 15.29, 9674B'] ], columns=['ALL_TEXT'])
updated_df = df.apply(split_into_columns, axis=1)
df
ALL_TEXT
0 11.49, 12.56, 9823A
1 14.02, 15.29, 9674B
updated_df
ALL_TEXT Latitude Longtitude Zipcode
0 11.49, 12.56, 9823A 11.49 12.56 9823A
1 14.02, 15.29, 9674B 14.02 15.29 9674B

h2o frame from pandas casting

I am using h2o to perform predictive modeling from python.
I have loaded some data from a csv using pandas, specifying some column types:
dtype_dict = {'SIT_SSICCOMP':'object',
'SIT_CAPACC':'object',
'PTT_SSIRMPOL':'object',
'PTT_SPTCLVEI':'object',
'cap_pad':'object',
'SIT_SADNS_RESP_PERC':'object',
'SIT_GEOCODE':'object',
'SIT_TIPOFIRMA':'object',
'SIT_TPFRODESI':'object',
'SIT_CITTAACC':'object',
'SIT_INDIRACC':'object',
'SIT_NUMCIVACC':'object'
}
date_cols = ["SIT_SSIDTSIN","SIT_SSIDTDEN","PTT_SPTDTEFF","PTT_SPTDTSCA","SIT_DTANTIFRODE","PTT_DTELABOR"]
columns_to_drop = ['SIT_TPFRODESI','SIT_CITTAACC',
'SIT_INDIRACC', 'SIT_NUMCIVACC', 'SIT_CAPACC', 'SIT_LONGITACC',
'SIT_LATITACC','cap_pad','SIT_DTANTIFRODE']
comp='mycomp'
file_completo = os.path.join(dataDir,"db4modelrisk_"+comp+".csv")
db4scoring = pd.read_csv(filepath_or_buffer=file_completo,sep=";", encoding='latin1',
header=0,infer_datetime_format =True,na_values=[''], keep_default_na =False,
parse_dates=date_cols,dtype=dtype_dict,nrows=500e3)
db4scoring.drop(labels=columns_to_drop,axis=1,inplace =True)
Then, after I set up a h2o cluster I import it in h2o using db4scoring_h2o = H2OFrame(db4scoring) and I convert categorical predictors in factor for example:
db4scoring_h2o["SIT_SADTPROV"]=db4scoring_h2o["SIT_SADTPROV"].asfactor()
db4scoring_h2o["PTT_SPTFRAZ"]=db4scoring_h2o["PTT_SPTFRAZ"].asfactor()
When I check data types using db4scoring.dtypes I notice that they are properly set but when I import it in h2o I notice that h2oframe performs some unwanted conversions to enum (eg from float or from int). I wonder if is is a way to specify the variable format in H2OFrame.
Yes, there is. See the H2OFrame doc here: http://docs.h2o.ai/h2o/latest-stable/h2o-py/docs/frame.html#h2oframe
You just need to use the column_types argument when you cast.
Here's a short example:
# imports
import h2o
import numpy as np
import pandas as pd
# create small random pandas df
df = pd.DataFrame(np.random.randint(0,10,size=(10, 2)),
columns=list('AB'))
print(df)
# A B
#0 5 0
#1 1 3
#2 4 8
#3 3 9
# ...
# start h2o, convert pandas frame to H2OFrame
# use column_types dict to set data types
h2o.init()
h2o_df = h2o.H2OFrame(df, column_types={'A':'numeric', 'B':'enum'})
h2o_df.describe() # you should now see the desired data types
# A B
# type int enum
# ...
# Filter a dictionary to keep elements only whose keys are even
newDict = filterTheDict(dictOfNames, lambda elem : elem[0] % 2 == 0)
print('Filtered Dictionary : ')
print(newDict)`enter code here`

How can I add columns in a data frame?

I have the following data:
Example:
DRIVER_ID;TIMESTAMP;POSITION
156;2014-02-01 00:00:00.739166+01;POINT(41.8836718276551 12.4877775603346)
I want to create a pandas dataframe with 4 columns that are the id, time, longitude, latitude.
So far, I got:
cur_cab = pd.DataFrame.from_csv(
path,
sep=";",
header=None,
parse_dates=[1]).reset_index()
cur_cab.columns = ['cab_id', 'datetime', 'point']
path specifies the .txt file containing the data.
I already wrote a function that returns the longitude and latitude values from the point formated string.
How do I expand the data frame with the additional column and the splitted values ?
After loading, if you're using a recent version of pandas then you can use the vectorised str methods to parse the column:
In [87]:
df['pos_x'], df['pos_y']= df['point'].str[6:-1].str.split(expand=True)
df
Out[87]:
cab_id datetime \
0 156 2014-01-31 23:00:00.739166
point pos_x pos_y
0 POINT(41.8836718276551 12.4877775603346) 0 1
Also you should stop using from_csv it's no longer updated, use the top level read_csv so your loading code would be:
cur_cab = pd.read_csv(
path,
sep=";",
header=None,
parse_dates=[1],
names=['cab_id', 'datetime', 'point'],
skiprows=1)

Categories