I have a function that extracts a number of variables from zillow. I used a lambda function to append the returned values to a dataframe. I am wondering if there is a faster way to return all the variables and append them to the dataframe instead of individually.
Here is my code:
from xml.dom.minidom import parse,parseString
import xml.dom.minidom
import requests
import sys
import pandas as pd
import numpy as np
l_zwsid=''
df = pd.read_csv('data.csv')
def getElementValue(p_dom,p_element):
if len(p_dom.getElementsByTagName(p_element)) > 0:
l_value=p_dom.getElementsByTagName(p_element)[0]
return(l_value.firstChild.data)
else:
l_value='NaN'
return(l_value)
def getData(l_zwsid, a_addr, a_zip):
try:
l_url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id='+l_zwsid+'&address='+a_addr+'&citystatezip='+a_zip
xml=requests.get(l_url)
dom=parseString(xml.text)
responses=dom.getElementsByTagName('response')
zpid=getElementValue(dom,'zpid')
usecode=getElementValue(dom,'useCode')
taxyear=getElementValue(dom,'taxAssessmentYear')
tax=getElementValue(dom,'taxAssessment')
yearbuilt=getElementValue(dom,'yearBuilt')
sqft=getElementValue(dom,'finishedSqFt')
lotsize=getElementValue(dom,'lotSizeSqFt')
bathrooms=getElementValue(dom,'bathrooms')
bedrooms=getElementValue(dom,'bedrooms')
totalrooms=getElementValue(dom,'totalRooms')
lastSale=getElementValue(dom,'lastSoldDate')
lastPrice=getElementValue(dom,'lastSoldPrice')
latitude=getElementValue(dom, 'latitude')
longitude=getElementValue(dom, 'longitude')
for response in responses:
addresses=response.getElementsByTagName('address')
for addr in addresses:
street=getElementValue(addr,'street')
zipcode=getElementValue(addr,'zipcode')
zestimates=response.getElementsByTagName('zestimate')
for zest in zestimates:
amt=getElementValue(zest,'amount')
lastupdate=getElementValue(zest,'last-updated')
valranges=zest.getElementsByTagName('valuationRange')
for val in valranges:
low=getElementValue(val,'low')
high=getElementValue(val,'high')
return longitude, latitude
except AttributeError:
return None
df['Longtitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
df['Latitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
This currently does not work because the new columns will contain both the longitude and latitude.
Your getData function returns a tuple, which is why both columns have both lat and lon. One workaround could be to parameterise this function as follows:
def getData(l_zwsid, a_addr, a_zip, axis='lat'):
valid = ['lat', 'lon']
if axis not in valid:
raise ValueError(f'axis must be one of {valid}')
...
if axis == 'lat':
return latitude
else:
return longitude
This won't improve efficiency will make it even slower, however. Your main overhead is coming from making API calls for every row in the DataFrame, so you are constrained by network performance.
You can make your getData function return a string which contains comma separated values of all the elements
Append this csv string as ALL_TEXTcolumn in the dataframe df
Split the column ALL_TEXT into multiple columns (Lat, long, zipcode, street etc)
def split_into_columns(text):
required_columns = ['Latitude', 'Longtitude', 'Zipcode']
columns_value_list = text['ALL_TEXT'].split(',')
for i in range(len(required_columns)):
text[required_columns[i]] = columns_value_list[i]
return text
df= pd.DataFrame([ ['11.49, 12.56, 9823A'], ['14.02, 15.29, 9674B'] ], columns=['ALL_TEXT'])
updated_df = df.apply(split_into_columns, axis=1)
df
ALL_TEXT
0 11.49, 12.56, 9823A
1 14.02, 15.29, 9674B
updated_df
ALL_TEXT Latitude Longtitude Zipcode
0 11.49, 12.56, 9823A 11.49 12.56 9823A
1 14.02, 15.29, 9674B 14.02 15.29 9674B
Related
Apologies if something similar has been asked before.
I have a task where I need a function that is fed a list of unix times, and a pandas df.
The pandas df has a column for unix time, a column for latitude, and a column for longitude.
I need to extract the latitude from the df where the df unix time matches the unix time in my list I pass to the function.
So far I have:
`def nl_lat_array(pandas_df, unixtime_list):
lat = dict()
data = pandas_df
for x, row in data.iterrows():
if data[data['DateTime_Unix']] == i in unixtime_list:
lat[i] = data[data['Latitude']]
v=list(lat.values())
nl_lat_array = np.array(v)
return nl_lat_array
This results in the following error:
KeyError: "None of [Float64Index([1585403852.468, 1585403852.518, 1585403852.568, 1585403852.618,\n 1585403852.668, 1585403852.718, 1585403852.768, 1585403852.818,\n 1585403852.868, 1585403852.918,\n ...\n 1585508348.524, 1585508348.574, 1585508348.624, 1585508348.674,\n 1585508348.724, 1585508348.774, 1585508348.824, 1585508348.874,\n 1585508348.924, 1585508348.974],\n dtype='float64', length=2089945)] are in the [columns]"
However these values in the pandas array do exist in the list I am passing.
Any help would be greatly appreciated.
import pandas as pd
data = pd.DataFrame([[1,4,7],[2,5,8],[3,6,9]])
data.columns = ['time', 'lat', 'long']
time_list = [1,2]
d = data[data['time'].isin(time_list)]['lat'].values
# [4, 5]
You can do something like this.
filtered_data = data[data['DateTime_Unix'].isin(unixtime_list)]
filtered_data['Latitude'].values()
I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN
This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN
Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.
I have a csv file with the following columns:
Date|Mkt-RF|SMB|HML|RF|C|aig-RF|ford-RF|ibm-RF|xom-RF|
I am trying to run a multiple OLS regression in python, regressing 'Mkt-RF', 'SMB' and 'HML' on 'aig-RF' for instance.
It seems like i need to first sort out the DataFrame from the arrays but i cannot seem to understand how:
# Regression
x = df[['Mkt-RF','SMB','HML']]
y = df['aig-RF']
df = pd.DataFrame({'x':x, 'y':y})
df['constant'] = 1
df.head()
sm.OLS(y,df[['constant','x']]).fit().summary()
The full code is:
import numpy as np
import pandas as pd
from pandas import DataFrame
from sklearn import linear_model
import statsmodels.api as sm
def ReadFF(sIn):
"""
Purpose:
Read the FF data
Inputs:
sIn string, name of input file
Return value:
df dataframe, data
"""
df= pd.read_csv(sIn, header=3, names=["Date","Mkt-RF","SMB","HML","RF"])
df= df.dropna(how='any')
# Reformat the dates, as date-time, and place them as index
vDate= pd.to_datetime(df["Date"].values,format='%Y%m%d')
df.index= vDate
# Add in a constant
iN= len(vDate)
df["C"]= np.ones(iN)
print(df)
return df
def JoinStock(df, sStock, sPer):
"""
Purpose:
Join the stock into the dataframe, as excess returns
Inputs:
df dataframe, data including RF
sStock string, name of stock to read
sPer string, extension indicating period
Return value:
df dataframe, enlarged
"""
df1= pd.read_csv(sStock+"_"+sPer+".csv", index_col="Date", usecols=["Date", "Adj Close"])
df1.columns= [sStock]
# Add prices to original dataframe, to get correct dates
df= df.join(df1, how="left")
# Extract returns
vR= 100*np.diff(np.log(df[sStock].values))
# Add a missing, as one observation was lost differencing
vR= np.hstack([np.nan, vR])
# Add excess return to dataframe
df[sStock + "-RF"]= vR - df["RF"]
print(df)
return df
def SaveFF(df, asStock, sOut):
"""
Purpose:
Save data for FF regressions
Inputs:
df dataframe, all data
asStock list of strings, stocks
sOut string, output file name
Output:
file written to disk
"""
df= df.dropna(how='any')
asOut= ['Mkt-RF', 'SMB', 'HML', 'RF', 'C']
for sStock in asStock:
asOut.append(sStock+"-RF")
print ("Writing columns ", asOut, "to file ", sOut)
df.to_csv(sOut, columns=asOut, index_label="Date", float_format="%.8g")
print(df)
return df
def main():
sPer= "0018"
sIn= "Research_Data_Factors_weekly.csv"
sOut= "ffstocks"
asStock= ["aig", "ford", "ibm", "xom"]
# Initialisation
df= ReadFF(sIn)
for sStock in asStock:
df= JoinStock(df, sStock, sPer)
# Output
SaveFF(df, asStock, sOut+"_"+sPer+".csv")
print ("Done")
# Regression
x = df[['Mkt-RF','SMB','HML']]
y = df['aig-RF']
df = pd.DataFrame({'x':x, 'y':y})
df['constant'] = 1
df.head()
sm.OLS(y,df[['constant','x']]).fit().summary()
What exactly do i need to modify in pd.DataFrame in order to get the multiple OLS regression table?
I propose to change the first chunk of your code to below (mostly just swapping line orders):
# add constant column to the original dataframe
df['constant'] = 1
# define x as a subset of original dataframe
x = df[['Mkt-RF', 'SMB', 'HML', 'constant']]
# define y as a series
y = df['aig-RF']
# pass x as a dataframe, while pass y as a series
sm.OLS(y, x).fit().summary()
Hope this helps.
I've imported deque from collections to limit the size of my data frame. When new data is entered, the older ones should be progressively deleted over time.
Big Picture:
Im creating a Data Frame of historical values of the previous 26 days from time "whatever day it is..."
Confusion:
I think my data each minute comes in a series format, which then I attempted to restrict the maxlen using deque. Then I tried implementing the data into an data frame. However I just get NaN values.
Code:
import numpy as np
import pandas as pd
from collections import deque
def initialize(context):
context.stocks = (symbol('AAPL'))
def before_trading_start(context, data):
data = data.history(context.stocks, 'close', 20, '1m').dropna()
length = 5
d = deque(maxlen = length)
data = d.append(data)
index = pd.DatetimeIndex(start='2016-04-03 00:00:00', freq='S', periods=length)
columns = ['price']
df = pd.DataFrame(index=index, columns=columns, data=data)
print df
How can I get this to work?
Mike
If I understand correctly the question, you want to keep all the values of the last twenty six last days. Does the following function is enough for you?
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
.loc[lambda x: x.index > twenty_six_day_before, :]
.iloc[-length:, :]
)
If the dates are not in the index:
def select_values_of_the_last_twenty_six_days(old_data, new_data):
length = 5
twenty_six_day_before = (
pd.Timestamp.now(tz='Europe/Paris').round('D')
- pd.to_timedelta(26, 'D')
)
return (
pd.concat([old_data, new_data])
# the following line is changed for values in a specific column
.loc[lambda x: x['column_with_date'] > twenty_six_day_before, :]
.iloc[-length:, :]
)
Don't forget to change the hard coded timezone if you are not in France. :-)
I'm using a web service that returns a CSV response in which the 1st row contains the column names, and the 2nd row contains the column units, for example:
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
I can read this into a Pandas DataFrame:
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
# Create a Pandas DataFrame
obs=pd.read_csv(StringIO(x.strip()), sep=",\s*")
print(obs)
which produces
longitude latitude
0 degrees_east degrees_north
1 -142.842 -1.82
2 -25.389 39.87
3 -37.704 27.114
But what would be the best approach to associate the units with the DataFrame columns for later use, for example labeling plots?
Allowing pandas to read the second line as data is screwing up the dtype for the columns. Instead of a float dtype, the presence of strings make the dtype of the columns object, and the underlying objects, even the numbers, are strings. This screws up all numerical operations:
In [8]: obs['latitude']+obs['longitude']
Out[8]:
0 degrees_northdegrees_east
1 -1.82-142.842
2 39.87-25.389
3 27.114-37.704
In [9]: obs['latitude'][1]
Out[9]: '-1.82'
So it is imperative that pd.read_csv skip the second line.
The following is pretty ugly, but given the format of the input, I don't see a better way.
import pandas as pd
from StringIO import StringIO
x = '''
longitude,latitude
degrees_east,degrees_north
-142.842,-1.82
-25.389,39.87
-37.704,27.114
'''
content = StringIO(x.strip())
def read_csv(content):
columns = next(content).strip().split(',')
units = next(content).strip().split(',')
obs = pd.read_table(content, sep=",\s*", header=None)
obs.columns = ['{c} ({u})'.format(c=col, u=unit)
for col, unit in zip(columns, units)]
return obs
obs = read_csv(content)
print(obs)
# longitude (degrees_east) latitude (degrees_north)
# 0 -142.842 -1.820
# 1 -25.389 39.870
# 2 -37.704 27.114
print(obs.dtypes)
# longitude (degrees_east) float64
# latitude (degrees_north) float64