I'm currently trying to plot AirBnb locations in Paris using folium. My code is as below:
f = folium.Figure(width = 800,
height = 500)
map = folium.Map(location = [48.8569421129686, 2.3503337285332204], # Coords for Paris
zoom_start = 10,
tiles = 'CartoDB positron').add_to(f)
for index in range(0, len(df4)-1):
lat = df4['latitude'][index]
long = df4['longitude'][index]
temp = lat, long
folium.Marker(temp, marker_icon = 'cloud').add_to(map)
map
df4 is structured with the following columns:
Index(['id', 'name', 'host_id', 'host_since', 'host_location',
'host_acceptance_rate', 'host_is_superhost', 'host_neighbourhood',
'neighbourhood_cleansed', 'latitude', 'longitude', 'property_type',
'room_type', 'accommodates', 'bedrooms', 'beds', 'price',
'minimum_nights', 'maximum_nights', 'number_of_reviews',
'review_scores_rating', 'review_scores_accuracy',
'review_scores_communication', 'review_scores_location',
'review_scores_value'],
dtype='object')
Why am I getting KeyError: 6 when I attempt to run my code? I attempted to use an if statement to catch index 6, but then I got KeyError 10. The data is formatted correctly, and all of the latitudes and longitudes are formatted uniformly. Why is it getting hung up on random rows?
In your case, it is better to use the iterrows() method from the Pandas dataframe to iterate over the rows of the dataframe :
for row in df4.iterrows():
lat = row[1]['latitude']
long = row[1]['longitude']
temp = lat, long
folium.Marker(temp, marker_icon = 'cloud').add_to(map)
Related
I want to extract monthly temperature data from several netCDF files in different locations. Files are built as follows:
> print(data.variables.keys())
dict_keys(['lon', 'lat', 'time', 'tmp','stn'])
Files hold names like "tmp_1901_1910."
Here is the code I use:
import glob
import pandas as pd
import os
import numpy as np
import time
os.chdir('PATH/data_tmp')
all_years = []
for file in glob.glob('*.nc'):
data = Dataset(file,'r')
time_data = data.variables['time'][:]
time = data.variables['time']
year = str(file)[4:13]
all_years.append(year)
# Empty pandas dataframe
year_start = min(all_years)
end_year = max(all_years)
date_range = pd.date_range(start = str(year_start[0:4]) + '-01-01', end = str(end_year[5:9]) + '-12-31', freq ='M')
df = pd.DataFrame(0.0, columns = ['Temp'], index = date_range)
# Defining the location, lat, lon based on the csv data
cities = pd.read_csv(r'PATH/cities_coordinates.csv', sep =',')
cities['city']= cities['city'].map(str)
for index, row in cities.iterrows():
location = row['code_nbs']
location_latitude = row['lat']
location_longitude = row['lon']
# Sorting the list
all_years.sort()
for yr in all_years:
#Reading in the data
data = Dataset('tmp_'+str(yr)+'.nc','r')
# Storing the lat and lon data into variables of the netCDF file into variables
lat = data.variables['lat'][:]
lon = data.variables['lon'][:]
# Squared difference between the specified lat, lon and the lat, lon of the netCDF
sq_diff_lat = (lat - location_latitude)**2
sq_diff_lon = (lon - location_longitude)**2
# Retrieving the index of the min value for lat and lon
min_index_lat = sq_diff_lat.argmin()
min_index_lon = sq_diff_lon.argmin()
# Accessing the temperature data
tmp = data.variables['tmp']
start = str(yr[0:4])+'-01-01'
end = str(yr[5:11])+'-12-31'
d_range = pd.date_range(start = start, end = end, freq='M')
for t_index in np.arange(0, len(d_range)):
print('Recording the value for: '+str(d_range[t_index]))
df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
df.to_csv(location +'.csv')
I obtain the following message while running the command df.loc[d_range[t_index]]['Temp']=tmp[min_index_lon, min_index_lat, t_index]
IndexError: index exceeds dimension bounds
I inspect the object's values and have:
print(d_range)
DatetimeIndex(['1901-01-31', '1901-02-28', '1901-03-31', '1901-04-30',
'1901-05-31', '1901-06-30', '1901-07-31', '1901-08-31',
'1901-09-30', '1901-10-31',
...
'1910-03-31', '1910-04-30', '1910-05-31', '1910-06-30',
'1910-07-31', '1910-08-31', '1910-09-30', '1910-10-31',
'1910-11-30', '1910-12-31'],
dtype='datetime64[ns]', length=120, freq='M')
On the first t_index within the loop, I have:
print(t_index)
0
print(d_range[t_index])
1901-01-31 00:00:00
print(min_index_lat)
259
print(min_index_lon)
592
I don't understand what went wrong with the dimensions.
Thank you for any help!
I assume, you want to read in all .nc data and map the closest city to it. For that, I suggest to read all data first and afterwards calculate to which city a location belongs. The following code probably needs some adoptions to your data. It should show in which direction you could go to get the code more robust.
Step 1: Import your 'raw' data
e.g. into a DataFrame(s). Depends if you can import all data at once. If not split step 1 and 2 into chunks
df_list = []
for file in glob.glob('*.nc'):
data = Dataset(file,'r')
df_i = pd.DataFrame({
variables.keys())
'time': data.variables['time'][:],
'lat': data.variables['lat'][:],
'lon': data.variables['lon'][:],
'tmp': data.variables['tmp'][:],
'stn': data.variables['stn'][:],
'year': str(file)[4:13], # maybe not needed as 'time' should have this info already, and [4:13] needs exactly this format
'file_name': file, # to track back the file
# ... and more
})
df_list.append(df_i)
df = pandas.concat(df_list, ignore_index=True)
Second step: map the locations
e.g. with groupby but there are several other methods. Depending on the amount of data, I suggest to use pandas or numpy routines over any python loops. They are way faster.
df['city'] = None
gp = df.groupby(['lon', 'lat'])
for values_i, indexes_i in gp.groups.items():
# Add your code to get the closest city
# values_i[0] is 'lon'
# values_i[1] is 'lat'
# e.g.:
diff_lon_lat = np.hypot(cities['lon']-values_i[0], cities['lat']-values_i[1])
location = cities.loc[diff_lon_lat.argmin(), 'code_nbs']
# and add the parameters to the df
df.loc[indexes_i, 'city'] = location
Apologies if something similar has been asked before.
I have a task where I need a function that is fed a list of unix times, and a pandas df.
The pandas df has a column for unix time, a column for latitude, and a column for longitude.
I need to extract the latitude from the df where the df unix time matches the unix time in my list I pass to the function.
So far I have:
`def nl_lat_array(pandas_df, unixtime_list):
lat = dict()
data = pandas_df
for x, row in data.iterrows():
if data[data['DateTime_Unix']] == i in unixtime_list:
lat[i] = data[data['Latitude']]
v=list(lat.values())
nl_lat_array = np.array(v)
return nl_lat_array
This results in the following error:
KeyError: "None of [Float64Index([1585403852.468, 1585403852.518, 1585403852.568, 1585403852.618,\n 1585403852.668, 1585403852.718, 1585403852.768, 1585403852.818,\n 1585403852.868, 1585403852.918,\n ...\n 1585508348.524, 1585508348.574, 1585508348.624, 1585508348.674,\n 1585508348.724, 1585508348.774, 1585508348.824, 1585508348.874,\n 1585508348.924, 1585508348.974],\n dtype='float64', length=2089945)] are in the [columns]"
However these values in the pandas array do exist in the list I am passing.
Any help would be greatly appreciated.
import pandas as pd
data = pd.DataFrame([[1,4,7],[2,5,8],[3,6,9]])
data.columns = ['time', 'lat', 'long']
time_list = [1,2]
d = data[data['time'].isin(time_list)]['lat'].values
# [4, 5]
You can do something like this.
filtered_data = data[data['DateTime_Unix'].isin(unixtime_list)]
filtered_data['Latitude'].values()
I'm not sure what happened, but my code has worked today, however not it won't. I have an Excel spreadsheet of projects I want to individually import and put into lists. However, I'm getting a "IndexError: index 8 is out of bounds for axis 0 with size 8" error and Google searches have not resolved this for me. Any help is appreciated. I have the following fields in my Excel sheet: id, funding_end, keywords, pi, summaryurl, htmlabstract, abstract, project_num, title. Not sure what I'm missing...
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
cols = [0,1,2,3,4,5,6,7,8]
df = df[df.columns[cols]]
tt = df['funding_end'] = df['funding_end'].astype(str)
tt = df.funding_end.tolist()
for t in tt:
allenddates.append(t)
bb = df['keywords'] = df['keywords'].astype(str)
bb = df.keywords.tolist()
for b in bb:
allkeywords.append(b)
uu = df['pi'] = df['pi'].astype(str)
uu = df.pi.tolist()
for u in uu:
allpis.append(u)
vv = df['summaryurl'] = df['summaryurl'].astype(str)
vv = df.summaryurl.tolist()
for v in vv:
allsummaryurls.append(v)
ww = df['htmlabstract'] = df['htmlabstract'].astype(str)
ww = df.htmlabstract.tolist()
for w in ww:
allhtmlabstracts.append(w)
xx = df['abstract'] = df['abstract'].astype(str)
xx = df.abstract.tolist()
for x in xx:
allabstracts.append(x)
yy = df['project_num'] = df['project_num'].astype(str)
yy = df.project_num.tolist()
for y in yy:
allprojectnums.append(y)
zz = df['title'] = df['title'].astype(str)
zz = df.title.tolist()
for z in zz:
alltitles.append(z)
"IndexError: index 8 is out of bounds for axis 0 with size 8"
cols = [0,1,2,3,4,5,6,7,8]
should be cols = [0,1,2,3,4,5,6,7].
I think you have 8 columns but your col has 9 col index.
IndexError: index out of bounds means you're trying to insert or access something which is beyond its limit or range.
Every time, when you load either of these files such as an test.xlx, test.csv or test.xlsx file using Pandas such as:
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
It would be better for everyone to find the length of columns of a DataFrame that will help you move forward when working with large Data_Sets. e.g.
import pandas as pd
data_set = pd.read_excel('file_example_XLS_10.xls', encoding="ISO-8859-1")
data_frames = pd.DataFrame(data_set)
print("Length of Columns:", len(data_frames.columns))
This will give you the exact number of columns of an Excel Spread-Sheet. Then you can specify the Data Frames Accordingly:
Length of Columns: 8
cols = [0, 1, 2, 3, 4, 5, 6, 7]
I agree with #Bill CX that it sounds like you're trying to access a column that doesn't exist. Although I cannot reproduce your error, I have some ideas that may help you move forward.
First, double check the shape of your data frame:
import pandas as pd
dataset = pd.read_excel('new_ahrq_projects_current.xlsx',encoding="ISO-8859-1")
df = pd.DataFrame(dataset)
print(df.shape) # print shape of data read in to python
The output should be
(X, 9) # "X" is the number of rows
If the data frame has 8 columns, then the df.shape will be (X, 8). This could be why your are getting the error.
Another check for you is to print out the first few rows of your data frame.
print(df.head)
This will let you double-check to see if you have read in the data in the correct form. I'm not sure, but it might be possible that your .xlsx file has 9 columns, but pandas is reading in only 8 of them.
I have a dataframe named SD_Apartments that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of apartment names, and their coordinates.
I have another dataframe named SD_Coffee that has 3 variables: name (object), latitude (float64), longitude (float64). It's a list of coffee shop names, and their coordinates.
I want to add another variable to SD_apartments called coffee_count that would have the number of coffee shop locations listed in my SD_coffee dataframe that are within x (for example, 300) meters from each apartment listed in SD_apartments.
Here is a setup of the code I'm working with:
import pandas as pd
import geopy.distance
from geopy.distance import geodesic
data = [['Insomnia', 32.784782, -117.129130], ['Starbucks', 32.827521, -117.139966], ['Dunkin', 32.778519, -117.154720]]
data1 = [['DreamAPT', 32.822090, -117.184200], ['OKAPT', 32.748081, -117.130691], ['BadAPT', 32.786886, -117.097536]]
SD_Coffee = pd.DataFrame(data, columns = ['name', 'latitude', 'longitude'])
SD_Apartments = pd.DataFrame(data1, columns = ['name', 'latitude', 'longitude'])
Here is the code I'm attempting to use to accomplish my goal:
def geodesic_pd(df1, df2_row):
return [(geodesic([tuple(x) for x in row.values], [tuple(x) for x in df2_row.values]).m for row in df1)]
SD_Apartments['coffee_count'] = pd.Series([(sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row) < 300) for row in SD_Apartments[['latitude', 'longitude']])])
If you run it and print SD_Apartments, you will see that SD_Apartments looks like:
name ... coffee_count
0 DreamAPT ... <generator object <genexpr> at 0x000002E178849...
1 OKAPT ... NaN
2 BadAPT ... NaN
This will probably help you:
import pandas as pd
df = pd.DataFrame({'geodesic': [1, 10, 8, 11, 20,2,2],'apartment': list('aaceeee')})
df.nsmallest(3, 'geodesic')
Another way of doing this is by using K-Nearest neighbors using the geodesic distance:
SKLearn-KNN
Assuming you are using pandas dataframes, you should be able to use something like this unless you have very large arrays -
import numpy as np
def geodesic_pd(df1, df2_row):
dist = []
for _, row in df1.iterrows():
dist.append(geodesic(tuple(row.values), tuple(df2_row.values)).m)
return np.array(dist)
SD_Apartments['coffee_count'] = SD_Apartments.apply(lambda row: sum(geodesic_pd(SD_Coffee[['latitude', 'longitude']], row[['latitude', 'longitude']]) < 300), axis =1)
The geodesic_pd functions extends the geodesic calculation to a dataframe from individual tuples to a dataframe, and the next statement calculates the number of coffee stores within 300 meters and stores them in a new column.
If you have large arrays, then you should combine KNN in order to only perform this operation over a subset of points.
I have a function that extracts a number of variables from zillow. I used a lambda function to append the returned values to a dataframe. I am wondering if there is a faster way to return all the variables and append them to the dataframe instead of individually.
Here is my code:
from xml.dom.minidom import parse,parseString
import xml.dom.minidom
import requests
import sys
import pandas as pd
import numpy as np
l_zwsid=''
df = pd.read_csv('data.csv')
def getElementValue(p_dom,p_element):
if len(p_dom.getElementsByTagName(p_element)) > 0:
l_value=p_dom.getElementsByTagName(p_element)[0]
return(l_value.firstChild.data)
else:
l_value='NaN'
return(l_value)
def getData(l_zwsid, a_addr, a_zip):
try:
l_url='http://www.zillow.com/webservice/GetDeepSearchResults.htm?zws-id='+l_zwsid+'&address='+a_addr+'&citystatezip='+a_zip
xml=requests.get(l_url)
dom=parseString(xml.text)
responses=dom.getElementsByTagName('response')
zpid=getElementValue(dom,'zpid')
usecode=getElementValue(dom,'useCode')
taxyear=getElementValue(dom,'taxAssessmentYear')
tax=getElementValue(dom,'taxAssessment')
yearbuilt=getElementValue(dom,'yearBuilt')
sqft=getElementValue(dom,'finishedSqFt')
lotsize=getElementValue(dom,'lotSizeSqFt')
bathrooms=getElementValue(dom,'bathrooms')
bedrooms=getElementValue(dom,'bedrooms')
totalrooms=getElementValue(dom,'totalRooms')
lastSale=getElementValue(dom,'lastSoldDate')
lastPrice=getElementValue(dom,'lastSoldPrice')
latitude=getElementValue(dom, 'latitude')
longitude=getElementValue(dom, 'longitude')
for response in responses:
addresses=response.getElementsByTagName('address')
for addr in addresses:
street=getElementValue(addr,'street')
zipcode=getElementValue(addr,'zipcode')
zestimates=response.getElementsByTagName('zestimate')
for zest in zestimates:
amt=getElementValue(zest,'amount')
lastupdate=getElementValue(zest,'last-updated')
valranges=zest.getElementsByTagName('valuationRange')
for val in valranges:
low=getElementValue(val,'low')
high=getElementValue(val,'high')
return longitude, latitude
except AttributeError:
return None
df['Longtitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
df['Latitude'] = df.apply(lambda row: getData(l_zwsid, row['Street'], row['Zip']), axis = 1)
This currently does not work because the new columns will contain both the longitude and latitude.
Your getData function returns a tuple, which is why both columns have both lat and lon. One workaround could be to parameterise this function as follows:
def getData(l_zwsid, a_addr, a_zip, axis='lat'):
valid = ['lat', 'lon']
if axis not in valid:
raise ValueError(f'axis must be one of {valid}')
...
if axis == 'lat':
return latitude
else:
return longitude
This won't improve efficiency will make it even slower, however. Your main overhead is coming from making API calls for every row in the DataFrame, so you are constrained by network performance.
You can make your getData function return a string which contains comma separated values of all the elements
Append this csv string as ALL_TEXTcolumn in the dataframe df
Split the column ALL_TEXT into multiple columns (Lat, long, zipcode, street etc)
def split_into_columns(text):
required_columns = ['Latitude', 'Longtitude', 'Zipcode']
columns_value_list = text['ALL_TEXT'].split(',')
for i in range(len(required_columns)):
text[required_columns[i]] = columns_value_list[i]
return text
df= pd.DataFrame([ ['11.49, 12.56, 9823A'], ['14.02, 15.29, 9674B'] ], columns=['ALL_TEXT'])
updated_df = df.apply(split_into_columns, axis=1)
df
ALL_TEXT
0 11.49, 12.56, 9823A
1 14.02, 15.29, 9674B
updated_df
ALL_TEXT Latitude Longtitude Zipcode
0 11.49, 12.56, 9823A 11.49 12.56 9823A
1 14.02, 15.29, 9674B 14.02 15.29 9674B