I have a csv of about 100 million logs. Where one of the column is address and I am trying to get latitude and longitude of the address. I want to try something like mentioned in the Solution , But the solution given is arcGIS and that is a commercial tool. I did try google API that has limit of 2000 entries only.
What is next best alternative to get address's Lat & Long into the large dataset.
Input: The column Site is the address from the City Paris
start_time,stop_time,duration,input_octets,output_octets,os,browser,device,langue,site
2016-08-27T16:15:00+05:30,2016-08-27T16:28:00+05:30,721.0,69979.0,48638.0,iOS,CFNetwork,iOS-Device,zh_CN,NULL
2016-08-27T16:16:00+05:30,2016-08-27T16:30:00+05:30,835.0,2528858.0,247541.0,iOS,Mobile Safari UIWebView,iPhone,en_GB,Berges de Seine Rive Gauche - Gros Caillou
2016-08-27T16:16:00+05:30,2016-08-27T16:47:00+05:30,1805.0,133303549.0,4304680.0,Android,Android,Samsung GT-N7100,fr_FR,Centre d'Accueil Kellermann
2016-08-27T16:17:00+05:30,,2702.0,32499482.0,7396904.0,Other,Apache-HttpClient,Other,NULL,Bibliothèque Saint Fargeau
2016-08-27T16:17:00+05:30,2016-08-27T17:07:00+05:30,2966.0,39208187.0,1856761.0,iOS,Mobile Safari UIWebView,iPad,fr_FR,NULL
2016-08-27T16:18:00+05:30,,2400.0,1505716.0,342726.0,NULL,NULL,NULL,NULL,NULL
2016-08-27T16:18:00+05:30,,302.0,3424123.0,208827.0,Android,Chrome Mobile,Samsung SGH-I337M,fr_CA,Square Jean Xxiii
2016-08-27T16:19:00+05:30,,1500.0,35035181.0,1913667.0,iOS,Mobile Safari UIWebView,iPhone,fr_FR,Parc Monceau 1 (Entrée)
2016-08-27T16:19:00+05:30,,6301.0,9227174.0,5681273.0,Mac OS X,AppleMail,Other,fr_FR,Bibliothèque Parmentier
The address with NULL can be neglected and also can be removed from the output.
The output should have following columns
start_time,stop_time,duration,input_octets,output_octets,os,browser,device,langue,site, latitude, longitude
Appreciate all the help, Thank you in advance!!
import csv
from geopy.geocoders import Nominatim
#if your sites are located in France only you can use the country_bias parameters to restrict search
geolocator = Nominatim(country_bias="France")
with open('c:/temp/input.csv', 'rb') as csvinput:
with open('c:/temp/output.csv', 'wb') as csvoutput:
output_fieldnames = ['Site', 'Address_found', 'Latitude', 'Longitude']
writer = csv.DictWriter(csvoutput, delimiter=';', fieldnames=output_fieldnames)
writer.writeheader()
reader = csv.DictReader(csvinput)
for row in reader:
site = row['site']
if site != "NULL":
try:
location = geolocator.geocode(site)
address = location.address
latitude = location.latitude
longitude = location.longitude
except:
address = 'Not found'
latitude = 'N/A'
longitude = 'N/A'
else:
address = 'N/A'
latitude = 'N/A'
longitude = 'N/A'
#here is the writing section
output_row = {}
output_row['Site'] = row['site']
output_row['Address_found'] = address.encode("utf-8")
output_row['Latitude'] = latitude
output_row['Longitude'] = longitude
writer.writerow(output_row)
Related
I'm working on a personal project and I'm trying to retrieve air quality data from the https://aqicn.org website using their API.
I've used this code, which I've copied and adapted for the city of Bucharest as follows:
import pandas as pd
import folium
import requests
# GET data from AQI website through the API
base_url = "https://api.waqi.info"
path_to_file = "~/path"
# Got token from:- https://aqicn.org/data-platform/token/#/
with open(path_to_file) as f:
contents = f.readlines()
key = contents[0]
# (lat, long)-> bottom left, (lat, lon)-> top right
latlngbox = "44.300264,25.920181,44.566991,26.297836" # For Bucharest
trail_url=f"/map/bounds/?token={key}&latlng={latlngbox}" #
my_data = pd.read_json(base_url + trail_url) # Joined parts of URL
print('columns->', my_data.columns) #2 cols ‘status’ and ‘data’ JSON
### Built a dataframe from the json file
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],
each_row['lat'],
each_row['lon'],
each_row['aqi']])
df = pd.DataFrame(all_rows, columns=['station_name', 'lat', 'lon', 'aqi'])
# Cleaned the DataFrame
df['aqi'] = pd.to_numeric(df.aqi, errors='coerce') # Invalid parsing to NaN
# Remove NaN entries in col
df1 = df.dropna(subset = ['aqi'])
Unfortunately it only retrieves 4 stations whereas there are many more available on the actual site. In the API documentation the only limitation I saw was for "1,000 (one thousand) requests per second" so why can't I get more of them?
Also, I've tried to modify the lat-long values and managed to get more stations, but they were outside the city I was interested in.
Here is a view of the actual perimeter I've used in the embedded code.
If you have any suggestions as of how I can solve this issue, I'd be very happy to read your thoughts. Thank you!
Try using waqi through aqicn... not exactly a clean API but I found it to work quite well
import pandas as pd
url1 = 'https://api.waqi.info'
# Get token from:- https://aqicn.org/data-platform/token/#/
token = 'XXX'
box = '113.805332,22.148942,114.434299,22.561716' # polygon around HongKong via bboxfinder.com
url2=f'/map/bounds/?latlng={box}&token={token}'
my_data = pd.read_json(url1 + url2)
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],each_row['lat'],each_row['lon'],each_row['aqi']])
df = pd.DataFrame(all_rows,columns=['station_name', 'lat', 'lon', 'aqi'])
From there its easy to plot
df['aqi'] = pd.to_numeric(df.aqi,errors='coerce')
print('with NaN->', df.shape)
df1 = df.dropna(subset = ['aqi'])
df2 = df1[['lat', 'lon', 'aqi']]
init_loc = [22.396428, 114.109497]
max_aqi = int(df1['aqi'].max())
print('max_aqi->', max_aqi)
m = folium.Map(location = init_loc, zoom_start = 5)
heat_aqi = HeatMap(df2, min_opacity = 0.1, max_val = max_aqi,
radius = 60, blur = 20, max_zoom = 2)
m.add_child(heat_aqi)
m
Or as such
centre_point = [22.396428, 114.109497]
m2 = folium.Map(location = centre_point,tiles = 'Stamen Terrain', zoom_start= 6)
for idx, row in df1.iterrows():
lat = row['lat']
lon = row['lon']
station = row['station_name'] + ' AQI=' + str(row['aqi'])
station_aqi = row['aqi']
if station_aqi > 300:
pop_color = 'red'
elif station_aqi > 200:
pop_color = 'orange'
else:
pop_color = 'green'
folium.Marker(location= [lat, lon],
popup = station,
icon = folium.Icon(color = pop_color)).add_to(m2)
m2
checking for stations within HK, returns 19
df[df['station_name'].str.contains('HongKong')]
I'm trying to associate the city/country/state name to the latitude and longitude of my dataset. This is how I made it:
import pandas as pd
import io
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="geoapiExercises")
def city_state_country(row):
coord = f"{row['latitude']}, {row['longitude']}"
location = geolocator.reverse(coord, exactly_one=True)
address = location.raw['address']
city = address.get('city', '')
state = address.get('state', '')
country = address.get('country', '')
row['city'] = city
row['state'] = state
row['country'] = country
return row
ddf_slice= ddf_slice.apply(city_state_country, axis=1)
but I have so many rows and it takes forever. how can I solve this?
I'm looping through records in a dataframe column and trying to pull geocode data for each. Here's the code that I'm testing.
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
for item in df_fin['market_address']:
try:
location = geolocator.geocode(item)
df_fin.loc['address'] = location.address
df_fin.loc['latitude'] = location.latitude
df_fin.loc['longitude'] = location.longitude
df_fin.loc['raw'] = location.raw
print(location.raw)
except:
df_fin.loc['raw'] = 'no info for: ' + item
print('no info for: ' + item)
I must be missing something simple, but I'm just not seeing what the issue is here.
UPDATE:
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
for index, row in df_fin.market_address.iterrows():
try:
location = geolocator.geocode(row)
row['address'] = location.address
row['latitude'] = location.latitude
row['longitude'] = location.longitude
row['raw'] = location.raw
print(location.raw)
except:
row['raw'] = 'no info for: ' + row
print('no info for: ' + row)
df_fin.tail(10)
You can reference below code :
from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="http")
for index, row in df_fin.iterrows():
try:
location = geolocator.geocode(item)
row['address'] = location.address
row['latitude'] = location.latitude
row['longitude'] = location.longitude
row['raw'] = location.raw
print(location.raw)
except:
row['raw'] = 'no info for: ' + item
print('no info for: ' + item)
And if you are more familiar with Pandas, you can use #DYZ's answer.
You should define a function that converts market_address into the address, lat, and long, and .apply that function to the DataFrame.
def locate(market_address):
loc = geolocator.geocode(market_address)
return pd.Series({'address': loc.address if loc else np.nan,
'latitude': loc.latitude if loc else np.nan,
'longitude': loc.longitude if loc else np.nan,
'raw': loc.raw if loc else np.nan})
df_fin.join(df_fin['market_address'].apply(locate))
Note that loc.raw is a dictionary. When you store a dictionary in a DataFrame, you are looking for trouble in the future.
starter code: https://docs.bokeh.org/en/latest/docs/gallery/texas.html
I am trying to replace the unemployment percentage with a different percentage that I have in a csv file. The csv columns are county name and concentration.
I am using the same call method for the county data as in the example. Just pulling in different data for the percentage value.
I have tried turning the csv into a dictionary to then look up the county name value and return the corresponding concentration using the same format as the starter code. I've tried inner joining, outer joining, appending. What am I missing here?
from bokeh.io import show
from bokeh.models import LogColorMapper
from bokeh.palettes import Viridis6 as palette
from bokeh.plotting import figure
from bokeh.sampledata.us_counties import data as counties
import pandas as pd
import csv
#with open('resources/concentration.csv', mode='r') as infile:
#reader = csv.reader(infile)
#with open('concentration_new.csv', mode='w') as outfile:
#writer = csv.writer(outfile)
#mydict = {rows[0]:rows[1] for rows in reader}
#d_1_2= dict(list(counties.items()) + list(mydict.items()))
pharmacy_concentration = []
with open('resources/unemployment.csv', mode = 'r') as infile:
reader = csv.reader(infile, delimiter = ',', quotechar = ' ') # remove
last attribute if you dont have '"' in your csv file
for row in reader:
name, concentration = row
pharmacy_concentration[name] = concentration
counties = {
code: county for code, county in counties.items() if county["state"] ==
"tx"
}
palette.reverse()
county_xs = [county["lons"] for county in counties.values()]
county_ys = [county["lats"] for county in counties.values()]
county_names = [county['name'] for county in counties.values()]
#this is the line I am trying to have pull the corresponding value for the correct county
#county_rates = [d_1_2['concentration'] for county in counties.values()]
color_mapper = LogColorMapper(palette=palette)
data=dict(
x=county_xs,
y=county_ys,
name=county_names,
#rate=county_rates,
)
TOOLS = "pan,wheel_zoom,reset,hover,save"
p = figure(
title="Texas Pharmacy Concentration", tools=TOOLS,
x_axis_location=None, y_axis_location=None,
tooltips=[
("Name", "#name"), ("Pharmacy Concentration", "#rate%"),
(" (Long, Lat)", "($x, $y)")])
p.grid.grid_line_color = None
p.hover.point_policy = "follow_mouse"
p.patches('x', 'y', source=data,
fill_color={'field': 'rate', 'transform': color_mapper},
fill_alpha=0.7, line_color="white", line_width=0.5)
show(p)
Is is hard to speculate without knowing the exact structure of you csv file. Assuming there are just 2 columns in your csv file: county_name + concentration (no first empty column or there between) the following code may work for you:
from bokeh.models import LogColorMapper
from bokeh.palettes import Viridis256 as palette
from bokeh.plotting import figure, show
from bokeh.sampledata.us_counties import data as counties
import csv
pharmacy_concentration = {}
with open('resources/concentration.csv', mode = 'r') as infile:
reader = [row for row in csv.reader(infile.read().splitlines())]
for row in reader:
try:
county_name, concentration = row # add "dummy" before "county_name" if there is an empty column in the csv file
pharmacy_concentration[county_name] = float(concentration)
except Exception, error:
print error, row
counties = { code: county for code, county in counties.items() if county["state"] == "tx" }
county_xs = [county["lons"] for county in counties.values()]
county_ys = [county["lats"] for county in counties.values()]
county_names = [county['name'] for county in counties.values()]
county_pharmacy_concentration_rates = [pharmacy_concentration[counties[county]['name']] for county in counties if counties[county]['name'] in pharmacy_concentration]
palette.reverse()
color_mapper = LogColorMapper(palette = palette)
data = dict(x = county_xs, y = county_ys, name = county_names, rate = county_pharmacy_concentration_rates)
p = figure(title = "Texas Pharmacy Concentration, 2009", tools = "pan,wheel_zoom,reset,hover,save", tooltips = [("Name", "#name"), ("Pharmacy Concentration)", "#rate%"), ("(Long, Lat)", "($x, $y)")], x_axis_location = None, y_axis_location = None,)
p.grid.grid_line_color = None
p.hover.point_policy = "follow_mouse"
p.patches('x', 'y', source = data, fill_color = {'field': 'rate', 'transform': color_mapper}, fill_alpha = 0.7, line_color = "white", line_width = 0.5)
show(p)
The result:
I have a huge file, which has some missing rows. The data needs to be rooted at Country.
The input data is like:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
which needed to be:
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,,
4,USA,WI,Dane,
4,USA,WI,Dane,Madison
"""
The key as per my logic is Type field, where if I cannot find a County (type 3) for a City (type 4), then insert a row with fields upto County.
Same with County. If I cannot find a State (type 2) for a County (type 3), then insert a row with fields upto State.
With my lack of understanding the facilities in python, I was trying more of a brute-force approach. It is bit problematic as I need lot of iteration over the same file.
I was also tried google-refine, but couldn't get it work. Doing manually is quite error prone.
Any help appreciated.
import csv
import io
csv_str = """Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
found_county =[]
missing_county =[]
def check_missing_county(row):
found = False
for elm in found_county:
if elm.Type == row.Type:
found = True
if not found:
missing_county.append(row)
print(row)
reader = csv.reader(io.StringIO(csv_str))
for row in reader:
check_missing_county(row)
I've knocked up the following based on my understanding of the question:
import csv
import io
csv_str = u"""Type,Country,State,County,City,
1,USA,,,
2,USA,OH,,
3,USA,OH,Franklin,
4,USA,OH,Franklin,Columbus
4,USA,OH,Franklin,Springfield
4,USA,WI,Dane,Madison
"""
counties = []
states = []
def handle_missing_data(row):
try:
rtype = int(row[0])
except ValueError:
return []
rtype = row[0]
country = row[1]
state = row[2]
county = row[3]
rows = []
# if a state is present and it hasn't a row of it's own
if state and state not in states:
rows.append([rtype, country, state, '', ''])
states.append(state)
# if a county is present and it hasn't a row of it's own
if county and county not in counties:
rows.append([rtype, country, state, county, ''])
counties.append(county)
# if the row hasn't already been added add it now
if row not in rows:
rows.append(row)
return rows
csvf = io.StringIO(csv_str)
reader = csv.reader(csvf)
for row in reader:
new_rows = handle_missing_data(row)
for new_row in new_rows:
print new_row