I have data of post number in excel. I input in python as a list.
I use geocoder library to get the latitude and longitude by the post number so i can put on map later on.
g = geocoder.google('1700002')
g.latlng
g.latlng brings me a list with [latitude,longitude] in it.
Since is take string only. I changed the values from float to int to get rid of point 0 (133.0 = 130). then make it to string to read it.
yubin_frame = yubin['yubin'] #post data
#1st put it to ing to get rid of float
yubin_list_int = map(int, yubin_list)
#then make it to string to in put all to string
yubin_list_str = map(str, yubin_list_int)
I made this for-loop to make list of both latitude and longitude like this.
#create a new list that include all data in Yubin_zahyou list
Yubin_zahyou = []
for i in range(len(yubin_list_str)):
Yubin_zahyou.append(geocoder.google(yubin_list_str[i]).latlng)
My problem is that i have nearly 30000 data and geocoder brings only nearly 2500 input!. Does this mean geocoder has a limit or I made a mistake somehow?
Yes, it has rate limit as written here in Providers.
https://github.com/DenisCarriere/geocoder
https://developers.google.com/maps/documentation/geocoding/usage-limits
as for google it nearly only gives 2500 limit per day.
Related
I have the following polygon of a geographic area that I fetch via a request in CAP/XML format from an API
The raw data looks like this:
<polygon>22.3243,113.8659 22.3333,113.8691 22.4288,113.8691 22.4316,113.8742 22.4724,113.9478 22.5101,113.9951 22.5099,113.9985 22.508,114.0017 22.5046,114.0051 22.5018,114.0085 22.5007,114.0112 22.5007,114.0125 22.502,114.0166 22.5038,114.0204 22.5066,114.0245 22.5067,114.0281 22.5057,114.0371 22.5051,114.0409 22.5041,114.0453 22.5025,114.0494 22.5023,114.0511 22.5035,114.0549 22.5047,114.0564 22.5059,114.057 22.5104,114.0576 22.512,114.0584 22.5144,114.0608 22.5163,114.0637 22.517,114.0657 22.5172,114.0683 22.5181,114.0717 22.5173,114.0739</polygon>
I store the requested items in a dictionary and then work through them to transform to a GeoJSON list object that is suitable for ingestion into Elasticsearch according to the schema I'm working with. I've removed irrelevant code here for ease of reading.
# fetches and store data in a dictionary
r = requests.get("https://alerts.weather.gov/cap/ny.php?x=0")
xpars = xmltodict.parse(r.text)
json_entry = json.dumps(xpars['feed']['entry'])
dict_entry = json.loads(json_entry)
# transform items if necessary
for entry in dict_entry:
if entry['cap:polygon']:
polygon = entry['cap:polygon']
polygon = polygon.split(" ")
coordinates = []
# take the split list items swap their positions and enclose them in their own arrays
for p in polygon:
p = p.split(",")
p[0], p[1] = float(p[1]), float(p[0]) # swap lon/lat
coordinates += [p]
# more code adding fields to new dict object, not relevant to the question
The output of the p in polygon loop looks like:
[ [113.8659, 22.3243], [113.8691, 22.3333], [113.8691, 22.4288], [113.8742, 22.4316], [113.9478, 22.4724], [113.9951, 22.5101], [113.9985, 22.5099], [114.0017, 22.508], [114.0051, 22.5046], [114.0085, 22.5018], [114.0112, 22.5007], [114.0125, 22.5007], [114.0166, 22.502], [114.0204, 22.5038], [114.0245, 22.5066], [114.0281, 22.5067], [114.0371, 22.5057], [114.0409, 22.5051], [114.0453, 22.5041], [114.0494, 22.5025], [114.0511, 22.5023], [114.0549, 22.5035], [114.0564, 22.5047], [114.057, 22.5059], [114.0576, 22.5104], [114.0584, 22.512], [114.0608, 22.5144], [114.0637, 22.5163], [114.0657, 22.517], [114.0683, 22.5172], [114.0717, 22.5181], [114.0739, 22.5173] ]
Is there a way to do this that is better than O(N^2)? Thank you for taking the time to read.
O(KxNxM)
This process involves three obvious loops. These are:
Checking each entry (K)
Splitting valid entries into points (MxN) and iterating through those points (N)
Splitting those points into respective coordinates (M)
The amount of letters in a polygon string is ~MxN because there are N points each roughly M letters long, so it iterates through MxN characters.
Now that we know all of this, let's pinpoint where each occurs.
ENTRIES (K):
IF:
SPLIT (MxN)
POINTS (N):
COORDS(M)
So, we can finally conclude that this is O(K(MxN + MxN)) which is just O(KxNxM).
I'm trying to geolocate all the businesses related to a keyword in my city using, first, the radarsearch API in order to retrieve the Place ID and later using the Places API to get more information of each Place ID (such as the name, or the formatted address).
In my first approach I splitted my city in 9 circumferences, each one with radius 22km and avoiding rural zones, where there's no supposed to be a business. This way I obtained (once removing duplicated results, due to the circumferences overlapping) approximately 150 businesses. This result is not reliable because the official webpage of the company asserts there are 245.
In order to retrieve ALL the businesses, I split my city in circumferences of radius 10km. Therefore with approx 50 pairs of coordinates I fill the city, including now all zones, both rural and non-rural. Now, surprisingly I obtain only 81 businesses! How can this be possible?
I'm storing all the information in separated dictionaries and I noticed the amount of data of each dictionary increases with the increasing of the radius and is always the same (for a fixed radius).
Now, apart from the previous question, is there any way to limit the amount of results each request yields?
The code I'm using is the following:
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
else:
pass
Finally I managed to solve this problem.
The thing was the dict initialization must be done in each loop iteration. Now it stores all the information and I retrieve what I wanted from the beginning.
dict1 = {}
radius=20000
keyword='keyworkd'
key=YOUR_API_KEY
url_base="https://maps.googleapis.com/maps/api/place/radarsearch/json?"
list_dicts = []
for i,(lo, la) in enumerate(zip(lon_txt,lat_txt)):
url=url_base+'location='+str(lo)+','+str(la)+'&radius='+str(radius)+'&keyword='+keyword+'&key='+key
response = urllib2.urlopen(url)
table = json.load(response)
if table['status']=='OK':
for j,line in enumerate(table['results']):
temp = {j : line['place_id']}
dict1.update(temp)
list_dicts.append(dict1)
dict1 = {}
else:
pass
I have a 5GB file of businesses and I'm trying to extract all the businesses that whose business type codes (SNACODE) start with the SNACODE corresponding to grocery stores. For example, SNACODEs for some businesses could be 42443013, 44511003, 44419041, 44512001, 44522004 and I want all businesses whose codes start with my list of grocery SNACODES codes = [4451,4452,447,772,45299,45291,45212]. In this case, I'd get the rows for 44511003, 44512001, and 44522004
Based on what I googled, the most efficient way to read in the file seemed to be one row at a time (if not the SQL route). I then used a for loop and checked if my SNACODE column started with any of my codes (which probably was a bad idea but the only way I could get to work).
I have no idea how many rows are in the file, but there are 84 columns. My computer was running for so long that I asked a friend who said it should only take 10-20 min to complete this task. My friend edited the code but I think he misunderstood what I was trying to do because his result returns nothing.
I am now trying to find a more efficient method than re-doing my 9.5 hours and having my laptop run for an unknown amount of time. The closest thing I've been able to find is most efficient way to find partial string matches in large file of strings (python), but it doesn't seem like what I was looking for.
Questions:
What's the best way to do this? How long should this take?
Is there any way that I can start where I stopped? (I have no idea how many rows of my 5gb file I read, but I have the last saved line of data--is there a fast/easy way to find the line corresponding to a unique ID in the file without having to read each line?)
This is what I tried -- in 9.5 hours it outputted a 72MB file (200k+ rows) of grocery stores
codes = [4451,4452,447,772,45299,45291,45212] #codes for grocery stores
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1):
data = np.asarray(df)
data = pd.DataFrame(data, columns = headers)
for code in codes:
if np.char.startswith(str(data["SNACODE"][0]), str(code)):
with open("grocery.csv", "a") as myfile:
data.to_csv(myfile, header = False)
print code
break #break code for loop if match
grocery.to_csv("grocery.csv", sep = '\t')
This is what my friend edited it to. I'm pretty sure the x = df[df.SNACODE.isin(codes)] is only matching perfect matches, and thus returning nothing.
codes = [4451,4452,447,772,45299,45291,45212]
matched = []
for df in pd.read_csv('infogroup_bus_2010.csv',sep=',', chunksize=1024*1024, dtype = str, low_memory=False):
x = df[df.SNACODE.isin(codes)]
if len(x):
matched.append(x)
print "Processed chunk and found {} matches".format(len(x))
output = pd.concat(matched, axis=0)
output.to_csv("grocery.csv", index = False)
Thanks!
To increase speed you could pre-build a single regexp matching the lines you need and the read the raw file lines (no csv parsing) and check them with the regexp...
codes = [4451,4452,447,772,45299,45291,45212]
col_number = 4 # Column number of SNACODE
expr = re.compile("[^,]*," * col_num +
"|".join(map(str, codes)) +
".*")
for L in open('infogroup_bus_2010.csv'):
if expr.match(L):
print L
Note that this is just a simple sketch as no escaping is considered... if the SNACODE column is not the first one and preceding fields may contain a comma you need a more sophisticated regexp like:
...
'([^"][^,]*,|"([^"]|"")*",)' * col_num +
...
that ignores commas inside double-quotes
You can probably make your pandas solution much faster:
codes = [4451, 4452, 447, 772, 45299, 45291, 45212]
codes = [str(code) for code in codes]
sna = pd.read_csv('infogroup_bus_2010.csv', usecols=['SNACODE'],
chunksize=int(1e6), dtype={'SNACODE': str})
with open('grocery.csv', 'w') as fout:
for chunk in sna:
for code in chunk['SNACODE']:
for target_code in codes:
if code.startswith(target_code):
fout.write('{}\n'.format(code))
Read only the needed column with usecols=['SNACODE']. You can adjust the chunk size with chunksize=int(1e6). Depending on your RAM you can likely make it much bigger.
I would like to test the accuracy of a Highcharts graph presenting data from a JSON file (which I already read) using Python and Selenium Webdriver.
How can I read the Highchart data from the website?
thank you,
Evgeny
The highchart data is converted to an SVG path, so you'd have to interpret the path yourself. I'm not sure why you would want to do this, actually: in general you can trust 3rd party libraries to work as advertised; the testing of that code should reside in that library.
If you still want to do it, then you'd have to dive into Javascript to retrieve the data. Taking the Highcharts Demo as an example, you can extract the data points for the first line as shown below. This will give you the SVG path definition as a string, which you can then parse to determine the origin and the data points. Comparing this to the size of the vertical axis should allow you to calculate the value implied by the graph.
# Get the origin and datapoints of the first line
s = selenium.get_eval("window.jQuery('svg g.highcharts-tracker path:eq(0)')")
splitted = re.split('\s+L\s+', s)
origin = splitted[0].split(' ')[1:]
data = [p.split(' ') for p in splitted[1:]]
# Convert to floats
origin = [float(origin[1]), float(origin[2])]
data = [[float(x), float(y)] for x, y in data]
# Get the min and max y-axis value and position
min_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').text()")
max_y_val = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').text()")
min_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:first').attr('y')")
max_y_pos = float(selenium.get_eval( \
"window.jQuery('svg g.highcharts-axis:eq(1) text:last').attr('y')")
# Calculate the value based on the retrieved positions
y_scale = min_y_pos - max_y_pos
y_range = max_y_val - min_y_val
y_percentage = data[0][1] * 100.0 / y_scale
value = max_y_val - (y_range * percentage)
Disclaimer: I didn't have to time to fully verify it, but something along these lines should give you what you want.
I have an XML file that contains a number of points with their longitude and latitude.
My python code at the moment gets the nearest point by simply looping through the XML file, finding the nearest, in miles or whatever, then comparing it with the previous closest point. If its nearer then I assign the variable the value of this new point. So everything is working in that regard.
Now, what I want to do is actually store the closest 2 or 3 points.
How do I go about doing this? The XML file isn't ordered by closest, and besides, the users location will change each time a request is made. Can I do this with an XML file or will I perhaps have to look into storing the data is SQL Server or MySQL?
Thanks for the help.
PS, the sample code is available here if anyone is interested. This is part of a college project.
You should store in a list of tuples (for example) all the point pairs and their distances as you parse de xml file.
mypoints = [(distance12, x1, x2),...,(distancenm, xn, xm)]
mypoints.sort()
three_closer = mypoints[:3]
Adapting this to your code:
..............
mypoints = []
for row in rows:
# Get coords for current record
curr_coords = row.getAttribute("lat") + ',' + row.getAttribute("lng")
# Get distance
tempDistance = distance.distance(user_coords, curr_coords).miles
mypoints.append((tempDistance, row))
mypoints.sort()
#the three closest points:
mythree_shorter = mypoints[0:3]
for distance, row in mythree_shorter:
shortestStation = json.dumps(
{'number': row.getAttribute("number"),
'address': row.getAttribute("address"),
'lat': row.getAttribute("lat"),
'lng': row.getAttribute("lng"),
'open': row.getAttribute("open")},
sort_keys=True,
indent=4)
save_in_some_way(shortestStation) #maybe writing to a file?
..................
Here's a solution that will work for any number of points:
closest = points[:NUM_CLOSEST]
closest.sort()
for point in points[NUM_CLOSEST:]:
if point.distance < closest[-1].distance:
closest[-1] = point
closest.sort()
Obviously, a bit pseudo-cody. The sort() calls will probably need an argument so they are sorted in a useful way, and you'll probably want a function to calculate the distance to replace the distance member.