I am trying to do reverse geocoding and extract pincodes for lot-long. The .csv file has around 1 million records..
Below is my problem
1. Google API failing to give address for large records, and taking huge amount of time. I will later move it to Batch-Process though.
2. I tried to split the file into chunks and ran few files manually one by one (1000 records in each file after splitting), then i surprisingly get 100% result.
3. Later, I ran in loop one by one, again, Google API fails to give the result
Note: Right now we are looking for free API's only
**Below is my code**
def reverse_geocode(latlng):
result = {}
url = 'https://maps.googleapis.com/maps/api/geocode/json?latlng={}'
request = url.format(latlng)
key= '&key=' + api_key
request = request + key
data = requests.get(request).json()
if len(data['results']) > 0:
result = data['results'][0]
return result
def parse_postal_code(geocode_data):
if (not geocode_data is None) and ('formatted_address' in geocode_data):
for component in geocode_data['address_components']:
if 'postal_code' in component['types']:
return component['short_name']
return None
dfinal = pd.DataFrame(columns=colnames)
dmiss = pd.DataFrame(columns=colnames)
for fl in files:
df = pd.read_csv(fl)
print ('Processing file : ' + fl[36:])
df['geocode_data'] = ''
df['Pincode'] = ''
df['geocode_data']=df['latlng'].map(reverse_geocode)
df['Pincode'] = df['geocode_data'].map(parse_postal_code)
if (len(df[df['Pincode'].isnull()]) > 0):
d0=df[df['Pincode'].isnull()]
print("Missing Picodes : " + str(len(df[df['Pincode'].isnull()])) + " / " + str(len(df)))
dmiss.append(d0)
d0=df[~df['Pincode'].isnull()]
dfinal.append(d0)
else:
dfinal.append(df)
Can anybody help me out, what is the problem in my code? or if any additional info required please let me know....
You've run into Google API usage limits.
Related
Working with google(and all) api calls for the first time, I'm continually hitting a rate limit threshold despite having limited my rate. How would I go about changing the following code to a batch format to avoid this?
#API Call Function
from ratelimit import limits, sleep_and_retry
import requests
from googleapiclient.discovery import build
#sleep_and_retry
#limits(calls=1, period=4.5)
def pull_sheet_data(SCOPE,SPREADSHEET_ID,DATA_TO_PULL):
creds = gsheet_api_check(SCOPE)
service = build('sheets', 'v4', credentials=creds)
sheet = service.spreadsheets()
result = sheet.values().get(
spreadsheetId=SPREADSHEET_ID,
range=DATA_TO_PULL).execute()
values = result.get('values', [])
if not values:
print('No data found.')
else:
rows = sheet.values().get(spreadsheetId=SPREADSHEET_ID,
range=DATA_TO_PULL).execute()
data = rows.get('values')
print("COMPLETE: Data copied")
return data
#list files in the active brews folder
activebrews = drive.ListFile({'q':"'0BxU70FB_wb-Da0x5amtYbUkybXc' in parents"}).GetList()
tabs = ['Brew Log','Fermentation Log','Centrifuge Log']
brewsheetdict ={}
#Pulls data from the entire spreadsheet tab.
for i in activebrews:
for j in tabs:
#set spreadsheet parameters
SCOPE = ['https://www.googleapis.com/auth/drive.readonly']
SPREADSHEET_ID = i['id']
data = pull_sheet_data(SCOPE,SPREADSHEET_ID,j)
dftitle = str(i['title'])
dftab = str(j)
dfname = dftitle+'_'+dftab
brewsheetdict[dfname] = pd.DataFrame(data)
Thanks!
Thanks to Tanaike's suggestion, I settled on the following solution. There's still remaining issues that I haven't resolved. Namely, the resultant API calls occured at a rate of .5/second, well below the published limit, but any faster would still result in rate limiting issues. Additionally, the code completed after executing the for loops on half of the list every time, requiring repeated executions. To work around the second problem, I included the indicated line to remove list items after they were iterated over, so re running the code began with the last incomplete record each time.
import time
SCOPE = ['https://www.googleapis.com/auth/drive.readonly']
recordcount=0
#if (current - start)/(recordcount) >= 1:
sleeptime=2.2
start=time.time()
for i in activebrews:
SPREADSHEET_ID = i['id']
for j in tabs:
data = pull_sheet_data(SCOPE,SPREADSHEET_ID,j)
dftitle = str(i['title'])
dftab = str(j)
dfname = dftitle+'_'+dftab
brewsheetdict[dfname] = pd.DataFrame(data)
recordcount+=1
time.sleep(sleeptime)
end=time.time()
print(recordcount,dfname, (end-start)/recordcount)
activebrews.remove(i)#remove list item after iterated over
time.sleep(1)
brewsheetdata = open("brewsheetdata.pkl","wb")
pickle.dump(brewsheetdict,brewsheetdata)
brewsheetdata.close()
I have a python code here which goes into SAP using BAPI RFC_READ_TABLE, queries USR02 table and bring back the results. The input is taken from an excel sheet A column and the output is pasted in B column
The code is running all fine. However, for 1000 records, it is taking 8 minutes approximately to run.
Can you please help in optimizing the code? I am really new at python, managed to write this heavy code but now stuck at the optimization part.
It would be really great if this can run in 1-2 minutes max.
from pyrfc import Connection, ABAPApplicationError, ABAPRuntimeError, LogonError, CommunicationError
from configparser import ConfigParser
from pprint import PrettyPrinter
import openpyxl
ASHOST='***'
CLIENT='***'
SYSNR='***'
USER='***'
PASSWD='***'
conn = Connection(ashost=ASHOST, sysnr=SYSNR, client=CLIENT, user=USER, passwd=PASSWD)
try:
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
for i in range(1,len(ws['A'])+1):
x = ws['A'+ str(i)].value
options = [{ 'TEXT': "BNAME = '" +x+"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
pp = PrettyPrinter(indent=4)
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', \
QUERY_TABLE = 'USR02', \
OPTIONS = options, \
FIELDS = fields, \
ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
a= data_result[line]["WA"].strip()
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
ws['B'+str(i)].value = a
wb.save('new2.xlsx')
except CommunicationError:
print("Could not connect to server.")
raise
except LogonError:
print("Could not log in. Wrong credentials?")
raise
except (ABAPApplicationError, ABAPRuntimeError):
print("An error occurred.")
raise
EDIT :
So here is my updated code. For now, I have decided to output the data on command line only. Output shows where is the time taken.
try:
output_list = []
wb = openpyxl.load_workbook('new3.xlsx')
ws = wb['Sheet1']
col = ws['A']
col_lis = [col[x].value for x in range(len(col))]
length = len(col_lis)
for i in range(length):
print("--- %s seconds Start of the loop ---" % (time.time() - start_time))
x = col_lis[i]
options = [{ 'TEXT': "BNAME = '" + x +"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', QUERY_TABLE = 'USR02', OPTIONS = options, FIELDS = fields, ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
print("--- %s seconds in SAP ---" % (time.time() - start_time))
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
a= data_result[line]["WA"]
output_list.append(a)
print(output_list)
Firstly I put timing mark at different places of code having divided it into functional sections (SAP processing, Excel processing).
Upon analyzing the timings I found that the most runtime is consumed by Excel writing code,
consider the intervals:
16:52:37.306272
16:52:37.405006 moment it was fetched from SAP
16:52:37.552611 moment it was pushed to Excel
16:52:37.558631
16:52:37.634395 moment it was fetched from SAP
16:52:37.796002 moment it was pushed to Excel
16:52:37.806930
16:52:37.883724 moment it was fetched from SAP
16:52:38.060254 moment it was pushed to Excel
16:52:38.067235
16:52:38.148098 moment it was fetched from SAP
16:52:38.293669 moment it was pushed to Excel
16:52:38.304640
16:52:38.374453 moment it was fetched from SAP
16:52:38.535054 moment it was pushed to Excel
16:52:38.542004
16:52:38.618800 moment it was fetched from SAP
16:52:38.782363 moment it was pushed to Excel
16:52:38.792336
16:52:38.873119 moment it was fetched from SAP
16:52:39.034687 moment it was pushed to Excel
16:52:39.040712
16:52:39.114517 moment it was fetched from SAP
16:52:39.264716 moment it was pushed to Excel
16:52:39.275649
16:52:39.346005 moment it was fetched from SAP
16:52:39.523721 moment it was pushed to Excel
16:52:39.530741
16:52:39.610487 moment it was fetched from SAP
16:52:39.760086 moment it was pushed to Excel
16:52:39.771057
16:52:39.839873 moment it was fetched from SAP
16:52:40.024574 moment it was pushed to Excel
as you can see the Excel writing part is much as twice as SAP querying part.
What is wrong in your code is that you open/initizalizing the workbook and sheet in each loop iteration, this slows execution a lot and is redundant as you can reuse the wrokbook variables from the top.
Another redundant thing is stripping leading and trailing zeroes, it is quite of redundant as Excel do this automatically for string data.
This variant of code
try:
wb = openpyxl.load_workbook('new2.xlsx')
ws = wb['Sheet1']
print(datetime.now().time())
for i in range(1,len(ws['A'])+1):
x = ws['A'+ str(i)].value
options = [{ 'TEXT': "BNAME = '" + x +"'"}]
fields = [{'FIELDNAME': 'CLASS'},{'FIELDNAME':'USTYP'}]
ROWS_AT_A_TIME = 10
rowskips = 0
while True:
result = conn.call('RFC_READ_TABLE', QUERY_TABLE = 'USR02', OPTIONS = options, FIELDS = fields, ROWSKIPS = rowskips, ROWCOUNT = ROWS_AT_A_TIME)
rowskips += ROWS_AT_A_TIME
if len(result['DATA']) < ROWS_AT_A_TIME:
break
data_result = result['DATA']
length_result = len(data_result)
for line in range(0,length_result):
ws['B'+str(i)].value = data_result[line]["WA"]
wb.save('new2.xlsx')
print(datetime.now().time())
except ...
gives me following timestamps of program run:
>>> exec(open('RFC_READ_TABLE.py').read())
18:14:03.003174
18:16:29.014373
2.5 minutes for 1000 user records, which looks a fair price for this kind of processing.
In my opinion, the problem is in the while True loop. I think you need to optimize your query logic (or change it). It is hard without knowing what you are interested in the DB, The other things looking easy and fast.
Something that could help is to try to not open and close the file continuously: try to compute your "B" column and then open and paste all at once in the xlsx file. It could help (but i'm pretty sure that is the query the problem)
P.S. Maybe you can use some timing library (like here) to compute WHERE you spend most of the time.
I'm fairly new to python and web-scraping in general. The code below works but it seems to be awfully slow for the amount of information its actually going through. Is there any way to easily cut down on execution time. I'm not sure but it does seem like I have typed out more/made it more difficult then I actually needed to, any help would be appreciated.
Currently the code starts at the sitemap then iterates through a list of additional sitemaps. Within the new sitemaps it pulls data information to construct a url for the json data of a webpage. From the json data I pull an xml link that I use to search for a string. If the string is found it appends it to a text file.
#global variable
start = 'https://www.govinfo.gov/wssearch/getContentDetail?packageId='
dash = '-'
urlSitemap="https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml"
old_xml=requests.get(urlSitemap)
print (old_xml)
new_xml= io.BytesIO(old_xml.content).read()
final_xml=BeautifulSoup(new_xml)
linkToBeFound = final_xml.findAll('loc')
for loc in linkToBeFound:
urlPLmap=loc.text
old_xmlPLmap=requests.get(urlPLmap)
print(old_xmlPLmap)
new_xmlPLmap= io.BytesIO(old_xmlPLmap.content).read()
final_xmlPLmap=BeautifulSoup(new_xmlPLmap)
linkToBeFound2 = final_xmlPLmap.findAll('loc')
for pls in linkToBeFound2:
argh = pls.text.find('PLAW')
theWanted = pls.text[argh:]
thisShallWork =eval(requests.get(start + theWanted).text)
print(requests.get(start + theWanted))
dict1 = (thisShallWork['download'])
finaldict = (dict1['modslink'])[2:]
print(finaldict)
url2='https://' + finaldict
try:
old_xml4=requests.get(url2)
print(old_xml4)
new_xml4= io.BytesIO(old_xml4.content).read()
final_xml4=BeautifulSoup(new_xml4)
references = final_xml4.findAll('identifier',{'type': 'Statute citation'})
for sec in references:
if sec.text == "106 Stat. 4845":
Print(dash * 20)
print(sec.text)
Print(dash * 20)
sec313 = open('sec313info.txt','a')
sec313.write("\n")
sec313.write(pls.text + '\n')
sec313.close()
except:
print('error at: ' + url2)
No idea why i spent so long on this, but i did. Your code was really hard to look through. So i started with that, I broke it up into 2 parts, getting the links from the sitemaps, then the other stuff. I broke out a few bits into separate functions too.
This is checking about 2 urls per second on my machine which seems about right.
How this is better (you can argue with me about this part).
Don't have to reopen and close the output file after each write
Removed a fair bit of unneeded code
gave your variables better names (this does not improve speed in any way but please do this especially if you are asking for help with it)
Really the main thing... once you break it all up it becomes fairly clear that whats slowing you down is waiting on the requests which is pretty standard for web-scraping, you can look into multi threading to avoid the wait. Once you get into multi threading, the benefit of breaking up your code will likely also become much more evident.
# returns sitemap links
def get_links(s):
old_xml = requests.get(s)
new_xml = old_xml.text
final_xml = BeautifulSoup(new_xml, "lxml")
return final_xml.findAll('loc')
# gets the final url from your middle url and looks through it for the thing you are looking for
def scrapey(link):
link_id = link[link.find("PLAW"):]
r = requests.get('https://www.govinfo.gov/wssearch/getContentDetail?packageId={}'.format(link_id))
print(r.url)
try:
r = requests.get("https://{}".format(r.json()["download"]["modslink"][2:]))
print(r.url)
soup = BeautifulSoup(r.text, "lxml")
references = soup.findAll('identifier', {'type': 'Statute citation'})
for ref in references:
if ref.text == "106 Stat. 4845":
return r.url
else:
return False
except:
print("bah" + r.url)
return False
sitemap_links_el = get_links("https://www.govinfo.gov/sitemap/PLAW_sitemap_index.xml")
sitemap_links = map(lambda x: x.text, sitemap_links_el)
nlinks_el = map(get_links, sitemap_links)
links = [num.text for elem in nlinks_el for num in elem]
with open("output.txt", "a") as f:
for link in links:
url = scrapey(link)
if url is False:
print("no find")
else:
print("found on: {}".format(url))
f.write("{}\n".format(url))
I'm fairly new to AWS and for the past week, been following all the helpful documentation on the site.
I am currently stuck on bring unable to pull the External Image Id data from a Reko collection after a 'search face by image', I just need to be able to put that data into a variable or to print it, does anybody know how I could do that?
Basically, this is my code:
import boto3
if name == "main":
bucket = 'bucketname'
collectionId = 'collectionname'
fileName = 'test.jpg'
threshold = 90
maxFaces = 2
admin = 'test'
targetFile = "%sTarget.jpg" % admin
imageTarget = open(targetFile, 'rb')
client = boto3.client('rekognition')
response = client.search_faces_by_image(CollectionId=collectionId,
Image={'Bytes': imageTarget.read()},
FaceMatchThreshold=threshold,
MaxFaces=maxFaces)
faceMatches = response['FaceMatches']
print ('Matching faces')
for match in faceMatches:
print ('FaceId:' + match['Face']['FaceId'])
print ('Similarity: ' + "{:.2f}".format(match['Similarity']) + "%")
at the end of it, I receive:
Matching faces
FaceId:8081ad90-b3bf-47e0-9745-dfb5a530a1a7
Similarity: 96.12%
Process finished with exit code 0
What I need is the External Image Id instead of the FaceId.
Thanks!
I have files with the following structure:
{
"function": "ComAl_Set_nad_crtl_xcall_state",
"timeStamp": 1488500329974,
"Param1": "SIG_NAD_XCALL_ATTEMPTS_COUNT",
"Value1": "2"
}
These JSON files are created by some functions which I have in my program. But I have an issue getting the last value from these files (Value1). Currently this is the code I am using to get data from the file:
def get_json_from_stub(self, file_name):
def jsonize_stub(raw_data):
end = raw_data.rfind(",")
parsed_data = "[" + raw_data[:end] + "]"
return json.loads(parsed_data.replace("\00", ""))
command = "'cat " + self.stub_path + file_name + "'"
content = self.send_ssh_command(command)
json_stub = jsonize_stub(content)
return json_stub
and this is the code for getting Value1:
#app.route('/stub/comal/getSignal/ComAl_Set_nad_crtl_xcall_requests', methods=['GET'])
def get_nad_crtl_xcall_requests():
file_name = "ComAl_Set_nad_crtl_xcall_requests.out"
json_stub = self.stubManager.get_json_from_stub(file_name)
return MapEcallRequests().tech_to_business(json_stub[-1]["Value1"])
more specifically I want to replace json_stub[-1]["Value1"] with another way of getting Value1. The problem is that sometimes these files don´t get written so I would like to get Value1 in a different way and to raise an error message in case Value1 isn´t there, just to avoid my application crashing in case the value is not there. Is there are way to do it? Thanks.
You can check if the key exists (you can also check if the length is correct):
if len(json_stub) > 0 and json_stub[-1].get('Value1') is not None:
value1_node = json_stub[-1]('Value1')
else:
# 'Value1' key does not exist