Request API but return empty, why? - python

searched various method, none working, dont understand which part went wrong.
it works for single time, how to loop through a list of ID, some ID may return error, skipped.
current code
x = 22555003
URL = "https://data.gcis.nat.gov.tw/od/data/api/5F64D864-61CB-4D0D-8AD9-492047CC1EA6?$format=json&$filter=Business_Accounting_NO eq {}".format(x)
response = requests.get(url = URL)
data = response.json()
result = pd.DataFrame(data)
result.head()
desire output but error
listID = ['22555003','12345678','27240313']
#12345678 is error ID
result = []
for x in listID:
try:
JSONContent = requests.get("https://data.gcis.nat.gov.tw/od/data/api/5F64D864-61CB-4D0D-8AD9-492047CC1EA6?$format=json&$filter=Business_Accounting_NO eq {}".format(x)).json()
result.append([JSONContent['Business_Accounting_NO'],
JSONContent['Capital_Stock_Amount']])
except:
pass
dataset = pd.DataFrame(result)
dataset.head()
why result empty?
thanks!!!

import pandas as pd
import requests
listID = ['22555003','12345678','27240313']
#12345678 is error ID
result = []
for x in listID:
try:
JSONContent = requests.get("https://data.gcis.nat.gov.tw/od/data/api/5F64D864-61CB-4D0D-8AD9-492047CC1EA6?$format=json&$filter=Business_Accounting_NO eq {}".format(x)).json()
#print(JSONContent[0]['Business_Accounting_NO'])
result.append([JSONContent[0]['Business_Accounting_NO'],JSONContent[0]['Capital_Stock_Amount']])
print(result)
except Exception as e:
print(e)
dataset = pd.DataFrame(result)
dataset.head()
print(result)

Related

Looping through links with Python paginated api works, but only by timeout error?

I'm new to python, but I don't see much information on Stackoverflow in regards to paginating with the links method. The loop works perfectly in that it pulls all the data I want, but it only breaks until there's a timeout error when my Mac falls asleep. Sometimes it runs for 2 hours until my Mac sleeps. I'm wondering if there's a faster way to retrieve this data? Here is my python script:
import requests
import pandas as pd
res = []
url = "https://horizon.stellar.org/accounts/GCQJVAXWHB23WBNIG7TWEWHWUGGB6HWBC2ASPF5HMSADO5R5UKI4T7SD/trades"
querystring = {"limit":"200"}
try:
while True:
response = requests.request("GET", url, params=querystring)
data = response.json()
res += data['_embedded']['records']
if "href" not in data['_links']['next']:
break
url = data['_links']['next']['href']
except Exception as ex:
print("Exception:", ex)
df = pd.json_normalize(res)
df.to_csv('stellar_t7sd_trades.csv')
It returns with the following:
Exception: ('Connection aborted.', TimeoutError(60, 'Operation timed
out'))
But it returns the desired data into the csv file.
Is there a problem with my loop in that it doesn't properly break when It's done returning the data? Just trying to figure out a way so it doesn't run for 2 hours, but other than that, I get the desired data.
I solved this by adding a break after n number of loop iterations. This only works because I know exactly how many iterations of the loop will pull the data I need.
res = []
url = "https://horizon.stellar.org/accounts/GCQJVAXWHB23WBNIG7TWEWHWUGGB6HWBC2ASPF5HMSADO5R5UKI4T7SD/trades"
querystring = {"limit":"200"}
n = 32
try:
while n > 0: #True:
response = requests.request("GET", url, params=querystring)
n-=1
data = response.json()
res += data['_embedded']['records']
if "href" not in data['_links']['next']:
break
elif n==32:
break
url = data['_links']['next']['href']
except Exception as ex:
print("Exception:", ex)
df = pd.json_normalize(res)
df.to_csv('stellar_t7sd_tradestest.csv')

How to create an excel file if the script stop executing

I am reading an excel file which contain multiple unique numbers, which i use to perform request activity on a REST API. At the end of the program, I am trying to write another excel file where i am writing the status as 'Success' or 'Fail'.
The issue is, I am reading a file which contain data over 100k numbers. So if my program stop at any reason, or even i stop it intentionally, the output excel file never create. How do i make sure to get the output file till my script run.
here is my code.
from openpyxl import Workbook
from openpyxl import load_workbook
import requests
from datetime import datetime
def api_status():
wk = Workbook()
ws = wk.active
start_row = 2
start_column = 1
status_column = 2
wk = load_workbook("Data-File.xlsx")
source = wk["Sheet"]
global IDNO
for id_list in source['A']:
IDNO = id_list.value
url = "someURL"
payload = {'id_no': str(IDNO)}
headers = {}
response = requests.request("POST", json_url, data=json.dumps(payload), headers=headers)
json_obj = response.json()
ws.cell(row=start_row, column=start_column).value = IDNO
json_message = (json_obj.get('message'))
if json_message == "Success":
ws.cell(row=start_row, column=status_column).value = "Success"
start_row += 1
else:
print("NO")
ws.cell(row=start_row, column=status_column).value = "FAIL"
start_row += 1
wb.save("STATUS-FILE-%s.xlsx" % datetime.now().strftime("%d-%m-%Y_%I-%M-%S_%p"))
api_status()
You just fallback on some code by using a try... except. Replace this line:
api_status()
With this block of code:
try:
api_status()
except:
# CODE TO WRITE YOUR "FAIL" STATUS
Or we can do this for the loop inside the function. Of course there's a little more to it than that. You can specify specific actions to be taken for different error types, or you might want to put that try...except block inside of your function to control for specific lines failing and carrying out the rest.
I'm assuming the most likely error would come from your web request. In that case:
try:
response = requests.request("POST", json_url, data=json.dumps(payload), headers=headers)
json_obj = response.json()
except:
json.obj = {}
By making json_obj an empty dict if the request doesn't work, I guarantee that the next lines would write FAIL in your excel instead of Success
Combining both ideas to make sure your code reaches the save as well would look like this (using a finally to make sure it runs in any possible case):
from openpyxl import Workbook
from openpyxl import load_workbook
import requests
from datetime import datetime
def api_status():
wk = Workbook()
ws = wk.active
start_row = 2
start_column = 1
status_column = 2
wk = load_workbook("Data-File.xlsx")
source = wk["Sheet"]
try:
global IDNO
for id_list in source['A']:
IDNO = id_list.value
url = "someURL"
payload = {'id_no': str(IDNO)}
headers = {}
try:
response = requests.request("POST", json_url, data=json.dumps(payload), headers=headers)
json_obj = response.json()
except:
json.obj = {}
ws.cell(row=start_row, column=start_column).value = IDNO
json_message = (json_obj.get('message'))
if json_message == "Success":
ws.cell(row=start_row, column=status_column).value = "Success"
start_row += 1
else:
print("NO")
ws.cell(row=start_row, column=status_column).value = "FAIL"
start_row += 1
finally:
wb.save("STATUS-FILE-%s.xlsx" % datetime.now().strftime("%d-%m-%Y_%I-%M-%S_%p"))
api_status()

multithreading for loop not working in Python with no errors

I have put together the below and wanted to test multithreading.
I am trying to make the for loop run threaded, so several URLs in the list can be processed in parallel.
This script doesn't error, but it doesn't do anything and I am not sure why.
If I remove the multithreading pieces, it works fine
Can anyone help me?
import multiprocessing.dummy as mp
import requests
import pandas as pd
import datetime
urls = [
'http://google.co.uk',
'http://bbc.co.uk/'
]
def do_print(s):
check_data = pd.DataFrame([])
now = datetime.datetime.now()
try:
response = requests.get(url)
except:
response = 'null'
try:
response_code = response.status_code
except:
response_code = 'null'
try:
response_content = response.content
except:
response_content = 'null'
try:
response_text = response.text
except:
response_text = 'null'
try:
response_content_type = response.headers['Content-Type']
except:
response_content_type = 'null'
try:
response_server = response.headers['Server']
except:
response_server = 'null'
try:
response_last_modified = response.headers['Last-Modified']
except:
response_last_modified = 'null'
try:
response_content_encoding = response.headers['Content-Encoding']
except:
response_content_encoding = 'null'
try:
response_content_length = response.headers['Content-Length']
except:
response_content_length = 'null'
try:
response_url = response.url
except:
response_url = 'null'
if int(response_code) <400:
availability = 'OK'
elif int(response_code) >399 and int(response_code) < 500:
availability = 'Client Error'
elif int(response_code) >499:
availability = 'Server Error'
if int(response_code) <400:
availability_score = 1
elif int(response_code) >399 and int(response_code) < 500:
availability_score = 0
elif int(response_code) >499:
availability_score = 0
d = {'check_time': [now], 'code': [response_code], 'type': [response_content_type], 'url': [response_url], 'server': [response_server], 'modified': [response_last_modified], 'encoding': [response_content_encoding], 'availability': [availability], 'availability_score': [availability_score]}
df = pd.DataFrame(data=d)
check_data = check_data.append(df ,ignore_index=True,sort=False)
if __name__=="__main__":
p=mp.Pool(4)
p.map(do_print, urls)
p.close()
p.join()
When I run code I get error because it try to convert int("null") - all because you have
except:
response_code = 'null'`
If I use except Exception as ex: print(ex) then I get error that variable url doesn't exists. And it is true because you have def do_print(s): but it should be def do_print(url):
BTW: instead of 'null' you could use standard None and later check if response_code: before you try to covnert it to integer. Or simply skip rest of code when you get error.
Other problem - process should use return df and you should get it
results = p.map(...)
and then use results to create DataFrame check_data

How to handle exception, so the script continues to work?

There is a script which makes API requests through iterating of a params dictionary.
If params are not compatible between each other (metrics and dimensions) or there is a mistake, it throws an exception:
googleapiclient.errors.HttpError: "Could not parse content (N/A) of field parameters.filters.">
And the script stops working.
It looks like this
def yt_return_api_response(yt_params):
responses = []
timestamp = []
try:
youtubeAnalytics = get_service()
for k, v in yt_params.items():
request = execute_api_request(
youtubeAnalytics.reports().query,
ids=v['ids'],
startDate=v['startDate'],
endDate=v['endDate'],
metrics=v['metrics'],
dimensions=v['dimensions'],
filters=v['filters'],
maxResults=v['maxResults'],
sort=v['sort'])
response = youtube_response(request)
responses.append(response)
# get the timestamp
timestamp_request = dt.datetime.now()
timestamp_request = timestamp_request.strftime('%Y-%m-%d %H:%M:%S.%f')
timestamp.append(timestamp_request)
return responses, timestamp
except Exception as e:
logging.error('Check the request params, unsupported query', exc_info=True)
I've tried to change it, in order after one iteration if there is a mistake it would not crash but continues to work.
With 'while True' it starts and just keeps working without any result.
def yt_return_api_response(yt_params, request_ids, filters):
responses = []
timestamp = []
while True:
try:
with 'finally' returns empty lists
def yt_return_api_response(yt_params):
responses = []
timestamp = []
try:
youtubeAnalytics = get_service()
for k, v in yt_params.items():
request = execute_api_request(
youtubeAnalytics.reports().query,
ids=v['ids'],
startDate=v['startDate'],
endDate=v['endDate'],
metrics=v['metrics'],
dimensions=v['dimensions'],
filters=v['filters'],
maxResults=v['maxResults'],
sort=v['sort'])
response = youtube_response(request)
responses.append(response)
# get the timestamp
timestamp_request = dt.datetime.now()
timestamp_request = timestamp_request.strftime('%Y-%m-%d %H:%M:%S.%f')
timestamp.append(timestamp_request)
except Exception as e:
logging.error('Check the request params, unsupported query', exc_info=True)
finally:
return responses, timestamp
Is there other way to handle exceptions?
You need to skip one itteration, your code when catch exception go out from loop. You can try to skip one iteration like that:
def yt_return_api_response(yt_params):
responses = []
timestamp = []
youtubeAnalytics = get_service()
for k, v in yt_params.items():
try:
request = execute_api_request(
youtubeAnalytics.reports().query,
ids=v['ids'],
startDate=v['startDate'],
endDate=v['endDate'],
metrics=v['metrics'],
dimensions=v['dimensions'],
filters=v['filters'],
maxResults=v['maxResults'],
sort=v['sort'])
response = youtube_response(request)
responses.append(response)
# get the timestamp
timestamp_request = dt.datetime.now()
timestamp_request = timestamp_request.strftime('%Y-%m-%d %H:%M:%S.%f')
timestamp.append(timestamp_request)
except Exception as e:
logging.error('Check the request params, unsupported query', exc_info=True)
return responses, timestamp

Scraping Instagram with API ?__a=1

I've been trying scraping Instagram posts for a certain hashtag for the keys: display_url, taken_at_timestamp, text, edge_liked_by. This goes perfect for some hundreds in the start, but then stops fetching 'text' keyword only. Other three fields are successfully fetched though. I am not sure why it's happening.
I am parsing the JSON https://www.instagram.com/explore/tags/something/?__a=1.
base_url = "https://www.instagram.com/explore/tags/salonedelmobile/?__a=1"
url = "https://www.instagram.com/explore/tags/salonedelmobile/?__a=1"
while True:
response = url_req.urlopen(url)
json_file = json.load(response)
for i in json_file['graphql']['hashtag']['edge_hashtag_to_media']['edges']:
try:
post_text = i['node']['edge_media_to_caption']['edges'][0]['node']['text']
except IndexError as e:
post_text = e
try:
display_url = i['node']['display_url']
except:
display_url = e
try:
like_count = i['node']['edge_liked_by']['count']
except:
like_count = e
try:
time_stamp = i['node']['taken_at_timestamp']
except:
time_stamp = e
output.append([display_url, like_count, time_stamp, post_text])
df = pd.DataFrame(output,columns=['URL', 'Like Count', 'Time', 'Text'])
try:
df.to_excel('instagram.xlsx')
except:
pass
if json_file['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['has_next_page'] == True:
end_cursor = json_file['graphql']['hashtag']['edge_hashtag_to_media']['page_info']['end_cursor']
url = base_url + '&max_id=' + end_cursor
else:
break

Categories