Store result from URL into Pandas Data frame

Store result from URL into Pandas Data frame - python

I am new to Pandas & Python . Have a requirement where..
I am passing 100 post codes to a URL using for loop & trying to extract the latitude & longitude for each of the post codes passed.
The result of it I need to save in data frame . Below is the code I have am using .
query_cust = "select custMasterID,Full_Name,POSTCODE from DMON.BANK_CUSTOMERS"
df_cust = pd.read_sql(query_cust, con=con_str)
df_cust["URL"] = "https://api.getthedata.com/postcode/" + df_cust['POSTCODE'].str.replace(" ", "")
for column in df_cust["URL"]:
# print(column)
response = requests.get(column)
response_text = response.text
#df = json.loads(response_text)['data']
parse_json = json.loads(response_text)
df_cust["Lat"] = pd.json_normalize(parse_json['data']['latitude'])
df_cust["Long"] = parse_json['data']['longitude']
print(df_cust)
Below is the error which is coming when i try running it .
df_cust["Lat"] = pd.json_normalize(parse_json['data']['latitude'])
in _json_normalize
raise NotImplementedError
NotImplementedError

You don't need to use json_normalize to get what you need from the response data. Just iterate through each row of the dataframe and update the values:
import pandas as pd
import json
import requests
pd.options.display.max_columns = None
pd.options.display.max_rows = None
df_cust = pd.DataFrame(columns=['POSTCODE'])
# Just appending some data
df_cust = df_cust.append({'POSTCODE': 'SW1A-1AA'}, ignore_index=True)
df_cust = df_cust.append({'POSTCODE': 'WC2B-4AB'}, ignore_index=True)
df_cust = df_cust.append({'POSTCODE': 'ASDF-QWE'}, ignore_index=True) # Wrong postal code
for i, row in df_cust.iterrows():
df_cust.at[i, 'URL'] = 'https://api.getthedata.com/postcode/' + row['POSTCODE'].replace('-','')
response = requests.get(df_cust.loc[i, 'URL'])
parse_json = json.loads(response.text)
if 'data' in parse_json:
if 'latitude' in parse_json['data']:
df_cust.at[i, 'LAT'] = parse_json['data']['latitude']
else:
df_cust.at[i, 'LAT'] = None
if 'longitude' in parse_json['data']:
df_cust.at[i, 'LON'] = parse_json['data']['longitude']
else:
df_cust.at[i, 'LON'] = None
else:
df_cust.at[i, 'LAT'] = None
df_cust.at[i, 'LON'] = None
print(df_cust)
Output:
POSTCODE URL LAT LON
0 SW1A-1AA https://api.getthedata.com/postcode/SW1A1AA 51.501009 -0.141588
1 WC2B-4AB https://api.getthedata.com/postcode/WC2B4AB 51.514206 -0.119893
2 ASDF-QWE https://api.getthedata.com/postcode/ASDFQWE None None

Related

Parsing the "geographies" json response from the US Census Geocoder API to convert into a pandas Dataframe

This is a typical example of a json reponse from the US Census Geocoder API request for addresses.
When I geocode the addresses using my API call, I collect the payload into a json file. When parsing the json file using the below Python code, it sometimes so happens that the geocodes are getting wrongly associated with the input address, so when I am converting address geographies to a dataframe format, addresses and their geocodes start to mismatch when the response encounters a timeout/ any exception/ random HTML text in the reponse.
How can I modify my python script to map the corresponding geocodes to the input addresses? Any help would be appreciated!
street = []
city = []
ipstate = []
zipcode = []
status = []
geoid = []
centlat = []
centlon = []
block = []
state = []
basename = []
oid = []
intptlat = []
objectid = []
tract = []
centlon = []
blkgrp = []
arealand = []
intptlon = []
county = []
for i in range(len(payload)):
if '<!DOCTYPE html>' in payload[i]:
print(i,'HTML Response')
status.append('HTML response')
geoid.append(np.nan)
centlat.append(np.nan)
block.append(np.nan)
state.append(np.nan)
basename.append(np.nan)
oid.append(np.nan)
intptlat.append(np.nan)
objectid.append(np.nan)
tract.append(np.nan)
centlon.append(np.nan)
blkgrp.append(np.nan)
arealand.append(np.nan)
intptlon.append(np.nan)
county.append(np.nan)
street.append(np.nan)
city.append(np.nan)
ipstate.append(np.nan)
zipcode.append(np.nan)
else:
data = json.loads(payload[i])
inputAddress = data['result']['input']['address']
street.append(inputAddress['street'])
city.append(inputAddress['city'])
ipstate.append(inputAddress['state'])
zipcode.append(inputAddress['zip'])
censusParams = data['result']['addressMatches']
if len(censusParams) == 0:
# print('No Match', i)
status.append('No Match')
geoid.append(np.nan)
centlat.append(np.nan)
block.append(np.nan)
state.append(np.nan)
basename.append(np.nan)
oid.append(np.nan)
intptlat.append(np.nan)
objectid.append(np.nan)
tract.append(np.nan)
centlon.append(np.nan)
blkgrp.append(np.nan)
arealand.append(np.nan)
intptlon.append(np.nan)
county.append(np.nan)
# print(inputAddress['street'], inputAddress['city'], inputAddress['state'], inputAddress['zip'])
else:
# print('Match', i)
status.append('Match')
# print(inputAddress['street'], inputAddress['city'], inputAddress['state'], inputAddress['zip'])
for c in censusParams:
for key, value in c.items():
if key == 'geographies':
censusBlocks = dict_get(value, 'Census Blocks')
params = censusBlocks[0][0]
geoid.append(params['GEOID'])
centlat.append(params['CENTLAT'])
centlon.append(params['CENTLON'])
block.append(params['BLOCK'])
state.append(params['STATE'])
basename.append(params['BASENAME'])
oid.append(params['OID'])
intptlat.append(params['INTPTLAT'])
intptlon.append(params['INTPTLON'])
objectid.append(params['OBJECTID'])
tract.append(params['TRACT'])
blkgrp.append(params['BLKGRP'])
arealand.append(params['AREALAND'])
county.append(params['COUNTY'])
df_columns = ['Match',
'STREET',
'CITY',
'IP_STATE',
'ZIP',
'GEOID',
'CENTLAT',
'CENTLON',
'BLOCK',
'STATE',
'BASENAME',
'OID',
'INTPTLAT',
'INTPTLON',
'OBJECTID',
'TRACT',
'BLKGRP',
'AREALAND',
'COUNTY']
json_df = pd.DataFrame(list(zip(status,
street,
city,
ipstate,
zipcode,
geoid,
centlat,
centlon,
block,
state,
basename,
oid,
intptlat,
intptlon,
objectid,
tract,
blkgrp,
arealand,
county)), columns = df_columns)

How to append data to a pandas dataframe?

I have a fairly complex sequence of functions calling apis and having the result set be appended to a dataframe - the thing is when I print the dataframe during each loop of append, I see new values but at the end when the loop breaks, I only see what value for final_df ? Any thoughts as to why?
df = pd.DataFrame(columns = ['repo', 'number', 'title', 'branch', 'merged_at', 'created_at', 'authored_by', 'merged_by', 'from_version', 'to_version'] )
def get_prs(repo,pr_number):
response = requests.request("GET", pgv.github_pr_url + str(repo) + '/pulls/' + str(pr_number), headers=pgv.headers)
response = response.json()
return response
def get_commits(repo,from_version,to_version):
response = requests.request("GET", pgv.github_commits_url + str(repo) +'/compare/' + str(from_version) + '...' + str(to_version) , headers=pgv.headers)
response = response.json()
# print(len(response['commits']))
# print(response['commits'])
for i in range(0,len(response['commits'])):
# print(response['commits'][i])
# x = re.match(r"\AMerge pull request #(?P<number>\d+) from/(?P<branch>(.+)\s*$)", response['commits'][i].get('commit').get('message'))
x = re.search("\AMerge pull request #(?P<number>\d+) from/(?P<branch>.*)", response['commits'][i].get('commit').get('message'))
# print(x)
if x is None:
pass
else:
# return re.search("(\d+)",x.group(0)).group(0), response['commits'][i].get('branches_url')
return x.group('number'), x.group('branch')
# print(x.group('branch'))
#query GitHub to get all commits between from_version and to_version.
def return_deploy_events():
final_object = []
response = requests.request('POST',pgv.url, params = {'api_key' : pgv.key}, json = pgv.query_params)
response = response.json()
if "jobs" in response:
time.sleep(5)
else:
for i in range(0,len(response['query_result']['data']['rows'])):
# print(response['query_result']['data']['rows'])
# get_prs(response['query_result']['data']['rows'][i].get('REPO'),get_commits(response['query_result']['data']['rows'][i].get('REPO'),response['query_result']['data']['rows'][i].get('FROM_VERSION'), response['query_result']['data']['rows'][i].get('TO_VERSION'))).get('merged_at')
try:
repo = response['query_result']['data']['rows'][i].get('REPO')
from_version = response['query_result']['data']['rows'][i].get('FROM_VERSION')
to_version = response['query_result']['data']['rows'][i].get('TO_VERSION')
# print(get_prs(repo,get_commits(repo,from_version, to_version)))
pull_requests = get_prs(repo,get_commits(repo,from_version, to_version)[0])
##pack into all one return
final_df = df.append({
'repo':repo,
'title': pull_requests.get('title'),
'branch': get_commits(repo,from_version, to_version)[1],
'created_at': pull_requests.get('created_at'),
'merged_at': pull_requests.get('merged_at'),
'authored_by': pull_requests.get('user').get('login'),
'merged_by': pull_requests.get('merged_by').get('login'),
'number': get_commits(repo,from_version, to_version)[0],
'from_version': from_version,
'to_version': to_version,}, ignore_index = True)
# print(get_commits(repo,from_version, to_version))
**HERE, WHEN UNCOMMENTED, PRINTS ALL RECORDS I WANT APPENDED **
# print(final_df.head(10))
except Exception:
pass
# 'title':, 'branch',
# 'merged_at', 'created_at', 'authored_by', 'merged_by',
# 'from_version': response['query_result']['data']['rows'][i].get('FROM_VERSION'), 'to_version':response['query_result']['data']['rows'][i].get('TO_VERSION')},
# ignore_index = True)
**BELOW IS WHERE IT PRINTS ONLY 1 RECORD **
print(final_df)
# final_df = json.loads(final_df.to_json(orient = 'records'))
# gec.json_to_s3(final_df, glob_common_vars.s3_resource,glob_common_vars.s3_bucket_name, 'test/test.json.gzip')
return_deploy_events()

I think the problem is, you are assigning each rows to the same variable.
So the last row will be printed at the last. So try to append each rows to result list.
def return_deploy_events():
final_object = []
result = []
response = requests.request('POST',pgv.url, params = {'api_key' : pgv.key}, json = pgv.query_params)
response = response.json()
if "jobs" in response:
time.sleep(5)
else:
for i in range(0,len(response['query_result']['data']['rows'])):
# print(response['query_result']['data']['rows'])
# get_prs(response['query_result']['data']['rows'][i].get('REPO'),get_commits(response['query_result']['data']['rows'][i].get('REPO'),response['query_result']['data']['rows'][i].get('FROM_VERSION'), response['query_result']['data']['rows'][i].get('TO_VERSION'))).get('merged_at')
try:
repo = response['query_result']['data']['rows'][i].get('REPO')
from_version = response['query_result']['data']['rows'][i].get('FROM_VERSION')
to_version = response['query_result']['data']['rows'][i].get('TO_VERSION')
# print(get_prs(repo,get_commits(repo,from_version, to_version)))
pull_requests = get_prs(repo,get_commits(repo,from_version, to_version)[0])
##pack into all one return
final_df = df.append({
'repo':repo,
'title': pull_requests.get('title'),
'branch': get_commits(repo,from_version, to_version)[1],
'created_at': pull_requests.get('created_at'),
'merged_at': pull_requests.get('merged_at'),
'authored_by': pull_requests.get('user').get('login'),
'merged_by': pull_requests.get('merged_by').get('login'),
'number': get_commits(repo,from_version, to_version)[0],
'from_version': from_version,
'to_version': to_version,}, ignore_index = True)
# print(get_commits(repo,from_version, to_version))
**HERE, WHEN UNCOMMENTED, PRINTS ALL RECORDS I WANT APPENDED **
# print(final_df.head(10))
result.append(final_df) # append the current row to result
except Exception:
pass
**BELOW IS WHERE IT PRINTS ONLY 1 RECORD **
print(result) # print the final result
I just added two lines of code, but I hope it works.

Create merged df based on the url list [pandas]

I was able to extract the data from url_query url, but additionally, I would like to get the data from the urls_list created based on the query['ids'] column from dataframe. Please see below the current logic:
url = 'https://instancename.some-platform.com/api/now/table/data?display_value=true&'
team = 'query=group_name=123456789'
url_query = url+team
dataframe: query
[ids]
0 aaabbb1cccdddeee4ffggghhhhh5iijj
1 aa1bbb2cccdddeee5ffggghhhhh6iijj
issue_list = []
for issue in query['ids']:
issue_list.append(f'https://instancename.some-platform.com/api/now/table/data?display_value=true&?display_value=true&query=group_name&sys_id={issue}')
response = requests.get(url_query, headers=headers,auth=auth, proxies=proxies)
data = response.json()
def api_response(k):
dct = dict(
event_id= k['number'],
created_time = k[‘created’],
status = k[‘status’],
created_by = k[‘raised_by’],
short_desc = k[‘short_description’],
group = k[‘team’]
)
return dct
raw_data = []
for p in data['result']:
rec = api_response(k)
raw_data.append(rec)
df = pd.DataFrame.from_records(raw_data)
df:
The url_query response extracts what I need, but the key is that I would like to add to the existing one 'df' add the data from the issue_list = []. I don't know how to put the issue_list = [] to the response. I've tried to add issue_list to the response = requests.get(issue_list, headers=headers,auth=auth, proxies=proxies) statement, but I've got invalid schema error.

You can create list of DataFrames with query q instead url_query and last join together by concat:
dfs = []
for issue in query['ids']:
q = f'https://instancename.some-platform.com/api/now/table/data?display_value=true&?display_value=true&query=group_name&sys_id={issue}'
response = requests.get(q, headers=headers,auth=auth, proxies=proxies)
data = response.json()
raw_data = [api_response(k) for p in data['result']]
df = pd.DataFrame.from_records(raw_data)
dfs.append(df)
df = pd.concat(dfs, ignore_index=True)

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

I'm working on a personal project and I'm trying to retrieve air quality data from the https://aqicn.org website using their API.
I've used this code, which I've copied and adapted for the city of Bucharest as follows:
import pandas as pd
import folium
import requests
# GET data from AQI website through the API
base_url = "https://api.waqi.info"
path_to_file = "~/path"
# Got token from:- https://aqicn.org/data-platform/token/#/
with open(path_to_file) as f:
contents = f.readlines()
key = contents[0]
# (lat, long)-> bottom left, (lat, lon)-> top right
latlngbox = "44.300264,25.920181,44.566991,26.297836" # For Bucharest
trail_url=f"/map/bounds/?token={key}&latlng={latlngbox}" #
my_data = pd.read_json(base_url + trail_url) # Joined parts of URL
print('columns->', my_data.columns) #2 cols ‘status’ and ‘data’ JSON
### Built a dataframe from the json file
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],
each_row['lat'],
each_row['lon'],
each_row['aqi']])
df = pd.DataFrame(all_rows, columns=['station_name', 'lat', 'lon', 'aqi'])
# Cleaned the DataFrame
df['aqi'] = pd.to_numeric(df.aqi, errors='coerce') # Invalid parsing to NaN
# Remove NaN entries in col
df1 = df.dropna(subset = ['aqi'])
Unfortunately it only retrieves 4 stations whereas there are many more available on the actual site. In the API documentation the only limitation I saw was for "1,000 (one thousand) requests per second" so why can't I get more of them?
Also, I've tried to modify the lat-long values and managed to get more stations, but they were outside the city I was interested in.
Here is a view of the actual perimeter I've used in the embedded code.
If you have any suggestions as of how I can solve this issue, I'd be very happy to read your thoughts. Thank you!

Try using waqi through aqicn... not exactly a clean API but I found it to work quite well
import pandas as pd
url1 = 'https://api.waqi.info'
# Get token from:- https://aqicn.org/data-platform/token/#/
token = 'XXX'
box = '113.805332,22.148942,114.434299,22.561716' # polygon around HongKong via bboxfinder.com
url2=f'/map/bounds/?latlng={box}&token={token}'
my_data = pd.read_json(url1 + url2)
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],each_row['lat'],each_row['lon'],each_row['aqi']])
df = pd.DataFrame(all_rows,columns=['station_name', 'lat', 'lon', 'aqi'])
From there its easy to plot
df['aqi'] = pd.to_numeric(df.aqi,errors='coerce')
print('with NaN->', df.shape)
df1 = df.dropna(subset = ['aqi'])
df2 = df1[['lat', 'lon', 'aqi']]
init_loc = [22.396428, 114.109497]
max_aqi = int(df1['aqi'].max())
print('max_aqi->', max_aqi)
m = folium.Map(location = init_loc, zoom_start = 5)
heat_aqi = HeatMap(df2, min_opacity = 0.1, max_val = max_aqi,
radius = 60, blur = 20, max_zoom = 2)
m.add_child(heat_aqi)
m
Or as such
centre_point = [22.396428, 114.109497]
m2 = folium.Map(location = centre_point,tiles = 'Stamen Terrain', zoom_start= 6)
for idx, row in df1.iterrows():
lat = row['lat']
lon = row['lon']
station = row['station_name'] + ' AQI=' + str(row['aqi'])
station_aqi = row['aqi']
if station_aqi > 300:
pop_color = 'red'
elif station_aqi > 200:
pop_color = 'orange'
else:
pop_color = 'green'
folium.Marker(location= [lat, lon],
popup = station,
icon = folium.Icon(color = pop_color)).add_to(m2)
m2
checking for stations within HK, returns 19
df[df['station_name'].str.contains('HongKong')]

Webscraping data from a json source, why i get only 1 row?

I'am trying to get some information from a website with python, from a webshop.
I tried this one:
def proba():
my_url = requests.get('https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')
data = my_url.json()
results = []
products = data['MainContent'][0]['contents'][0]['productList']['products']
for product in products:
name = product['productModel']['displayName']
try:
priceGross = product['priceInfo']['priceItemSale']['gross']
except:
priceGross = product['priceInfo']['priceItemToBase']['gross']
url = product['productModel']['url']
results.append([name, priceGross, url])
df = pd.DataFrame(results, columns = ['Name', 'Price', 'Url'])
# print(df) ## print df
df.to_csv(r'/usr/src/Python-2.7.13/test.csv', sep=',', encoding='utf-8-sig',index = False )
while True:
mytime=datetime.now().strftime("%H:%M:%S")
while mytime < "23:59:59":
print mytime
proba()
mytime=datetime.now().strftime("%H:%M:%S")
In this webshop there are 9 items, but i see only 1 row in the csv file.

Not entirely sure what you intend as end result. Are you wanting to update an existing file? Get data and write out all in one go? Example of latter shown below where I add each new dataframe to an overall dataframe and use a Return statement for the function call to provide each new dataframe.
import requests
from datetime import datetime
import pandas as pd
def proba():
my_url = requests.get('https://www.telekom.hu/shop/categoryresults/?N=10994&contractType=list_price&instock_products=1&Ns=sku.sortingPrice%7C0%7C%7Cproduct.displayName%7C0&No=0&Nrpp=9&paymentType=FULL')
data = my_url.json()
results = []
products = data['MainContent'][0]['contents'][0]['productList']['products']
for product in products:
name = product['productModel']['displayName']
try:
priceGross = product['priceInfo']['priceItemSale']['gross']
except:
priceGross = product['priceInfo']['priceItemToBase']['gross']
url = product['productModel']['url']
results.append([name, priceGross, url])
df = pd.DataFrame(results, columns = ['Name', 'Price', 'Url'])
return df
headers = ['Name', 'Price', 'Url']
df = pd.DataFrame(columns = headers)
while True:
mytime = datetime.now().strftime("%H:%M:%S")
while mytime < "23:59:59":
print(mytime)
dfCurrent = proba()
mytime=datetime.now().strftime("%H:%M:%S")
df = pd.concat([df, dfCurrent])
df.to_csv(r"C:\Users\User\Desktop\test.csv", encoding='utf-8')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Store result from URL into Pandas Data frame - python

Related

Parsing the "geographies" json response from the US Census Geocoder API to convert into a pandas Dataframe

How to append data to a pandas dataframe?

Create merged df based on the url list [pandas]

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

Webscraping data from a json source, why i get only 1 row?

Categories

Resources