Collecting places using Python and Google Places API

Collecting places using Python and Google Places API - python

I want to collect the places around my city, Pekanbaru, with latlong (0.507068, 101.447777) and I will convert it to the dataset. Dataset (it contains place_name, place_id, lat, long and type columns).
Below is the script that I tried.
import json
import urllib.request as url_req
import time
import pandas as pd
NATAL_CENTER = (0.507068,101.447777)
API_KEY = 'API'
API_NEARBY_SEARCH_URL = 'https://maps.googleapis.com/maps/api/place/nearbysearch/json'
RADIUS = 30000
PLACES_TYPES = [('airport', 1), ('bank', 2)] ## TESTING
# PLACES_TYPES = [('airport', 1), ('bank', 2), ('bar', 3), ('beauty_salon', 3), ('book_store', 1), ('cafe', 1), ('church', 3), ('doctor', 3), ('dentist', 2), ('gym', 3), ('hair_care', 3), ('hospital', 2), ('pharmacy', 3), ('pet_store', 1), ('night_club', 2), ('movie_theater', 1), ('school', 3), ('shopping_mall', 1), ('supermarket', 3), ('store', 3)]
def request_api(url):
response = url_req.urlopen(url)
json_raw = response.read()
json_data = json.loads(json_raw)
return json_data
def get_places(types, pages):
location = str(NATAL_CENTER[0]) + "," + str(NATAL_CENTER[1])
mounted_url = ('%s'
'?location=%s'
'&radius=%s'
'&type=%s'
'&key=%s') % (API_NEARBY_SEARCH_URL, location, RADIUS, types, API_KEY)
results = []
next_page_token = None
if pages == None: pages = 1
for num_page in range(pages):
if num_page == 0:
api_response = request_api(mounted_url)
results = results + api_response['results']
else:
page_url = ('%s'
'?key=%s'
'&pagetoken=%s') % (API_NEARBY_SEARCH_URL, API_KEY, next_page_token)
api_response = request_api(str(page_url))
results += api_response['results']
if 'next_page_token' in api_response:
next_page_token = api_response['next_page_token']
else: break
time.sleep(1)
return results
def parse_place_to_list(place, type_name):
# Using name, place_id, lat, lng, rating, type_name
return [
place['name'],
place['place_id'],
place['geometry']['location']['lat'],
place['geometry']['location']['lng'],
type_name
]
def mount_dataset():
data = []
for place_type in PLACES_TYPES:
type_name = place_type[0]
type_pages = place_type[1]
print("Getting into " + type_name)
result = get_places(type_name, type_pages)
result_parsed = list(map(lambda x: parse_place_to_list(x, type_name), result))
data += result_parsed
dataframe = pd.DataFrame(data, columns=['place_name', 'place_id', 'lat', 'lng', 'type'])
dataframe.to_csv('places.csv')
mount_dataset()
But the script returned with Empty DataFrame.
How to solve and got the right Dataset?

I am afraid the scraping of the data and storing it is prohibited by the Terms of Service of Google Maps Platform.
Have a look at the Terms of Service prior to advance with the implementation. The paragraph 3.2.4 'Restrictions Against Misusing the Services' reads
(a) No Scraping. Customer will not extract, export, or otherwise scrape Google Maps Content for use outside the Services. For example, Customer will not: (i) pre-fetch, index, store, reshare, or rehost Google Maps Content outside the services; (ii) bulk download Google Maps tiles, Street View images, geocodes, directions, distance matrix results, roads information, places information, elevation values, and time zone details; (iii) copy and save business names, addresses, or user reviews; or (iv) use Google Maps Content with text-to-speech services.
source: https://cloud.google.com/maps-platform/terms/#3-license
Sorry to be bearer of bad news.

Related

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

I'm working on a personal project and I'm trying to retrieve air quality data from the https://aqicn.org website using their API.
I've used this code, which I've copied and adapted for the city of Bucharest as follows:
import pandas as pd
import folium
import requests
# GET data from AQI website through the API
base_url = "https://api.waqi.info"
path_to_file = "~/path"
# Got token from:- https://aqicn.org/data-platform/token/#/
with open(path_to_file) as f:
contents = f.readlines()
key = contents[0]
# (lat, long)-> bottom left, (lat, lon)-> top right
latlngbox = "44.300264,25.920181,44.566991,26.297836" # For Bucharest
trail_url=f"/map/bounds/?token={key}&latlng={latlngbox}" #
my_data = pd.read_json(base_url + trail_url) # Joined parts of URL
print('columns->', my_data.columns) #2 cols ‘status’ and ‘data’ JSON
### Built a dataframe from the json file
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],
each_row['lat'],
each_row['lon'],
each_row['aqi']])
df = pd.DataFrame(all_rows, columns=['station_name', 'lat', 'lon', 'aqi'])
# Cleaned the DataFrame
df['aqi'] = pd.to_numeric(df.aqi, errors='coerce') # Invalid parsing to NaN
# Remove NaN entries in col
df1 = df.dropna(subset = ['aqi'])
Unfortunately it only retrieves 4 stations whereas there are many more available on the actual site. In the API documentation the only limitation I saw was for "1,000 (one thousand) requests per second" so why can't I get more of them?
Also, I've tried to modify the lat-long values and managed to get more stations, but they were outside the city I was interested in.
Here is a view of the actual perimeter I've used in the embedded code.
If you have any suggestions as of how I can solve this issue, I'd be very happy to read your thoughts. Thank you!

Try using waqi through aqicn... not exactly a clean API but I found it to work quite well
import pandas as pd
url1 = 'https://api.waqi.info'
# Get token from:- https://aqicn.org/data-platform/token/#/
token = 'XXX'
box = '113.805332,22.148942,114.434299,22.561716' # polygon around HongKong via bboxfinder.com
url2=f'/map/bounds/?latlng={box}&token={token}'
my_data = pd.read_json(url1 + url2)
all_rows = []
for each_row in my_data['data']:
all_rows.append([each_row['station']['name'],each_row['lat'],each_row['lon'],each_row['aqi']])
df = pd.DataFrame(all_rows,columns=['station_name', 'lat', 'lon', 'aqi'])
From there its easy to plot
df['aqi'] = pd.to_numeric(df.aqi,errors='coerce')
print('with NaN->', df.shape)
df1 = df.dropna(subset = ['aqi'])
df2 = df1[['lat', 'lon', 'aqi']]
init_loc = [22.396428, 114.109497]
max_aqi = int(df1['aqi'].max())
print('max_aqi->', max_aqi)
m = folium.Map(location = init_loc, zoom_start = 5)
heat_aqi = HeatMap(df2, min_opacity = 0.1, max_val = max_aqi,
radius = 60, blur = 20, max_zoom = 2)
m.add_child(heat_aqi)
m
Or as such
centre_point = [22.396428, 114.109497]
m2 = folium.Map(location = centre_point,tiles = 'Stamen Terrain', zoom_start= 6)
for idx, row in df1.iterrows():
lat = row['lat']
lon = row['lon']
station = row['station_name'] + ' AQI=' + str(row['aqi'])
station_aqi = row['aqi']
if station_aqi > 300:
pop_color = 'red'
elif station_aqi > 200:
pop_color = 'orange'
else:
pop_color = 'green'
folium.Marker(location= [lat, lon],
popup = station,
icon = folium.Icon(color = pop_color)).add_to(m2)
m2
checking for stations within HK, returns 19
df[df['station_name'].str.contains('HongKong')]

get length of list per scraped link

I am quite new to Python and I need your professional advice.
What I want in the end is, that I get the lenght of the injury_list per player. The players are stored in PlayerLinks
playerLinks = ['https://www.transfermarkt.de/Serge Gnabry/verletzungen/spieler/159471',
'https://www.transfermarkt.de/Jamal Musiala/verletzungen/spieler/580195',
'https://www.transfermarkt.de/Douglas Costa/verletzungen/spieler/75615',
'https://www.transfermarkt.de/Joshua Kimmich/verletzungen/spieler/161056',
'https://www.transfermarkt.de/Alexander Nübel/verletzungen/spieler/195778',
'https://www.transfermarkt.de/Kingsley Coman/verletzungen/spieler/243714',
'https://www.transfermarkt.de/Christopher Scott/verletzungen/spieler/503162',
'https://www.transfermarkt.de/Corentin Tolisso/verletzungen/spieler/190393',
'https://www.transfermarkt.de/Leon Goretzka/verletzungen/spieler/153084',
'https://www.transfermarkt.de/Javi Martínez/verletzungen/spieler/44017',
'https://www.transfermarkt.de/Tiago Dantas/verletzungen/spieler/429987',
'https://www.transfermarkt.de/Robert Lewandowski/verletzungen/spieler/38253',
'https://www.transfermarkt.de/Lucas Hernández/verletzungen/spieler/281963',
'https://www.transfermarkt.de/Josip Stanisic/verletzungen/spieler/483046',
'https://www.transfermarkt.de/Thomas Müller/verletzungen/spieler/58358',
'https://www.transfermarkt.de/Benjamin Pavard/verletzungen/spieler/353366',
'https://www.transfermarkt.de/Bouna Sarr/verletzungen/spieler/190685',
'https://www.transfermarkt.de/Leroy Sané/verletzungen/spieler/192565',
'https://www.transfermarkt.de/Manuel Neuer/verletzungen/spieler/17259',
'https://www.transfermarkt.de/David Alaba/verletzungen/spieler/59016',
'https://www.transfermarkt.de/Niklas Süle/verletzungen/spieler/166601',
'https://www.transfermarkt.de/Tanguy Nianzou/verletzungen/spieler/538996',
'https://www.transfermarkt.de/Ron-Thorben Hoffmann/verletzungen/spieler/317444',
'https://www.transfermarkt.de/Jérôme Boateng/verletzungen/spieler/26485',
'https://www.transfermarkt.de/Alphonso Davies/verletzungen/spieler/424204',
'https://www.transfermarkt.de/Eric Maxim Choupo-Moting/verletzungen/spieler/45660',
'https://www.transfermarkt.de/Marc Roca/verletzungen/spieler/336869']
injury_list = []
name_list = []
With the code below I get a list of all injuries of all playersLinks.
However, the lists are not from equal size. And I need the name of each player next to the injuries of that specific player.
I tried the following:
However, the lenght of injury_list is a random numebr then and not the number per player.
How do I get instead the lenght of the injury_list by player?
In order that I have the correct names next to the injuries.
for p in range(len(playerLinks)):
page = playerLinks[p]
response = requests.get(page, headers={'User-Agent': 'Custom5'})
print(response.status_code)
injury_data = response.text
soup = BeautifulSoup(injury_data, 'html.parser')
table = soup.find(id="yw1")
injurytypes = table.select("td[class='hauptlink']")
for j in range(len(injurytypes)):
all_injuries = [injury.text for injury in injurytypes]
injury_list.extend(all_injuries)
image = soup.find("div", {"class": "dataBild"})
for j in range(len(image)):
names = image.find("img").get("title")
name_list.append(''.join(names))
name_list_def = name_list * len(injury_list)
Through the img tag I get the names of the players.
Do you have any advice?
Thanks a lot!

player_inj_numb=[]
for url in (playerLinks):
player_name = url.split('/')[3]
response = requests.get(url, headers={'User-Agent': 'Custom5'})
print(response.status_code)
injury_data = response.text
soup = BeautifulSoup(injury_data, 'html.parser')
table = soup.find(id="yw1")
nbInjury = len(table.findAll("tr"))
player_inj_numb.append((player_name,nbInjury-1))
print(player_inj_numb)
which outputs:
[('Serge Gnabry', 15), ('Jamal Musiala', 0), ('Douglas Costa', 15), ('Joshua Kimmich', 12), ('Alexander Nübel', 2), ('Kingsley Coman', 15), ('Christopher Scott', 0), ('Corentin Tolisso', 14), ('Leon Goretzka', 15), ('Javi Martínez', 15), ('Tiago Dantas', 0), ('Robert Lewandowski', 15), ('Lucas Hernández', 15), ('Josip Stanisic', 8), ('Thomas Müller', 13), ('Benjamin Pavard', 9), ('Bouna Sarr', 8), ('Leroy Sané', 11), ('Manuel Neuer', 15), ('David Alaba', 15), ('Niklas Süle', 13), ('Tanguy Nianzou', 4), ('Ron-Thorben Hoffmann', 4), ('Jérôme Boateng', 15), ('Alphonso Davies', 3), ('Eric Maxim Choupo-Moting', 15), ('Marc Roca', 7)]
I got the name from the url, since it is already there, no need for additional scraping.
The number of injurys is equal the number of row in the table minus the first row which is the table header.
Please note that some player have more than 15 injurys so you will need to get the subsequent page in those cases.

"Could not convert string to float" when importing CSV to Blender

I have a script I'm trying to use (not mine) which is meant to import CSV data into Blender as a mesh.
I have two CSV files which are two parts of the same model, the script works fine for importing one of the CSV files but throws a "ValueError" when I try to import the second.
The error can be seen here:
As far as I know there's only one "-1" in the script, I've tried changing it, removing it, nothing works, I still get the same error.
I've also tried changing the encoding on the CVS files themselves but that didn't work either.
I don't know much about scripting and I'm at my wits end, any help would be greatly appreciated.
The script I'm using is here:
https://github.com/sbobovyc/GameTools/blob/master/Blender/import_pix.py
# ##### BEGIN GPL LICENSE BLOCK #####
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software Foundation,
# Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
#
# ##### END GPL LICENSE BLOCK #####
# <pep8 compliant>
import bpy
import csv
import mathutils
from bpy_extras.io_utils import unpack_list, unpack_face_list, axis_conversion
from bpy.props import (BoolProperty,
FloatProperty,
StringProperty,
EnumProperty,
)
from collections import OrderedDict
bl_info = {
"name": "PIX CSV",
"author": "Stanislav Bobovych",
"version": (1, 0, 0),
"blender": (2, 7, 8),
"location": "File > Import-Export",
"description": "Import PIX csv dump of mesh. Import mesh, normals and UVs.",
"category": "Import"}
class PIX_CSV_Operator(bpy.types.Operator):
bl_idname = "object.pix_csv_importer"
bl_label = "Import PIX csv"
filepath = bpy.props.StringProperty(subtype="FILE_PATH")
filter_glob = StringProperty(default="*.csv", options={'HIDDEN'})
mirror_x = bpy.props.BoolProperty(name="Mirror X",
description="Mirror all the vertices across X axis",
default=True)
vertex_order = bpy.props.BoolProperty(name="Change vertex order",
description="Reorder vertices in counter-clockwise order",
default=True)
axis_forward = EnumProperty(
name="Forward",
items=(('X', "X Forward", ""),
('Y', "Y Forward", ""),
('Z', "Z Forward", ""),
('-X', "-X Forward", ""),
('-Y', "-Y Forward", ""),
('-Z', "-Z Forward", ""),
),
default='Z')
axis_up = EnumProperty(
name="Up",
items=(('X', "X Up", ""),
('Y', "Y Up", ""),
('Z', "Z Up", ""),
('-X', "-X Up", ""),
('-Y', "-Y Up", ""),
('-Z', "-Z Up", ""),
),
default='Y',
)
def execute(self, context):
keywords = self.as_keywords(ignore=("axis_forward",
"axis_up",
"filter_glob"))
global_matrix = axis_conversion(from_forward=self.axis_forward,
from_up=self.axis_up,
).to_4x4()
keywords["global_matrix"] = global_matrix
print(keywords)
importCSV(**keywords)
return {'FINISHED'}
def invoke(self, context, event):
context.window_manager.fileselect_add(self)
return {'RUNNING_MODAL'}
def draw(self, context):
layout = self.layout
col = layout.column()
col.label(text="Import options")
row = col.row()
row.prop(self, "mirror_x")
row = col.row()
row.prop(self, "vertex_order")
layout.prop(self, "axis_forward")
layout.prop(self, "axis_up")
def make_mesh(verteces, faces, normals, uvs, global_matrix):
mesh = bpy.data.meshes.new('name')
mesh.vertices.add(len(verteces))
mesh.vertices.foreach_set("co", unpack_list(verteces))
mesh.tessfaces.add(len(faces))
mesh.tessfaces.foreach_set("vertices_raw", unpack_face_list(faces))
index = 0
for vertex in mesh.vertices:
vertex.normal = normals[index]
index += 1
uvtex = mesh.tessface_uv_textures.new()
uvtex.name = "UV"
for face, uv in enumerate(uvs):
data = uvtex.data[face]
data.uv1 = uv[0]
data.uv2 = uv[1]
data.uv3 = uv[2]
mesh.update(calc_tessface=False, calc_edges=False)
obj = bpy.data.objects.new('name', mesh)
# apply transformation matrix
obj.matrix_world = global_matrix
bpy.context.scene.objects.link(obj) # link object to scene
def importCSV(filepath=None, mirror_x=False, vertex_order=True, global_matrix=None):
if global_matrix is None:
global_matrix = mathutils.Matrix()
if filepath == None:
return
vertex_dict = {}
normal_dict = {}
vertices = []
faces = []
normals = []
uvs = []
with open(filepath) as f:
reader = csv.reader(f)
next(reader) # skip header
# face_count = sum(1 for row in reader) / 3
# print(face_count)
f.seek(0)
reader = csv.reader(f)
next(reader) # skip header
current_face = []
current_uv = []
i = 0
x_mod = 1
if mirror_x:
x_mod = -1
for row in reader:
vertex_index = int(row[0])
vertex_dict[vertex_index] = (x_mod*float(row[2]), float(row[3]), float(row[4]))
#TODO how are axis really ligned up?
# x, y, z = (vertex_dict[vertex_index][0], vertex_dict[vertex_index][1], vertex_dict[vertex_index][2])
# vertex_dict[vertex_index] = (x, z, y)
normal_dict[vertex_index] = (float(row[6]), float(row[7]), float(row[8]))
#TODO add support for changing the origin of UV coords
uv = (float(row[9]), 1.0 - float(row[10])) # modify V
if i < 2:
current_face.append(vertex_index)
current_uv.append(uv)
i += 1
else:
current_face.append(vertex_index)
#TODO add option to change order of marching vertices
if vertex_order:
faces.append((current_face[2], current_face[1], current_face[0]))
else:
faces.append(current_face)
current_uv.append(uv)
uvs.append(current_uv)
current_face = []
current_uv = []
i = 0
for i in range(len(vertex_dict)):
if i in vertex_dict:
pass
else:
# print("missing",i)
vertex_dict[i] = (0, 0, 0)
normal_dict[i] = (0, 0, 0)
# dictionary sorted by key
vertex_dict = OrderedDict(sorted(vertex_dict.items(), key=lambda t: t[0]))
normal_dict = OrderedDict(sorted(normal_dict.items(), key=lambda t: t[0]))
for key in vertex_dict:
vertices.append(list(vertex_dict[key]))
# print(key,vertex_dict[key])
for key in normal_dict:
normals.append(list(normal_dict[key]))
# print(vertices)
# print(faces)
# print(normals)
# print(uvs)
make_mesh(vertices, faces, normals, uvs, global_matrix)
def menu_func_import(self, context):
self.layout.operator(PIX_CSV_Operator.bl_idname, text="PIX CSV (.csv)")
def register():
bpy.utils.register_module(__name__)
bpy.types.INFO_MT_file_import.append(menu_func_import)
def unregister():
bpy.utils.unregister_module(__name__)
bpy.types.INFO_MT_file_import.remove(menu_func_import)
if __name__ == "__main__":
# register()
# These run the script from "Run script" button
bpy.utils.register_class(PIX_CSV_Operator)
bpy.ops.object.pix_csv_importer('INVOKE_DEFAULT')

Can you search if the string -, occurs anywhere in the CSV file that doesn’t work? - mkrieger1

Scrapy: How to find highest/lowest values from a number of values?

I am trying to write a crawler using Scrapy/Python, that reads some values from a page.
I then want this crawler to store the highest and lowest values in seperate fields.
So far, I am able to read the values from the page (please see my code below), but I am not sure how to calculate the lowest and highest value and store in separate fields ?
For example, say the crawler reads the page and returns these values
t1-score = 75.25
t2-score = 85.04
t3-score = '' (value missing)
t4-score = 90.67
t5-score = 50.00
So I want to populate ....
'highestscore': 90.67
'lowestscore': 50.00
How do I do that ? Do I need to use an array ? Put all values in array and then pick the highest/lowest ?
Any help is very appreciated.
Here is my code so far .... I am storing -1, in case of missing values.
class MySpider(BaseSpider):
name = "courses"
start_urls = ['http://www.example.com/courses-listing']
allowed_domains = ["example.com"]
def parse(self, response):
hxs = Selector(response)
for courses in response.xpath("//meta"):
{
d = {
'courset1score': float(courses.xpath('//meta[#name="t1-score"]/#content').extract_first('').strip() or -1),
'courset2score': float(courses.xpath('//meta[#name="t2-score"]/#content').extract_first('').strip() or -1),
'courset3score': float(courses.xpath('//meta[#name="t3-score"]/#content').extract_first('').strip() or -1),
'courset4score': float(courses.xpath('//meta[#name="t4-score"]/#content').extract_first('').strip() or -1),
'courset5score': float(courses.xpath('//meta[#name="t5-score"]/#content').extract_first('').strip() or -1),
}
d['highestscore'] = max(d.values())
d['lowestscore'] = min(d.values())
'pagetitle': courses.xpath('//meta[#name="pagetitle"]/#content').extract_first(),
'pageurl': courses.xpath('//meta[#name="pageurl"]/#content').extract_first(),
}
for url in hxs.xpath('//ul[#class="scrapy"]/li/a/#href').extract():
// yield Request(response.urljoin(url), callback=self.parse)
yield d

Build the dictionary before the yield statement. This will let you reference the values already in the dictionary.
for courses in response.xpath("//meta"):
d = {'courset1score': float(courses.xpath('//meta[#name="t1-score"]/#content').extract_first('').strip() or -1),
'courset2score': float(courses.xpath('//meta[#name="t2-score"]/#content').extract_first('').strip() or -1),
'courset3score': float(courses.xpath('//meta[#name="t3-score"]/#content').extract_first('').strip() or -1),
'courset4score': float(courses.xpath('//meta[#name="t4-score"]/#content').extract_first('').strip() or -1),
'courset5score': float(courses.xpath('//meta[#name="t5-score"]/#content').extract_first('').strip() or -1),
}
d['highestscore'] = max(d.values())
d['lowestscore'] = min(d.values())
yield d

Assuming we have this html document example:
body = """
<meta name="t1-score" content="10"></meta>
<meta name="t2-score" content="20"></meta>
<meta name="t3-score" content="5"></meta>
<meta name="t4-score" content="8"></meta>
"""
sel = Selector(text=body)
We can extract scores, convert to number objects and use inbuilt min and max functions.
# you can use this xpath to select any score
scores = sel.xpath("//meta[re:test(#name, 't\d-score')]/#content").extract()
# ['10', '20', '5', '8']
scores = [float(score) for score in scores]
# [10.0, 20.0, 5.0, 8.0]
min(scores)
# 5.0
max(scores)
# 20.0
Combining output:
item = dict()
item['max_score'] = max(scores)
item['min_score'] = min(scores)
for i, score in enumerate(scores):
item['score{}'.format(i)] = score

Read data from OECD API into python (and pandas)

I'm trying to download data from OECD API (https://data.oecd.org/api/sdmx-json-documentation/) into python.
I managed to download data in SDMX-JSON format (and transform it to JSON) so far:
OECD_ROOT_URL = "http://stats.oecd.org/SDMX-JSON/data"
def make_OECD_request(dsname, dimensions, params = None, root_dir = OECD_ROOT_URL):
"""Make URL for the OECD API and return a response"""
"""4 dimensions: location, subject, measure, frequency"""
if not params:
params = {}
dim_args = ['+'.join(d) for d in dimensions]
dim_str = '.'.join(dim_args)
url = root_dir + '/' + dsname + '/' + dim_str + '/all'
print('Requesting URL ' + url)
return rq.get(url = url, params = params)
response = make_OECD_request('MEI'
, [['USA', 'CZE'], [], [], ['M']]
, {'startTime': '2009-Q1', 'endTime': '2010-Q1'})
if (response.status_code == 200):
json = response.json()
How can I transform the data set into pandas.DataFrame? I tried pandas.read_json() and pandasdmx library, but I was not able to solve this.

The documentation the original question points to does not (yet?) mention that the API accepts the parameter contentType, which may be set to csv. That makes it trivial to use with Pandas.
import pandas as pd
def get_from_oecd(sdmx_query):
return pd.read_csv(
f"https://stats.oecd.org/SDMX-JSON/data/{sdmx_query}?contentType=csv"
)
print(get_from_oecd("MEI_FIN/IRLT.AUS.M/OECD").head())

Update:
The function to automatically download the data from OECD API is now available in my Python library CIF (abbreviation for the Composite Indicators Framework, installable via pip):
from cif import cif
data, subjects, measures = cif.createDataFrameFromOECD(countries = ['USA'], dsname = 'MEI', frequency = 'M')
Original answer:
If you need your data in Pandas DataFrame format, it is IMHO better to send your request to OECD with additional parameter 'dimensionAtObservation': 'AllDimensions', which results in more comprehensive JSON file.
Use following functions to download the data:
import requests as rq
import pandas as pd
import re
OECD_ROOT_URL = "http://stats.oecd.org/SDMX-JSON/data"
def make_OECD_request(dsname, dimensions, params = None, root_dir = OECD_ROOT_URL):
# Make URL for the OECD API and return a response
# 4 dimensions: location, subject, measure, frequency
# OECD API: https://data.oecd.org/api/sdmx-json-documentation/#d.en.330346
if not params:
params = {}
dim_args = ['+'.join(d) for d in dimensions]
dim_str = '.'.join(dim_args)
url = root_dir + '/' + dsname + '/' + dim_str + '/all'
print('Requesting URL ' + url)
return rq.get(url = url, params = params)
def create_DataFrame_from_OECD(country = 'CZE', subject = [], measure = [], frequency = 'M', startDate = None, endDate = None):
# Request data from OECD API and return pandas DataFrame
# country: country code (max 1)
# subject: list of subjects, empty list for all
# measure: list of measures, empty list for all
# frequency: 'M' for monthly and 'Q' for quarterly time series
# startDate: date in YYYY-MM (2000-01) or YYYY-QQ (2000-Q1) format, None for all observations
# endDate: date in YYYY-MM (2000-01) or YYYY-QQ (2000-Q1) format, None for all observations
# Data download
response = make_OECD_request('MEI'
, [[country], subject, measure, [frequency]]
, {'startTime': startDate, 'endTime': endDate, 'dimensionAtObservation': 'AllDimensions'})
# Data transformation
if (response.status_code == 200):
responseJson = response.json()
obsList = responseJson.get('dataSets')[0].get('observations')
if (len(obsList) > 0):
print('Data downloaded from %s' % response.url)
timeList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'TIME_PERIOD'][0]['values']
subjectList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'SUBJECT'][0]['values']
measureList = [item for item in responseJson.get('structure').get('dimensions').get('observation') if item['id'] == 'MEASURE'][0]['values']
obs = pd.DataFrame(obsList).transpose()
obs.rename(columns = {0: 'series'}, inplace = True)
obs['id'] = obs.index
obs = obs[['id', 'series']]
obs['dimensions'] = obs.apply(lambda x: re.findall('\d+', x['id']), axis = 1)
obs['subject'] = obs.apply(lambda x: subjectList[int(x['dimensions'][1])]['id'], axis = 1)
obs['measure'] = obs.apply(lambda x: measureList[int(x['dimensions'][2])]['id'], axis = 1)
obs['time'] = obs.apply(lambda x: timeList[int(x['dimensions'][4])]['id'], axis = 1)
obs['names'] = obs['subject'] + '_' + obs['measure']
data = obs.pivot_table(index = 'time', columns = ['names'], values = 'series')
return(data)
else:
print('Error: No available records, please change parameters')
else:
print('Error: %s' % response.status_code)
You can create requests like these:
data = create_DataFrame_from_OECD(country = 'CZE', subject = ['LOCOPCNO'])
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'Q', startDate = '2009-Q1', endDate = '2010-Q1')
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'M', startDate = '2009-01', endDate = '2010-12')
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'M', subject = ['B6DBSI01'])
data = create_DataFrame_from_OECD(country = 'USA', frequency = 'Q', subject = ['B6DBSI01'])

You can recover the data from the source using code like this.
from urllib.request import urlopen
import json
URL = 'http://stats.oecd.org/SDMX-JSON/data/MEI/USA+CZE...M/all'
response = urlopen(URL).read()
responseDict = json.loads(str(response)[2:-1])
print (responseDict.keys())
print (len(responseDict['dataSets']))
Here is the output from this code.
dict_keys(['header', 'structure', 'dataSets'])
1
If you are curious about the appearance of the [2:-1] (I would be) it's because for some reason unknown to me the str function leaves some extraneous characters at the beginning and end of the string when it converts the byte array passed to it. json.loads is documented to require a string as input.
This is the code I used to get to this point.
>>> from urllib.request import urlopen
>>> import json
>>> URL = 'http://stats.oecd.org/SDMX-JSON/data/MEI/USA+CZE...M/all'
>>> response = urlopen(URL).read()
>>> len(response)
9886387
>>> response[:50]
b'{"header":{"id":"1975590b-346a-47ee-8d99-6562ccc11'
>>> str(response[:50])
'b\'{"header":{"id":"1975590b-346a-47ee-8d99-6562ccc11\''
>>> str(response[-50:])
'b\'"uri":"http://www.oecd.org/contact/","text":""}]}}\''
I understand that this is not a complete solution as you must still crack into the dataSets structure for the data to put into pandas. It's a list but you could explore it starting with this sketch.

The latest release of pandasdmx (pandasdmx.readthedocs.io) fixes previous issues accessing OECD data in sdmx-json.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Collecting places using Python and Google Places API - python

Related

Retrieving data from the Air Quality Index (AQI) website through the API and only recieving small nr. of stations

get length of list per scraped link

"Could not convert string to float" when importing CSV to Blender

Scrapy: How to find highest/lowest values from a number of values?

Read data from OECD API into python (and pandas)

Categories

Resources