I'm working on pulling data from a public API and converting the response JSON file to a Pandas Dataframe. I've written the code to pull the data and gotten a successful JSON response. The issue I'm having is parsing through the file and converting the data to a dataframe. Whenever I run through my for loop, I get a dataframe that retruns 1 row when it should be returning approximately 2500 rows & 6 columns. I've copied and pasted my code below:
Things to note:
I've commented out my api key with "api_key".
I'm new(ish) to python so I understand that my code formatting might not be best practices. I'm open to changes.
Here is the link to the API that I am requesting from: https://developer.va.gov/explore/facilities/docs/facilities?version=current
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
# function that will make the api call and sort through the json data
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get('https://sandboxapi.va.gov/services/va_facilities/v0/facilities/all',headers={'apikey': 'api_key'})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = data['features'][0]['geometry']['type']
geometry_coordinates = data['features'][0]['geometry']['coordinates']
facility_id = data['features'][0]['properties']['id']
facility_name = data['features'][0]['properties']['name']
facility_type = data['features'][0]['properties']['facility_type']
facility_classification = data['features'][0]['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data)```
As mentioned, you should
loop over facility instead of data['features'][0]
append within the loop
This will get you the result you are after.
facilities_data = pd.DataFrame(columns=['geometry_type', 'geometry_coordinates', 'id', 'facility_name', 'facility_type','facility_classification'])
def get_facilities_data(facilities_data):
# Make API Call
res = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"})
data = json.loads(res.content.decode('utf-8'))
time.sleep(1)
for facility in data['features']:
geometry_type = facility['geometry']['type']
geometry_coordinates = facility['geometry']['coordinates']
facility_id = facility['properties']['id']
facility_name = facility['properties']['name']
facility_type = facility['properties']['facility_type']
facility_classification = facility['properties']['classification']
# Save data into pandas dataframe
facilities_data = facilities_data.append(
{'geometry_type': geometry_type, 'geometry_coordinates': geometry_coordinates,
'facility_id': facility_id, 'facility_name': facility_name, 'facility_type': facility_type,
'facility_classification': facility_classification}, ignore_index=True)
return facilities_data
facilities_data = get_facilities_data(facilities_data)
print(facilities_data.head())
There are some more things we can improve upon;
json() can be called directly on requests output
time.sleep() is not needed
appending to a DataFrame on each iteration is discouraged; we can collect the data in another way and create the DataFrame afterwards.
Implementing these improvements results in;
def get_facilities_data():
data = requests.get("https://sandbox-api.va.gov/services/va_facilities/v0/facilities/all",
headers={"apikey": "REDACTED"}).json()
facilities_data = []
for facility in data["features"]:
facility_data = (facility["geometry"]["type"],
facility["geometry"]["coordinates"],
facility["properties"]["id"],
facility["properties"]["name"],
facility["properties"]["facility_type"],
facility["properties"]["classification"])
facilities_data.append(facility_data)
facilities_df = pd.DataFrame(data=facilities_data,
columns=["geometry_type", "geometry_coords", "id", "name", "type", "classification"])
return facilities_df
Related
I'm working on python code to update and append token price and volume data using gate.io's API to a .csv file. Basically trying to check to see if it's up to date, and update with the most recently hour's data if not. The below code isn't throwing any errors, but it's not working. My columns are all in the same order as they are in the code. Any assistance would be greatly appreciated, thank you
import requests
import pandas as pd
from datetime import datetime
# Define API endpoint and parameters
host = "https://api.gateio.ws"
prefix = "/api/v4"
url = '/spot/candlesticks'
currency_pair = "BTC_USDT"
interval = "1h"
# Read the existing data from the csv file
df = pd.read_csv("price_calcs.csv")
# Extract the last timestamp from the csv file
last_timestamp = df["time1"].iloc[-1]
# Convert the timestamp to datetime and add an hour to get the new "from" parameter
from_time = datetime.utcfromtimestamp(last_timestamp).strftime('%Y-%m-%d %H:%M:%S')
to_time = datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S')
# Use the last timestamp to make a 'GET' request to the API to get the latest hourly data for the token
query_params = {"currency_pair": currency_pair, "from": from_time, "to": to_time, "interval": interval}
r = requests.get(host + prefix + url, params=query_params)
# Append the new data to the existing data from the csv file
new_data = pd.DataFrame(r.json(), columns=["time1", "volume1", "close1", "high1", "low1", "open1", "volume2"])
df = pd.concat([df, new_data])
# Write the updated data to the csv file
df.to_csv("price_calcs.csv", index=False)
Nevermind figured it out myself
I'm pulling data from the NHL API for player stats based on individual games. I'm trying to make a loop that calls the data, parses the JSON, creates a dict which I then can create a data frame from for an entire team. The code before my looping looks like this:
API_URL = "https://statsapi.web.nhl.com/api/v1"
response = requests.get(API_URL + "/people/8477956/stats?stats=gameLog", params={"Content-Type": "application/json"})
data = json.loads(response.text)
df_list_dict = []
for game in data['stats'][0]['splits']:
curr_dict = game['stat']
curr_dict['date'] = game['date']
curr_dict['isHome'] = game['isHome']
curr_dict['isWin'] = game['isWin']
curr_dict['isOT'] = game['isOT']
curr_dict['team'] = game['team']['name']
curr_dict['opponent'] = game['opponent']['name']
df_list_dict.append(curr_dict)
df = pd.DataFrame.from_dict(df_list_dict)
print(df)
This gives me a digestible data frame for a single player. (/people/{player}/....
I want to iterate through a list (the list being an NHL team), while adding a column that identifies the player and concatenates the created data frames. My attempt thus far looks like this:
import requests
import json
import pandas as pd
Rangers = ['8478550', '8476459', '8479323', '8476389', '8475184', '8480817', '8480078', '8476624', '8481554', '8482109', '8476918', '8476885', '8479324',
'8482073', '8479328', '8480833', '8478104', '8477846', '8477380', '8477380', '8477433', '8479333', '8479991']
def callapi(player):
response = (requests.get(f'https://statsapi.web.nhl.com/api/v1/people/{player}/stats?stats=gameLog', params={"Content-Type": "application/json"}))
data = json.loads(response.text)
df_list_dict = []
for game in data['stats'][0]['splits']:
curr_dict = game['stat']
curr_dict['date'] = game['date']
curr_dict['isHome'] = game['isHome']
curr_dict['isWin'] = game['isWin']
curr_dict['isOT'] = game['isOT']
curr_dict['team'] = game['team']['name']
curr_dict['opponent'] = game['opponent']['name']
df_list_dict.append(curr_dict)
df = pd.DataFrame.from_dict(df_list_dict)
print(df)
for player in Rangers:
callapi(player)
print(callapi)
When this is printed I can see all the data frames that were created. I cannot use curr_dict[] to add a column based on the list position (the player ID) because must be a slice or integer, not string.
What I'm hoping to do is make this one data frame in which the stats are identified by a player id column.
My python knowledge is very scattered, I feel as if with the progress I've made I should know how to complete this but I've simply hit a wall. Any help would be appreciated.
You can use concurrent.futures to parallelize the requests before concatenating them all together, and json_normalize to parse the json.
import concurrent.futures
import json
import os
import pandas as pd
import requests
class Scrape:
def main(self) -> pd.DataFrame:
rangers = ["8478550", "8476459", "8479323", "8476389", "8475184", "8480817", "8480078",
"8476624", "8481554", "8482109", "8476918", "8476885", "8479324", "8482073",
"8479328", "8480833", "8478104", "8477846", "8477380", "8477380", "8477433",
"8479333", "8479991"]
with concurrent.futures.ProcessPoolExecutor(max_workers=os.cpu_count()) as executor:
return pd.concat(executor.map(self.get_stats, rangers)).reset_index(drop=True).fillna(0)
#staticmethod
def get_stats(player: str) -> pd.DataFrame:
url = f"https://statsapi.web.nhl.com/api/v1/people/{player}/stats?stats=gameLog"
with requests.Session() as request:
response = request.get(url, timeout=30)
if response.status_code != 200:
print(response.raise_for_status())
data = json.loads(response.text)
df = (pd.
json_normalize(data=data, record_path=["stats", "splits"])
.rename(columns={"team.id": "team_id", "team.name": "team_name",
"opponent.id": "opponent_id", "opponent.name": "opponent_name"})
).assign(player_id=player)
df = df[df.columns.drop(list(df.filter(regex="link|gamePk")))]
df.columns = df.columns.str.split(".").str[-1]
if "faceOffPct" not in df.columns:
df["faceOffPct"] = 0
return df
if __name__ == "__main__":
stats = Scrape().main()
print(stats)
I am new to python and currently learning this language. I am trying to build a web scraper that will export the data to a CSV. I have the data I want and downloaded it to a CSV. The problem is that I have only managed to dump the data from one index and I want to dump all the data from all the indexes into the same CSV to form a database.
The problem I have is that I can only request n_companies indicating the index. For example (n_company[0] ) and I get the data from the first index of the list. What I want is to get all the data from all the indexes in the same function and then dump them with pandas in a CSV and thus be able to create a DB.
I'm stuck at this point and don't know how to proceed. Can you help me please.
This is the function:
def datos_directorio(n_empresa):
r = session.get(n_empresa[0])
home=r.content.decode('UTF-8')
tree=html.fromstring(home)
descripcion_direccion_empresas = '//p[#class = "paragraph"][2]//text()[normalize-space()]'
nombre_e = '//h1[#class ="mb3 h0 bold"][normalize-space()]/text()'
email = '//div[#class = "inline-block mb1 mr1"][3]/a[#class = "mail button button-inverted h4"]/text()[normalize-space()]'
teléfono = '//div[#class = "inline-block mb1 mr1"][2]/a[#class = "tel button button-inverted h4"]/text()[normalize-space()]'
d_empresas=tree.xpath(descripcion_direccion_empresas)
d_empresas = " ".join(d_empresas)
empresas_n=tree.xpath(nombre_e)
empresas_n = " ".join(empresas_n[0].split())
email_e=tree.xpath(email)
email_e = " ".join(email_e[0].split())
teléfono_e=tree.xpath(teléfono)
teléfono_e = " ".join(teléfono_e[0].split())
contenido = {
'EMPRESA' : empresas_n,
'EMAIL' : email_e,
'TELÉFONO' : teléfono_e,
'CONTACTO Y DIRECCIÓN' : d_empresas
}
return contenido
Best regards.
I have a python code that loops through multiple location and pulls data from a third part API. Below is the code sublocation_idsare location id coming from a directory.
As you can see from the code the data gets converted to a data frame and then saved to a Excel file. The current issue I am facing is if the API does not returns data for publication_timestamp for certain location the loop stops and does not proceeds and I get error as shown below the code.
How do I avoid this and skip to another loop if no data is returned by the API?
for sub in sublocation_ids:
city_num_int = sub['id']
city_num_str = str(city_num_int)
city_name = sub['name']
filter_text_new = filter_text.format(city_num_str)
data = json.dumps({"filters": [filter_text_new], "sort_by":"created_at", "size":2})
r = requests.post(url = api_endpoint, data = data).json()
articles_list = r["articles"]
articles_list_normalized = json_normalize(articles_list)
df = articles_list_normalized
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df.to_excel(writer, sheet_name = city_name)
writer.save()
Key Error: publication_timestamp
Change this bit of code:
df = articles_list_normalized
if 'publication_timestamp' in df.columns:
df['publication_timestamp'] = pd.to_datetime(df['publication_timestamp'])
df['publication_timestamp'] = df['publication_timestamp'].apply(lambda x: x.now().strftime('%Y-%m-%d'))
df.to_excel(writer, sheet_name = city_name)
else:
continue
If the API literally returns no data i.e. {} then you might even do the check before normalizing it:
if articles_list:
df = json_normalize(articles_list)
# ... rest of code ...
else:
continue
I've a json list and I can't convert to Pandas dataframe (various rows and 19 columns)
Link to response : https://www.byma.com.ar/wp-admin/admin-ajax.php?action=get_historico_simbolo&simbolo=INAG&fecha=01-02-2018
json response:
[
{"Apertura":35,"Apertura_Homogeneo":35,"Cantidad_Operaciones":1,"Cierre":35,"Cierre_Homogeneo":35,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"02\/02\/2018","Maximo":35,"Maximo_Homogeneo":35,"Minimo":35,"Minimo_Homogeneo":35,"Monto_Operado_Pesos":175,"Promedio":35,"Promedio_Homogeneo":35,"Simbolo":"INAG","Variacion":-5.15,"Variacion_Homogeneo":0,"Vencimiento":"48hs","Volumen_Nominal":5},
{"Apertura":34.95,"Apertura_Homogeneo":34.95,"Cantidad_Operaciones":2,"Cierre":34.95,"Cierre_Homogeneo":34.95,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"05\/02\/2018","Maximo":34.95,"Maximo_Homogeneo":34.95,"Minimo":34.95,"Minimo_Homogeneo":34.95,"Monto_Operado_Pesos":5243,"Promedio":-79228162514264337593543950335,"Promedio_Homogeneo":-79228162514264337593543950335,"Simbolo":"INAG","Variacion":-0.14,"Variacion_Homogeneo":-0.14,"Vencimiento":"48hs","Volumen_Nominal":150},
{"Apertura":32.10,"Apertura_Homogeneo":32.10,"Cantidad_Operaciones":2,"Cierre":32.10,"Cierre_Homogeneo":32.10,"Denominacion":"INSUMOS AGROQUIMICOS S.A.","Fecha":"07\/02\/2018","Maximo":32.10,"Maximo_Homogeneo":32.10,"Minimo":32.10,"Minimo_Homogeneo":32.10,"Monto_Operado_Pesos":98756,"Promedio":32.10,"Promedio_Homogeneo":32.10,"Simbolo":"INAG","Variacion":-8.16,"Variacion_Homogeneo":-8.88,"Vencimiento":"48hs","Volumen_Nominal":3076}
]
I use the next piece of code to convert this json to dataframe:
def getFinanceHistoricalStockFromByma(tickerList):
dataFrameHistorical = pd.DataFrame()
for item in tickerList:
url = 'https://www.byma.com.ar/wp-admin/admin-ajax.php?action=get_historico_simbolo&simbolo=' + item + '&fecha=01-02-2018'
response = requests.get(url)
if response.content : print 'ok info Historical Stock'
data = response.json()
dfItem = jsonToDataFrame(data)
dataFrameHistorical = dataFrameHistorical.append(dfItem, ignore_index=True)
return dataFrameHistorical
def jsonToDataFrame(jsonStr):
return json_normalize(jsonStr)
The result of json_normalize is 1 row and a lot of columns. How can I convert this json response to 1 row per list?
If you change this line in your function: dfItem = jsonToDataFrame(data) to:
dfItem = pd.DataFrame.from_records(data)
it should work. I tested your function with this line replaced, using ['INAG'] as a parameter passed to your getFinanceHistoricalStockFromByma function, and it returned a DataFrame.
You can directly call pd.DataFrame() directly on a list of dictionaries as in the sample in OP (.from_records() is not necessary). Try:
df = pd.DataFrame(data)
For the function in the OP, since pd.DataFrame.append() is deprecated, the best way to write it currently (pandas >= 1.4.0) is to collect the json responses in a Python list and create a DataFrame once at the end of the loop.
def getFinanceHistoricalStockFromByma(tickerList):
dataHistorical = [] # <----- list
for item in tickerList:
url = 'https://www.byma.com.ar/wp-admin/admin-ajax.php?action=get_historico_simbolo&simbolo=' + item + '&fecha=01-02-2018'
response = requests.get(url)
if response.content:
print('ok info Historical Stock')
data = response.json()
dataHistorical.append(data) # <----- list.append()
dataFrameHistorical = pd.DataFrame(dataHistorical) # <----- dataframe construction
return dataFrameHistorical