I'm trying to build my first cloud function. Its a function that should get data from API, transform to DF and push to bigquery. I've set the cloud function up with a http trigger using validate_http as entry point. The problem is that it states the function is working but it doesnt actually write anything. Its a similiar problem as the problem discussed here: Passing data from http api to bigquery using google cloud function python
import pandas as pd
import json
import requests
from pandas.io import gbq
import pandas_gbq
import gcsfs
#function 1: Responding and validating any HTTP request
def validate_http(request):
request.json = request.get_json()
if request.args:
get_api_data()
return f'Data pull complete'
elif request_json:
get_api_data()
return f'Data pull complete'
else:
get_api_data()
return f'Data pull complete'
#function 2: Get data and transform
def get_api_data():
import pandas as pd
import requests
import json
#Setting up variables with tokens
base_url = "https://"
token= "&token="
token2= "&token="
fields = "&fields=date,id,shippingAddress,items"
date_filter = "&filter=date in '2022-01-22'"
data_limit = "&limit=99999999"
#Performing API call on request with variables
def main_requests(base_url,token,fields,date_filter,data_limit):
req = requests.get(base_url + token + fields +date_filter + data_limit)
return req.json()
#Making API Call and storing in data
data = main_requests(base_url,token,fields,date_filter,data_limit)
#transforming the data
df = pd.json_normalize(data['orders']).explode('items').reset_index(drop=True)
items = df['items'].agg(pd.Series)[['id','itemNumber','colorNumber', 'amount', 'size','quantity', 'quantityReturned']]
df = df.drop(columns=[ 'items', 'shippingAddress.id', 'shippingAddress.housenumber', 'shippingAddress.housenumberExtension', 'shippingAddress.address2','shippingAddress.name','shippingAddress.companyName','shippingAddress.street', 'shippingAddress.postalcode', 'shippingAddress.city', 'shippingAddress.county', 'shippingAddress.countryId', 'shippingAddress.email', 'shippingAddress.phone'])
df = df.rename(columns=
{'date' : 'Date',
'shippingAddress.countryIso' : 'Country',
'id' : 'order_id'})
df = pd.concat([df, items], axis=1, join='inner')
#Push data function
bq_load('Return_data_api', df)
#function 3: Convert to bigquery table
def bq_load(key, value):
project_name = '375215'
dataset_name = 'Returns'
table_name = key
value.to_gbq(destination_table='{}.{}'.format(dataset_name, table_name), project_id=project_name, if_exists='replace')
The problem is that the script doesnt write to bigquery and doesnt return any error. I know that the get_api_data() function is working since I tested it locally and does seem to be able to write to BigQuery. Using cloud functions I cant seem to trigger this function and make it write data to bigquery.
There are a couple of things wrong with the code that would set you right.
you have list data, so store as a csv file (in preference to json).
this would mean updating (and probably renaming) the JsonArrayStore class and its methods to work with CSV.
Once you have completed the above and written well formed csv, you can proceed to this:
reading the csv in the del_btn method would then look like this:
import python
class ToDoGUI(tk.Tk):
...
# methods
...
def del_btn(self):
a = JsonArrayStore('test1.csv')
# read to list
with open('test1.csv') as csvfile:
reader = csv.reader(csvfile)
data = list(reader)
print(data)
Good work, you have a lot to do, if you get stuck further please post again.
Related
I have a FastAPI GET endpoint that is returning a large amount of JSON data (~160,000 rows and 45 columns). Unsurprisingly, it is extremely slow to return the data using json.dumps(). I am first reading the data from a file using json.loads() and filtering it per the inputted parameters. Is there a faster way to return the data to the user than using return data? It takes nearly a minute in the current state.
My code currently looks like this:
# helper function to parse parquet file (where data is stored)
def parse_parquet(file_path):
df = pd.read_parquet(file_path)
result = df.to_json(orient = 'records')
parsed = json.loads(result)
return parsed
#app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
if year is None:
data = parse_parquet(f'path/{year}_data.parquet')
# no year
if year is not None:
data = parse_parquet(f'path/all_data.parquet')
if id is not None:
data = [d for d in data if d['id'] == id]
return data
One of the reasons for the response being that slow is that in your parse_parquet() method, you initially convert the file into JSON (using df.to_json()), then into dictionary (using json.loads()) and finally into JSON again, as FastAPI, behind the scenes, automatically converts the returned value into JSON-compatible data using the jsonable_encoder, and then uses the Python standard json.dumps() to serialise the object—a process that is quite slow (see this answer for more details).
As suggested by #MatsLindh in the comments section, you could use alternative JSON encoders, such as orjson or ujosn (see this answer as well), which would indeed speed up the process, compared to letting FastAPI use the jsonable_encoder and then the standard json.dumps() for converting the data into JSON. However, using pandas to_json() and returing a custom Response directly—as described in Option 1 (Update 2) of this answer—seems to be the best-performing solution. You can use the code given below—which uses a custom APIRoute class—to compare the response time for all available solutions.
Use your own parquet file or the below code to create a sample parquet file consisting of 160K rows and 45 columns.
create_parquet.py
import pandas as pd
import numpy as np
columns = ['C' + str(i) for i in range(1, 46)]
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(160000,45)),columns=columns)
df.to_parquet('data.parquet')
Run the FastAPI app below and access each endpoint separately to inspect the time taken to complete the process of loading and converting the data into JSON.
app.py
from fastapi import FastAPI, APIRouter, Response, Request
from fastapi.routing import APIRoute
from typing import Callable
import pandas as pd
import json
import time
import ujson
import orjson
class TimedRoute(APIRoute):
def get_route_handler(self) -> Callable:
original_route_handler = super().get_route_handler()
async def custom_route_handler(request: Request) -> Response:
before = time.time()
response: Response = await original_route_handler(request)
duration = time.time() - before
response.headers["Response-Time"] = str(duration)
print(f"route duration: {duration}")
return response
return custom_route_handler
app = FastAPI()
router = APIRouter(route_class=TimedRoute)
#router.get("/defaultFastAPIencoder")
def get_data_default():
df = pd.read_parquet('data.parquet')
return df.to_dict(orient="records")
#router.get("/orjson")
def get_data_orjson():
df = pd.read_parquet('data.parquet')
return Response(orjson.dumps(df.to_dict(orient='records')), media_type="application/json")
#router.get("/ujson")
def get_data_ujson():
df = pd.read_parquet('data.parquet')
return Response(ujson.dumps(df.to_dict(orient='records')), media_type="application/json")
# Preferred way
#router.get("/pandasJSON")
def get_data_pandasJSON():
df = pd.read_parquet('data.parquet')
return Response(df.to_json(orient="records"), media_type="application/json")
app.include_router(router)
Even though the response time is quite fast using /pandasJSON above (and this should be the preferred way), you may encounter some delay on displaying the data on the browser. That, however, has nothing to do with the server side, but with the client side, as the browser is trying to display a large amount of data. If you don't want to display the data, but instead let the user download the data to their device (which would be much faster), you can set the Content-Disposition header in the Response using the attachment parameter and passing a filename as well, indicating to the browser that the file should be downloaded. For more details, have a look at this answer and this answer.
#router.get("/download")
def get_data():
df = pd.read_parquet('data.parquet')
headers = {'Content-Disposition': 'attachment; filename="data.json"'}
return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')
I should also mention that there is a library, called Dask, which can handle large datasets, as described here, in case you had to process a large amount of records that is taking too long to complete. Similar to Pandas, you can use the .read_parquet() method to read the file. As Dask doesn't seem to provide an equivalent .to_json() method, you could convert the Dask DataFrame to Pandas DataFrame using df.compute(), and then use Pandas df.to_json() to convert the DataFrame into a JSON string, and return it as demonstrated above.
I would also suggest you take a look at this answer, which provides details and solutions on streaming/returning a DataFrame, in case that you are dealing with a large amount of data that converting them into JSON (using .to_json()) or CSV (using .to_csv()) may cause memory issues on server side, if you opt to store the output string (either JSON or CSV) into RAM (which is the default behaviour, if you don't pass a path parameter to the aforementioned functions)—since a large amount of memory would already be allocated for the original DataFrame as well.
I guess the json.loads(result) will return a dict data type in your case, and you are filtering the dict data type. You can send the dict data type as JSON as follows:
from fastapi.responses import JSONResponse
#app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
if year is None:
data = parse_parquet(f'path/{year}_data.parquet')
# no year
if year is not None:
data = parse_parquet(f'path/all_data.parquet')
if id is not None:
data = [d for d in data if d['id'] == id]
return JSONResponse(content=json_compatible_item_data)
I'm trying to extract nutritional information using the Nutritionix API/database in Python. I was able to get a successful query and placed it into a pandas dataframe. However, I'm a bit confused though because the resulting json claims that there are several thousand 'hits' for my query but at most 10 are ever returned. For instance, when I query for Garbanzo, the json file says that there are 513 total_hits, but only 10 are actually returned. Does anyone know what is causing this? The code I'm using is below.
import requests
import json
import pandas as pd
from nutritionix import Nutritionix
nix_apikey = ''
nix_appid = ''
nix = Nutritionix(app_id = nix_appid, api_key = nix_apikey)
results = nix.search('Garbanzo').json()
df = pd.json_normalize(results, record_path = ['hits'])
I'm not including the my api_key or app_id for obvious reasons. Here's a link to the Nutritionix API: https://github.com/leetrout/python-nutritionix
Thanks for any suggestions!
I am trying to pull twitter streaming data in cloud function and essentially export the stream data into big query.
Currently, i have this code. The Entry Point is set to stream_twitter.
main.txt:
import os
import tweepy
import pandas as pd
import datalab.bigquery as bq
from google.cloud import bigquery
import os
import tweepy
import pandas as pd
import datalab.bigquery as bq
from google.cloud import bigquery
#access key
api_key = os.environ['API_KEY']
secret_key = os.environ['SECRET_KEY']
bearer_token = os.environ['BEARER_TOKEN']
def stream_twitter(event, context):
#authentication
auth = tweepy.Client(bearer_token = bearer_token)
api = tweepy.API(auth)
#create Stream Listener
class Listener(tweepy.StreamingClient):
#save list to dataframe
tweets = []
def on_tweet(self, tweet):
if tweet.referenced_tweets == None: #Original tweet not reply or retweet
self.tweets.append(tweet)
def on_error(self, status_code):
if status_code == 420:
#returning False in on_data disconnects the stream
return False
stream_tweet = Listener(bearer_token)
#filtered Stream using rules
rule = tweepy.StreamRule("(covid OR covid19 OR coronavirus OR pandemic OR #covid19 OR #covid) lang:en")
stream_tweet.add_rules(rule, dry_run = True)
stream_tweet.filter(tweet_fields=["referenced_tweets"])
#insert into dataframe
columns = ["UserID", "Tweets"]
data = []
for tweet in stream_tweet.tweets:
data.append([tweet.id, tweet.text, ])
stream_df = pd.DataFrame(data, columns=columns)
## Insert time col - TimeStamp to give the time that data is pulled from API
stream_df.insert(0, 'TimeStamp', pd.to_datetime('now').replace(microsecond=0))
## Converting UTC Time to SGT(UTC+8hours)
stream_df.insert(1,'SGT_TimeStamp', '')
stream_df['SGT_TimeStamp'] = stream_df['TimeStamp'] + pd.Timedelta(hours=8)
## Define BQ dataset & table names
bigquery_dataset_name = 'streaming_dataset'
bigquery_table_name = 'streaming-table'
## Define BigQuery dataset & table
dataset = bq.Dataset(bigquery_dataset_name)
table = bq.Table(bigquery_dataset_name + '.' + bigquery_table_name)
if not table.exists():
# Create or overwrite the existing table if it exists
table_schema = bq.Schema.from_dataframe(stream_df)
table.create(schema = table_schema, overwrite = False)
# Write the DataFrame to a BigQuery table
table.insert_data(stream_df)
requirement.txt:
tweepy
pandas
google-cloud-bigquery
However, i keep getting a
"Deployment failure: Function deployment failed due to a health check failure. This usually indicates that your code was built successfully but failed during a test execution. Examine the logs to determine the cause. Try deploying again in a few minutes if it appears to be transient."
I can't seem to figure how to solve this error. Is there something wrong with my codes? Or is there something that i should have done? I test the streaming codes on Pycharm and was able to pull the data.
Would appreicate any help i can get. Thank you.
The logs to the function are this. (I am unfamiliar with Logs hence i shall include a screenshot.) Essentially, those were the 2 info and error i've been getting.
I managed to replicate your error message. All I did was add datalab==1.2.0 inside requirements.txt. Since you are importing the datalab library, you need to include the support package for it, which is the latest version of datalab.
Here's the reference that I used: Migrating from the datalab Python package.
See the requirements.txt file to view the versions of the libraries used for these code snippets.
Here's the screenshot of the logs:
I'm trying to get some stats from the NBA stats page. I'm following this tutorial-idea
https://towardsdatascience.com/using-python-pandas-and-plotly-to-generate-nba-shot-charts-e28f873a99cb
The basic idea is put the data into a csv file.
So I try this code, to get the data from the nba web, trying to get the json file and the convert it to a csv:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/events/?flag=3&CFID=33&CFPARAMS=2017-18&PlayerID="
player_id="202695"
shot_data_url_end="&ContextMeasure=FGA&Season=2017-18§ion=player&sct=plot"
def shoy_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
json = requests.get(full_url, headers=headers).json()
return(json)
data = json['resultSets'][0]['rowSets']
columns = json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
And this is the error that notebook shows to me:
TypeError Traceback (most recent call last)
<ipython-input-42-a3452c3a4fc8> in <module>
18
19
---> 20 data = json['resultSets'][0]['rowSets']
21 columns = json['resultSets'][0]['headers']
22
TypeError: 'module' object is not subscriptable
Anyone can help me, or know another way to get the data into a .csv or excel file?
When imported with import json, the name json is referring to the JSON module of the Python standard library. You cannot use it as a regular variable name. If you rename your variable to something else such as response_json, this part of your code will work.
Regarding the rest of the code, the page https://stats.nba.com/events/ doesn't return any JSON text, it is a regular web page with images, menus, a video player, etc... If you want to access the API that returns the shots in JSON format, you will have to use the https://stats.nba.com/stats/shotchartdetail (with the right query string). This API endpoint is mentioned in the tutorial, in the "Chrome XHR tab and resulting json linked by url" image.
Ok I've changed the code like this:
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
def shot_chart(player_id):
full_url = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID=202695&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
response_json = requests.get(full_url, headers=headers)
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
import requests
import json
import pandas as pd
from pandas import DataFrame as df
import urllib.request
shot_data_url_start="https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2019-20&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id="202330"
shot_data_url_end="&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2019-20&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def shot_chart(player_id):
full_url = shot_data_url_start + str(player_id) + shot_data_url_end
response_json = requests.get(full_url).json()
return(response_json)
data = response_json['resultSets'][0]['rowSets']
columns = response_json['resultSets'][0]['headers']
df = pd.DataFrame.from_records(data, columns=columns)
shot_chart("202330")
What is going on now? the notebook is tucked right know
Try this out
import pandas as pd
from pandas import DataFrame as df
shot_data_url_start = "https://stats.nba.com/stats/shotchartdetail?AheadBehind=&CFID=33&CFPARAMS=2017-18&ClutchTime=&Conference=&ContextFilter=&ContextMeasure=FGA&DateFrom=&DateTo=&Division=&EndPeriod=10&EndRange=28800&GROUP_ID=&GameEventID=&GameID=&GameSegment=&GroupID=&GroupMode=&GroupQuantity=5&LastNGames=0&LeagueID=00&Location=&Month=0&OnOff=&OpponentTeamID=0&Outcome=&PORound=0&Period=0&PlayerID="
player_id = "204001"
shot_data_url_end = "&PlayerID1=&PlayerID2=&PlayerID3=&PlayerID4=&PlayerID5=&PlayerPosition=&PointDiff=&Position=&RangeType=0&RookieYear=&Season=2017-18&SeasonSegment=&SeasonType=Regular+Season&ShotClockRange=&StartPeriod=1&StartRange=0&StarterBench=&TeamID=0&VsConference=&VsDivision=&VsPlayerID1=&VsPlayerID2=&VsPlayerID3=&VsPlayerID4=&VsPlayerID5=&VsTeamID="
def get_shot_data(player_id):
full_url = shot_data_url_start + player_id + shot_data_url_end
data = requests.get(
full_url,
headers = {
"User-Agent": "PostmanRuntime/7.4.0"
}
)
return data.json()
shot_results = get_shot_data(player_id)
result_sets = shot_results['resultSets']
first_result_set = result_sets[0]
row_set = first_result_set['rowSet']
set_headers = first_result_set['headers']
df = pd.DataFrame.from_records(row_set, columns=set_headers)
I see how you got confused with that medium post. You were missing the headers and the url for the NBA api wasn't right. That's what #pierre was trying to say in his response. The url you're using isn't right. If you reread that post you were following, you'll see that the author said he had to dig in to dev tools in order to find that actual url to use in order to grab the JSON.
Edit: Forgot to mention that when I didn't pass a User-Agent in the headers, the request would timeout. If you don't pass that in, you won't get a successful response.
Have anyone used the function importRows() from fusion table API?
As the API reference below,
https://developers.google.com/fusiontables/docs/v1/reference/table/importRows
I have to supply CSV data in the request body.
But what should I do for the html body exactly?
My code:
http = getAuthorizedHttp()
DISCOVERYURL = 'https://www.googleapis.com/discovery/v1/apis/{api}/{apiVersion}/rest'
ftable = build('fusiontables', 'v1', discoveryServiceUrl=DISCOVERYURL, http=http)
body = create_ft(CSVFILE,"title here") # the function to load csv file and create the table with columns from csv file.
result = ftable.table().insert(body=body).execute()
print result["tableId"] # good, I have got the id for new created table
# I have no idea how to go on here..
f = ftable.table().importRows(tableId=result["tableId"])
f.body = ?????????????
f.execute()
I finally fixed my problem, my code can be found in the following link.
https://github.com/childnotfound/parser/blob/master/uploader.py
I fixed the problem like this:
media = http.MediaFileUpload('example.csv', mimetype='application/octet-stream', resumable=True)
request = service.table().importRows(media_body=media, tableId='1cowubQ0vj_H9q3owo1vLM_gMyavvbuoNmRQaYiZV').execute()