Creating a pandas dataframe from a JSON request - python

I am trying to create a machine Learning application with Flask. I have created a POST API route that will take the data and transform it into a pandas dataframe. Here is the Flask code that calls the python function to transform the data.
from flask import Flask, abort, request
import json
import mlFlask as ml
import pandas as pd
app = Flask(__name__)
#app.route('/test', methods=['POST'])
def test():
if not request.json:
abort(400)
print type(request.json)
result = ml.classification(request.json)
return json.dumps(result)
This is the file that contains the helper function.
def jsonToDataFrame(data):
print type(data)
df = pd.DataFrame.from_dict(data,orient='columns')
But I am getting an import error. Also, when I print the type of data is dict so I don't know why it would cause an issue. It works when I orient the dataframe based on index but it doesn't work based on column.
ValueError: If using all scalar values, you must pass an index
Here is the body of the request in JSON format.
{
"updatedDate":"2012-09-30T23:51:45.778Z",
"createdDate":"2012-09-30T23:51:45.778Z",
"date":"2012-06-30T00:00:00.000Z",
"name":"Mad Max",
"Type":"SBC",
"Org":"Private",
"month":"Feb"
}
What am I doing wrong here ?

Related

Python API call to BigQuery using cloud functions

I'm trying to build my first cloud function. Its a function that should get data from API, transform to DF and push to bigquery. I've set the cloud function up with a http trigger using validate_http as entry point. The problem is that it states the function is working but it doesnt actually write anything. Its a similiar problem as the problem discussed here: Passing data from http api to bigquery using google cloud function python
import pandas as pd
import json
import requests
from pandas.io import gbq
import pandas_gbq
import gcsfs
#function 1: Responding and validating any HTTP request
def validate_http(request):
request.json = request.get_json()
if request.args:
get_api_data()
return f'Data pull complete'
elif request_json:
get_api_data()
return f'Data pull complete'
else:
get_api_data()
return f'Data pull complete'
#function 2: Get data and transform
def get_api_data():
import pandas as pd
import requests
import json
#Setting up variables with tokens
base_url = "https://"
token= "&token="
token2= "&token="
fields = "&fields=date,id,shippingAddress,items"
date_filter = "&filter=date in '2022-01-22'"
data_limit = "&limit=99999999"
#Performing API call on request with variables
def main_requests(base_url,token,fields,date_filter,data_limit):
req = requests.get(base_url + token + fields +date_filter + data_limit)
return req.json()
#Making API Call and storing in data
data = main_requests(base_url,token,fields,date_filter,data_limit)
#transforming the data
df = pd.json_normalize(data['orders']).explode('items').reset_index(drop=True)
items = df['items'].agg(pd.Series)[['id','itemNumber','colorNumber', 'amount', 'size','quantity', 'quantityReturned']]
df = df.drop(columns=[ 'items', 'shippingAddress.id', 'shippingAddress.housenumber', 'shippingAddress.housenumberExtension', 'shippingAddress.address2','shippingAddress.name','shippingAddress.companyName','shippingAddress.street', 'shippingAddress.postalcode', 'shippingAddress.city', 'shippingAddress.county', 'shippingAddress.countryId', 'shippingAddress.email', 'shippingAddress.phone'])
df = df.rename(columns=
{'date' : 'Date',
'shippingAddress.countryIso' : 'Country',
'id' : 'order_id'})
df = pd.concat([df, items], axis=1, join='inner')
#Push data function
bq_load('Return_data_api', df)
#function 3: Convert to bigquery table
def bq_load(key, value):
project_name = '375215'
dataset_name = 'Returns'
table_name = key
value.to_gbq(destination_table='{}.{}'.format(dataset_name, table_name), project_id=project_name, if_exists='replace')
The problem is that the script doesnt write to bigquery and doesnt return any error. I know that the get_api_data() function is working since I tested it locally and does seem to be able to write to BigQuery. Using cloud functions I cant seem to trigger this function and make it write data to bigquery.
There are a couple of things wrong with the code that would set you right.
you have list data, so store as a csv file (in preference to json).
this would mean updating (and probably renaming) the JsonArrayStore class and its methods to work with CSV.
Once you have completed the above and written well formed csv, you can proceed to this:
reading the csv in the del_btn method would then look like this:
import python
class ToDoGUI(tk.Tk):
...
# methods
...
def del_btn(self):
a = JsonArrayStore('test1.csv')
# read to list
with open('test1.csv') as csvfile:
reader = csv.reader(csvfile)
data = list(reader)
print(data)
Good work, you have a lot to do, if you get stuck further please post again.

FastAPI is very slow in returning a large amount of JSON data

I have a FastAPI GET endpoint that is returning a large amount of JSON data (~160,000 rows and 45 columns). Unsurprisingly, it is extremely slow to return the data using json.dumps(). I am first reading the data from a file using json.loads() and filtering it per the inputted parameters. Is there a faster way to return the data to the user than using return data? It takes nearly a minute in the current state.
My code currently looks like this:
# helper function to parse parquet file (where data is stored)
def parse_parquet(file_path):
df = pd.read_parquet(file_path)
result = df.to_json(orient = 'records')
parsed = json.loads(result)
return parsed
#app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
if year is None:
data = parse_parquet(f'path/{year}_data.parquet')
# no year
if year is not None:
data = parse_parquet(f'path/all_data.parquet')
if id is not None:
data = [d for d in data if d['id'] == id]
return data
One of the reasons for the response being that slow is that in your parse_parquet() method, you initially convert the file into JSON (using df.to_json()), then into dictionary (using json.loads()) and finally into JSON again, as FastAPI, behind the scenes, automatically converts the returned value into JSON-compatible data using the jsonable_encoder, and then uses the Python standard json.dumps() to serialise the object—a process that is quite slow (see this answer for more details).
As suggested by #MatsLindh in the comments section, you could use alternative JSON encoders, such as orjson or ujosn (see this answer as well), which would indeed speed up the process, compared to letting FastAPI use the jsonable_encoder and then the standard json.dumps() for converting the data into JSON. However, using pandas to_json() and returing a custom Response directly—as described in Option 1 (Update 2) of this answer—seems to be the best-performing solution. You can use the code given below—which uses a custom APIRoute class—to compare the response time for all available solutions.
Use your own parquet file or the below code to create a sample parquet file consisting of 160K rows and 45 columns.
create_parquet.py
import pandas as pd
import numpy as np
columns = ['C' + str(i) for i in range(1, 46)]
df = pd.DataFrame(data=np.random.randint(99999, 99999999, size=(160000,45)),columns=columns)
df.to_parquet('data.parquet')
Run the FastAPI app below and access each endpoint separately to inspect the time taken to complete the process of loading and converting the data into JSON.
app.py
from fastapi import FastAPI, APIRouter, Response, Request
from fastapi.routing import APIRoute
from typing import Callable
import pandas as pd
import json
import time
import ujson
import orjson
class TimedRoute(APIRoute):
def get_route_handler(self) -> Callable:
original_route_handler = super().get_route_handler()
async def custom_route_handler(request: Request) -> Response:
before = time.time()
response: Response = await original_route_handler(request)
duration = time.time() - before
response.headers["Response-Time"] = str(duration)
print(f"route duration: {duration}")
return response
return custom_route_handler
app = FastAPI()
router = APIRouter(route_class=TimedRoute)
#router.get("/defaultFastAPIencoder")
def get_data_default():
df = pd.read_parquet('data.parquet')
return df.to_dict(orient="records")
#router.get("/orjson")
def get_data_orjson():
df = pd.read_parquet('data.parquet')
return Response(orjson.dumps(df.to_dict(orient='records')), media_type="application/json")
#router.get("/ujson")
def get_data_ujson():
df = pd.read_parquet('data.parquet')
return Response(ujson.dumps(df.to_dict(orient='records')), media_type="application/json")
# Preferred way
#router.get("/pandasJSON")
def get_data_pandasJSON():
df = pd.read_parquet('data.parquet')
return Response(df.to_json(orient="records"), media_type="application/json")
app.include_router(router)
Even though the response time is quite fast using /pandasJSON above (and this should be the preferred way), you may encounter some delay on displaying the data on the browser. That, however, has nothing to do with the server side, but with the client side, as the browser is trying to display a large amount of data. If you don't want to display the data, but instead let the user download the data to their device (which would be much faster), you can set the Content-Disposition header in the Response using the attachment parameter and passing a filename as well, indicating to the browser that the file should be downloaded. For more details, have a look at this answer and this answer.
#router.get("/download")
def get_data():
df = pd.read_parquet('data.parquet')
headers = {'Content-Disposition': 'attachment; filename="data.json"'}
return Response(df.to_json(orient="records"), headers=headers, media_type='application/json')
I should also mention that there is a library, called Dask, which can handle large datasets, as described here, in case you had to process a large amount of records that is taking too long to complete. Similar to Pandas, you can use the .read_parquet() method to read the file. As Dask doesn't seem to provide an equivalent .to_json() method, you could convert the Dask DataFrame to Pandas DataFrame using df.compute(), and then use Pandas df.to_json() to convert the DataFrame into a JSON string, and return it as demonstrated above.
I would also suggest you take a look at this answer, which provides details and solutions on streaming/returning a DataFrame, in case that you are dealing with a large amount of data that converting them into JSON (using .to_json()) or CSV (using .to_csv()) may cause memory issues on server side, if you opt to store the output string (either JSON or CSV) into RAM (which is the default behaviour, if you don't pass a path parameter to the aforementioned functions)—since a large amount of memory would already be allocated for the original DataFrame as well.
I guess the json.loads(result) will return a dict data type in your case, and you are filtering the dict data type. You can send the dict data type as JSON as follows:
from fastapi.responses import JSONResponse
#app.get('/endpoint')
# has several more parameters
async def some_function(year = int | None = None, id = str | None = None):
if year is None:
data = parse_parquet(f'path/{year}_data.parquet')
# no year
if year is not None:
data = parse_parquet(f'path/all_data.parquet')
if id is not None:
data = [d for d in data if d['id'] == id]
return JSONResponse(content=json_compatible_item_data)

rets-python with flask cannot return as json

I am doing a project which has to do with rets, somehow my collaborator has problem using rets-client for js during installation.
Anyways, so I decided to find python instead which I found rets-python from https://github.com/opendoor-labs/rets
I am able to get data returned from rets, somehow I wanted to make an api so I installed flask (new with flask)
However I tried to return the data into JSON I just kept on getting errors
I this is my code
import json
from flask import Flask, Response, jsonify
app = Flask(__name__)
#app.route('/')
def mls():
from rets.client import RetsClient
client = RetsClient(
login_url='xxxxxxxxxxxx',
username='xxxxxxxxxxxx',
password='xxxxxxxxxxxxx',
# Ensure that you are using the right auth_type for this particular MLS
# auth_type='basic',
# Alternatively authenticate using user agent password
user_agent='xxxxxxxxxxxxxxxx',
user_agent_password='xxxxxxxxxxxxxxxx'
)
resource = client.get_resource('Property')
print(resource)
resource.key_field
print(resource.key_field)
resource_class = resource.get_class('RD_1')
print(resource_class)
search_result = resource_class.search(query='(L_Status=1_0)', limit=2)
I will write my returns here as I have tried few different returns
# return jsonify(results=search_result.data)
# TypeError: <Record: Property:RD_1:259957852> is not JSON serializable
return Response(json.dumps(search_result.data), mimetype='application/json')
# TypeError: <Record: Property:RD_1:259957852> is not JSON serializable
return json.dumps(search_result.data)
# TypeError: <Record: Property:RD_1:259957852> is not JSON serializable
I event tried making the results into dict such as dict(search_result.data) which gives me errors like TypeError: cannot convert dictionary update sequence element #0 to a sequence
if I try to get the print(type(search_result.data[0].data)) this is the return <class 'collections.OrderedDict'> I tried to jsonify, json.dump and also with Response with just this one record with or without dict() I get this error TypeError: Decimal('1152') is not JSON serializable
also print(type(search_result.data)) gives tuple
Anyone able to give me a hand on this?
Thanks in advance for any help and advises.

Getting null from flask request python

I am writing a simple flask application where based on my query, I should get the required answer in the desired format.
The code is as below;
#-*- coding: utf-8 -*-
import StringIO
import os
import pandas as pd
import numpy as np
from flask import Flask, request, Response, abort, jsonify, send_from_directory,make_response
import io
from pandas import DataFrame
import urllib2, json
import requests
from flask import session
import sys
reload(sys)
sys.setdefaultencoding("ISO-8859-1")
app = Flask(__name__)
#app.route("/api/conversation/", methods=['POST'])
def chatbot():
df = pd.DataFrame(json.load(urllib2.urlopen('http://192.168.21.245/sixthsensedata/server/Test_new.json')))
question = request.form.get('question')
store = []
if question == 'What is the number of total observation of the dataset':
store.append(df.shape)
if question == 'What are the column names of the dataset':
store.append(df.columns)
return jsonify(store)
if __name__ == '__main__':
app.debug = True
app.run(host = '192.168.21.11',port=5000)
It's running properly but getting null response. I would like to create ~30 more questions like this & store values in the store array. But values are not getting appended inside store, I think.
In jupyter notebook, though, I am getting proper response;
df = pd.DataFrame(json.load(urllib2.urlopen('http://192.168.21.245/sixthsensedata/server/Test_new.json')))
store = []
store.append(df.shape)
print store
[(521, 24)]
Why in flask, the values are not getting appended? I am testing my application in postman. Please guide where I am lacking.
Screenshot from postman
When not providing the data type for the Post method, request.form evaluates to
ImmutableMultiDict([('{"question": "What is the number of total observation of the dataset"}', u'')])
and question = request.form.get('question') ends up being none
You can explicitly use content type as json, or force load it.
#app.route('/api/conversation/', methods=['POST'])
def chatbot():
question = request.get_json(force=True).get('question')
store = []
if question == 'What is the number of total observation of the dataset':
store.append("shape")
elif question == 'What are the column names of the dataset':
store.append("columns")
return jsonify(store)
Curl requests
$curl -X POST -d '{"question": "What is the number of total observation of the dataset"}' http://127.0.0.1:5000/api/conversation/
["shape"]
$curl -H 'Content-Type: application/json' -X POST -d '{"question": "What is the number of total observation of the dataset"}' http://127.0.0.1:5000/api/conversation/
["shape"]

Transforming JSON string to Pandas DataFrame in Flask

I'm parsing JSON data in Flask from POST request. Everything seems to be fine and works ok:
from flask import Flask
from flask import request
import io
import json
import pandas as pd
app = Flask(__name__)
#app.route('/postjson', methods = ['POST'])
def postJsonHandler():
print (request.is_json)
content = request.get_json()
df = pd.io.json.json_normalize(content)
print (df)
return 'JSON posted'
app.run(host='0.0.0.0', port= 8090)
The output looks like this:
True
columns data
0 [Days, Orders] [[10/1/16, 284], [10/2/16, 633], [10/3/16, 532...
Then I try to transform json to pandas dataframe using json_normalize() function. So I receiving the result close to pandas dataframe but it is not yet it.
What changes in the code should I do to receive classical Pandas Dataframe format with columns and data inside.
Thanks in advance.
Solved the problem. The idea was to use parameters of the json_normalize() function something like that:
df = pd.io.json.json_normalize(content, 'data')

Categories