converting a deep nested loop from JSON into Pandas DF

converting a deep nested loop from JSON into Pandas DF - python

I am getting info from an API, and getting this as the resulting JSON file:
{'business_discovery': {'media': {'data': [{'media_url': 'a link',
'timestamp': '2022-01-01T01:00:00+0000',
'caption': 'Caption',
'media_type': 'type',
'media_product_type': 'product_type',
'comments_count': 1,
'like_count': 1,
'id': 'ID'},
{'media_url': 'link',
# ... and so on
# NOTE: I scrubbed the numbers with dummy data
I know to get the data I can run this script to get all the data within the data
# "a" is the json without business discovery or media, which would be this:
a = {'data': [{'media_url': 'a link',
'timestamp': '2022-01-01T01:00:00+0000',
'caption': 'Caption',
'media_type': 'type',
'media_product_type': 'product_type',
'comments_count': 1,
'like_count': 1,
'id': 'ID'},
{'media_url': 'link',
# ... and so on
media_url,timestamp,caption,media_type,media_product_type,comment_count,like_count,id_code = [],[],[],[],[],[],[],[]
for result in a['data']:
media_url.append(result[u'media_url']) #Appending all the info within their Json to a list
timestamp.append(result[u'timestamp'])
caption.append(result[u'caption'])
media_type.append(result[u'media_type'])
media_product_type.append(result[u'media_product_type'])
comment_count.append(result[u'comments_count'])
like_count.append(result[u'like_count'])
id_code.append(result[u'id']) # All info exists, even when a value is 0
df = pd.DataFrame([media_url,timestamp,caption,media_type,media_product_type,comment_count,like_count,id_code]).T
when I run the above command on the info from the api, I get errors saying that the data is not found
This works fine for now, but trying to figure out a way to "hop" over both business discovery, and media, to get straight to data so I can run this more effectively, rather than copying and pasting where I skip over business discovery and media

Using json.normalize
df = pd.json_normalize(data=data["business_discovery"]["media"], record_path="data")

Related

Why reading a json format file resulting all the records going to _corrupt_record in pyspark

I am reading data from an api call and the data is in the form of json like below:
{'success': True, 'errors': \[\], 'requestId': '151a2#fg', 'warnings': \[\], 'result': \[{'id': 10322433, 'name': 'sdfdgd', 'desc': '', 'createdAt': '2016-09-20T13:48:58Z+0000', 'updatedAt': '2020-07-16T13:08:03Z+0000', 'url': 'https://eda', 'subject': {'type': 'Text', 'value': 'Register now'}, 'fromName': {'type': 'Text', 'value': 'ramjdn fg'}, 'fromEmail': {'type': 'Text', 'value': 'ffdfee#ozx.com'}, 'replyEmail': {'type': 'Text', 'value': 'ffdfee#ozx.com'}, 'folder': {'type': 'Folder', 'value': 478, 'folderName': 'sjha'}, 'operational': False, 'textOnly': False, 'publishToMSI': False, 'webView': False, 'status': 'approved', 'template': 1031, 'workspace': 'Default', 'isOpenTrackingDisabled': False, 'version': 2, 'autoCopyToText': True, 'preHeader': None}\]}
Now when I am creating a dataframe out of this data using below code:
df = spark.read.json(sc.parallelize(\[data\]))
I am getting only one column which is _corrupt_record, below is the dataframe o/p I am getting. I have tried using multine is true but am still not getting the desired output.
+--------------------+
| \_corrupt_record|
\+--------------------+
|{'id': 12526, 'na...|
\+--------------------+
Expected o/p is the dataframe after exploding json with different columns, like id as one column, name as other column and so on.
I have tried lot of things but not able to fix this.

I have made certain changes and it worked.
I need to define the custom schema
Then used this bit of code
data = sc.parallelize([items])
df = spark.createDataFrame(data,schema=schema)
And It worked.
If there are any optimized solution to this please feel free to share.

How can I load a Plaid banking API response to a pandas dataframe in python?

I'm using Plaid's API to return balances on banking accounts. Their documentation indicates that all responses come in standard JSON. I have experience loading JSON responses from the request module, but I'm not able to directly load Plaid's response to a pandas dataframe. Here's what happens when I try:
request = AccountsBalanceGetRequest(access_token=token)
response = client.accounts_balance_get(request)
df = pd.json_normalize(response, record_path=['accounts'])
ERROR:
File "C:\Users\<me>\AppData\Local\Programs\Python\Python39\lib\site-packages\pandas\io\json\_normalize.py", line 423, in _json_normalize
raise NotImplementedError
For reference, print(response['accounts']) correctly accesses the relevant part of the response. Here's the section of _normalize from the error, though I don't understand how to apply this to solve the problem:
if isinstance(data, list) and not data:
return DataFrame()
elif isinstance(data, dict):
# A bit of a hackjob
data = [data]
elif isinstance(data, abc.Iterable) and not isinstance(data, str):
# GH35923 Fix pd.json_normalize to not skip the first element of a
# generator input
data = list(data)
else:
raise NotImplementedError
If I print the response, it looks like this:
{'accounts': [{'account_id': 'account_1',
'balances': {'available': 300.0,
'current': 300.0,
'iso_currency_code': 'USD',
'limit': None,
'unofficial_currency_code': None},
'mask': 'xxx1',
'name': 'SAVINGS',
'official_name': 'Bank Savings',
'subtype': 'savings',
'type': 'depository'},
{'account_id': 'account_2',
'balances': {'available': 500.00,
'current': 600.0,
'iso_currency_code': 'USD',
'limit': None,
'unofficial_currency_code': None},
'mask': 'xxx2',
'name': 'CHECKING',
'official_name': 'Bank Checking',
'subtype': 'checking',
'type': 'depository'},
{'account_id': 'account_3',
'balances': {'available': 2000.00,
'current': 2000.00,
'iso_currency_code': 'USD',
'limit': None,
'unofficial_currency_code': None},
'mask': 'xxx3',
'name': 'BUSINESS CHECKING',
'official_name': 'Bank Business Checking',
'subtype': 'checking',
'type': 'depository'}],
'item': {'available_products': ['balance'],
'billed_products': ['auth', 'transactions'],
'consent_expiration_time': None,
'error': None,
'institution_id': 'ins_123xyz',
'item_id': 'item_123xyz',
'update_type': 'background',
'webhook': ''},
'request_id': 'request_123xyz'}
I assume if Plaid's response is standard JSON, the single quotes are only there because Python's print converted them from double quotes. If I take this string as a base and replace single with double quotes, and replace None with "None", I can load to a dataframe:
data = json.loads(responseString.replace("'", '"').replace('None', '"None"'))
df = pd.json_normalize(data, record_path=['accounts'])
print(df)
Applying this directly to Plaid's response also works:
data = str(response)
data = data.replace("'", '"').replace('None', '"None"')
data = json.loads(data)
df = pd.json_normalize(data, record_path=['accounts'])
What I have seems to be a temporary workaround, but not a robust or intended solution. Is there a more preferred way to get there?
UPDATE 1: Expected output from the first block of code in this post would produce the dataframe below, rather than an error:
account_id mask name official_name subtype ... balances.available balances.current balances.iso_currency_code balances.limit balances.unofficial_currency_code
0 account_1 xxx1 SAVINGS Bank Savings savings ... 300.0 300.0 USD None None
1 account_2 xxx2 CHECKING Bank Checking checking ... 500.0 600.0 USD None None
2 account_3 xxx3 BUSINESS CHECKING Bank Business Checking checking ... 2000.0 2000.0 USD None None
I can get the same output with the workaround, but don't understand why it's necessary, and it doesn't seem like a great way to do get the result with relying on replacing single quotes with double quotes.
UPDATE 2: I installed the plaid components on 10/15/2021 using the non-docker instructions and npm.
print(plaid.__version__)
8.2.0
$ py --version
Python 3.9.6
UPDATE 3: Adding full solution based on Stephen's suggested answer. The response needs to be explicitly converted to a dict first, then processed from there. What worked:
json_string = json.loads(json.dumps(response.to_dict()))
df = pd.json_normalize(json_string, record_path=['accounts'])
This allowed me to cut out all the workarounds needed after converting to a string and basically load straight to a dataframe.

So I think the solution is something like this
json_string = json.dumps(response.to_dict())
# which you can then input into a df
Basically, we moved from returning dictionaries from the API to returning Python models. So we need to travel from model -> dictionary -> json. to_dict is a method on every model that outputs a dictionary, and then json.dumps takes in a dictionary and converts it to valid JSON.
LMK if this works for you :)

How do I read a yaml file into a Jupyter notebook?

I have a file from an Open API Spec that I have been trying to access in a Jupyter notebook. It is a .yaml file. I was able to upload it into Jupyter and put it in the same folder as the notebook I'd like to use to access it. I am new to Jupyter and Python, so I'm sorry if this is a basic question. I found a forum that suggested this code to read the data (in my file: "openapi.yaml"):
import yaml
with open("openapi.yaml", 'r') as stream:
try:
print(yaml.safe_load(stream))
except yaml.YAMLError as exc:
print(exc)
This seems to bring the data in, but it is a completely unstructured stream like so:
{'openapi': '3.0.0', 'info': {'title': 'XY Tracking API', 'version': '2.0', 'contact': {'name': 'Narrativa', 'url': 'http://link, 'email': '}, 'description': 'The XY Tracking Project collects information from different data sources to provide comprehensive data for the XYs, X-Y. Contact Support:'}, 'servers': [{'url': 'link'}], 'paths': {'/api': {'get': {'summary': 'Data by date range', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/covidtata'}}}}}, 'operationId': 'get-api', 'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_from', 'description': 'Date range beginig (YYYY-DD-MM)', 'required': True}, {'schema': {'type': 'string', 'format': 'date'}, 'in': 'query', 'name': 'date_to', 'description': 'Date range ending (YYYY-DD-MM)'}], 'description': 'Returns the data for a specific date range.'}}, '/api/{date}': {'parameters': [{'schema': {'type': 'string', 'format': 'date'}, 'name': 'date', 'in': 'path', 'required': True}], 'get': {'summary': 'Data by date', 'tags': [], 'responses': {'200': {'description': 'OK', 'content': {'application/json': {'schema': {'$ref': '#/components/schemas/data'}}}}}, 'operationId': 'get-api-date', 'description': 'Returns the data for a specific day.'}}, '/api/country/{country}': {'parameters': [{'schema': {'type': 'string', 'example': 'spain'}, 'name': 'country', 'in': 'path', 'required': True, 'example': 'spain'}, {'schema': {'type': 'strin
...etc.
I'd like to work through the data for analysis but can't seem to access it correctly. Any help would be extremely appreciated!!! Thank you so much for reading.

What you're seeing in the output is JSON. This is in a machine-readable format which doesn't need human-readable newlines or indentation. You should be able to work with this data just fine in your code.
Alternatively, you may want to consider another parser/emitter such as ruamel.yaml which can make dealing with YAML files considerably easier than the package you're currently importing. Print statements with this package can preserve lines and indentation for better readability.

Parsing Panda to_dict

I have data is being fetch via API, but the data is in HTML format, so I used panda to convert the HTML to to_dict but when fetching the data in Django, it adds wraps around with string, which I'm not able to use the for loop to parse the data. How to remove the string so that I can fetch data.
Data:
output = fetchdata(datacenter)
## Dict format to fetch
context = {
'datacenter': datacenter,
'output': output
}
Here is the below OUTPUT:
{'datacenter': 'DC1', 'output': b"[{'Device': 'device01', 'Port': 'Ge0/0/5', 'Provider': 'L3', 'ID': 3324114459135, 'Remote': 'ISP Circuit', 'Destination Port': 'ISP Port'}, {'Device': 'device02', 'Port': 'Ge0/0/5', 'Provider': 'L3', 'ID': 334555114459135, 'Remote': 'ISP Circuit', 'Destination Port': 'ISP Port'}]\n"}
I would like to garb data from the output and present in Table format

The output should be json object, so:
import json
json.loads(output)
It must work.

When i search for document after index a document. It returns empty but it returns document with sleep of 2 seconds between creating and fetching

I am using Elasticsearch . I am trying to connect Elasticsearch with python.
I can create index with document. When i tried to fetch the same once successful creation done it returns empty.
If i make the code to sleep for 2 seconds after creating the data in Elasticsearch, then it returns actual data.
Is there any interval required to create and search the same data?
Sample code:
es_client.index(index=index_id, doc_type=doc_type, id=doc_id, body=body)
returns:
{'_index': 'account_001', '_type': 'UnitTest', '_id': '9f48ae128e4811e88c4b0242ac120013', '_version': 1, 'result': 'created', '_shards': {'total': 2, 'successful': 2, 'failed': 0}, 'created': True}
es_client.search(index=index_id, doc_type=doc_type, body=query, filter_path=filter_path)
returns {}

I got answer from elastic discuss forum
https://discuss.elastic.co/t/when-i-search-for-document-after-index-a-document-it-returns-empty-but-it-returns-document-with-sleep-of-2-seconds-between-creating-and-fetching/141480/3

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

converting a deep nested loop from JSON into Pandas DF - python

Using json.normalize df = pd.json_normalize(data=data["business_discovery"]["media"], record_path="data")

Related

Why reading a json format file resulting all the records going to _corrupt_record in pyspark

How can I load a Plaid banking API response to a pandas dataframe in python?

How do I read a yaml file into a Jupyter notebook?

Parsing Panda to_dict

When i search for document after index a document. It returns empty but it returns document with sleep of 2 seconds between creating and fetching

Categories

Resources