I want to change json format to table format - python

I want to change json format to tabular format using python.
dict and list are used in nesting.
Currently
{'tables': [{'name': 'PrimaryResult', 'columns': [{'name': 'TimeGenerated', 'type': 'datetime'}, {'name': 'OperationName', 'type': 'string'}, {'name': 'Category', 'type': 'string'}], 'rows': [['2021-08-24T04:08:01.966Z', 'Restore application', 'ApplicationManagement'], ['2021-08-24T06:52:22.14Z', 'Bulk create users - started (bulk)', 'UserManagement'], ['2021-08-24T06:52:22.671Z', 'Bulk create users - finished (bulk)', 'UserManagement'], ['2021-08-24T06:52:22.471Z', 'Add user', 'UserManagement'], ['2021-08-24T06:52:22.501Z', 'Add user', 'UserManagement'], ['2021-08-24T06:52:22.594Z', 'Add user', 'UserManagement'], ['2021-08-24T06:52:22.513Z', 'Add user', 'UserManagement'], ['2021-08-24T06:54:48.482Z', 'Enable Strong Authentication', 'UserManagement'], ['2021-08-24T06:54:48.487Z', 'Update user', 'UserManagement'], ['2021-08-24T06:54:33.391Z', 'Enable Strong Authentication', 'UserManagement']]}]}
Table
headers: tables | TimeGenerated | OperationName | Category
eg: PrimaryResult, 2021-08-24T04:08:01.966Z, Restore application, ApplicationManagement

Here is the quick and straightforward solution:
import pandas as pd
import json
# Open JSON file
with open('{your_file_path}') as json_file:
data = json.load(json_file)
# Create dataframe
pd_data = data['tables'][0]['rows']
pd_columns = [v['name'] for k, v in enumerate(data['tables'][0]['columns'])]
df = pd.DataFrame(data=pd_data, columns=pd_columns)
You may export the dataframe to various table format provided by pandas.

Related

Airflow: How to get the current date of when data is inserted into a BigQuery table?

I am inserting data from a GCS Bucket to BigQuery, and I am unsure how to get the current date of when the data is inserted into a column.
This is my schema:
load_csv = gcs_to_bq.GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_example',
bucket='cloud-samples-data',
source_objects=['SOURCE-FILE-LOCATION'],
destination_project_dataset_table='airflow_test.gcs_to_bq_table',
schema_fields=[
{'name': 'item', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'DATE', 'mode': 'NULLABLE'},
],
write_disposition='WRITE_TRUNCATE',
dag=dag)
So, in my schema, I have item and date.
Therefore, when triggering my DAG to insert the data from the GCS Bucket to BigQuery, how do I make it so that the date column contains the current date of when the data gets inserted?
For example, if I insert it today, then the date column should be 2022-11-24.
There might be 2 ways to reach the desired result but not sure of either.
The first one is to use default values as described here and add a column to your schema:
schema_fields=[
{'name': 'item', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'DATE', 'mode': 'NULLABLE'},
{'name': 'load_date', 'type': 'DATE', 'default': 'CURRENT_DATE'},
]
However, this is pre-GA so not sure whether you can use it (also I haven't tested sorry).
Other possibility would be to use Airflow templating ability and add another step:
load_csv = gcs_to_bq.GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq_example',
bucket='cloud-samples-data',
source_objects=['SOURCE-FILE-LOCATION'],
destination_project_dataset_table='airflow_test.gcs_to_bq_table_{{ ds_nodash }}',
schema_fields=[
{'name': 'item', 'type': 'STRING', 'mode': 'NULLABLE'},
{'name': 'date', 'type': 'DATE', 'mode': 'NULLABLE'},
],
write_disposition='WRITE_TRUNCATE',
dag=dag)
With this operation you'll get your file in a table, with the ingestion date (or timestamp if you use ts_nodash) in the table name. You're then free to use the BigqueryOperator to insert this staged data into your destination data with some SQL.

converting a deep nested loop from JSON into Pandas DF

I am getting info from an API, and getting this as the resulting JSON file:
{'business_discovery': {'media': {'data': [{'media_url': 'a link',
'timestamp': '2022-01-01T01:00:00+0000',
'caption': 'Caption',
'media_type': 'type',
'media_product_type': 'product_type',
'comments_count': 1,
'like_count': 1,
'id': 'ID'},
{'media_url': 'link',
# ... and so on
# NOTE: I scrubbed the numbers with dummy data
I know to get the data I can run this script to get all the data within the data
# "a" is the json without business discovery or media, which would be this:
a = {'data': [{'media_url': 'a link',
'timestamp': '2022-01-01T01:00:00+0000',
'caption': 'Caption',
'media_type': 'type',
'media_product_type': 'product_type',
'comments_count': 1,
'like_count': 1,
'id': 'ID'},
{'media_url': 'link',
# ... and so on
media_url,timestamp,caption,media_type,media_product_type,comment_count,like_count,id_code = [],[],[],[],[],[],[],[]
for result in a['data']:
media_url.append(result[u'media_url']) #Appending all the info within their Json to a list
timestamp.append(result[u'timestamp'])
caption.append(result[u'caption'])
media_type.append(result[u'media_type'])
media_product_type.append(result[u'media_product_type'])
comment_count.append(result[u'comments_count'])
like_count.append(result[u'like_count'])
id_code.append(result[u'id']) # All info exists, even when a value is 0
df = pd.DataFrame([media_url,timestamp,caption,media_type,media_product_type,comment_count,like_count,id_code]).T
when I run the above command on the info from the api, I get errors saying that the data is not found
This works fine for now, but trying to figure out a way to "hop" over both business discovery, and media, to get straight to data so I can run this more effectively, rather than copying and pasting where I skip over business discovery and media
Using json.normalize
df = pd.json_normalize(data=data["business_discovery"]["media"], record_path="data")

Pandas read the chat log log json to data frame?

How to converting the multiple list to data frame. below list contains the details about cloud containers want to extract the information like name , language , description and workspace id.
{'workspaces': [{'name': 'A_SupportAgent_dev',
'language': 'en',
'metadata': {'api_version': {'major_version': 'v1',
'minor_version': '2019-02-28'},
'digressions': True},
'description': 'Credit Card Banking Support Agent to assist with Sales And Service, created by Oliver Ivanoski and Steve Green',
'workspace_id': '',
'learning_opt_out': False},
{'name': 'Neatnik Watson Assistant Webhook Demo Skill',
'language': 'en',
'metadata': {'api_version': {'major_version': 'v1',
'minor_version': '2019-02-28'}},
'webhooks': [{'url': 'https://neatnik.net/watson/assistant/webhook/',
'name': 'main_webhook',
'headers': []}],
'description': '',
'workspace_id': '',
'system_settings': {'tooling': {'store_generic_responses': True},
'system_entities': {'enabled': True},
'spelling_auto_correct': True},
'learning_opt_out': False}]
'pagination': {'refresh_url': '/v1/workspaces?version=2019-02-28'}}
Want to convert the above list below data frame
Tried
pd.DataFrame(list(Workspace_List.items()) ,columns=['workspaces', 'pagination'])
columns = list(Workspace_List.keys())
values = list(Workspace_List.values())
arr_len = len(values)
You need to specify columns as you have another dictionary. So i think following below code will help u to organize your desire output
key = ['name','language','description','workspace_id']
output = pd.DataFrame(columns = key)
for i in range(len(df['workspaces'])):
ll = df['workspaces'][i]
output.loc[i] = [ll[x] for x in key]

Parsing Panda to_dict

I have data is being fetch via API, but the data is in HTML format, so I used panda to convert the HTML to to_dict but when fetching the data in Django, it adds wraps around with string, which I'm not able to use the for loop to parse the data. How to remove the string so that I can fetch data.
Data:
output = fetchdata(datacenter)
## Dict format to fetch
context = {
'datacenter': datacenter,
'output': output
}
Here is the below OUTPUT:
{'datacenter': 'DC1', 'output': b"[{'Device': 'device01', 'Port': 'Ge0/0/5', 'Provider': 'L3', 'ID': 3324114459135, 'Remote': 'ISP Circuit', 'Destination Port': 'ISP Port'}, {'Device': 'device02', 'Port': 'Ge0/0/5', 'Provider': 'L3', 'ID': 334555114459135, 'Remote': 'ISP Circuit', 'Destination Port': 'ISP Port'}]\n"}
I would like to garb data from the output and present in Table format
The output should be json object, so:
import json
json.loads(output)
It must work.

How do I delete a dict item that is a 'NoneType'?

I'm making a program that takes salesforce reports, iteraters through them and displays them in a flask app.
from flask import Flask
from flask import render_template
import csv
import collections
app = Flask(__name__)
# odict_keys(['Edit', 'Activity ID', 'Assigned', 'Subject', 'Last Modified Date', 'Date', 'Priority'
# , 'Status', 'Company / Account', 'Created By', 'Activity Type', 'Comments'])
#app.route('/')
def hello_world():
reports = {}
with open('./reports/report040717.csv', newline='') as csvfile:
reader = csv.DictReader(csvfile, delimiter=',', quotechar='"')
for row in reader:
temp = {row['Activity ID']: {'subject': row['Subject'], 'due_date': row['Date'], 'last_modified': row['Last Modified Date'], 'status': row['Status'],
'company': row['Company / Account'], 'type': row['Activity Type'], 'comments': row['Comments']}}
reports.update(temp)
reports = collections.OrderedDict(reversed(list(reports.items())))
for k, v in reports.items():
print(k, v)
return render_template('home.html', reports=reports)
if __name__ == '__main__':
app.run()
I'm storing all of the rows read into a dictionary, then push that dictionary into another dictionary with the ID as the key.
The problem is I keep getting an empty dictionary and I can't figure out how to delete it.
This is how its showing up when im printing the k ,v values before calling the render
None {'subject': None, 'due_date': None, 'last_modified': None, 'status': None, 'company': None, 'type': None, 'comments': None}
and this is how all of the other shows up
00TF000003Ti9iE {'subject': 'some text', 'due_date': '4/18/2017', 'last_modified': '8/23/2016', 'status': 'Not Started', 'company': 'some text', 'type': 'some text', 'comments': 'some text'}
Any suggestions on how to delete the None entry in the report dict?
I would imagine your reading in empty rows from your CSV file. You can do the following so those empty rows aren't loaded into the dictionary at all:
if row['Activity ID'] and row['Subject']:
# row will only be added if the values above aren't None
temp = {row['Activity ID']: {'subject': row['Subject'], 'due_date': row['Date'], 'last_modified': row['Last Modified Date'], 'status': row['Status'],
'company': row['Company / Account'], 'type': row['Activity Type'], 'comments': row['Comments']}}
reports.update(temp)
You shouldn't put anything in the larger dictionary that contains only None as values. My original code was incorrect this is TemporalWolf's suggested fix.
if(None not in temp):
reports.update(temp)

Categories