read multiple nested json file with python pandas - python

im beginner with python.
i want to read this json file data1 like in the attachment.
i have tried to read all columns in the file, but i can only read the 'data' nest. i don't know how to read all the columns in both "data" and "quotes" nest. can you please help me?
Thankyou
my code:
import pandas as pd
data = json.load(open('C:/JSON_IMPORT/data1.json'))
df = pd.DataFrame(data["data"])
print (df)
**Json file:**
```{"status":
{"timestamp":"2021-03-16T19:27:55.404Z","error_code":0,"error_message":null,"elapsed":173,"credit_count":22,"notice":null,"total_count":4368},
"data":[{"id":1,
"name":"Bitcoin",
"symbol":"BTC",
"slug":"bitcoin",
"num_market_pairs":9862,
"date_added":"2013-04-28T00:00:00.000Z",
"tags":["mineable","pow","sha-256","store-of-value","state-channels","coinbase-ventures-portfolio","three-arrows-capital-portfolio","polychain-capital-portfolio"],
"max_supply":21000000,
"circulating_supply":18655725,
"total_supply":18655725,
"platform":null,
"cmc_rank":1,
"last_updated":"2021-03-16T19:26:11.000Z",
"quote":{
"USD":{
"price":55643.86231386882,
"volume_24h":57006039705.56386,
"percent_change_1h":-0.22948654,
"percent_change_24h":-0.66133846,
"percent_change_7d":3.26713607,
"percent_change_30d":14.24843475,
"percent_change_60d":54.21680422,
"percent_change_90d":168.83609047,
"market_cap":1038076593265.4004,
"last_updated":"2021-03-16T19:26:11.000Z"}}
}}
]

Here you go. You should use pd.json_normalize and concatenate that with the dataframe made from data['status']
df = pd.concat([pd.DataFrame(data['status'],index=[0]),
pd.json_normalize(data, record_path=['data'])],
axis=1)
print(df)
# > timestamp error_code error_message elapsed credit_count notice total_count id name symbol slug num_market_pairs date_added tags max_supply circulating_supply total_supply platform cmc_rank last_updated quote.USD.price quote.USD.volume_24h quote.USD.percent_change_1h quote.USD.percent_change_24h quote.USD.percent_change_7d quote.USD.percent_change_30d quote.USD.percent_change_60d quote.USD.percent_change_90d quote.USD.market_cap quote.USD.last_updated
0 2021-03-16T19:27:55.404Z 0 null 173 22 null 4368 1 Bitcoin BTC bitcoin 9862 2013-04-28T00:00:00.000Z ['mineable', 'pow', 'sha-256', 'store-of-value', 'state-channels', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio'] 21000000 18655725 18655725 null 1 2021-03-16T19:26:11.000Z 55643.86231386882 57006039705.56386 -0.22948654 -0.66133846 3.26713607 14.24843475 54.21680422 168.83609047 1038076593265.4004 2021-03-16T19:26:11.000Z

Related

Convert Log file to Dataframe Pandas

I have log files, which have many lines in the form of :
<log uri="Brand" t="2017-01-24T11:33:54" u="Rohan" a="U" ref="00000000-2017-01" desc="This has been updated."></log>
I am trying to convert each line in the log file into a Data frame and store it in csv or excel format. I want only values of uri, t is nothing but time u for username and desc for description
Something like this
Columns :- uri Date Time User Description
Brand 2017-01-24 11:33:54 Rohan This has been updated.
and so on.
As mentionned by #Corralien in the comments, you can use some of beautifulsoup functions (Beautifulsoup and find_all) to parse each line in your logfile separately, then use pandas.DataFrame constructor with a listcomp to make a DataFrame for each line :
import pandas as pd
import bs4 #pip install beautifulsoup4
​
with open("/tmp/logfile.txt", "r") as f:
logFile = f.read()
​
soupObj = bs4.BeautifulSoup(logFile, "html5lib")
​
dfList = [pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])],
columns=["uri", "Date", "Time", "User", "Description"])
for x in soupObj.find_all("log")]
#this bloc creates an Excel file for each df​
for lineNumber, df in enumerate(dfList, start=1):
df.to_excel(f"logfile_{lineNumber}.xlsx", index=False)
Output :
print(dfList[0])
uri Date Time User Description
0 Brand 2017-01-24 11:33:54 Rohan This has been updated.
Update :
If you need a single dataframe/spreadsheet for the all the lines, use this :
with open("/tmp/logfile.txt", "r") as f:
soupObj = bs4.BeautifulSoup(f, "html5lib")
df = pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])
for x in soupObj.find_all("log")],
columns=["uri", "Date", "Time", "User", "Description"])
df.to_excel("logfile.xlsx", index=False)

pyspark streaming dataframe write to different path depending on column values

In databricks notebook, I am reading json files with readStream, json has structure for example:
id
entityType
eventId
1
person
123
2
employee
234
3
client
687
4
client
687
My code:
cloudfile = {
"cloudFiles.format": "json",
"cloudFiles.schemaLocation": SCHEMA_LOCATION
"cloudFiles.useNotifications", True}
df = (spark.readStream
.format('cloudfiles')
.options(**cloudfile)
.load(SOURCE_PATH)
)
How can I write it using writeStream to different folders, depending on column values?
Output exmaple:
mainPath/{entityType}/{eventId}/data.json
entity with id = 1 to file: mainPath/person/123/data.json
entity with id = 2 to file: mainPath/employee/234/data.json
entity with id = 3 to file: mainPath/client/687/data.json
...

Extract nested values from data frame using python

I've extracted the data from API response and created a dictionary function:
def data_from_api(a):
dictionary = dict(
data = a['number']
,created_by = a['opened_by']
,assigned_to = a['assigned']
,closed_by = a['closed']
)
return dictionary
and then to df (around 1k records):
raw_data = []
for k in data['resultsData']:
records = data_from_api(k)
raw_data.append(records)
I would like to create a function allows to extract the nested fields {display_value} in the columns in the dataframe. I need only the names like John Snow, etc. Please see below:
How to create a function extracts the display values for those fields? I've tried to create something like:
df = pd.DataFrame.from_records(raw_data)
def get_nested_fields(nested):
if isinstance(nested, dict):
return nested['display_value']
else:
return ''
df['created_by'] = df['opened_by'].apply(get_nested_fields)
df['assigned_to'] = df['assigned'].apply(get_nested_fields)
df['closed_by'] = df['closed'].apply(get_nested_fields)
but I'm getting an error:
KeyError: 'created_by'
Could you please help me?
You can use .str and get() like below. If the key isn't there, it'll write None.
df = pd.DataFrame({'data':[1234, 5678, 5656], 'created_by':[{'display_value':'John Snow', 'link':'a.com'}, {'display_value':'John Dow'}, {'my_value':'Jane Doe'}]})
df['author'] = df['created_by'].str.get('display_value')
output
data created_by author
0 1234 {'display_value': 'John Snow', 'link': 'a.com'} John Snow
1 5678 {'display_value': 'John Dow'} John Dow
2 5656 {'my_value': 'Jane Doe'} None

How can I split a batch string message received from Azure Service Bus to row by row?

I'm a beginner in python, I have an Azure function that runs with a time trigger. This function reads a batch of raw JSON data from an Azure service bus with string format.
This is a two-row of data. In reality, I received about 50 like this message is continuous. Now I want to split this message row by row and then archive it to Azure Storage.
The message is like the below sample ( concat of row1 and row2 ) :
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.0254566,\"10\":-0.054562772}},\"NUMBER_TAG\":\"2145600\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":1,"id":"074222a38-2816-42c7-b95c-6644448ba9d","t":-2}
Row 1 is:
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}
Row 2 is:
{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTTAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-2}
The structure of a row is like the below image:
In my opinion, First I should split each row and then create a data frame and insert each value in the related column. After that, I append to a blob. Is it right?
How can I do? What is your suggested solution?
Edited:
My code for reading from service bus:
from azure.servicebus import ServiceBusClient, ServiceBusMessage
connection_str = "**"
topic_name = "***"
subscription_name = "***"
servicebus_client = ServiceBusClient.from_connection_string(
conn_str=connection_str, logging_enable=True)
with servicebus_client:
# get the Subscription Receiver object for the subscription
receiver = servicebus_client.get_subscription_receiver(
topic_name=topic_name, subscription_name=subscription_name, )
with receiver:
for msg in receiver:
print("Received: " + str(msg))
# complete the message so that the message is removed from the subscription
receiver.complete_message(msg)
Since the messages are sent individually, you can process them individually. There is no need to concatenate into a string. Just keep appending them into a data frame. Below sample is for a queue but you can extend to a topic/subscription. I've also attached the results to show you what the output looks like.
from azure.servicebus import ServiceBusClient
import pandas as pd
import json
from pandas import json_normalize
CONNECTION_STR = 'Endpoint=sb://xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
QUEUE_NAME = 'xxxxxxxxxxx'
servicebus_client = ServiceBusClient.from_connection_string(conn_str=CONNECTION_STR)
with servicebus_client:
receiver = servicebus_client.get_queue_receiver(queue_name=QUEUE_NAME)
# create an Empty DataFrame object
df = pd.DataFrame()
msg_concat = ""
dfs = []
with receiver:
received_msgs = receiver.receive_messages(max_message_count=10, max_wait_time=5)
for msg in received_msgs:
msg_dict = json.loads(str(msg))
df2 = json_normalize(msg_dict)
df = df.append(df2, ignore_index = True)
receiver.complete_message(msg)
print(df)
print("Receive is done.")
Name Seri SiName As ... Id Asse id t
0 21000000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -1
1 4560000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -2
[2 rows x 21 columns]
Receive is done.
Consider sample data with three rows:
data = '{"Name": "Hassan", "code":"12"}{"Name": "Jack", "code":"345"}{"Name": "Jack", "code":"345"}'
Here is how you can get dataframe from this data:
from ast import literal_eval
data = [literal_eval(d + '}')for d in data.split('}')[0:-1]]
df = pd.DataFrame.from_records(data)
Output:
Name code
0 Hassan 12
1 Jack 345
2 Jack 345

Read CSV with OOP Python

i'm new to OOP in Python. So, I want to read CSV files using OOP.
I have a CSV file with 5 columns separated by a comma.
I want to read that CSV file, with each column stored in a table of the new dataframe.
So, suppose I have data like these:
1434,"2021-08-13 06:31:59",unread,082196788998,kuse.hamdy#gmail.com
1433,"2021-08-13 06:09:41",unread,081554220007,ritaambarwati1#umsida.ac.id
1432,"2021-08-13 05:35:07",unread,081911075017,rifqinaufalfayyadh#gmail.com
I want the OOP code to read that CSV file and store it to a new table like these:
id date status number email
1434 2021-08-13 06:31:59 unread 089296788998 kuse.hamdy#gmail.com
1433 2021-08-13 06:09:41 unread 081554271927 ritati1#yahoo.com
1432 2021-08-13 05:35:07 unread 081911075017 rifqinaufalfayyadh#gmail.com
I tried this code:
import csv
class Complete_list:
def __init__(self, row, header, list_):
self.__dict__ = dict(zip(header, row))
self.list_ = list_
def __repr__(self):
return self.list_
data = list(csv.reader(open("complete_list.csv")))
instances = [Complete_list(a, data[1], "date_{}".format(i+1)) for i, a in enumerate(data[1:])]
instances = list(instances)
for i in instances:
j = i.list_.split(',')
print(j)
Somehow, I could not access the value of each list separated by the comma and put it into a new dataframe with multiple columns. Instead, I got the result like this:
['date_1']
['date_2']
['date_3']
To be honest , you are better of using libraies like pandas. but this is how i would approach it.
class complete_list:
def __init__(self, path, header=None):
self.data = path
self.header= header
def read(self):
with open(self.data, 'r') as f:
data = [x.split(',') for x in f.readlines()]
return data
def printer(self):
if self.header:
a,b,c,d,e = self.header
yield(f'{a:^10} {b:^15} {c:^25}{d:10}{e:^10}')
for i in self.read():
# print(i)
yield f'{i[0]:^10}| {i[1]:^10} | {i[2]:^10} | {i[3]:^10} | {i[4]:^10}'
headers= ['id', 'date', 'status', 'number', 'email']
data_frame = complete_list('yes.txt',header = headers).printer()
output
id date status number email
1434 | "2021-08-13 06:31:59" | unread | 082196788998 | kuse.hamdy#gmail.com
1433 | "2021-08-13 06:09:41" | unread | 081554220007 | ritaambarwati1#umsida.ac
1432 | "2021-08-13 05:35:07" | unread | 081911075017 | rifqinaufalfayyadh#gmail.com
The pandas library is the perfect tool for that
import pandas as pd
df = pd.read_csv("data.csv", sep=",", names=['id', 'date', 'status', 'number', 'email'])
print(df)
id date status number email
0 1434 2021-08-13 06:31:59 unread 82196788998 kuse.hamdy#gmail.com
1 1433 2021-08-13 06:09:41 unread 81554220007 ritaambarwati1#umsida.ac.id
2 1432 2021-08-13 05:35:07 unread 81911075017 rifqinaufalfayyadh#gmail.com

Categories