pyspark streaming dataframe write to different path depending on column values

pyspark streaming dataframe write to different path depending on column values - python

In databricks notebook, I am reading json files with readStream, json has structure for example:
id
entityType
eventId
1
person
123
2
employee
234
3
client
687
4
client
687
My code:
cloudfile = {
"cloudFiles.format": "json",
"cloudFiles.schemaLocation": SCHEMA_LOCATION
"cloudFiles.useNotifications", True}
df = (spark.readStream
.format('cloudfiles')
.options(**cloudfile)
.load(SOURCE_PATH)
)
How can I write it using writeStream to different folders, depending on column values?
Output exmaple:
mainPath/{entityType}/{eventId}/data.json
entity with id = 1 to file: mainPath/person/123/data.json
entity with id = 2 to file: mainPath/employee/234/data.json
entity with id = 3 to file: mainPath/client/687/data.json
...

Related

Create Rows and Tables using BeautifulSoup python with xml to json convertion

Currently I'm creating a parser script that can convert xml to json, my plan is to modify my script and add creation of rows and columns when I convert it to csv file.
The current output of my script it can only creates newlines.
My Script
xml_parser = BeautifulSoup(open('SAMPLE.xml'), 'xml')
DESCRIPTION = xml_parser.DESCRIPTION
NAME = xml_parser.NAME
LOCATION = xml_parser.LOCATION
STATUS = xml_parser.STATUS
data = {
'DESCRIPTION' : DESCRIPTION.text,
'NAME' : NAME.text,
'LOCATION' : LOCATION.text,
'STATUS' : STATUS.text,
}
print(json.dumps(data).replace(",","\n"))
Output
DESCRIPTION: MAIN FLOOR
NAME: FORT-0232
LOCATION: MIDDLE
STATUS: ACTIVE
Plan Output to be
DESCRIPTION | NAME | LOCATION | STATUS |
MAIN FLOOR | FORT-0232 | MIDDLE | ACTIVE |

How can I split a batch string message received from Azure Service Bus to row by row?

I'm a beginner in python, I have an Azure function that runs with a time trigger. This function reads a batch of raw JSON data from an Azure service bus with string format.
This is a two-row of data. In reality, I received about 50 like this message is continuous. Now I want to split this message row by row and then archive it to Azure Storage.
The message is like the below sample ( concat of row1 and row2 ) :
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.0254566,\"10\":-0.054562772}},\"NUMBER_TAG\":\"2145600\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":1,"id":"074222a38-2816-42c7-b95c-6644448ba9d","t":-2}
Row 1 is:
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}
Row 2 is:
{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTTAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-2}
The structure of a row is like the below image:
In my opinion, First I should split each row and then create a data frame and insert each value in the related column. After that, I append to a blob. Is it right?
How can I do? What is your suggested solution?
Edited:
My code for reading from service bus:
from azure.servicebus import ServiceBusClient, ServiceBusMessage
connection_str = "**"
topic_name = "***"
subscription_name = "***"
servicebus_client = ServiceBusClient.from_connection_string(
conn_str=connection_str, logging_enable=True)
with servicebus_client:
# get the Subscription Receiver object for the subscription
receiver = servicebus_client.get_subscription_receiver(
topic_name=topic_name, subscription_name=subscription_name, )
with receiver:
for msg in receiver:
print("Received: " + str(msg))
# complete the message so that the message is removed from the subscription
receiver.complete_message(msg)

Since the messages are sent individually, you can process them individually. There is no need to concatenate into a string. Just keep appending them into a data frame. Below sample is for a queue but you can extend to a topic/subscription. I've also attached the results to show you what the output looks like.
from azure.servicebus import ServiceBusClient
import pandas as pd
import json
from pandas import json_normalize
CONNECTION_STR = 'Endpoint=sb://xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
QUEUE_NAME = 'xxxxxxxxxxx'
servicebus_client = ServiceBusClient.from_connection_string(conn_str=CONNECTION_STR)
with servicebus_client:
receiver = servicebus_client.get_queue_receiver(queue_name=QUEUE_NAME)
# create an Empty DataFrame object
df = pd.DataFrame()
msg_concat = ""
dfs = []
with receiver:
received_msgs = receiver.receive_messages(max_message_count=10, max_wait_time=5)
for msg in received_msgs:
msg_dict = json.loads(str(msg))
df2 = json_normalize(msg_dict)
df = df.append(df2, ignore_index = True)
receiver.complete_message(msg)
print(df)
print("Receive is done.")
Name Seri SiName As ... Id Asse id t
0 21000000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -1
1 4560000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -2
[2 rows x 21 columns]
Receive is done.

Consider sample data with three rows:
data = '{"Name": "Hassan", "code":"12"}{"Name": "Jack", "code":"345"}{"Name": "Jack", "code":"345"}'
Here is how you can get dataframe from this data:
from ast import literal_eval
data = [literal_eval(d + '}')for d in data.split('}')[0:-1]]
df = pd.DataFrame.from_records(data)
Output:
Name code
0 Hassan 12
1 Jack 345
2 Jack 345

Read CSV with OOP Python

i'm new to OOP in Python. So, I want to read CSV files using OOP.
I have a CSV file with 5 columns separated by a comma.
I want to read that CSV file, with each column stored in a table of the new dataframe.
So, suppose I have data like these:
1434,"2021-08-13 06:31:59",unread,082196788998,kuse.hamdy#gmail.com
1433,"2021-08-13 06:09:41",unread,081554220007,ritaambarwati1#umsida.ac.id
1432,"2021-08-13 05:35:07",unread,081911075017,rifqinaufalfayyadh#gmail.com
I want the OOP code to read that CSV file and store it to a new table like these:
id date status number email
1434 2021-08-13 06:31:59 unread 089296788998 kuse.hamdy#gmail.com
1433 2021-08-13 06:09:41 unread 081554271927 ritati1#yahoo.com
1432 2021-08-13 05:35:07 unread 081911075017 rifqinaufalfayyadh#gmail.com
I tried this code:
import csv
class Complete_list:
def __init__(self, row, header, list_):
self.__dict__ = dict(zip(header, row))
self.list_ = list_
def __repr__(self):
return self.list_
data = list(csv.reader(open("complete_list.csv")))
instances = [Complete_list(a, data[1], "date_{}".format(i+1)) for i, a in enumerate(data[1:])]
instances = list(instances)
for i in instances:
j = i.list_.split(',')
print(j)
Somehow, I could not access the value of each list separated by the comma and put it into a new dataframe with multiple columns. Instead, I got the result like this:
['date_1']
['date_2']
['date_3']

To be honest , you are better of using libraies like pandas. but this is how i would approach it.
class complete_list:
def __init__(self, path, header=None):
self.data = path
self.header= header
def read(self):
with open(self.data, 'r') as f:
data = [x.split(',') for x in f.readlines()]
return data
def printer(self):
if self.header:
a,b,c,d,e = self.header
yield(f'{a:^10} {b:^15} {c:^25}{d:10}{e:^10}')
for i in self.read():
# print(i)
yield f'{i[0]:^10}| {i[1]:^10} | {i[2]:^10} | {i[3]:^10} | {i[4]:^10}'
headers= ['id', 'date', 'status', 'number', 'email']
data_frame = complete_list('yes.txt',header = headers).printer()
output
id date status number email
1434 | "2021-08-13 06:31:59" | unread | 082196788998 | kuse.hamdy#gmail.com
1433 | "2021-08-13 06:09:41" | unread | 081554220007 | ritaambarwati1#umsida.ac
1432 | "2021-08-13 05:35:07" | unread | 081911075017 | rifqinaufalfayyadh#gmail.com

The pandas library is the perfect tool for that
import pandas as pd
df = pd.read_csv("data.csv", sep=",", names=['id', 'date', 'status', 'number', 'email'])
print(df)
id date status number email
0 1434 2021-08-13 06:31:59 unread 82196788998 kuse.hamdy#gmail.com
1 1433 2021-08-13 06:09:41 unread 81554220007 ritaambarwati1#umsida.ac.id
2 1432 2021-08-13 05:35:07 unread 81911075017 rifqinaufalfayyadh#gmail.com

read multiple nested json file with python pandas

im beginner with python.
i want to read this json file data1 like in the attachment.
i have tried to read all columns in the file, but i can only read the 'data' nest. i don't know how to read all the columns in both "data" and "quotes" nest. can you please help me?
Thankyou
my code:
import pandas as pd
data = json.load(open('C:/JSON_IMPORT/data1.json'))
df = pd.DataFrame(data["data"])
print (df)
**Json file:**
```{"status":
{"timestamp":"2021-03-16T19:27:55.404Z","error_code":0,"error_message":null,"elapsed":173,"credit_count":22,"notice":null,"total_count":4368},
"data":[{"id":1,
"name":"Bitcoin",
"symbol":"BTC",
"slug":"bitcoin",
"num_market_pairs":9862,
"date_added":"2013-04-28T00:00:00.000Z",
"tags":["mineable","pow","sha-256","store-of-value","state-channels","coinbase-ventures-portfolio","three-arrows-capital-portfolio","polychain-capital-portfolio"],
"max_supply":21000000,
"circulating_supply":18655725,
"total_supply":18655725,
"platform":null,
"cmc_rank":1,
"last_updated":"2021-03-16T19:26:11.000Z",
"quote":{
"USD":{
"price":55643.86231386882,
"volume_24h":57006039705.56386,
"percent_change_1h":-0.22948654,
"percent_change_24h":-0.66133846,
"percent_change_7d":3.26713607,
"percent_change_30d":14.24843475,
"percent_change_60d":54.21680422,
"percent_change_90d":168.83609047,
"market_cap":1038076593265.4004,
"last_updated":"2021-03-16T19:26:11.000Z"}}
}}
]

Here you go. You should use pd.json_normalize and concatenate that with the dataframe made from data['status']
df = pd.concat([pd.DataFrame(data['status'],index=[0]),
pd.json_normalize(data, record_path=['data'])],
axis=1)
print(df)
# > timestamp error_code error_message elapsed credit_count notice total_count id name symbol slug num_market_pairs date_added tags max_supply circulating_supply total_supply platform cmc_rank last_updated quote.USD.price quote.USD.volume_24h quote.USD.percent_change_1h quote.USD.percent_change_24h quote.USD.percent_change_7d quote.USD.percent_change_30d quote.USD.percent_change_60d quote.USD.percent_change_90d quote.USD.market_cap quote.USD.last_updated
0 2021-03-16T19:27:55.404Z 0 null 173 22 null 4368 1 Bitcoin BTC bitcoin 9862 2013-04-28T00:00:00.000Z ['mineable', 'pow', 'sha-256', 'store-of-value', 'state-channels', 'coinbase-ventures-portfolio', 'three-arrows-capital-portfolio', 'polychain-capital-portfolio'] 21000000 18655725 18655725 null 1 2021-03-16T19:26:11.000Z 55643.86231386882 57006039705.56386 -0.22948654 -0.66133846 3.26713607 14.24843475 54.21680422 168.83609047 1038076593265.4004 2021-03-16T19:26:11.000Z

How to remove elements from JSON using Python

I receive the following data from an API request, I would like to be able to search the data and select the ID where Name = "Steve" (for instance 3 in the example below)
The data returned from the API always has a different number of elements in, and the location of 'Steve' can be in a different part of the returned string. The ID will also change.
(Getdata){
header =
(APIResponseHeader){
sessionToken = "xxxx"
}
Items[] =
(Summary){
Id = 1
Name = "John"
TypeId = 1
},
(Summary){
Id = 2
Name = "Jack"
TypeId = 1
},
(Summary){
Id = 3
Name = "Steve"
TypeId = 1
},
}
I think the format of the data is JSON(?) and I'm not sure how to convert, and then search it, if this is at all possible..?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

pyspark streaming dataframe write to different path depending on column values - python

Related

Create Rows and Tables using BeautifulSoup python with xml to json convertion

How can I split a batch string message received from Azure Service Bus to row by row?

Read CSV with OOP Python

read multiple nested json file with python pandas

How to remove elements from JSON using Python

Categories

Resources