I have log files, which have many lines in the form of :
<log uri="Brand" t="2017-01-24T11:33:54" u="Rohan" a="U" ref="00000000-2017-01" desc="This has been updated."></log>
I am trying to convert each line in the log file into a Data frame and store it in csv or excel format. I want only values of uri, t is nothing but time u for username and desc for description
Something like this
Columns :- uri Date Time User Description
Brand 2017-01-24 11:33:54 Rohan This has been updated.
and so on.
As mentionned by #Corralien in the comments, you can use some of beautifulsoup functions (Beautifulsoup and find_all) to parse each line in your logfile separately, then use pandas.DataFrame constructor with a listcomp to make a DataFrame for each line :
import pandas as pd
import bs4 #pip install beautifulsoup4
​
with open("/tmp/logfile.txt", "r") as f:
logFile = f.read()
​
soupObj = bs4.BeautifulSoup(logFile, "html5lib")
​
dfList = [pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])],
columns=["uri", "Date", "Time", "User", "Description"])
for x in soupObj.find_all("log")]
#this bloc creates an Excel file for each df​
for lineNumber, df in enumerate(dfList, start=1):
df.to_excel(f"logfile_{lineNumber}.xlsx", index=False)
Output :
print(dfList[0])
uri Date Time User Description
0 Brand 2017-01-24 11:33:54 Rohan This has been updated.
Update :
If you need a single dataframe/spreadsheet for the all the lines, use this :
with open("/tmp/logfile.txt", "r") as f:
soupObj = bs4.BeautifulSoup(f, "html5lib")
df = pd.DataFrame([(x["uri"], *x["t"].split("T"), x["u"], x["desc"])
for x in soupObj.find_all("log")],
columns=["uri", "Date", "Time", "User", "Description"])
df.to_excel("logfile.xlsx", index=False)
I'm a beginner in python, I have an Azure function that runs with a time trigger. This function reads a batch of raw JSON data from an Azure service bus with string format.
This is a two-row of data. In reality, I received about 50 like this message is continuous. Now I want to split this message row by row and then archive it to Azure Storage.
The message is like the below sample ( concat of row1 and row2 ) :
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.0254566,\"10\":-0.054562772}},\"NUMBER_TAG\":\"2145600\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":1,"id":"074222a38-2816-42c7-b95c-6644448ba9d","t":-2}
Row 1 is:
{"Name":"","Seri":21000000,"SiName":"","As":"","PId":21070101,"ICheck":0,"SeeNum":405097041391424,"Type":0,"Counter":33,"PaId":0,"MeType":30,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTACHED_DEVICE_SERIAL_NUMBER_TAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-1}
Row 2 is:
{"Name":"","Seri":4560000,"SiName":"","As":"","PId":2107401,"ICheck":0,"SeeNum":40509704561424,"Type":0,"Counter":34,"PaId":0,"MeType":31,"RecTime":"2021-10-21T09:04:41.0151Z","ReaTime":null,"Cape":"2021-10-21T09:04:40.644","Status":0,"text":"{\"TYPE_TAG\":\"00\",\"ENSORAG\":{\"date_time\":\"2021-10-21 09:04:40.644\",\"seber\":10,\"seqmber\":405097041391424,\"lo_name\":\"\",\"accati\":{\"0\":0.0,\"1\":-0.037665367,\"2\":-0.033863068,\"3\":-0.026795387,\"4\":-0.03757,\"5\":-0.02809906,\"6\":-0.016090393,\"7\":-0.040496826,\"8\":-0.05318451,\"9\":-0.025012016,\"10\":-0.057872772}},\"ATTTAG\":\"21000000\",\"error\":{}}","CerId":null,"Id":null,"Asse":null,"Id":0,"id":"075f0a38-2816-42c7-b95c-66c425b8ba9d","t":-2}
The structure of a row is like the below image:
In my opinion, First I should split each row and then create a data frame and insert each value in the related column. After that, I append to a blob. Is it right?
How can I do? What is your suggested solution?
Edited:
My code for reading from service bus:
from azure.servicebus import ServiceBusClient, ServiceBusMessage
connection_str = "**"
topic_name = "***"
subscription_name = "***"
servicebus_client = ServiceBusClient.from_connection_string(
conn_str=connection_str, logging_enable=True)
with servicebus_client:
# get the Subscription Receiver object for the subscription
receiver = servicebus_client.get_subscription_receiver(
topic_name=topic_name, subscription_name=subscription_name, )
with receiver:
for msg in receiver:
print("Received: " + str(msg))
# complete the message so that the message is removed from the subscription
receiver.complete_message(msg)
Since the messages are sent individually, you can process them individually. There is no need to concatenate into a string. Just keep appending them into a data frame. Below sample is for a queue but you can extend to a topic/subscription. I've also attached the results to show you what the output looks like.
from azure.servicebus import ServiceBusClient
import pandas as pd
import json
from pandas import json_normalize
CONNECTION_STR = 'Endpoint=sb://xxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
QUEUE_NAME = 'xxxxxxxxxxx'
servicebus_client = ServiceBusClient.from_connection_string(conn_str=CONNECTION_STR)
with servicebus_client:
receiver = servicebus_client.get_queue_receiver(queue_name=QUEUE_NAME)
# create an Empty DataFrame object
df = pd.DataFrame()
msg_concat = ""
dfs = []
with receiver:
received_msgs = receiver.receive_messages(max_message_count=10, max_wait_time=5)
for msg in received_msgs:
msg_dict = json.loads(str(msg))
df2 = json_normalize(msg_dict)
df = df.append(df2, ignore_index = True)
receiver.complete_message(msg)
print(df)
print("Receive is done.")
Name Seri SiName As ... Id Asse id t
0 21000000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -1
1 4560000 ... 0 None 075f0a38-2816-42c7-b95c-66c425b8ba9d -2
[2 rows x 21 columns]
Receive is done.
Consider sample data with three rows:
data = '{"Name": "Hassan", "code":"12"}{"Name": "Jack", "code":"345"}{"Name": "Jack", "code":"345"}'
Here is how you can get dataframe from this data:
from ast import literal_eval
data = [literal_eval(d + '}')for d in data.split('}')[0:-1]]
df = pd.DataFrame.from_records(data)
Output:
Name code
0 Hassan 12
1 Jack 345
2 Jack 345
i'm new to OOP in Python. So, I want to read CSV files using OOP.
I have a CSV file with 5 columns separated by a comma.
I want to read that CSV file, with each column stored in a table of the new dataframe.
So, suppose I have data like these:
1434,"2021-08-13 06:31:59",unread,082196788998,kuse.hamdy#gmail.com
1433,"2021-08-13 06:09:41",unread,081554220007,ritaambarwati1#umsida.ac.id
1432,"2021-08-13 05:35:07",unread,081911075017,rifqinaufalfayyadh#gmail.com
I want the OOP code to read that CSV file and store it to a new table like these:
id date status number email
1434 2021-08-13 06:31:59 unread 089296788998 kuse.hamdy#gmail.com
1433 2021-08-13 06:09:41 unread 081554271927 ritati1#yahoo.com
1432 2021-08-13 05:35:07 unread 081911075017 rifqinaufalfayyadh#gmail.com
I tried this code:
import csv
class Complete_list:
def __init__(self, row, header, list_):
self.__dict__ = dict(zip(header, row))
self.list_ = list_
def __repr__(self):
return self.list_
data = list(csv.reader(open("complete_list.csv")))
instances = [Complete_list(a, data[1], "date_{}".format(i+1)) for i, a in enumerate(data[1:])]
instances = list(instances)
for i in instances:
j = i.list_.split(',')
print(j)
Somehow, I could not access the value of each list separated by the comma and put it into a new dataframe with multiple columns. Instead, I got the result like this:
['date_1']
['date_2']
['date_3']
To be honest , you are better of using libraies like pandas. but this is how i would approach it.
class complete_list:
def __init__(self, path, header=None):
self.data = path
self.header= header
def read(self):
with open(self.data, 'r') as f:
data = [x.split(',') for x in f.readlines()]
return data
def printer(self):
if self.header:
a,b,c,d,e = self.header
yield(f'{a:^10} {b:^15} {c:^25}{d:10}{e:^10}')
for i in self.read():
# print(i)
yield f'{i[0]:^10}| {i[1]:^10} | {i[2]:^10} | {i[3]:^10} | {i[4]:^10}'
headers= ['id', 'date', 'status', 'number', 'email']
data_frame = complete_list('yes.txt',header = headers).printer()
output
id date status number email
1434 | "2021-08-13 06:31:59" | unread | 082196788998 | kuse.hamdy#gmail.com
1433 | "2021-08-13 06:09:41" | unread | 081554220007 | ritaambarwati1#umsida.ac
1432 | "2021-08-13 05:35:07" | unread | 081911075017 | rifqinaufalfayyadh#gmail.com
The pandas library is the perfect tool for that
import pandas as pd
df = pd.read_csv("data.csv", sep=",", names=['id', 'date', 'status', 'number', 'email'])
print(df)
id date status number email
0 1434 2021-08-13 06:31:59 unread 82196788998 kuse.hamdy#gmail.com
1 1433 2021-08-13 06:09:41 unread 81554220007 ritaambarwati1#umsida.ac.id
2 1432 2021-08-13 05:35:07 unread 81911075017 rifqinaufalfayyadh#gmail.com