How to read Avro files from S3 in Python?

How to read Avro files from S3 in Python? - python

I have a bunch of Avro files that I would like to read one by one from S3. I have no problem reading the files as bytes but I am wondering how can you iterate over the entires after that. Current code:
conn = boto.s3.connect_to_region("us-east-1")
my_bucket=boto.s3.bucket.Bucket(conn, "my_bucket")
my_key = my_bucket.get_key("folder/file.avro")
raw_bytes = my_key.read()
test_schema = '''
{
"namespace": "com.company",
"type": "record",
"name": "MimeMessage_v2",
"fields": [
{
"name": "record_timestamp",
"type": "long"
},
{
"name": "contents",
"type": "bytes"
}
],
"message_id": 2
}
'''
schema = avro.schema.Parse(test_schema)
#this is the problematic section
dreader = DatumReader(schema, schema)
v = dreader.read(raw_bytes)
I am wondering how to read a variable containing bytes of a Avro file properly.

Here is one of the ways that worked for me in Python 3:
from avro.datafile import DataFileReader
avro_bytes = io.BytesIO(raw_bytes)
reader = DataFileReader(avro_bytes, avro.io.DatumReader())
for line in reader:
print(line)

Related

am getting identical sha256 for each json file in python

I am in a huge hashing crisis. Using the chip-0007's default format I generatedfew JSON files. Using these files I have been trying to generate sha256 hash value. And I expect a unique hash value for each file.
However, python code isn't doing so. I thought there might be some issue with JSON file but, it is not. Something is to do with sha256 code.
All the json files ->
JSON File 1
{ "format": "CHIP-0007", "name": "adewale-the-amebo", "description": "Adewale always wants to be in everyone's business.", "attributes": [ { "trait_type": "Gender", "value": "male" } ], "collection": { "name": "adewale-the-amebo Collection", "id": "1" } }
JSON File 2
{ "format": "CHIP-0007", "name": "alli-the-queeny", "description": "Alli is an LGBT Stan.", "attributes": [ { "trait_type": "Gender", "value": "male" } ], "collection": { "name": "alli-the-queeny Collection", "id": "2" } }
JSON File 3
{ "format": "CHIP-0007", "name": "aminat-the-snnobish", "description": "Aminat never really wants to talk to anyone.", "attributes": [ { "trait_type": "Gender", "value": "female" } ], "collection": { "name": "aminat-the-snnobish Collection", "id": "3" } }
Sample CSV File:
Series Number,Filename,Description,Gender
1,adewale-the-amebo,Adewale always wants to be in everyone's business.,male
2,alli-the-queeny,Alli is an LGBT Stan.,male
3,aminat-the-snnobish,Aminat never really wants to talk to anyone.,female
Python CODE
TODO 2 : Generate a JSON file per entry in team's sheet in CHIP-0007's default format
new_jsonFile = f"{row[1]}.json"
json_data = {}
json_data["format"] = "CHIP-0007"
json_data["name"] = row[1]
json_data["description"] = row[2]
attribute_data = {}
attribute_data["trait_type"] = "Gender" # gender
attribute_data["value"] = row[3] # "value/male/female"
json_data["attributes"] = [attribute_data]
collection_data = {}
collection_data["name"] = f"{row[1]} Collection"
collection_data["id"] = row[0] # "ID of the NFT collection"
json_data["collection"] = collection_data
filepath = f"Json_Files/{new_jsonFile}"
with open(filepath, 'w') as f:
json.dump(json_data, f, indent=2)
C += 1
sha256_hash = sha256_gen(filepath)
temp.append(sha256_hash)
NEW.append(temp)
# TODO 3 : Calculate sha256 of the each entry
def sha256_gen(fn):
return hashlib.sha256(open(fn, 'rb').read()).hexdigest()
How can I generate a unique sha256 hash for each JSON?
I tried reading in byte blocks. That is also not working out. After many trials, I am going nowhere. Sharing the unexpected outputs of each JSON file:
[ All hashes are identical ]
Unexpected SHA256 output:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Expected:
Unique Hash value. Different from each other

Because of output buffering, you're calling sha256_gen(filepath) before anything is written to the file, so you're getting the hash of an empty file. You should do that outside the with, so that the JSON file is closed and the buffer is flushed.
with open(filepath, 'w') as f:
json.dump(json_data, f, indent=2)
C += 1
sha256_hash = sha256_gen(filepath)
temp.append(sha256_hash)
NEW.append(temp)

Convert a JSON string to multiple CSV's based on its structure and name it to a certain value

I currently have A JSON file saved containing some data I want to convert to CSV. Here is the data sample below, please note, I have censored the actual value in there for security and privacy reasons.
{
"ID value1": {
"Id": "ID value1",
"TechnischContactpersoon": {
"Naam": "Value",
"Telefoon": "Value",
"Email": "Value"
},
"Disclaimer": [
"Value"
],
"Voorzorgsmaatregelen": [
{
"Attributes": {},
"FileId": "value",
"FileName": "value",
"FilePackageLocation": "value"
},
{
"Attributes": {},
"FileId": "value",
"FileName": "value",
"FilePackageLocation": "value"
},
]
},
"ID value2": {
"Id": "id value2",
"TechnischContactpersoon": {
"Naam": "Value",
"Telefoon": "Value",
"Email": "Value"
},
"Disclaimer": [
"Placeholder"
],
"Voorzorgsmaatregelen": [
{
"Attributes": {},
"FileId": "value",
"FileName": "value",
"FilePackageLocation": "value"
}
]
},
Though I know how to do this (because I already have a function to handle a JSON to CSV convertion) with a simple JSON string without issues. I do not know to this with this kind of JSON file that this kind of a structure layer. Aka a second layer beneath the first. Also you may have noticed that there is an ID value above
Because as may have noticed from structure is actually another layer inside the JSON file. So in total I need to have two kinds of CSV files:
The main CSV file just containing the ID, Disclaimer. This CSV file
is called utility networks and contains all possible ID value's and
the value
A file containing the "Voorzorgsmaatregelen" value's. Because there are multiple values in this section, one CSV file per unique
ID file is needed and needs to be named after the Unique value id.
Deleted this part because it was irrelevant.
Data_folder = "Data"
Unazones_file_name = "UnaZones"
Utilitynetworks_file_name = "utilityNetworks"
folder_path_JSON_BS_JSON = folder_path_creation(Data_folder)
pkml_file_path = os.path.join(folder_path_JSON_BS_JSON,"pmkl.json")
print(pkml_file_path)
json_object = json_open(pkml_file_path)
json_content_unazones = json_object.get("mapRequest").get("UnaZones")
json_content_utility_Networks = json_object.get("utilityNetworks")
Unazones_json_location = json_to_save(json_content_unazones,folder_path_JSON_BS_JSON,Unazones_file_name)
csv_file_location_unazones = os.path.join(folder_path_CSV_file_path(Data_folder),(Unazones_file_name+".csv"))
csv_file_location_Utilitynetwork = os.path.join(folder_path_CSV_file_path(Data_folder),(Unazones_file_name+".csv"))
json_content_utility_Networks = json_object.get("utilityNetworks")
Utility_networks_json_location = json_to_save(json_content_utility_Networks,folder_path_JSON_BS_JSON,Utilitynetworks_file_name)
def json_to_csv_convertion(json_file_path: str, csv_file_location: str):
loaded_json_data = json_open(json_file_path)
# now we will open a file for writing
data_file = open(csv_file_location, 'w', newline='')
# # create the csv writer object
csv_writer = csv.writer(data_file,delimiter = ";")
# Counter variable used for writing
# headers to the CSV file
count = 0
for row in loaded_json_data:
if count == 0:
# Writing headers of CSV file
header = row.keys()
csv_writer.writerow(header)
count += 1
# Writing data of CSV file
csv_writer.writerow(row.values())
data_file.close()
def folder_path_creation(path: str):
if not os.path.exists(path):
os.makedirs(path)
return path
def json_open(complete_folder_path):
with open(complete_folder_path) as f:
json_to_load = json.load(f) # Modified "objectids" to "object_ids" for readability -sg
return json_to_load
def json_to_save(input_json, folder_path: str, file_name: str):
json_save_location = save_file(input_json, folder_path, file_name, "json")
return json_save_location
So how do I this starting from this?
for obj in json_content_utility_Networks:
Go from there?
Keep in mind that is JSON value has already one layer above every object for every object I need to start one layer below it.
So how do I this?

How to convert csv to nested arrays in json using python

I am trying to use csv file to read data and convert them into nested array using python.
my column values of csv are
"hallticket_Number ","student_name","gender","course_name","university_course_code ","university_college_code","caste","course_year","semester_yearly_exams","subject_name1","subject_code1","marks_or_grade_points_obtained1","maximum_marks_or_grade_points1","pass_mark1","no_of_credits1","pass_fail_absent1","subject_name2","subject_code2","marks_or_grade_points_obtained2","maximum_marks_or_grade_points2","no_of_credits2","pass_fail_absent2" ,"subject_name3","subject_code3", "marks_or_grade_points_obtained3","maximum_marks_or_grade_points3","no_of_credits3", "pass_fail_absent3" ,"subject_name4" ,"subject_code4" ,"marks_or_grade_points_obtained4","maximum_marks_or_grade_points4","no_of_credits4" , "pass_fail_absent4" ,"subject_code5", "marks_or_grade_points_obtained5" ,"maximum_marks_or_grade_points5","no_of_credits5","pass_fail_absent5","subject_name6","marks_or_grade_points_obtained6","maximum_marks_or_grade_points6", "no_of_credits6","pass_fail_absent","final_result_pass_fail","marks_or_sgpa_
The output i need in JSON is
{
"hallticket_": 22342,
"student_name": "abc",
"gender": "m",
"course_name":" fgd",
"course_code":52,
"college_code ":521,
"caste":"open",
"year":55,
"exam":"s1",
"subject": [ {
"subject_name1":"hh",
"subject_code1":52,
"marks_or_grade_points_obtained1":85,
"maximum_marks_or_grade_points1":50,
"pass_mark1":52,
"no_of_credits1":85,
"pass_fail_absent1":"pass"},]
"subject": [ {
"subject_name2":"hh",
"subject_code2":52,
"marks_or_grade_points_obtained2":85,
"maximum_marks_or_grade_points2":50,
"pass_mark2":52,
"no_of_credits2":85,
"pass_fail_absent2":"pass"},]
"subject": [ {
"subject_name3":"hh",
"subject_code3":52,
"marks_or_grade_points_obtained3":85,
"maximum_marks_or_grade_points3":50,
"pass_mark3":52,
"no_of_credits3":85,
"pass_fail_absent3":"pass"},]
"subject": [ {
"subject_name4":"hh",
"subject_code4":52,
"marks_or_grade_points_obtained4":85,
"maximum_marks_or_grade_points4":50,
"pass_mark4":52,
"no_of_credits4":85,
"pass_fail_absent4":"pass"},]
"subject": [ {
"subject_name5":"hh",
"subject_code5":52,
"marks_or_grade_points_obtained5":85,
"maximum_marks_or_grade_points5":50,
"pass_mark5":52,
"no_of_credits5":85,
"pass_fail_absent5":"pass"},]
"subject": [ {
"subject_name6":"hh",
"subject_code6":52,
"marks_or_grade_points_obtained6":85,
"maximum_marks_or_grade_points6":50,
"pass_mark6":52,
"no_of_credits6":85,
"pass_fail_absent6":"pass"},]
"final_result_pass_fail":"pass",
" marks_or_sgpa_obtained":"8.00",
"maximum_marks_sgpa":"10",
"total_credits":"135"
}

import csv
import json
# Open the CSV
f = open('data.csv', 'r')
reader = csv.DictReader(f)
# Parse the CSV into JSON
out = json.dumps([row for row in reader])
print(out)
Hopefully this will work as your expectations!

Write all values in one line csv.DictWriter

I'm having trouble to generate a well formatted CSV file out of some data i fetched from the leadfeeder API. In the csv file that is currently being created, not all values are in one row, id and leads are one column higher then the rest. Like here:
CSV Output
I later also like to load another json file and use it to map some values over the id and then put also the visits per lead into my csv file.
Do you also have some advice for this?
This is my code so far:
import json
import csv
csv_columns = ['name', 'industry', 'website_url', 'status', 'crm_lead_id', 'crm_organization_id', 'employee_count', 'id', 'type' ]
with open('data.json', 'r') as d:
d = json.load(d)
csv_file = 'lead_daten.csv'
try:
with open('leads.csv', 'w', newline='') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=csv_columns, extrasaction='ignore')
writer.writeheader()
for item in d['data']:
writer.writerow(item)
writer.writerow(item['attributes'])
except IOError:
print("I/O error")
My json data has the following structure:
I need also some of the nested values like the id in relationships!
{
"data": [
{
"attributes": {
"crm_lead_id": null,
"crm_organization_id": null,
"employee_count": 5000,
"facebook_url": null,
"first_visit_date": "2019-01-31",
"industry": "Furniture",
"last_visit_date": "2019-01-31",
"linkedin_url": null,
"name": "Example Inc",
"phone": null,
"status": "new",
"twitter_handle": "example",
"website_url": "http://www.example.com"
},
"id": "s7ybF6VxqhQqVM1m1BCnZT_8SRo9XnuoxSUP5ChvERZS9",
"relationships": {
"location": {
"data": {
"id": "8SRo9XnuoxSUP5ChvERZS9",
"type": "locations"
}
}
},
"type": "leads"
},
{
"attributes": {
"crm_lead_id": null,

When you write to a csv, you must write one full row at a time. You current code writes one row with only id and type, and then a different row with the other fields.
The correct way is to first fully build a dictionary containing all the fields and only then write it in one single operation. Code could be:
...
writer.writeheader()
for item in d['data']:
item.update(item["attributes"])
writer.writerow(item)
...

Reformat non-serializable JSON-ish data into a format suitable for value extraction in Python

With the following simple Python script:
import json
file = 'toy.json'
data = json.loads(file)
print(data['gas']) # example
My data generates the error ...is not JSON serializable.
With this, slightly more sophisticated, Python script:
import json
import sys
#load the data into an element
data = open('transactions000000000029.json', 'r')
#dumps the json object into an element
json_str = json.dumps(data)
#load the json to a string
resp = json.loads(json_str)
#extract an element in the response
print(resp['gas'])
The same.
What I'd like to do is extract all the values of a particular index, so ideally I'd like to render the input like so:
...
"hash": "0xf2b5b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63",
"gasUsed": "21000",
"hash": "0xf8f2a397b0f7bb1ff212b6bcc57e4a56ce3e27eb9f5839fef3e193c0252fab26"
"gasUsed": "21000"
...
The data looks like this:
{
"blockNumber": "1941794",
"blockHash": "0x41ee74e34cbf9ef4116febea958dbc260e2da3a6bf6f601bfaeb2cd9ab944a29",
"hash": "0xf2b5b8fb173e371cbb427625b0339f6023f8b4ec3701b7a5c691fa9cef9daf63",
"from": "0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9",
"to": "0x2ba24c66cbff0bda0e3053ea07325479b3ed1393",
"gas": "121000",
"gasUsed": "21000",
"gasPrice": "20000000000",
"input": "",
"logs": [],
"nonce": "14",
"value": "0x24406420d09ce7440000",
"timestamp": "2016-07-24 20:28:11 UTC"
}
{
"blockNumber": "1941716",
"blockHash": "0x75e1602cad967a781f4a2ea9e19c97405fe1acaa8b9ad333fb7288d98f7b49e3",
"hash": "0xf8f2a397b0f7bb1ff212b6bcc57e4a56ce3e27eb9f5839fef3e193c0252fab26",
"from": "0xa0480c6f402b036e33e46f993d9c7b93913e7461",
"to": "0xb2ea1f1f997365d1036dd6f00c51b361e9a3f351",
"gas": "121000",
"gasUsed": "21000",
"gasPrice": "20000000000",
"input": "",
"logs": [],
"nonce": "1",
"value": "0xde0b6b3a7640000",
"timestamp": "2016-07-24 20:12:17 UTC"
}
What would be the best way to achieve that?
I've been thinking that perhaps the best way would be to reformat it as valid json?
Or maybe to just treat it like regex?

Your json file is not valid. This data should be a list of dictionaries. You should then separate each dictionary with a comma, Like this:
[
{
"blockNumber":"1941794",
"blockHash": "0x41ee74bf9ef411d9ab944a29",
"hash":"0xf2ef9daf63",
"from":"0x3c0cbb196e3847d40cb4d77d7dd3b386222998d9",
"to":"0x2ba24c66cbff0bda0e3053ea07325479b3ed1393",
"gas":"121000",
"gasUsed":"21000",
"gasPrice":"20000000000",
"input":"",
"logs":[
],
"nonce":"14",
"value":"0x24406420d09ce7440000",
"timestamp":"2016-07-24 20:28:11 UTC"
},
{
"blockNumber":"1941716",
"blockHash":"0x75e1602ca8d98f7b49e3",
"hash":"0xf8f2a397b0f7bb1ff212e193c0252fab26",
"from":"0xa0480c6f402b036e33e46f993d9c7b93913e7461",
"to":"0xb2ea1f1f997365d1036dd6f00c51b361e9a3f351",
"gas":"121000",
"gasUsed":"21000",
"gasPrice":"20000000000",
"input":"",
"logs":[
],
"nonce":"1",
"value":"0xde0b6b3a7640000",
"timestamp":"2016-07-24 20:12:17 UTC"
}
]
Then use this to open the file:
with open('toy.json') as data_file:
data = json.load(data_file)
You can then render the desired output like:
for item in data:
print item['hash']
print item['gasUsed']

If each block is valid JSON data you can parse them seperatly:
data = []
with open('transactions000000000029.json') as inpt:
lines = []
for line in inpt:
if line.startswith('{'): # block starts
lines = [line]
else:
lines.append(line)
if line.startswith('}'): # block ends
data.append(json.loads(''.join(lines)))
for block in data:
print("hash: {}".format(block['hash']))
print("gasUsed: {}".format(block['gasUsed']))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to read Avro files from S3 in Python? - python

Here is one of the ways that worked for me in Python 3: from avro.datafile import DataFileReader avro_bytes = io.BytesIO(raw_bytes) reader = DataFileReader(avro_bytes, avro.io.DatumReader()) for line in reader: print(line)

Related

am getting identical sha256 for each json file in python

Convert a JSON string to multiple CSV's based on its structure and name it to a certain value

How to convert csv to nested arrays in json using python

Write all values in one line csv.DictWriter

Reformat non-serializable JSON-ish data into a format suitable for value extraction in Python

Categories

Resources