I am starting to work with some data manipulation and I need to create a new file (with new features) out of an old one. However, I could not realize how can I customize my own dataframe before using a ".to_json" method.
For example, I have a .csv as:
seller, customer, product, price
Roger, Will, 8129, 30
Roger, Markus, 1234, 100
Roger, Will, 2334, 50
Mike, Markus, 2295, 20
Mike, Albert, 1234, 100
...and I want to generate a .json file to support me in visualizing a network out of it. This should be more or less like:
{
"node": [
{"id":"Roger", "group": "seller" },
{"id":"Mike", "group": "seller" },
{"id":"Will", "group": "customer" },
{"id":"Markus", "group": "customer" },
{"id":"Albert", "group": "customer" }
],
"links":[
{"source":"Roger","target":"Will","product":8129,"price":30},
#...and so on
]
}
I tried to do something like:
df1 = pd.read_csv('file.csv')
seller_list = df1.seller.unique()
customer_list = df1.customer.unique()
..and I could get indeed lists with unique items. However, I could not find how I should add them in a dataframe in order to create an structure such as:
"node":[
...
{"id":"Mike", "group": "seller" },
{"id":"Markus", "group": "customer" },
...
]...#see above
Any support or hint on this is appreciated.
This will be a two step process. First, create the nodes dict using melt + drop_duplicates +to_dict -
nodes = df[['customer', 'seller']]\
.melt(var_name='group', value_name='id')\
.drop_duplicates()\
.to_dict('r')
Now, create the links dict using rename + to_dict
links = df.rename(columns={'seller' : 'source', 'customer' : 'target'}).to_dict('r')
Now, combine the data into one dictionary, and dump it as JSON to a file.
data = {'nodes' : nodes, 'links' : links}
with open('data.json', 'w') as f:
json.dump(data, f, indent=4)
Your data.json file should look like this -
{
"nodes": [
{
"id": "Will",
"group": "customer"
},
{
"id": "Markus",
"group": "customer"
},
{
"id": "Albert",
"group": "customer"
},
{
"id": "Roger",
"group": "seller"
},
{
"id": "Mike",
"group": "seller"
}
],
"links": [
{
"product": 8129,
"target": "Will",
"source": "Roger",
"price": 30
},
{
"product": 1234,
"target": "Markus",
"source": "Roger",
"price": 100
},
{
"product": 2334,
"target": "Will",
"source": "Roger",
"price": 50
},
{
"product": 2295,
"target": "Markus",
"source": "Mike",
"price": 20
},
{
"product": 1234,
"target": "Albert",
"source": "Mike",
"price": 100
}
]
}
Related
I am new to python and now want to convert a csv file into json file. Basically the json file is nested with dynamic structure, the structure will be defined using the csv header.
From csv input:
ID, Name, person_id/id_type, person_id/id_value,person_id_expiry_date,additional_info/0/name,additional_info/0/value,additional_info/1/name,additional_info/1/value,salary_info/details/0/grade,salary_info/details/0/payment,salary_info/details/0/amount,salary_info/details/1/next_promotion
1,Peter,PASSPORT,A452817,1-01-2055,Age,19,Gender,M,Manager,Monthly,8956.23,unknown
2,Jane,PASSPORT,B859804,2-01-2035,Age,38,Gender,F,Worker, Monthly,125980.1,unknown
To json output:
[
{
"ID": 1,
"Name": "Peter",
"person_id": {
"id_type": "PASSPORT",
"id_value": "A452817"
},
"person_id_expiry_date": "1-01-2055",
"additional_info": [
{
"name": "Age",
"value": 19
},
{
"name": "Gender",
"value": "M"
}
],
"salary_info": {
"details": [
{
"grade": "Manager",
"payment": "Monthly",
"amount": 8956.23
},
{
"next_promotion": "unknown"
}
]
}
},
{
"ID": 2,
"Name": "Jane",
"person_id": {
"id_type": "PASSPORT",
"id_value": "B859804"
},
"person_id_expiry_date": "2-01-2035",
"additional_info": [
{
"name": "Age",
"value": 38
},
{
"name": "Gender",
"value": "F"
}
],
"salary_info": {
"details": [
{
"grade": "Worker",
"payment": " Monthly",
"amount": 125980.1
},
{
"next_promotion": "unknown"
}
]
}
}
]
Is this something can be done by the existing pandas API or I have to write lots of complex codes to dynamically construct the json object? Thanks.
Goal: To create a script that will take in nested JSON object as input and output a CSV file with all keys as rows in the CSV?
Example:
{
"Document": {
"DocumentType": 945,
"Version": "V007",
"ClientCode": "WI",
"Shipment": [
{
"ShipmentHeader": {
"ShipmentID": 123456789,
"OrderChannel": "Shopify",
"CustomerNumber": 234234,
"VendorID": "2343SDF",
"ShipViaCode": "FEDX2D",
"AsnDate": "2018-01-27",
"AsnTime": "09:30:47-08:00",
"ShipmentDate": "2018-01-23",
"ShipmentTime": "09:30:47-08:00",
"MBOL": 12345678901234568,
"BOL": 12345678901234566,
"ShippingNumber": "1ZTESTTEST",
"LoadID": 321456987,
"ShipmentWeight": 10,
"ShipmentCost": 2.3,
"CartonsTotal": 2,
"CartonPackagingCode": "CTN25",
"OrdersTotal": 2
},
"References": [
{
"Reference": {
"ReferenceQualifier": "TST",
"ReferenceText": "Testing text"
}
}
],
"Addresses": {
"Address": [
{
"AddressLocationQualifier": "ST",
"LocationNumber": 23234234,
"Name": "John Smith",
"Address1": "123 Main St",
"Address2": "Suite 12",
"City": "Hometown",
"State": "WA",
"Zip": 92345,
"Country": "USA"
},
{
"AddressLocationQualifier": "BT",
"LocationNumber": 2342342,
"Name": "Jane Smith",
"Address1": "345 Second Ave",
"Address2": "Building 32",
"City": "Sometown",
"State": "CA",
"Zip": "23665-0987",
"Country": "USA"
}
]
},
"Orders": {
"Order": [
{
"OrderHeader": {
"PurchaseOrderNumber": 23456342,
"RetailerPurchaseOrderNumber": 234234234,
"RetailerOrderNumber": 23423423,
"CustomerOrderNumber": 234234234,
"Department": 3333,
"Division": 23423,
"OrderWeight": 10.23,
"CartonsTotal": 2,
"QTYOrdered": 12,
"QTYShipped": 23
},
"Cartons": {
"Carton": [
{
"SSCC18": 12345678901234567000,
"TrackingNumber": "1ZTESTTESTTEST",
"CartonContentsQty": 10,
"CartonWeight": 10.23,
"LineItems": {
"LineItem": [
{
"LineNumber": 1,
"ItemNumber": 1234567890,
"UPC": 9876543212,
"QTYOrdered": 34,
"QTYShipped": 32,
"QTYUOM": "EA",
"Description": "Shoes",
"Style": "Tall",
"Size": 9.5,
"Color": "Bllack",
"RetailerItemNumber": 2342333,
"OuterPack": 10
},
{
"LineNumber": 2,
"ItemNumber": 987654321,
"UPC": 7654324567,
"QTYOrdered": 12,
"QTYShipped": 23,
"QTYUOM": "EA",
"Description": "Sunglasses",
"Style": "Short",
"Size": 10,
"Color": "White",
"RetailerItemNumber": 565465456,
"OuterPack": 12
}
]
}
}
]
}
}
]
}
}
]
}
}
In the above JSON Object, I want all the keys (nested included) in a List (Duplicates can be removed by using a set Data Structure). If Nested Key Occurs like in actual JSON they can be keys multiple times in the CSV !
I personally feel that recursion is a perfect application for this type of problem if the amount of nests you will encounter is unpredictable. Here I have written an example in Python of how you can utilise recursion to extract all keys. Cheers.
import json
row = ""
def extract_keys(data):
global row
if isinstance(data, dict):
for key, value in data.items():
row += key + "\n"
extract_keys(value)
elif isinstance(data, list):
for element in data:
extract_keys(element)
# MAIN
with open("input.json", "r") as rfile:
dicts = json.load(rfile)
extract_keys(dicts)
with open("output.csv", "w") as wfile:
wfile.write(row)
Is there a way to import this kind of JSON response into Pandas? Ive been trying to get it usable with json_normalize but I can't seem to get more than one level to work at a time ( I can get notes but can't being in custom_fields). I also cannot figure out how to call out something like ['reporter']['name'] (which should be jdoe). This is from Mantis and its the JSON output of a requests response. Im now wondering if it needs to br broken up into multiple frames and put back together, or should I use a for loop and put the data I want into a better format for PD to import?
In my head each item should be a column in the series all tied to the id column like this.
id | summary | project.name | reporter.name ..|.. custom.fields.Project_Stage | ... notes1.reporter.name | notes1.text ... notes2.reporter.name | notes2.text
{
"issues": [
{
"id": 1234,
"summary": "Some text",
"project": {
"id": 1,
"name": "North America"
},
"category": {
"id": 11,
"name": "Retail"
},
"reporter": {
"id": 1099,
"name": "jdoe"
},
"custom_fields": [
{
"field": {
"id": 107,
"name": "Product Escalations"
},
"value": ""
},
{
"field": {
"id": 1,
"name": "Project_Stage"
},
"value": "Pending"
}
],
"notes": [
{
"id": 214288,
"reporter": {
"id": 9999,
"name": "jdoe"
},
"text": "Worked with Mark over e-mail",
"view_state": {
"id": 10,
"name": "public",
"label": "public"
},
"type": "note",
"created_at": "2020-12-04T15:55:02-08:00",
"updated_at": "2020-12-04T15:55:02-08:00"
},
{
"id": 214289,
"reporter": {
"id": 9999,
"name": "jdoe"
},
"text": "I attempted on numerous occasions to setup a meeting with him to set it up for him.",
"view_state": {
"id": 10,
"name": "public",
"label": "public"
},
"type": "note",
"created_at": "2020-12-04T15:57:02-08:00",
"updated_at": "2020-12-04T15:57:02-08:00"
}
]
}
]
}
Here is what the DF would look like in my head. All the data for one ticket on one line/series.
Those structures with many lists at the same level are tricky. Try flatten_json. https://github.com/amirziai/flatten
If you're response is called 'dic', you can use this.
from flatten_json import flatten
dic_flattened = (flatten(d, '.') for d in dic['issues'])
df = pd.DataFrame(dic_flattened)
Output
id summary project.id project.name category.id category.name ... notes.1.view_state.id notes.1.view_state.name notes.1.view_state.label notes.1.type notes.1.created_at notes.1.updated_at
0 1234 Some text 1 North America 11 Retail ... 10 public public note 2020-12-04T15:57:02-08:00 2020-12-04T15:57:02-08:00
In [101]: df.columns
Out[101]:
Index(['id', 'summary', 'project.id', 'project.name', 'category.id',
'category.name', 'reporter.id', 'reporter.name',
'custom_fields.0.field.id', 'custom_fields.0.field.name',
'custom_fields.0.value', 'custom_fields.1.field.id',
'custom_fields.1.field.name', 'custom_fields.1.value', 'notes.0.id',
'notes.0.reporter.id', 'notes.0.reporter.name', 'notes.0.text',
'notes.0.view_state.id', 'notes.0.view_state.name',
'notes.0.view_state.label', 'notes.0.type', 'notes.0.created_at',
'notes.0.updated_at', 'notes.1.id', 'notes.1.reporter.id',
'notes.1.reporter.name', 'notes.1.text', 'notes.1.view_state.id',
'notes.1.view_state.name', 'notes.1.view_state.label', 'notes.1.type',
'notes.1.created_at', 'notes.1.updated_at'],
dtype='object')
I have a csv file with 4 columns data as below.
type,MetalType,Date,Acknowledge
Metal,abc123451,2018-05-26,Success
Metal,abc123452,2018-05-27,Success
Metal,abc123454,2018-05-28,Failure
Iron,abc123455,2018-05-29,Success
Iron,abc123456,2018-05-30,Failure
( I just provided header in the above example data but in my case i dont have header in the data)
how can i convert above csv file to Json in the below format...
1st Column : belongs to --> "type": "Metal"
2nd Column : MetalType: "values" : "value": "abc123451"
3rd column : "Date": "values":"value": "2018-05-26"
4th Column : "Acknowledge": "values":"value": "Success"
and remaining all columns are default values.
As per below format ,
{
"entities": [
{
"id": "XXXXXXX",
"type": "Metal",
"data": {
"attributes": {
"MetalType": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": "abc123451"
}
]
},
"Date": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": "2018-05-26"
}
]
},
"Acknowledge": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": "Success"
}
]
}
}
}
}
]
}
Even though jww is right, I built something for you:
I import the csv using pandas:
df = pd.read_csv('data.csv')
then I create a template for the dictionaries you want to add:
d_json = {"entities": []}
template = {
"id": "XXXXXXX",
"type": "",
"data": {
"attributes": {
"MetalType": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": ""
}
]
},
"Date": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": ""
}
]
},
"Acknowledge": {
"values": [
{
"source": "XYZ",
"locale": "Australia",
"value": ""
}
]
}
}
}
}
Now you just need to fill in the dictionary:
for i in range(len(df)):
d = template
d['type'] = df['type'][i]
d['data']['attributes']['MetalType']['values'][0]['value'] = df['MetalType'][i]
d['data']['attributes']['Date']['values'][0]['value'] = df['Date'][i]
d['data']['attributes']['Acknowledge']['values'][0]['value'] = df['Acknowledge'][i]
d_json['entities'].append(d)
I know my way of iterating over the df is kind of ugly, maybe someone knows a cleaner way.
Cheers!
{
"matching_results": 1264,
"results": [
{
"main_image_url": "https://s4.reutersmedia.net/resources_v2/images/rcom-default.png",
"enriched_text": {
"entities": [
{
"relevance": 0.33,
"disambiguation": {
"subtype": [
"Country"
]
},
"sentiment": {
"score": 0
},
"type": "Location",
"count": 1,
"text": "China"
},
{
"relevance": 0.33,
"disambiguation": {
"subtype": [
"Country"
]
},
"sentiment": {
"score": 0
},
This is too much large file so I want to find "relevance" and "score" using python.
How fetch this info?
Regardless of how large it is, it is only a simple dictionary.
Iterate the lists. Extract the key-values.
for result in data['results']:
for e in result['enriched_text']['entities']:
print(e['relevance'])
print(e['sentiment']['score'])