The file is from a slack server export file, so the structure varies every time (if people responded to a thread with text or reactions).
I have tried several SO questions, with similar problems. But I guarantee my question is different. This one, This one too,This one as well
Sample JSON file:
"client_msg_id": "f347abdc-9e2a-4cad-a37d-8daaecc5ad51",
"type": "message",
"text": "I came here just to check <#U3QSFG5A4> This is a sample :slightly_smiling_face:",
"user": "U51N464MN",
"ts": "1550511445.321100",
"team": "T1559JB9V",
"user_team": "T1559JB9V",
"source_team": "T1559JB9V",
"user_profile": {
"avatar_hash": "gcc8ae3d55bb",
"image_72": "https:\/\/secure.gravatar.com\/avatar\/fcc8ae3d55bb91cb750438657694f8a0.jpg?s=72&d=https%3A%2F%2Fa.slack-edge.com%2Fdf10d%2Fimg%2Favatars%2Fava_0026-72.png",
"first_name": "A",
"real_name": "a name",
"display_name": "user",
"team": "T1559JB9V",
"name": "name",
"is_restricted": false,
"is_ultra_restricted": false
},
"thread_ts": "1550511445.321100",
"reply_count": 3,
"reply_users_count": 3,
"latest_reply": "1550515952.338000",
"reply_users": [
"U51N464MN",
"U8DUH4U2V",
"U3QSFG5A4"
],
"replies": [
{
"user": "U51N464MN",
"ts": "1550511485.321200"
},
{
"user": "U8DUH4U2V",
"ts": "1550515191.337300"
},
{
"user": "U3QSFG5A4",
"ts": "1550515952.338000"
}
],
"subscribed": false,
"reactions": [
{
"name": "trolldance",
"users": [
"U51N464MN",
"U4B30MHQE",
"U68E6A0JF"
],
"count": 3
},
{
"name": "trollface",
"users": [
"U8DUH4U2V"
],
"count": 1
}
]
},
The issue is that there are several keys that vary, so the structure changes within the same json file between messages depending on how other users interact to a given message.
with open("file.json") as file:
d = json.load(file)
df = pd.io.json.json_normalize(d)
df.columns = df.columns.map(lambda x: x.split(".")[-1])
Related
I am new to python and now want to convert a csv file into json file. Basically the json file is nested with dynamic structure, the structure will be defined using the csv header.
From csv input:
ID, Name, person_id/id_type, person_id/id_value,person_id_expiry_date,additional_info/0/name,additional_info/0/value,additional_info/1/name,additional_info/1/value,salary_info/details/0/grade,salary_info/details/0/payment,salary_info/details/0/amount,salary_info/details/1/next_promotion
1,Peter,PASSPORT,A452817,1-01-2055,Age,19,Gender,M,Manager,Monthly,8956.23,unknown
2,Jane,PASSPORT,B859804,2-01-2035,Age,38,Gender,F,Worker, Monthly,125980.1,unknown
To json output:
[
{
"ID": 1,
"Name": "Peter",
"person_id": {
"id_type": "PASSPORT",
"id_value": "A452817"
},
"person_id_expiry_date": "1-01-2055",
"additional_info": [
{
"name": "Age",
"value": 19
},
{
"name": "Gender",
"value": "M"
}
],
"salary_info": {
"details": [
{
"grade": "Manager",
"payment": "Monthly",
"amount": 8956.23
},
{
"next_promotion": "unknown"
}
]
}
},
{
"ID": 2,
"Name": "Jane",
"person_id": {
"id_type": "PASSPORT",
"id_value": "B859804"
},
"person_id_expiry_date": "2-01-2035",
"additional_info": [
{
"name": "Age",
"value": 38
},
{
"name": "Gender",
"value": "F"
}
],
"salary_info": {
"details": [
{
"grade": "Worker",
"payment": " Monthly",
"amount": 125980.1
},
{
"next_promotion": "unknown"
}
]
}
}
]
Is this something can be done by the existing pandas API or I have to write lots of complex codes to dynamically construct the json object? Thanks.
I'm extracting certain keys in several JSON files and then converting it to a CSV in Python. I'm able to define a key list when I run my code and get the information I need.
However, there are certain sub-keys that I want to ignore from the JSON file. For example, if we look at the following snippet:
JSON Sample
[
{
"callId": "abc123",
"errorCode": 0,
"apiVersion": 2,
"statusCode": 200,
"statusReason": "OK",
"time": "2020-12-14T12:00:32.744Z",
"registeredTimestamp": 1417731582000,
"UID": "_guid_abc123==",
"created": "2014-12-04T22:19:42.894Z",
"createdTimestamp": 1417731582000,
"data": {},
"preferences": {},
"emails": {
"verified": [],
"unverified": []
},
"identities": [
{
"provider": "facebook",
"providerUID": "123",
"allowsLogin": true,
"isLoginIdentity": true,
"isExpiredSession": true,
"lastUpdated": "2014-12-04T22:26:37.002Z",
"lastUpdatedTimestamp": 1417731997002,
"oldestDataUpdated": "2014-12-04T22:26:37.002Z",
"oldestDataUpdatedTimestamp": 1417731997002,
"firstName": "John",
"lastName": "Doe",
"nickname": "John Doe",
"profileURL": "https://www.facebook.com/John.Doe",
"age": 50,
"birthDay": 31,
"birthMonth": 12,
"birthYear": 1969,
"city": "City, State",
"education": [
{
"school": "High School Name",
"schoolType": "High School",
"degree": null,
"startYear": 0,
"fieldOfStudy": null,
"endYear": 0
}
],
"educationLevel": "High School",
"favorites": {
"music": [
{
"name": "Music 1",
"id": "123",
"category": "Musician/band"
},
{
"name": "Music 2",
"id": "123",
"category": "Musician/band"
}
],
"movies": [
{
"name": "Movie 1",
"id": "123",
"category": "Movie"
},
{
"name": "Movie 2",
"id": "123",
"category": "Movie"
}
],
"television": [
{
"name": "TV 1",
"id": "123",
"category": "Tv show"
}
]
},
"followersCount": 0,
"gender": "m",
"hometown": "City, State",
"languages": "English",
"likes": [
{
"name": "Like 1",
"id": "123",
"time": "2014-10-31T23:52:53.0000000Z",
"category": "TV",
"timestamp": "1414799573"
},
{
"name": "Like 2",
"id": "123",
"time": "2014-09-16T08:11:35.0000000Z",
"category": "Music",
"timestamp": "1410855095"
}
],
"locale": "en_US",
"name": "John Doe",
"photoURL": "https://graph.facebook.com/123/picture?type=large",
"timezone": "-8",
"thumbnailURL": "https://graph.facebook.com/123/picture?type=square",
"username": "john.doe",
"verified": "true",
"work": [
{
"companyID": null,
"isCurrent": null,
"endDate": null,
"company": "Company Name",
"industry": null,
"title": "Company Title",
"companySize": null,
"startDate": "2010-12-31T00:00:00"
}
]
}
],
"isActive": true,
"isLockedOut": false,
"isRegistered": true,
"isVerified": false,
"lastLogin": "2014-12-04T22:26:33.002Z",
"lastLoginTimestamp": 1417731993000,
"lastUpdated": "2014-12-04T22:19:42.769Z",
"lastUpdatedTimestamp": 1417731582769,
"loginProvider": "facebook",
"loginIDs": {
"emails": [],
"unverifiedEmails": []
},
"rbaPolicy": {
"riskPolicyLocked": false
},
"oldestDataUpdated": "2014-12-04T22:19:42.894Z",
"oldestDataUpdatedTimestamp": 1417731582894,
"registered": "2014-12-04T22:19:42.956Z",
"regSource": "",
"socialProviders": "facebook"
}
]
I want to extract data from created and identities but ignore identities.favorites and identities.likes as well as their data underneath it.
This is what I have so far, below. I defined the JSON keys that I want to extract in the key_list variable:
Current Code
import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
json_list = json.load(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
# Extract data from the defined key names
key_list = ['created', 'identities']
json_list = [{k:d[k] for k in key_list} for d in json_list]
# Flatten and convert to a data frame
json_list_flattened = (flatten(d, '.') for d in json_list)
df = pandas.DataFrame(json_list_flattened)
# Export to CSV in the same directory with the original file name
export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)
Similar to the key_list, I suspect that I would make an ignore list and factor that in the json_list for loop that I have? Something like:
key_ignore = ['identities.favorites', 'identities.likes']`
Then utilize the dict.pop() which looks like it will remove the unwanted sub-keys if it matches? Just not sure how to implement that correctly.
Expected Output
As a result, the code should extract data from the defined keys in key_list and ignore the sub keys defined in key_ignore, which is identities.favorites and identities.likes. Then the rest of the code will continue to convert it into a CSV:
created
identities.0.provider
identities.0.providerUID
identities...
2014-12-04T19:23:05.191Z
site
cb8168b0cf734b70ad541f0132763761
...
If the keys are always there, you can use
del d[0]['identities'][0]['likes']
del d[0]['identities'][0]['favorites']
or if you want to remove the columns from the dataframe after reading all the json data in you can use
df.drop(df.filter(regex='identities.0.favorites|identities.0.likes').columns, axis=1, inplace=True)
Is there a way to import this kind of JSON response into Pandas? Ive been trying to get it usable with json_normalize but I can't seem to get more than one level to work at a time ( I can get notes but can't being in custom_fields). I also cannot figure out how to call out something like ['reporter']['name'] (which should be jdoe). This is from Mantis and its the JSON output of a requests response. Im now wondering if it needs to br broken up into multiple frames and put back together, or should I use a for loop and put the data I want into a better format for PD to import?
In my head each item should be a column in the series all tied to the id column like this.
id | summary | project.name | reporter.name ..|.. custom.fields.Project_Stage | ... notes1.reporter.name | notes1.text ... notes2.reporter.name | notes2.text
{
"issues": [
{
"id": 1234,
"summary": "Some text",
"project": {
"id": 1,
"name": "North America"
},
"category": {
"id": 11,
"name": "Retail"
},
"reporter": {
"id": 1099,
"name": "jdoe"
},
"custom_fields": [
{
"field": {
"id": 107,
"name": "Product Escalations"
},
"value": ""
},
{
"field": {
"id": 1,
"name": "Project_Stage"
},
"value": "Pending"
}
],
"notes": [
{
"id": 214288,
"reporter": {
"id": 9999,
"name": "jdoe"
},
"text": "Worked with Mark over e-mail",
"view_state": {
"id": 10,
"name": "public",
"label": "public"
},
"type": "note",
"created_at": "2020-12-04T15:55:02-08:00",
"updated_at": "2020-12-04T15:55:02-08:00"
},
{
"id": 214289,
"reporter": {
"id": 9999,
"name": "jdoe"
},
"text": "I attempted on numerous occasions to setup a meeting with him to set it up for him.",
"view_state": {
"id": 10,
"name": "public",
"label": "public"
},
"type": "note",
"created_at": "2020-12-04T15:57:02-08:00",
"updated_at": "2020-12-04T15:57:02-08:00"
}
]
}
]
}
Here is what the DF would look like in my head. All the data for one ticket on one line/series.
Those structures with many lists at the same level are tricky. Try flatten_json. https://github.com/amirziai/flatten
If you're response is called 'dic', you can use this.
from flatten_json import flatten
dic_flattened = (flatten(d, '.') for d in dic['issues'])
df = pd.DataFrame(dic_flattened)
Output
id summary project.id project.name category.id category.name ... notes.1.view_state.id notes.1.view_state.name notes.1.view_state.label notes.1.type notes.1.created_at notes.1.updated_at
0 1234 Some text 1 North America 11 Retail ... 10 public public note 2020-12-04T15:57:02-08:00 2020-12-04T15:57:02-08:00
In [101]: df.columns
Out[101]:
Index(['id', 'summary', 'project.id', 'project.name', 'category.id',
'category.name', 'reporter.id', 'reporter.name',
'custom_fields.0.field.id', 'custom_fields.0.field.name',
'custom_fields.0.value', 'custom_fields.1.field.id',
'custom_fields.1.field.name', 'custom_fields.1.value', 'notes.0.id',
'notes.0.reporter.id', 'notes.0.reporter.name', 'notes.0.text',
'notes.0.view_state.id', 'notes.0.view_state.name',
'notes.0.view_state.label', 'notes.0.type', 'notes.0.created_at',
'notes.0.updated_at', 'notes.1.id', 'notes.1.reporter.id',
'notes.1.reporter.name', 'notes.1.text', 'notes.1.view_state.id',
'notes.1.view_state.name', 'notes.1.view_state.label', 'notes.1.type',
'notes.1.created_at', 'notes.1.updated_at'],
dtype='object')
I am new at python, I´ve worked with other languages... I´ve made this code with Java and works, but now, I must do it in python. I have a json of 3 levels, the first two are: resources, usages, and I want to count the names on the third level. I´ve seen several examples but I cant get it done
import json
data = {
"startDate": "2019-06-23T16:07:21.205Z",
"endDate": "2019-07-24T16:07:21.205Z",
"status": "Complete",
"usages": [
{
"name": "PureCloud Edge Virtual Usage",
"resources": [
{
"name": "Edge01-VM-GNS-DemoSite01 (1f279086-a6be-4a21-ab7a-2bb1ae703fa0)",
"date": "2019-07-24T09:00:28.034Z"
},
{
"name": "329ad5ae-e3a3-4371-9684-13dcb6542e11",
"date": "2019-07-24T09:00:28.034Z"
},
{
"name": "e5796741-bd63-4b8e-9837-4afb95bb0c09",
"date": "2019-07-24T09:00:28.034Z"
}
]
},
{
"name": "PureCloud for SmartVideo Add-On Concurrent",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
},
{
"name": "PureCloud 3 Concurrent User Usage",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
},
{
"name": "PureCloud Skype for Business WebSDK",
"resources": [
{
"name": "jpizarro#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "jaguilera#gns.com.co",
"date": "2019-06-25T04:54:17.662Z"
},
{
"name": "dcortes#gns.com.co",
"date": "2019-07-15T15:06:09.203Z"
}
]
}
],
"selfUri": "/api/v2/billing/reports/billableusage"
}
cantidadDeLicencias = 0
cantidadDeUsages = len(data['usages'])
for x in range(cantidadDeUsages):
temporal = data[x]
cantidadDeResources = len(temporal['resource'])
for z in range(cantidadDeResources):
print(x)
What changes I have to make? Maybe I have to do it on another approach? Thanks in advance
Update
Code that works
cantidadDeLicencias = 0
for usage in data['usages']:
cantidadDeLicencias = cantidadDeLicencias + len(usage['resources'])
print(cantidadDeLicencias)
You can do this :
for usage in data['usages']:
print(len(usage['resources']))
If you want to know the number of names in each of the resources level, counting the duplicated names (e.g. "jaguilera#gns.com.co" appears more than one time in your data), then just do iterate over the first-level (usages) and sum the size of each array
cantidadDeLicencias = 0
for usage in data['usages']:
cantidadDeLicencias += len(usage['resources'])
print(cantidadDeLicencias)
If you don't want to count duplicates, then use a set and iterate over each resources array
cantidadDeLicencias_set = {}
for usage in data['usages']:
for resource in usage['resources']:
cantidadDeLicencias_set.add(resource['name'])
print(len(cantidadDeLicencias_set ))
I have rather very weird requirement now. I have below json and somehow I have to convert it into flat csv.
[
{
"authorizationQualifier": "SDA",
"authorizationInformation": " ",
"securityQualifier": "ASD",
"securityInformation": " ",
"senderQualifier": "ASDAD",
"senderId": "FADA ",
"receiverQualifier": "ADSAS",
"receiverId": "ADAD ",
"date": "140101",
"time": "0730",
"standardsId": null,
"version": "00501",
"interchangeControlNumber": "123456789",
"acknowledgmentRequested": "0",
"testIndicator": "T",
"functionalGroups": [
{
"functionalIdentifierCode": "ADSAD",
"applicationSenderCode": "ASDAD",
"applicationReceiverCode": "ADSADS",
"date": "20140101",
"time": "07294900",
"groupControlNumber": "123456789",
"responsibleAgencyCode": "X",
"version": "005010X221A1",
"transactions": [
{
"name": "ASDADAD",
"transactionSetIdentifierCode": "adADS",
"transactionSetControlNumber": "123456789",
"implementationConventionReference": null,
"segments": [
{
"BPR03": "ad",
"BPR14": "QWQWDQ",
"BPR02": "1.57",
"BPR13": "23223",
"BPR01": "sad",
"BPR12": "56",
"BPR10": "32424",
"BPR09": "12313",
"BPR08": "DA",
"BPR07": "123456789",
"BPR06": "12313",
"BPR05": "ASDADSAD",
"BPR16": "21313",
"BPR04": "SDADSAS",
"BPR15": "11212",
"id": "aDSASD"
},
{
"TRN02": "2424",
"TRN03": "35435345",
"TRN01": "3435345",
"id": "FSDF"
},
{
"REF02": "fdsffs",
"REF01": "sfsfs",
"id": "fsfdsfd"
},
{
"DTM02": "2432424",
"id": "sfsfd",
"DTM01": "234243"
}
],
"loops": [
{
"id": "24324234234",
"segments": [
{
"N101": "sfsfsdf",
"N102": "sfsf",
"id": "dgfdgf"
},
{
"N301": "sfdssfdsfsf",
"N302": "effdssf",
"id": "fdssf"
},
{
"N401": "sdffssf",
"id": "sfds",
"N402": "sfdsf",
"N403": "23424"
},
{
"PER06": "Wsfsfdsfsf",
"PER05": "sfsf",
"PER04": "23424",
"PER03": "fdfbvcb",
"PER02": "Pedsdsf",
"PER01": "sfsfsf",
"id": "fdsdf"
}
]
},
{
"id": "2342",
"segments": [
{
"N101": "sdfsfds",
"N102": "vcbvcb",
"N103": "dsfsdfs",
"N104": "343443",
"id": "fdgfdg"
},
{
"N401": "dfsgdfg",
"id": "dfgdgdf",
"N402": "dgdgdg",
"N403": "234244"
},
{
"REF02": "23423342",
"REF01": "fsdfs",
"id": "sfdsfds"
}
]
}
]
}
]
}
]
}
]
The column header name corresponding to deeper key-value make take nested form, like functionalGroups[0].transactions[0].segments[0].BPR15.
I am able to do this in java using this github project (here you can find the output format I desire in the explanation) in one line:
flatJson = JSONFlattener.parseJson(new File("files/simple.json"), "UTF-8");
The output was:
date,securityQualifier,testIndicator,functionalGroups[1].functionalIdentifierCode,functionalGroups[1].date,functionalGroups[1].applicationReceiverCode, ...
140101,00,T,HP,20140101,ETIN,...
But I want to do this in python. I tried as suggested in this answer:
with open('data.json') as data_file:
data = json.load(data_file)
df = json_normalize(data, record_prefix=True)
with open('temp2.csv', "w", newline='\n') as csv_file:
csv_file.write(df.to_csv())
However, for column functionalGroups, it dumps json as a cell value.
I also tried as suggested in this answer:
with open('data.json') as f: # this ensures opening and closing file
a = json.loads(f.read())
df = pandas.DataFrame(a)
print(df.transpose())
But this also seem to do the same:
0
acknowledgmentRequested 0
authorizationInformation
authorizationQualifier SDA
date 140101
functionalGroups [{'functionalIdentifierCode': 'ADSAD', 'applic...
interchangeControlNumber 123456789
receiverId ADAD
receiverQualifier ADSAS
securityInformation
securityQualifier ASD
senderId FADA
senderQualifier ASDAD
standardsId None
testIndicator T
time 0730
version 00501
Is it possible to do what I desire in python?