Substring string column Pandas Python - python

I have a pandas dataframe with two columns : ticket number and history.
History is a string with the following structure. I need to create third column which include author name who change status from New to Open. Is it possible?
[
{
"id": "1,
"author": {
"name": "user1",
"emailAddress": "user1#test.com",
"displayName": "user1"
},
"created": "2021-06-09T12:54:22.915+0000",
"items": [
{
"field": "name",
"from": "1",
"fromString": null,
"to": "2",
"toString": "test"
}
]
},
{
"id": "2",
"author": {
"name": "user2",
"emailAdress": "user2#test.com",
"displayName": "user2"
},
"created": "2021-06-11T09:33:18.692+0000",
"items": [
{
"field": "status",
"from": 3,
"fromString": "New",
"to": "7",
"toString": "Open"
}
]
}]

If your dataframe is named df, the history column (column 2) is named history and the items in the history column actually are json strings with a structure like the one you've provided, you could do the following:
import json
def extract_author(json_string):
records = json.loads(json_string)
for record in records:
items = record['items'][0]
if (items['field'] == 'status'
and items['fromString'] == 'New'
and items['toString'] == 'Open'):
return record['author']['name']
return None
df['author'] = df['history'].map(extract_author)

Related

Very nested JSON with optional fields into pandas dataframe

I have a JSON with the following structure. I want to extract some data to different lists so that I will be able to transform them into a pandas dataframe.
{
"ratings": {
"like": {
"average": null,
"counts": {
"1": {
"total": 0,
"users": []
}
}
}
},
"sharefile_vault_url": null,
"last_event_on": "2021-02-03 00:00:01",
],
"fields": [
{
"type": "text",
"field_id": 130987800,
"label": "Name and Surname",
"values": [
{
"value": "John Smith"
}
],
{
"type": "category",
"field_id": 139057651,
"label": "Gender",
"values": [
{
"value": {
"status": "active",
"text": "Male",
"id": 1,
"color": "DCEBD8"
}
}
],
{
"type": "category",
"field_id": 151333010,
"label": "Field of Studies",
"values": [
{
"value": {
"status": "active",
"text": "Languages",
"id": 3,
"color": "DCEBD8"
}
}
],
}
}
For example, I create a list
names = []
where if "label" in the "fields" list is "Name and Surname" I append ["values"][0]["value"] so names now contains "John Smith". I do exactly the same for the "Gender" label and append the value to the list genders.
The above dictionary is contained in a list of dictionaries so I just have to loop though the list and extract the relevant fields like this:
names = []
genders = []
for r in range(len(users)):
for i in range(len(users[r].json()["items"])):
for field in users[r].json()["items"][i]["fields"]:
if field["label"] == "Name and Surname":
names.append(field["values"][0]["value"])
elif field["label"] == "Gender":
genders.append(field["values"][0]["value"]["text"])
else:
# Something else
where users is a list of responses from the API, each JSON of which has the items is a list of dictionaries where I can find the field key which has as the value a list of dictionaries of different fields (like Name and Surname and Gender).
The problem is that the dictionary with "label: Field of Studies" is optional and is not always present in the list of fields.
How can I manage to check for its presence, and if so append its value to a list, and None otherwise?
To me it seems that the data you have is not valid JSON. However if I were you I would try using pandas.json_normalize. According to the documentation this function will put None if it encounters an object with a label not inside it.

Ignore specific JSON keys when extracting data in Python

I'm extracting certain keys in several JSON files and then converting it to a CSV in Python. I'm able to define a key list when I run my code and get the information I need.
However, there are certain sub-keys that I want to ignore from the JSON file. For example, if we look at the following snippet:
JSON Sample
[
{
"callId": "abc123",
"errorCode": 0,
"apiVersion": 2,
"statusCode": 200,
"statusReason": "OK",
"time": "2020-12-14T12:00:32.744Z",
"registeredTimestamp": 1417731582000,
"UID": "_guid_abc123==",
"created": "2014-12-04T22:19:42.894Z",
"createdTimestamp": 1417731582000,
"data": {},
"preferences": {},
"emails": {
"verified": [],
"unverified": []
},
"identities": [
{
"provider": "facebook",
"providerUID": "123",
"allowsLogin": true,
"isLoginIdentity": true,
"isExpiredSession": true,
"lastUpdated": "2014-12-04T22:26:37.002Z",
"lastUpdatedTimestamp": 1417731997002,
"oldestDataUpdated": "2014-12-04T22:26:37.002Z",
"oldestDataUpdatedTimestamp": 1417731997002,
"firstName": "John",
"lastName": "Doe",
"nickname": "John Doe",
"profileURL": "https://www.facebook.com/John.Doe",
"age": 50,
"birthDay": 31,
"birthMonth": 12,
"birthYear": 1969,
"city": "City, State",
"education": [
{
"school": "High School Name",
"schoolType": "High School",
"degree": null,
"startYear": 0,
"fieldOfStudy": null,
"endYear": 0
}
],
"educationLevel": "High School",
"favorites": {
"music": [
{
"name": "Music 1",
"id": "123",
"category": "Musician/band"
},
{
"name": "Music 2",
"id": "123",
"category": "Musician/band"
}
],
"movies": [
{
"name": "Movie 1",
"id": "123",
"category": "Movie"
},
{
"name": "Movie 2",
"id": "123",
"category": "Movie"
}
],
"television": [
{
"name": "TV 1",
"id": "123",
"category": "Tv show"
}
]
},
"followersCount": 0,
"gender": "m",
"hometown": "City, State",
"languages": "English",
"likes": [
{
"name": "Like 1",
"id": "123",
"time": "2014-10-31T23:52:53.0000000Z",
"category": "TV",
"timestamp": "1414799573"
},
{
"name": "Like 2",
"id": "123",
"time": "2014-09-16T08:11:35.0000000Z",
"category": "Music",
"timestamp": "1410855095"
}
],
"locale": "en_US",
"name": "John Doe",
"photoURL": "https://graph.facebook.com/123/picture?type=large",
"timezone": "-8",
"thumbnailURL": "https://graph.facebook.com/123/picture?type=square",
"username": "john.doe",
"verified": "true",
"work": [
{
"companyID": null,
"isCurrent": null,
"endDate": null,
"company": "Company Name",
"industry": null,
"title": "Company Title",
"companySize": null,
"startDate": "2010-12-31T00:00:00"
}
]
}
],
"isActive": true,
"isLockedOut": false,
"isRegistered": true,
"isVerified": false,
"lastLogin": "2014-12-04T22:26:33.002Z",
"lastLoginTimestamp": 1417731993000,
"lastUpdated": "2014-12-04T22:19:42.769Z",
"lastUpdatedTimestamp": 1417731582769,
"loginProvider": "facebook",
"loginIDs": {
"emails": [],
"unverifiedEmails": []
},
"rbaPolicy": {
"riskPolicyLocked": false
},
"oldestDataUpdated": "2014-12-04T22:19:42.894Z",
"oldestDataUpdatedTimestamp": 1417731582894,
"registered": "2014-12-04T22:19:42.956Z",
"regSource": "",
"socialProviders": "facebook"
}
]
I want to extract data from created and identities but ignore identities.favorites and identities.likes as well as their data underneath it.
This is what I have so far, below. I defined the JSON keys that I want to extract in the key_list variable:
Current Code
import json, pandas
from flatten_json import flatten
# Enter the path to the JSON and the filename without appending '.json'
file_path = r'C:\Path\To\file_name'
# Open and load the JSON file
json_list = json.load(open(file_path + '.json', 'r', encoding='utf-8', errors='ignore'))
# Extract data from the defined key names
key_list = ['created', 'identities']
json_list = [{k:d[k] for k in key_list} for d in json_list]
# Flatten and convert to a data frame
json_list_flattened = (flatten(d, '.') for d in json_list)
df = pandas.DataFrame(json_list_flattened)
# Export to CSV in the same directory with the original file name
export_csv = df.to_csv (file_path + r'.csv', sep=',', encoding='utf-8', index=None, header=True)
Similar to the key_list, I suspect that I would make an ignore list and factor that in the json_list for loop that I have? Something like:
key_ignore = ['identities.favorites', 'identities.likes']`
Then utilize the dict.pop() which looks like it will remove the unwanted sub-keys if it matches? Just not sure how to implement that correctly.
Expected Output
As a result, the code should extract data from the defined keys in key_list and ignore the sub keys defined in key_ignore, which is identities.favorites and identities.likes. Then the rest of the code will continue to convert it into a CSV:
created
identities.0.provider
identities.0.providerUID
identities...
2014-12-04T19:23:05.191Z
site
cb8168b0cf734b70ad541f0132763761
...
If the keys are always there, you can use
del d[0]['identities'][0]['likes']
del d[0]['identities'][0]['favorites']
or if you want to remove the columns from the dataframe after reading all the json data in you can use
df.drop(df.filter(regex='identities.0.favorites|identities.0.likes').columns, axis=1, inplace=True)

How to read fields without numeric index in JSON

I have a json file where I need to read it in a structured way to insert in a database each value in its respective column, but in the tag "customFields" the fields change index, example: "Tribe / Customer" can be index 0 (row['customFields'][0]) in a json block, and in the other one be index 3 (row['customFields'][3]), so I tried to read the data using the name of the row field ['customFields'] ['Tribe / Customer'], but I got the error below:
TypeError: list indices must be integers or slices, not str
Script:
def getCustomField(ModelData):
for row in ModelData["data"]["squads"][0]["cards"]:
print(row['identifier'],
row['customFields']['Tribe / Customer'],
row['customFields']['Stopped with'],
row['customFields']['Sub-Activity'],
row['customFields']['Activity'],
row['customFields']['Complexity'],
row['customFields']['Effort'])
if __name__ == "__main__":
f = open('test.json')
json_file = json.load(f)
getCustomField(json_file)
JSON:
{
"data": {
"squads": [
{
"name": "TESTE",
"cards": [
{
"identifier": "0102",
"title": "TESTE",
"description": " TESTE ",
"status": "on_track",
"priority": null,
"assignees": [
{
"fullname": "TESTE",
"email": "TESTE"
}
],
"createdAt": "2020-04-16T15:00:31-03:00",
"secondaryLabel": null,
"primaryLabels": [
"TESTE",
"TESTE"
],
"swimlane": "TESTE",
"workstate": "Active",
"customFields": [
{
"name": "Tribe / Customer",
"value": "TESTE 1"
},
{
"name": "Checkpoint",
"value": "GNN"
},
{
"name": "Stopped with",
"value": null
},
{
"name": "Sub-Activity",
"value": "DEPLOY"
},
{
"name": "Activity",
"value": "TOOL"
},
{
"name": "Complexity",
"value": "HIGH"
},
{
"name": "Effort",
"value": "20"
}
]
},
{
"identifier": "0103",
"title": "TESTE",
"description": " TESTE ",
"status": "on_track",
"priority": null,
"assignees": [
{
"fullname": "TESTE",
"email": "TESTE"
}
],
"createdAt": "2020-04-16T15:00:31-03:00",
"secondaryLabel": null,
"primaryLabels": [
"TESTE",
"TESTE"
],
"swimlane": "TESTE",
"workstate": "Active",
"customFields": [
{
"name": "Tribe / Customer",
"value": "TESTE 1"
},
{
"name": "Stopped with",
"value": null
},
{
"name": "Checkpoint",
"value": "GNN"
},
{
"name": "Sub-Activity",
"value": "DEPLOY"
},
{
"name": "Activity",
"value": "TOOL"
},
{
"name": "Complexity",
"value": "HIGH"
},
{
"name": "Effort",
"value": "20"
}
]
}
]
}
]
}
}
You'll have to parse the list of custom fields into something you can access by name. Since you're accessing multiple entries from the same list, a dictionary is the most appropriate choice.
for row in ModelData["data"]["squads"][0]["cards"]:
custom_fields_dict = {field['name']: field['value'] for field in row['customFields']}
print(row['identifier'],
custom_fields_dict['Tribe / Customer'],
...
)
If you only wanted a single field you could traverse the list looking for a match, but it would be less efficient to do that repeatedly.
I'm skipping over dealing with missing fields - you'd probably want to use get('Tribe / Customer', some_reasonable_default) if there's any possibility of the field not being present in the json list.

How to print a specific item from within a JSON file into python

I want to print a user from a JSON list into Python that I select however I can only print all the users. How do you print a specific user? At the moment I have this which prints all the users out in a ugly format
import json
with open('Admin_sample.json') as f:
admin_json = json.load(f)
print(admin_json['staff'])
The JSON file looks like this
{
"staff": [
{
"id": "DA7153",
"name": [
"Fran\u00c3\u00a7ois",
"Ullman"
],
"department": {
"name": "Admin"
},
"server_admin": "true"
},
{
"id": "DA7356",
"name": [
"Bob",
"Johnson"
],
"department": {
"name": "Admin"
},
"server_admin": "false"
},
],
"assets": [
{
"asset_name": "ENGAGED SLOTH",
"asset_type": "File",
"owner": "DA8333",
"details": {
"security": {
"cia": [
"HIGH",
"INTERMEDIATE",
"LOW"
],
"data_categories": {
"Personal": "true",
"Personal Sensitive": "true",
"Customer Sensitive": "true"
}
},
"retention": 2
},
"file_type": "Document",
"server": {
"server_name": "ISOLATED UGUISU",
"ip": [
10,
234,
148,
52
]
}
},
{
"asset_name": "ISOLATED VIPER",
"asset_type": "File",
"owner": "DA8262",
"details": {
"security": {
"cia": [
"LOW",
"HIGH",
"LOW"
],
"data_categories": {
"Personal": "false",
"Personal Sensitive": "false",
"Customer Sensitive": "true"
}
},
"retention": 2
},
},
]
I just can't work it out. Any help would be appreciated.
Thanks.
You need to index into the staff list, e.g.:
print(admin_json['staff'][0])
I suggest reading up a bit on dictionaries in Python. Dictionary values can be set to any object: in this case, the value of the staff key is set to a list of dicts. Here's an example that will loop through all the staff members and print their names:
staff_list = admin_json['staff']
for person in staff_list:
name_parts = person['name']
full_name = ' '.join(name_parts) # combine name parts into a string
print(full_name)
Try something like this:
import json
def findStaffWithId(allStaff, id):
for staff in allStaff:
if staff["id"] == id:
return staff
return {} # no staff found
with open('Admin_sample.json') as f:
admin_json = json.load(f)
print(findStaffWithId(admin_json['staff'], "DA7356"))
You can list all the users name with
users = [user["name"] for user in admin_json['staff']]
You have two lists in this JSON file. When you try to parse it, you'll be reach a list. For example getting the first staff id:
print(admin_json['staff'][0]['id'])
This will print:
DA7153
When you use "json.loads" this will simply converts JSON file to the Python dictionary. For further info:
https://docs.python.org/3/tutorial/datastructures.html#dictionaries

Unable to pull data from json using python

I have the following json
{
"response": {
"message": null,
"exception": null,
"context": [
{
"headers": null,
"name": "aname",
"children": [
{
"type": "cluster-connectivity",
"name": "cluster-connectivity"
},
{
"type": "consistency-groups",
"name": "consistency-groups"
},
{
"type": "devices",
"name": "devices"
},
{
"type": "exports",
"name": "exports"
},
{
"type": "storage-elements",
"name": "storage-elements"
},
{
"type": "system-volumes",
"name": "system-volumes"
},
{
"type": "uninterruptible-power-supplies",
"name": "uninterruptible-power-supplies"
},
{
"type": "virtual-volumes",
"name": "virtual-volumes"
}
],
"parent": "/clusters",
"attributes": [
{
"value": "true",
"name": "allow-auto-join"
},
{
"value": "0",
"name": "auto-expel-count"
},
{
"value": "0",
"name": "auto-expel-period"
},
{
"value": "0",
"name": "auto-join-delay"
},
{
"value": "1",
"name": "cluster-id"
},
{
"value": "true",
"name": "connected"
},
{
"value": "synchronous",
"name": "default-cache-mode"
},
{
"value": "true",
"name": "default-caw-template"
},
{
"value": "blah",
"name": "default-director"
},
{
"value": [
"blah",
"blah"
],
"name": "director-names"
},
{
"value": [
],
"name": "health-indications"
},
{
"value": "ok",
"name": "health-state"
},
{
"value": "1",
"name": "island-id"
},
{
"value": "blah",
"name": "name"
},
{
"value": "ok",
"name": "operational-status"
},
{
"value": [
],
"name": "transition-indications"
},
{
"value": [
],
"name": "transition-progress"
}
],
"type": "cluster"
}
],
"custom-data": null
}
}
which im trying to parse using the json module in python. I am only intrested in getting the following information out of it.
Name Value
operational-status Value
health-state Value
Here is what i have tried.
in the below script data is the json returned from a webpage
json = json.loads(data)
healthstate= json['response']['context']['operational-status']
operationalstatus = json['response']['context']['health-status']
Unfortunately i think i must be missing something as the above results in an error that indexes must be integers not string.
if I try
healthstate= json['response'][0]
it errors saying index 0 is out of range.
Any help would be gratefully received.
json['response']['context'] is a list, so that object requires you to use integer indices.
Each item in that list is itself a dictionary again. In this case there is only one such item.
To get all "name": "health-state" dictionaries out of that structure you'd need to do a little more processing:
[attr['value'] for attr in json['response']['context'][0]['attributes'] if attr['name'] == 'health-state']
would give you a list of of matching values for health-state in the first context.
Demo:
>>> [attr['value'] for attr in json['response']['context'][0]['attributes'] if attr['name'] == 'health-state']
[u'ok']
You have to follow the data structure. It's best to interactively manipulate the data and check what every item is. If it's a list you'll have to index it positionally or iterate through it and check the values. If it's a dict you'll have to index it by it's keys. For example here is a function that get's the context and then iterates through it's attributes checking for a particular name.
def get_attribute(data, attribute):
for attrib in data['response']['context'][0]['attributes']:
if attrib['name'] == attribute:
return attrib['value']
return 'Not Found'
>>> data = json.loads(s)
>>> get_attribute(data, 'operational-status')
u'ok'
>>> get_attribute(data, 'health-state')
u'ok'
json['reponse']['context'] is a list, not a dict. The structure is not exactly what you think it is.
For example, the only "operational status" I see in there can be read with the following:
json['response']['context'][0]['attributes'][0]['operational-status']

Categories