My data looks similar to the data shown below ('snippet.json'). I want to be able to replace values for example, for id:1, replace employee number to 2455.
Data Snippet:
{"employees": [{"level1":{"id":1, "firstname": "John", "employee
number": 2343 },{"level1":{"id":2, "firstname": "Jane", "employee
number": 5647 }}]}
I understand that it is much easier to replace values when in the form of a list or a dictionary, so I did the following to convert it to a list.
import json
viewer_string=open('snippet.json','r')
data_str = viewer_string.read()
data_list = []
data_list.append(data_str)
But this doesn't seem to be working. Is there anyway I could convert Snippet.json into a dictionary? Or is there another way to go about this?
Since you are importing json, you might want to do something like below,
json_data = json.loads(viewer_string.read())
you have your data in dict type and you can loop through and replace values as you wish. Make sure the file has valid json
Related
I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.
I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.
This is the flattened version of the column. I still need the keys as column titles for the dataframe and the values as values for the corresponding column.
reaction
{ "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
I want my data to look like this so I can add it to the dataframe.
veddra_term_code veddra_version veddra_term_name
99026 3 'Tablets, Abnormal'
Use f-strings. Theyre made for creating strings formatted like you want:
d = { "veddra_term_code": "99026", "veddra_version": "3", "veddra_term_name": "Tablets, Abnormal" }
s = f'veddra_term_code veddra_version veddra_term_name {d["veddra_term_code"]} {d["veddra_version"]} \'{d["veddra_term_name"]}\''
print(s) # prints veddra_term_code veddra_version veddra_term_name 99026 3 'Tablets, Abnormal'
I'm trying to extract a field from a json that contains a list then append that list to a dataframe, but I'm running in to a few different errors.
I think I can write it to a csv then read the csv with Pandas, but I'm trying to avoid writing any files. I know that I can also use StringIO to make a csv, but that has issues with null bytes. Replacing those would be (i think) another line-by-line step that will further extend the time the script takes to complete... i'm running this against a query that returns thens of thousands of results so keeping it fast and simple is a priority
First I tried this:
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
df = df.append(ln['_source'], ignore_index=True)
print(df)
This gives me a result that looks like this:
1 2 3 4
a b d,e,f... x
Then I tried this:
df = df.append(ln['_source']['payload'], ignore_index=True)
But that gives me this error:
TypeError: cannot concatenate object of type "<class 'str'>"; only pd.Series,
pd.DataFrame, and pd.Panel (deprecated) objs are valid
What I'm looking for would be something like this:
0 1 2 3 4
d e f g h
On top of this... I need to figure out a way to handle a specific string in this list that contains a comma... which may be a headache that's best handled in a different question... something like:
# Obviously this is incorrect but I think you get the idea :)
str.replace(',', '^')
except if ',' followed by ' '
Greatly appreciate any help!
EDITING TO ADD JSON AS REQUESTED
{
"_index": "sanitized",
"_type": "sanitized",
"_id": "sanitized".,
"_score": sanitized,
"_source": {
"sanitized": sanitized,
"sanitized": "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,\"34,35\",36,37,38,39,40",
"sanitized": "sanitized",
"sanitized": ["sanitized"],
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
}
}]
}
}
You can maybe write a temporary file with StringIO, like it's done here.
Then for the second part you could do
if ',' in data and ', ' not in data:
data = data.replace(',', '^')
You can try the following
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
data = ln['_source']["payload"].split(",")
df.loc[len(df)] = pd.Series(data, index=range(len(data)))
print(df)
The benefit of the loc is that you will not create a new dataframe each time so it will be fast. You can find the post here.
I would also like to suggest an alternative that can be faster. First create a dictionary with all the data and then dump the dictionary into a dataframe.
I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.
You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.