Remove duplicate JSON keys in python - python

I have a 2M entries json that looks like this with some keys as integers that occasionally repeat and I'm trying to figure out how to remove all duplicates in a dictionary:
{
"1": {
"id": 1,
"some_data": "some_data_1",
},
"2": {
"id": 2,
"some_data": "some_data_2",
},
"2": {
"id": 2,
"some_data": "some_data_2",
},
"3": {
"id": 3,
"some_data": "some_data_3",
},
}
So, basically, I'm looking for a generic function that will iterate over dict_keys, look for duplicates and return a clean json. Tried flicking with { each[''] : each for each in te }.values(), but to no avail.

Related

Find a value in a list of dictionaries

I have the following list:
{
"id":1,
"name":"John",
"status":2,
"custom_attributes":[
{
"attribute_code":"address",
"value":"st"
},
{
"attribute_code":"city",
"value":"st"
},
{
"attribute_code":"job",
"value":"test"
}]
}
I need to get the value from the attribute_code that is equal city
I've tried this code:
if list["custom_attributes"]["attribute_code"] == "city" in list:
var = list["value"]
But this gives me the following error:
TypeError: list indices must be integers or slices, not str
What i'm doing wrong here? I've read this solution and this solution but din't understood how to access each value.
Another solution, using next():
dct = {
"id": 1,
"name": "John",
"status": 2,
"custom_attributes": [
{"attribute_code": "address", "value": "st"},
{"attribute_code": "city", "value": "st"},
{"attribute_code": "job", "value": "test"},
],
}
val = next(d["value"] for d in dct["custom_attributes"] if d["attribute_code"] == "city")
print(val)
Prints:
st
Your data is a dict not a list.
You need to scan the attributes according the criteria you mentioned.
See below:
data = {
"id": 1,
"name": "John",
"status": 2,
"custom_attributes": [
{
"attribute_code": "address",
"value": "st"
},
{
"attribute_code": "city",
"value": "st"
},
{
"attribute_code": "job",
"value": "test"
}]
}
for attr in data['custom_attributes']:
if attr['attribute_code'] == 'city':
print(attr['value'])
break
output
st

pandas read_json with orient="table"

I have dumped a dict of dataframes expanding the json encoder based on this answer. I just altered the way to dump the dataframe, changing orient="records" to orient="table" for my own purpose.
Somehow I can't manage to read the dataframes from the json ; to be precise, pandas seems to read it allright (no exception is raised), but it is filled with NaN values.
Can anybody check if I'm doing anything wrong or if this is a bug for pandas (maybe referring to multiindexed dataframes) ?
I'm using pandas version 1.1.4.
The following code would be enough (I hope) to either test if pandas is KO on my machine or if I've somehow messed up with the dataframe's format. I've also tried to reproduce this with a dummy dataframe including two indexes and didn't run into trouble.
Note also that the json displays a "pandas_version": "0.20.0" which is incoherent with my version (I just made a fresh installation to be sure and it stays there). I've seen the same 0.20.0 version is displayed on the doc's example for current pandas version...
import pandas as pd
s = """{
"schema": {
"fields": [{
"name": "grandeur",
"type": "string"
}, {
"name": "unite",
"type": "string"
}, {
"name": "year",
"type": "integer"
}, {
"name": 1,
"type": "number"
}, {
"name": 2,
"type": "number"
}, {
"name": 3,
"type": "number"
}, {
"name": 4,
"type": "number"
}, {
"name": 5,
"type": "number"
}, {
"name": 6,
"type": "number"
}, {
"name": 7,
"type": "number"
}, {
"name": 8,
"type": "number"
}, {
"name": 9,
"type": "number"
}, {
"name": 10,
"type": "number"
}, {
"name": 11,
"type": "number"
}, {
"name": 12,
"type": "number"
}
],
"primaryKey": ["grandeur", "unite", "year"],
"pandas_version": "0.20.0"
},
"data": [{
"grandeur": "Volumetric soil water layer 1",
"unite": "m3 m-3",
"year": 1981,
"1": 0.3893150916,
"2": 0.3614713229,
"3": 0.3965121538,
"4": 0.3513062306,
"5": 0.3860211495,
"6": 0.3507631742,
"7": 0.3499931922,
"8": 0.3195245205,
"9": 0.3078848032,
"10": 0.3917079828,
"11": 0.380486904,
"12": 0.3987094194
}, {
"grandeur": "Volumetric soil water layer 1",
"unite": "m3 m-3",
"year": 1982,
"1": 0.3924450997,
"2": 0.360954089,
"3": 0.3714920435,
"4": 0.3366828332,
"5": 0.329994006,
"6": 0.3659116305,
"7": 0.3035419171,
"8": 0.3143600073,
"9": 0.3099404359,
"10": 0.3938543858,
"11": 0.383870834,
"12": 0.3909665621
}]
}"""
pd.read_json(s, orient="table")
This is because a field in schema is not matching a key in data.
For example,
schema
{
"name": 1, // integer
"type": "number"
}
data
"1": 0.3893150916 // "1" is string
====================================================
If you change schema to match with data key. read_json should read properly.
schema
{
"name": "1", // string
"type": "number"
}
data
"1": 0.3893150916 // "1" is string
If the example json string is generated by pandas to_json, it is generating a wrong schema for integer column name.

Add ascending serial number field to all existing mongodb documents in a collection

I have a mongodb collection which looks something like this;
[
{
"Code": "018906",
"X": "0.12",
},
{
"Code": "018907",
"X": "0.18",
},
{
"Code": "018910",
"X": "0.24",
},
{
"Code": "018916",
"X": "0.75",
},
]
I want to add an ascending serial number field to all existing mongodb documents inside the collection. After adding, the new collection will look like this;
[
{
"Serial": 1,
"Code": "018906",
"X": "0.12",
},
{
"Serial": 2,
"Code": "018907",
"X": "0.18",
},
{
"Serial": 3,
"Code": "018910",
"X": "0.24",
},
{
"Serial": 4,
"Code": "018916",
"X": "0.75",
},
]
I am open to using any python mongodb library such as pymongo or mongoengine.
I am using python 3.7, mongodb v4.2.
You can do it with a single aggregation query by grouping up all documents in a single array, then unwinding it with element index included:
db.collection.aggregate([
{
$group: {
_id: null,
doc: {
$push: "$$ROOT"
}
}
},
{
$unwind: {
path: "$doc",
includeArrayIndex: "doc.Serial"
}
},
{
$replaceRoot: {
newRoot: "$doc"
}
},
{
$out: "new_collection_name"
}
])
All job is done serverside, no need to load whole collection to the application's memory. If the collection is large enough, you may need to call aggregation with "allowDiskUse".
Prepend it with sorting stage to ensure expected order if required.
First you need to find all the _id in the collection, and use bulk write operation.
from pymongo import UpdateOne
records = db.collection.find({}, {'_id':1})
i = 1
request = []
for record in records:
request.append(UpdateOne({'_id': record['_id']}, {'$set': {'serial': i}}))
i=i+1
db.collection.bulk_write(request)

Get the most currently dated subset of data from a Python dict with multiple dated dicts within

In Python, is there a good way (programmatically) to get the value of ["win"]["amount"] from the subset of data where the most recent date exists?
To provide a more concrete example of what I'm asking, I'd like the amount for win from April 2, 2018 (2018-04-02), which would be 199.51.
This is the source JSON (I convert this to a Python dict using json.loads):
{
"numbers": [{
"lose": {
"amount": "122.50"
},
"win": {
"amount": "232.50"
},
"date": "2018-01-08"
}, {
"lose": {
"amount": "233.75"
},
"win": {
"amount": "216.25"
},
"date": "2018-03-05"
}, {
"lose": {
"amount": "123.50"
},
"win": {
"amount": "543.00"
},
"date": "2018-03-12"
}, {
"lose": {
"amount": "213.31"
},
"win": {
"amount": "253.33"
},
"date": "2018-03-19"
}, {
"lose": {
"amount": "217.00"
},
"win": {
"amount": "199.51"
},
"date": "2018-04-02"
}]
}
This seems like a very simple solution is in order, but I cannot quite nail down what that solution is, or if there is a Pythonic way of accomplishing this. I did write some logic to calculate the largest date by putting all of the dates into a list called datelist and doing a max(datelist), but I'm not sure how to relate that back to get ["amount"]["win"].
If the 'date' attributes follow ISO 8601 so the lexicographical order corresponds to the chronological order then you can simply use max with an appropriate key, so:
In [12]: max(data['numbers'], key=lambda d:d['date'])['win']['amount']
Out[12]: '199.51'
Given that your dates are in ISO 8601 notation you can do simple string compare as a sorting mechanism for your inner numbers list and then pick up the last element:
import operator
oldest = sorted(your_data["numbers"], key=operator.itemgetter("date"))[0]["win"]["amount"]
# 232.50
newest = sorted(your_data["numbers"], key=operator.itemgetter("date"))[-1]["win"]["amount"]
# 199.51
Or just sort the inner list once and then pick up whichever element you want if you need to access them more often.

The fastest way to structure data in JSON for python

I have a complex document that I am trying to structure most conveniently and efficiently with JSON in Python. I would like to be able to retrieve one of the items in my document with one line (i.e. not via a for loop)
A demo of the structure looks like this:
{
"movies": {
"0": {
"name": "charles",
"id": 0,
"loopable": true
},
"1": {
"name": "ray",
"id": 1,
"loopable": true
}
}
}
I am trying to be able to easily fetch a movie based on its id field. To do this, right now, I have made the index the same as the key to the movies object. So when I json.load the object to find movie 1's name I can just do movie[(id)]['name']
It seems like I should have a list of movies in the json file but it also seems like that would be more complicated. It could look like this:
{
"movies": [
{
"name": "charles",
"id": 0,
"loopable": true
},
{
"name": "ray",
"id": 1,
"loopable": true
}
]
}
but if that were the case I would have to loop through the entire array like this:
for movie in movies:
if movie['id'] == (id)
# Now I can get movie['id']['name']
Is there a more effiecient way of doing this?
Let 'movies' be a dict and not a list:
{
"movies": {
"12": {
"name": "charles",
"id": 12,
"loopable": true
},
"39": {
"name": "ray",
"id": 39,
"loopable": true
}
}
}
and you can access movie by id with yourjson['movies'][str(id)]

Categories