How to count element in JSON and remove its duplicate with python

How to count element in JSON and remove its duplicate with python - python

I have below code to count the element in JSON and remove its duplicate.
My problem is when it need to read thousand line of data, this code take long time to finish.
Can anyone help me if we have better way to do this?
#count json element
BattleAmount = []
for i in DATA:
amount = DATA.count(i)
j = copy.deepcopy(i)
j['md']["amount"] = j['md']["amount"] + amount
BattleAmount.append(j)
print("Number of BattleAmount are ", len(BattleAmount))
#remove duplicate
duplicates=[]
for i in BattleAmount:
if BattleAmount.count(i)>1:
if i not in duplicates:
duplicates.append(i)
JSON as this format
[{"_id": {"$oid": "SL"}, "md": {"mana": 24, "rule_set": "Standard", "amount": 12}, "team": {other dict here}
full JSON structure as below
thank you

If you do not care about the order of your elements in BattleAmount you can just use set() function
unique_elements = set(BattleAmount)

Related

Converting nested dict into a single unchanging unique number

I'm working on a minimax algorithm project and I am trying to find a way to save board values in a text file so they don't need to be calculated over and over again each time the program is tested. I have the board stored as a nested dictionary.
rows = {
4:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
3:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
2:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
1:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
}
I tried doing this, which gives the desired result but is not at all optimized and I'm sure there is a way to do this better.
rows = {
4:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
3:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
2:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
1:{1:0,2:0,3:0,4:0,5:0,6:0,7:0,8:0},
}
e = []
for key in rows:
e.append(list(rows[key].values()))
e=str(e)
e=e.replace ("[",""); e=e.replace ("]","")
e=e.replace (" ","")
e=e.replace (",","")
print(e)

You could make use of a str.join(), map is used to convert integers to strings:
res = ''.join(''.join(map(str, r.values())) for r in rows.values())
print(res)
Out:
00000000000000000000000000000000

Get the value of specific key from JSON file python

I have a JSON file containing the list of price changes of all cryptocurrencies
I want to extract all 'percentage' for all the coins.
Using the code below it throws TypeError: string indices must be integers (which I know is totally wrong, Basically trying to understand how can I search for percentage and get its value for all items)
with open('balance.txt') as json_file:
data = json.load(json_file)
for json_i in data:
print(json_i['priceChangePercent'])
Any help is appreciated
I have attached the json file hereJSON FILE
Below is the sample of JSON file for those who dont want to open link
{
"ETH/BTC":{
"symbol":"ETH/BTC",
"timestamp":1630501910299,
"datetime":"2021-09-01T13:11:50.299Z",
"open":0.071579,
"close":0.0744,
"last":0.0744,
"previousClose":0.071585,
"change":0.002821,
"percentage":3.941,
"average":null,
"baseVolume":178776.0338,
"quoteVolume":13026.89979053,
"info":{
"symbol":"ETHBTC",
"priceChange":"0.00282100",
"priceChangePercent":"3.941",
"count":"279051"
}
},
"LTC/BTC":{
"symbol":"LTC/BTC",
"timestamp":1630501909389,
"datetime":"2021-09-01T13:11:49.389Z",
"open":0.003629,
"close":0.00365,
"last":0.00365,
"previousClose":0.003629,
"change":2.1e-05,
"percentage":0.579,
"average":null,
"baseVolume":132964.808,
"quoteVolume":485.12431556,
"info":{
"symbol":"LTCBTC",
"priceChange":"0.00002100",
"priceChangePercent":"0.579",
"count":"36021"
}
},
"BNB/BTC":{
"symbol":"BNB/BTC",
"timestamp":1630501910176,
"datetime":"2021-09-01T13:11:50.176Z",
"open":0.009848,
"close":0.010073,
"last":0.010073,
"previousClose":0.009848,
"change":0.000225,
"percentage":2.285,
"average":null,
"baseVolume":220645.713,
"quoteVolume":2187.75954249,
"info":{
"symbol":"BNBBTC",
"priceChange":"0.00022500",
"priceChangePercent":"2.285",
"count":"130422"
}
},

If it is single dictionary, it could be done the following way:
data['LTC/BTC']['info']['priceChangePercent']

Extract it using list comprehension.
percentage_list = [value['percentage'] for value in data.values()]
priceChangePercent_list = [value['info']['priceChangePercent'] for value in data.values()]
print(percentage_list)
print(priceChangePercent_list)
[3.941, 0.579, 2.285]
['3.941', '0.579', '2.285']

try this bro
t = []
for key, value in a.items():
if "info" in value and "priceChangePercent" in value["info"]:
t.append(value["info"]["priceChangePercent"])

Sort a json file in python

I had a list of single long string and I wanted to print the output in a particular form.
convert list to a particular json in python
but after conversion order of data changed. How can I maintain the same order?
input_data =
[
"21:15-21:30 IllegalAgrumentsException 1,
21:15-21:30 NullPointerException 2,
22:00-22:15 UserNotFoundException 1,
22:15-22:30 NullPointerException 1
....."
]
Code to covert the data in particular json form:
input_data = input[0] // input is list of single long string.
input_data = re.split(r',\s*', input_data)
output = collections.defaultdict(collections.Counter)
# print(output)
for line in input_data:
time, error, count = line.split(None, 2)
output[time][error] += int(count)
print(output)
response = [
{
"time": time,
"logs": [
{"exception": exception, "count": count}
for (exception, count) in counter.items()
],
}
for (time, counter) in output.items())
]
print(response)
My output :
{
"response": [
{
"logs": [
{
"count": 1,
"exception": "UserNotFoundException"
}
],
"time": "22:45-23:00"
},
{
"logs": [
{
"count": 1,
"exception": "NullPointerException"
}
],
"time": "23:00-23:15"
}...
]
}
so my order is changed but I need my data to be in same order i.e start from 21:15-21:30 and so on.. How can I maintain the same order ?

Your timestamps are already sortable, so if you don't care about the order of individual exceptions, you can just do:
for (time, counter) in sorted(output.items())
which will do a lexicographical sort by time and then by count. You can do sorted(output.items(), key=lambda x: x[0]) if you want just sort by time, or key=lambda x: x[0], -x[1] for by time and count descending.

The data is read into a dictionary, a defaultdict to be precise:
output[time][error] += int(count)
This data structure is grouping the data by time and by error type, which implies that there may be multiple items with the same time and the same error time. There is no way to have the "same order", if the data is regrouped like that.
On the other hand, you probably expect the time to be ordered in the input and even if it is not, you want output ordered by time, yo sou just need to do that, so instead of this:
for (time, counter) in output.items()
do this:
for time in sorted(output)
and then get the counter as
counter = output[time]
EDIT: time is sorted, but not starting at 0:00, sorting by time string is not correct. Instead, sorting the time by the original time order is correct.
Therefore, remember the original time order:
time_order = []
for line in input_data:
time, error, count = line.split(None, 2)
output[time][error] += int(count)
time_order.append(time)
Then later sort by it:
for time in sorted(output, key=time_order.index)

Python - Count JSON elements before extracting data

I use an API which gives me a JSON file structured like this:
{
offset: 0,
results: [
{
source_link: "http://www.example.com/1",
source_link/_title: "Title example 1",
source_link/_source: "/1",
source_link/_text: "Title example 1"
},
{
source_link: "http://www.example.com/2",
source_link/_title: "Title example 2",
source_link/_source: "/2",
source_link/_text: "Title example 2"
},
...
And I use this code in Python to extract the data I need:
import json
import urllib2
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
link = z['results'][1]['source_link']
title = z['results'][1]['source_link/_title']
The problem is that to use it I have to know the number of the element from which I'm extracting the data. My results can have different length every time, so what I want to do is to count the number of elements in results at first, so I would be able to set up a loop to extract data from each element.

To check the length of the results key:
len(z["results"])
But if you're just looping around them, a for loop is perfect:
for result in x["results"]:
print(result["source_link"])

You didn't need to know the length of the result, you are fine with a for loop:
for result in z['results']:
# process the results here
Anyway, if you want to know the length of 'results': len(z.results)

If you want to get the length, you can try:
len(z['result'])
But in python, what we usually do is:
for i in z['result']:
# do whatever you like with `i`
Hope this helps.

You don't need, or likely want, to count them in order to loop over them, you could do:
import json
import urllib2
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
for result in z['results']:
link = result['source_link']
title = result['source_link/_title']
# do something with link/title
Or you could do:
u = urllib2.urlopen('myapiurl')
z = json.load(u)
u.close
link = [result['source_link'] for result in z['results']]
title = [result['source_link/_title'] for result in z['results']]
# do something with links/titles lists

Few pointers:
No need to know results's length to iterate it. You can use for result in z['results'].
lists start from 0.
If you do need the index take a look at enumerate.

use this command to print the result on the terminal and then can check the number of results
print(len(z['results'][0]))

Optimize loops with big datasets Python

It's the first time I go so big with Python so I need some help.
I have a mongodb (or python dict) with the following structure:
{
"_id": { "$oid" : "521b1fabc36b440cbe3a6009" },
"country": "Brazil",
"id": "96371952",
"latitude": -23.815124482000001649,
"longitude": -45.532670811999999216,
"name": "coffee",
"users": [
{
"id": 277659258,
"photos": [
{
"created_time": 1376857433,
"photo_id": "525440696606428630_277659258",
},
{
"created_time": 1377483144,
"photo_id": "530689541585769912_10733844",
}
],
"username": "foo"
},
{
"id": 232745390,
"photos": [
{
"created_time": 1369422344,
"photo_id": "463070647967686017_232745390",
}
],
"username": "bar"
}
]
}
Now, I want to create two files, one with the summaries and the other with the weight of each connection. My loop which works for small datasets is the following:
#a is the dataset
data = db.collection.find()
a =[i for i in data]
#here go the connections between the locations
edges = csv.writer(open("edges.csv", "wb"))
#and here the location data
nodes = csv.writer(open("nodes.csv", "wb"))
for i in a:
#find the users that match
for q in a:
if i['_id'] <> q['_id'] and q.get('users') :
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
if weight>0:
edges.writerow([ i['id'], q['id'], weight])
#find the number of photos
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
nodes.writerow([ i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
photos_number
])
The scaling problems: I have 20000 locations, each location might have up to 2000 users, each user might have around 10 photos.
Is there any more efficient way to create the above loops? Maybe Multithreads, JIT, more indexes?
Because if I run the above in a single thread can be up to 20000^2 *2000 *10 results...
So how can I handle more efficiently the above problem?
Thanks

#YuchenXie and #PaulMcGuire's suggested microoptimizations probably aren't your main problem, which is that you're looping over 20,000 x 20,000 = 400,000,000 pairs of entries, and then have an inner loop of 2,000 x 2,000 user pairs. That's going to be slow.
Luckily, the inner loop can be made much faster by pre-caching sets of the user ids in i['users'], and replacing your inner loop with a simple set intersection. That changes an O(num_users^2) operation that's happening in the Python interpreter to an O(num_users) operation happening in C, which should help. (I just timed it with lists of integers of size 2,000; on my computer, it went from 156ms the way you're doing it to 41µs this way, for a 4,000x speedup.)
You can also cut off half your work of the main loop over pairs of locations by noticing that the relationship is symmetric, so there's no point in doing both i = a[1], q = a[2] and i = a[2], q = a[1].
Taking these and #PaulMcGuire's suggestions into account, along with some other stylistic changes, your code becomes (caveat: untested code ahead):
from itertools import combinations, izip
data = db.collection.find()
a = list(data)
user_ids = [{user['id'] for user in i['users']} if 'users' in i else set()
for i in a]
with open("edges.csv", "wb") as f:
edges = csv.writer(f)
for (i, i_ids), (q, q_ids) in combinations(izip(a, user_ids), 2):
weight = len(i_ids & q_ids)
if weight > 0:
edges.writerow([i['id'], q['id'], weight])
edges.writerow([q['id'], i['id'], weight])
with open("nodes.csv", "wb") as f:
nodes = csv.writer(f)
for i in a:
nodes.writerow([
i['id'],
i['name'],
i['latitude'],
i['longitude'],
len(i['users']),
sum(len(p['photos']) for p in i['users']), # total number of photos
])
Hopefully this should be enough of a speedup. If not, it's possible that #YuchenXie's suggestion will help, though I'm doubtful because the stdlib/OS is fairly good at buffering that kind of thing. (You might play with the buffering settings on the file objects.)
Otherwise, it may come down to trying to get the core loops out of Python (in Cython or handwritten C), or giving PyPy a shot. I'm doubtful that'll get you any huge speedups now, though.
You may also be able to push the hard weight calculations into Mongo, which might be smarter about that; I've never really used it so I don't know.

The bottle neck is disk I/O.
It should be much faster when you merge the results and use one or several writerows call instead of many writerow.

Does collapsing this loop:
photos_number =0
for p in i['users']:
photos_number += len(p['photos'])
down to:
photos_number = sum(len(p['photos']) for p in i['users'])
help at all?
Your weight computation:
weight = 0
for user_i in i['users']:
for user_q in q['users']:
if user_i['id'] == user_q['id']:
weight +=1
should also be collapsible down to:
weight = sum(user_i['id'] == user_q['id']
for user_i,user_q in product([i['users'],q['users']))
Since True equates to 1, summing all the boolean conditions is the same as counting all the values that are True.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to count element in JSON and remove its duplicate with python - python

If you do not care about the order of your elements in BattleAmount you can just use set() function unique_elements = set(BattleAmount)

Related

Converting nested dict into a single unchanging unique number

Get the value of specific key from JSON file python

Sort a json file in python

Python - Count JSON elements before extracting data

Optimize loops with big datasets Python

Categories

Resources